The release of stacks-node 3.1.0.0.9 packed a ton of performance improvements.

Stacks node operators also experienced extreme speedups during block synchronization. When catching back up to the chain after stopping a node, some nodes went from syncing ~1k blocks per hour to 75k blocks per hour (a 75x speedup).
As more and more nodes rolled out these improvements, the block production time of the network also improved: most blocks are now produced within 5 seconds.
In this blog post, I’ll walk through how we diagnosed and remedied three performance bottlenecks in the stacks-node implementation during that release.
Finding the next block to process
In the leadup to the Stacks core 3.1.0.0.9 release, we wanted to explore the performance characteristics of block processing in the Nakamoto release. Various network participants had reported slow downs in block production. Looking at Signer performance, we saw Signers that had previously been able to process blocks quickly start to slow down. The key to figuring out what could cause a slowdown is first understanding what a stacks-node is doing with its time.
The best diagnostic tool for this kind of thing is flamegraphs: these graphs provide a representation of how long a process is spending in a given function, grouped by that function’s caller. Because the stacks-node is a Rust program, collecting flamegraphs is fairly easy: we just need to make sure that the binary is built with debug symbols and then we can collect runtime information with perf and feed it into the flamegraph.pl scripts.
While exploring one of these flamegraphs, one particular function immediately jumped out as unusual:

The chains-coordinator thread (the thread responsible for all block processing in the stacks-node) spent nearly all of its time invoking <code-rich-text>handle_new_nakamoto_stacks_block<code-rich-text>. By itself, this isn’t unexpected: block processing is the primary responsibility of that thread. However, almost all of the block processing time was spent in <code-rich-text>next_ready_nakamoto_block<code-rich-text>, a function that just returns the block data if a block is ready for processing. This should be a relatively quick row fetch from the database, but we can see that the SQLite operations are dominating the runtime of the chains-coordinator thread.
So the next step was to look at the relevant code:
So there was a big SQL query! This looked like a likely culprit, so let’s check what SQLite’s query plan for this would be by opening the relevant database in sqlite and asking for the query plan:
There we have the underlying issue: this query required a SCAN step on <code-rich-text>child<code-rich-text>, which in this case is the <code-rich-text>nakamoto_staging_blocks<code-rich-text> table. This table grows with the number of nakamoto blocks produced which explains the slow performance degradation of different nodes in the network. To get rid of a SCAN, we can simply add an index to the table:
Now the SCAN is replaced by a SEARCH, and the query now can directly return the row result.
Adding this index to the schema (#6035) eliminated the <code-rich-text>next_ready_nakamoto_block()<code-rich-text> bottleneck in block processing: rough measurements of block sync timing showed that this improved block processing speeds by about 5x (140 minutes to sync ~10k blocks versus 630 minutes).
Running perf and flamegraph again confirmed the elimination of this bottleneck, but it surfaced something else as well.
ClarityDB metadata labeling
When looking at a flamegraph, any particular wide functions in a given thread should be closely considered: should this function be expensive and does it make sense for this function to dominate the thread it is occupying?

The above flamegraph shows that nearly the entire runtime of <code-rich-text>process_next_nakamoto_block<code-rich-text> is dominated by “committing” the ClarityDB changes. It makes sense that this could be expensive: after all, the database and MARF operations are usually the most costly part of executing Clarity code. However, this function isn’t on the MARF index, rather this function is responsible for labeling all of the Clarity metadata associated with a given block.
This function calls the following query:
This looks straightforward. If there is an index for <code-rich-text>blockhash<code-rich-text>, this should be a SEARCH. And indeed, there was an index for <code-rich-text>blockhash<code-rich-text>. Unfortunately, when that index was added, database migration logic was not included: so if your chainstate was created before the index was added, your stacks-node would SCAN rather than SEARCH.
The fix here was to simply add the migration logic to the stacks-node: when reopening the ClarityDB, check if the index exists and create it if not (#6050). This fix yielded an additional 5x-10x speedup for block syncing over the previous performance improvement.
Burnchain Operation Processing
The next candidate in the chains-coordinator thread was less obvious because its invocations were spread around the call stack:

Here we see the <code-rich-text>SortitionHandle::descended_from<code-rich-text> method dominating the runtime of three different functions in the chains-coordinator thread while the stacks-node is syncing. After eliminating the previous two bottlenecks, the <code-rich-text>descended_from<code-rich-text> invocations comprised about 70% of the runtime of the thread.
The <code-rich-text>descended_from<code-rich-text> function determines whether or not a given miner’s submitted block commit descended from the “PoX anchor block”. This check is known to be expensive: a given commit’s lineage needs to be searched step by step. This is linear in the distance to the PoX anchor block, and it needs to be performed for each miner’s submitted block commits.
Fortunately, the descendancy check is performed over an immutable state, and subsequent checks usually retread the descendancy paths checked by prior invocations. This makes the algorithm very amenable to caching: if the prior result is cached, when the next invocation eventually reaches a previously checked ancestor, the result can be returned. This is a classic strategy for reducing an algorithm’s complexity called dynamic programming, and it worked well in this case. The fix introduced an LRU cache for the descendancy check (#6038), and sped up block synchronization by a further 2-3x.
—--
To follow ongoing updates to Stacks, check out the main Github repo for the Stacks core codebase here.