Your subscription could not be saved. Please try again.
Thanks for subscribing.
Today, the Stacks Blockchain API has an availability of 99.97%. Let’s take a look at how we scaled a blockchain API over time to reach that achievement.
This is a part of an ongoing series covering Hiro’s infrastructure. You can read the introductory post here and one covering common infrastructure pitfalls here. Before we dive into a case study of how the Stacks API has evolved to show you how to scale a blockchain API, let’s first take a step back to basics.
Blockchain APIs serve two important functions:
At a high level, here is the architecture of a blockchain API, using the Stacks API as an example:
As you can see above, apps communicate with the API by sending web requests. Depending on the type of request, the API will route that request to a backend Stacks node (a node contributing to the blockchain itself) or that request will be routed to a backend database. Those request types are:
In both instances, the node or database responds to that request, and the API forwards that response back to the application. The term application is used loosely here: the “application” could be an actual application or it could be a developer interacting with the blockchain, a wallet, the Stacks Explorer, and more.
The data returned from either endpoint is returned as JSON to conform to industry standards and expectations from other developers. Doing so helps developers connect their service to the API in a durable way. To learn more about the API, visit our documentation.
At Hiro, some of my work, along with several other team members, focuses on optimizing the API for a few different factors:
Today, our efforts to optimize the API around these parameters has led to some important achievements:
Let’s look at how we got to these performance achievements.
When we first launched the Stacks API on January 14, 2021, we had a basic setup that worked great for low volumes of traffic. As you can see above, we had one Postgres database, one node, and one API servicing requests.
This met our immediate needs for launch and had tolerable response times for the low traffic we saw in the early days of the network. However, in order to upgrade this API setup, it required a half hour of downtime, or sometimes even longer. During that downtime, wallets and the Stacks Explorer didn’t work. That introduced a poor user experience as well as risk, so we decided to fix that.
The first improvement we made to the Stacks API was a simple one: we introduced two separate deployments of the API, named “Blue” and “Green.” You could name them whatever you want. We went with colors.
The importance of this setup is that traffic is only live in one environment at a time, and a reserve proxy directs traffic to the current live environment. Now, when we need to upgrade the API, if the Blue environment is live, we can upgrade the standby environment Green. Once the upgrade is done, we switch the traffic from Blue to Green, so there is no API downtime in the update process. Then, a few days later, if we don’t uncover a critical bug that requires us to roll back quickly and switch traffic back to Blue (still running the previous version), we will upgrade Blue to match Green.
This enables seamless API upgrades and protects against downtime. However, the /v2 response time slowed with increased traffic, and we saw contention over resources in the Stacks Node.
In this setup, the Stacks node is doing multiple jobs. Not only is it servicing backend requests, but it is participating in the blockchain. If it comes across events, it has to forward them to the API. These competing jobs fight over memory and CPU resources. That’s a problem. So we fixed it.
To combat the issue of competing jobs fighting over memory and CPU resources, we introduced proxy pools, a feature upgrade of the API. With proxy pools, we now have two types of nodes:
This architecture created less resource contention and significantly improved /v2 response times. It also enabled seamless node upgrades.
However, we still saw response times for the /extended/v1 endpoint slowed with increased traffic, and the same was true for the API accessing the Postgres database. Time for another upgrade.
For the next upgrade, we introduced multiple API instances in both Blue and Green. Each API had its own dedicated database and Stacks Node, and traffic was distributed across them. We could have as many as we wanted.
It introduced complexity, but it met our needs and allowed us to scale for the short term. It improved response time for the /extended/ endpoint dramatically. It also improved availability: if one API became available for any reason (database corruption, for example), other APIs could still service requests.
As for the downsides, it added complexity. Every time you wanted to upgrade the API, you had to upgrade each instance. This setup also reduced data consistency: each instance of the API had its own node broadcaster with its own mempool. Sometimes, if you refreshed the browser, you could get different data because the data hadn’t propagated across all APIs yet. A confusing experience for users.
It was also costly to run due to unused Stacks Nodes in the Stacks pool, which cannot be scaled up and down easily. If Blue is active, then Green is running similar tools that aren’t being used, wasting money until the traffic is switched.
The next major feature we introduced at this point was caching support. This was a big boon for the API. It meant you can create a file called .proxy-cache-control.json, and the API would pick it up. This file configures the API to help support caching on specified paths.
For context, most HTTP requests sent and received include metadata called “headers.” A header is a key-value pair, which could simply provide some extra information about the request itself. This metadata also enables services like the API to change how it responds based on the presence or absence of a particular header.
We introduced support in the API for the “cache-control” header. This is actually a header sent from the API when responding to a request. This header tells our browser and Cloudflare (a proxy layer we use as a middleman for all API requests) to cache the response from the API for a specified duration of time. So, if the page was re-requested, the browser would simply fetch the page from its cache, or Cloudflare would immediately return the page if it had it in its cache.
To make this possible, we also reconfigured Cloudflare (a 3rd-party layer that we use for caching and security) to serve as a proxy layer for the API, meaning we began routing all traffic through Cloudflare, and Cloudflare would pass that traffic on to the API. Previously, Cloudflare was simply acting as a DNS provider for its simplicity, and all traffic was hitting the blockchain API directly.
This change and the introduction of the “cache-control” header reduced the number of repeat requests being handled by the API and leveraged the caching abilities of both Cloudflare and users’ browsers. This config file allows any entity that’s running an API server to specify URLs which, when queried, respond with that special “cache-control” header in the request’s metadata. This signals to browsers and Cloudflare that it is ok to cache this page, as well as how long to hold on to that cache page before it expires and needs to be re-fetched from the API itself.
If you run an API, we encourage you to do this as well. It adds a big improvement to response time for the endpoint(s) specified in the file. You also get the additional benefit of improved resiliency to cyber threats. For example, if you have a Distributed Denial of Service (DDOS) attack, where someone sends you a flurry of requests in a short period of time, that load doesn’t entirely fall on the API itself. The caching layer may be able to take some of that burden and service some of the requests.
However, upgrading the API is a balance. Traffic continues to grow. We add an improvement, but response times can still slow. Scaling the API was still an arduous and delayed process. When a new release came out, sometimes it would take upwards of a week or longer to deploy that upgrade on Hiro’s infrastructure. That’s not sustainable in the long run…time for another change.
What we implemented next was read-only APIs. This feature, along with many others, was spearheaded by two of my colleagues, Matt Little and Rafael Cárdenas. What they introduced is a second run mode: a writer API and a read-only API.
This structural change allowed us to have automated scaling that can respond to traffic. As traffic scales, the number of read-only APIs automatically changes with it. As a result, this improved response time.
This reduced complexity in the overall API architecture because we were able to go back to one database per deployment. It also increased data reliability because everyone is now reading the same database again. On each refresh, you get the same data.
However, this architecture introduced a new bottleneck: with so many APIs reaching out to the same Postgres database, as traffic scales, the new bottleneck becomes the database itself. Enter database replicas.
This feature is quite similar to read-only APIs. Now the API supports read-only databases. What this means is that the Postgres database has one primary database that receives write events and writes to itself. Then you can have as many replicas of that database as you want that follow the primary. The primary and replica databases share incoming queries, evenly distributing the load between all of them.
These replicas have identical sets of data and are all synced together, with low sync time between them.
Voila! We now have a database that can scale up and down as needed with API demand. This improves availability, response time, and data reliability. Rather than maintaining multiple databases that update at slightly different times and have slightly different views of data, you can now have infinite databases that all have the same exact data view.
The next improvements we made to the API involved improvements to the caching layer. In particular, we enabled the API to keep track of the block tip in the stacks blockchain. If the API saw that the chain tip was at block 40,000, for example, it would track that information and send that same data every time until there is an advancement in the blockchain.
With this information, the API doesn’t always need to reach out to the Postgres database for every request if the Stacks blockchain tip has not changed since the previous request. Instead, the API can respond immediately and inform the caching layer to display the same response from the last time the page was requested. This allowed the API to handle a much higher volume of traffic.
To implement this, we added a new header to the metadata sent from the API:
Around the same time, we also released two other improvements to the API, namely:
The work of scaling a blockchain API is never over. We’ve made a number of improvements, but there are always more improvements to make to give richer functionality and improved performance to meet increasing traffic demands. Some of the immediate improvements we are looking into include:
Looking further out, there are a number of improvements for the API that Hiro is exploring (but that does not mean we will do them!). Those initiatives include:
If these upcoming improvements to the API sound interesting, join the conversation on GitHub and give us feedback! We encourage community collaboration.