This is a part of an ongoing series covering Hiro’s infrastructure. You can read the introductory post here and one covering common infrastructure pitfalls here. Before we dive into a case study of how the Stacks API has evolved to show you how to scale a blockchain API, let’s first take a step back to basics.
What Is a Blockchain API?
Blockchain APIs serve two important functions:
- It provides data about the blockchain. Accounts, transactions, smart contracts, NFT data, and anything you may want to query about the blockchain, you query via a blockchain API.
- You can submit new transactions to the blockchain. A blockchain API is the gateway for developers that are looking to make new transactions.
At a high level, here is the architecture of a blockchain API, using the Stacks API as an example:
As you can see above, apps communicate with the API by sending web requests. Depending on the type of request, the API will route that request to a backend Stacks node (a node contributing to the blockchain itself) or that request will be routed to a backend database. Those request types are:
- /v2/: requests that are proxied directly to a backend Stacks Node pool. When the API receives an HTTP request on a url with “/v2/” in its path, the API will forward that request to a Stacks Node. This path handles requests which retrieve data (GET requests) and requests which submit new data like new transactions (POST requests).
- /extended/: These requests do not get forwarded to a backend Stacks Node. Instead, the API will do some processing to figure out how to query the Postgres database to get the relevant data and return it. Since this path does not support submitting data, it can only handle requests to retrieve data (GET requests).
In both instances, the node or database responds to that request, and the API forwards that response back to the application. The term application is used loosely here: the “application” could be an actual application or it could be a developer interacting with the blockchain, a wallet, the Stacks Explorer, and more.
The data returned from either endpoint is returned as JSON to conform to industry standards and expectations from other developers. Doing so helps developers connect their service to the API in a durable way. To learn more about the API, visit our documentation and watch Hiro engineer Rafael Cárdenas talk about the Stacks API roadmap:
What the Stacks API Is Optimized For
At Hiro, some of my work, along with several other team members, focuses on optimizing the API for a few different factors:
- Availability: when you try to access the Stacks API, you can be confident it will respond.
- Reliability: when you receive data from the API, you can be confident that the data is the same every time you refresh your browser or open your wallet.
- Response time: when you use your wallet or access the Stacks Explorer, you can be confident that you will get a quick response from API.
Today, our efforts to optimize the API around these parameters has led to some important achievements:
- 99.97% availability over the last 30 days
- 400M+ monthly requests
- 50 millisecond response time for common endpoints
Let’s look at how we got to these performance achievements.
The API on Launch Day
When we first launched the Stacks API on January 14, 2021, we had a basic setup that worked great for low volumes of traffic. As you can see above, we had one Postgres database, one node, and one API servicing requests.
This met our immediate needs for launch and had tolerable response times for the low traffic we saw in the early days of the network. However, in order to upgrade this API setup, it required a half hour of downtime, or sometimes even longer. During that downtime, wallets and the Stacks Explorer didn’t work. That introduced a poor user experience as well as risk, so we decided to fix that.
Blue/Green API Architecture
The first improvement we made to the Stacks API was a simple one: we introduced two separate deployments of the API, named “Blue” and “Green.” You could name them whatever you want. We went with colors.
The importance of this setup is that traffic is only live in one environment at a time, and a reserve proxy directs traffic to the current live environment. Now, when we need to upgrade the API, if the Blue environment is live, we can upgrade the standby environment Green. Once the upgrade is done, we switch the traffic from Blue to Green, so there is no API downtime in the update process. Then, a few days later, if we don’t uncover a critical bug that requires us to roll back quickly and switch traffic back to Blue (still running the previous version), we will upgrade Blue to match Green.
This enables seamless API upgrades and protects against downtime. However, the /v2 response time slowed with increased traffic, and we saw contention over resources in the Stacks Node.
In this setup, the Stacks node is doing multiple jobs. Not only is it servicing backend requests, but it is participating in the blockchain. If it comes across events, it has to forward them to the API. These competing jobs fight over memory and CPU resources. That’s a problem. So we fixed it.
Introducing Proxy Pools
To combat the issue of competing jobs fighting over memory and CPU resources, we introduced proxy pools, a feature upgrade of the API. With proxy pools, we now have two types of nodes:
- Broadcaster nodes: nodes that are dedicated to following the blockchain and sending events to the API
- Node pools: separate addresses pointing to a collection of Stacks nodes, dedicated to servicing backend requests on the /v2 endpoint
This architecture created less resource contention and significantly improved /v2 response times. It also enabled seamless node upgrades.
However, we still saw response times for the /extended/v1 endpoint slowed with increased traffic, and the same was true for the API accessing the Postgres database. Time for another upgrade.
Scale Up APIs
For the next upgrade, we introduced multiple API instances in both Blue and Green. Each API had its own dedicated database and Stacks Node, and traffic was distributed across them. We could have as many as we wanted.
It introduced complexity, but it met our needs and allowed us to scale for the short term. It improved response time for the /extended/ endpoint dramatically. It also improved availability: if one API became available for any reason (database corruption, for example), other APIs could still service requests.
As for the downsides, it added complexity. Every time you wanted to upgrade the API, you had to upgrade each instance. This setup also reduced data consistency: each instance of the API had its own node broadcaster with its own mempool. Sometimes, if you refreshed the browser, you could get different data because the data hadn’t propagated across all APIs yet. A confusing experience for users.
It was also costly to run due to unused Stacks Nodes in the Stacks pool, which cannot be scaled up and down easily. If Blue is active, then Green is running similar tools that aren’t being used, wasting money until the traffic is switched.
The next major feature we introduced at this point was caching support. This was a big boon for the API. It meant you can create a file called .proxy-cache-control.json, and the API would pick it up. This file configures the API to help support caching on specified paths.
For context, most HTTP requests sent and received include metadata called “headers.” A header is a key-value pair, which could simply provide some extra information about the request itself. This metadata also enables services like the API to change how it responds based on the presence or absence of a particular header.
We introduced support in the API for the “cache-control” header. This is actually a header sent from the API when responding to a request. This header tells our browser and Cloudflare (a proxy layer we use as a middleman for all API requests) to cache the response from the API for a specified duration of time. So, if the page was re-requested, the browser would simply fetch the page from its cache, or Cloudflare would immediately return the page if it had it in its cache.
To make this possible, we also reconfigured Cloudflare (a 3rd-party layer that we use for caching and security) to serve as a proxy layer for the API, meaning we began routing all traffic through Cloudflare, and Cloudflare would pass that traffic on to the API. Previously, Cloudflare was simply acting as a DNS provider for its simplicity, and all traffic was hitting the blockchain API directly.
This change and the introduction of the “cache-control” header reduced the number of repeat requests being handled by the API and leveraged the caching abilities of both Cloudflare and users’ browsers. This config file allows any entity that’s running an API server to specify URLs which, when queried, respond with that special “cache-control” header in the request’s metadata. This signals to browsers and Cloudflare that it is ok to cache this page, as well as how long to hold on to that cache page before it expires and needs to be re-fetched from the API itself.
If you run an API, we encourage you to do this as well. It adds a big improvement to response time for the endpoint(s) specified in the file. You also get the additional benefit of improved resiliency to cyber threats. For example, if you have a Distributed Denial of Service (DDOS) attack, where someone sends you a flurry of requests in a short period of time, that load doesn’t entirely fall on the API itself. The caching layer may be able to take some of that burden and service some of the requests.
However, upgrading the API is a balance. Traffic continues to grow. We add an improvement, but response times can still slow. Scaling the API was still an arduous and delayed process. When a new release came out, sometimes it would take upwards of a week or longer to deploy that upgrade on Hiro’s infrastructure. That’s not sustainable in the long run…time for another change.
Read Only APIs
What we implemented next was read-only APIs. This feature, along with many others, was spearheaded by two of my colleagues, Matt Little and Rafael Cárdenas. What they introduced is a second run mode: a writer API and a read-only API.
- Writer API: this API does not receive internet traffic and only has one job. Its job is to receive events from the Stacks Node Broadcaster and write them to the Postgres database.
- Read-only API: this API is more easily able to scale up and down in response to traffic, and you can deploy as many as you want. All read-only APIs read from the same database.
This structural change allowed us to have automated scaling that can respond to traffic. As traffic scales, the number of read-only APIs automatically changes with it. As a result, this improved response time.
This reduced complexity in the overall API architecture because we were able to go back to one database per deployment. It also increased data reliability because everyone is now reading the same database again. On each refresh, you get the same data.
However, this architecture introduced a new bottleneck: with so many APIs reaching out to the same Postgres database, as traffic scales, the new bottleneck becomes the database itself. Enter database replicas.
Database Read Replicas
This feature is quite similar to read-only APIs. Now the API supports read-only databases. What this means is that the Postgres database has one primary database that receives write events and writes to itself. Then you can have as many replicas of that database as you want that follow the primary. The primary and replica databases share incoming queries, evenly distributing the load between all of them.
These replicas have identical sets of data and are all synced together, with low sync time between them.
Voila! We now have a database that can scale up and down as needed with API demand. This improves availability, response time, and data reliability. Rather than maintaining multiple databases that update at slightly different times and have slightly different views of data, you can now have infinite databases that all have the same exact data view.
Advanced Caching, Rate-Limiting, and Firewalls
The next improvements we made to the API involved improvements to the caching layer. In particular, we enabled the API to keep track of the block tip in the stacks blockchain. If the API saw that the chain tip was at block 40,000, for example, it would track that information and send that same data every time until there is an advancement in the blockchain.
With this information, the API doesn’t always need to reach out to the Postgres database for every request if the Stacks blockchain tip has not changed since the previous request. Instead, the API can respond immediately and inform the caching layer to display the same response from the last time the page was requested. This allowed the API to handle a much higher volume of traffic.
To implement this, we added a new header to the metadata sent from the API:
- etag: which shows the hash of the most recent blocktip. Until a new block in the Stacks blockchain is mined, the etag will return the same hash.
Around the same time, we also released two other improvements to the API, namely:
- Rate limiting: this allowed us to rate limit anyone that attempts to abuse the API, improving its overall availability.
- Cyber threat mitigation: if anyone tried to access a malicious URL or conduct unexpected activity on the API, they would be denied access immediately through the implementation of new firewalls.
Future Improvements for the API
The work of scaling a blockchain API is never over. We’ve made a number of improvements, but there are always more improvements to make to give richer functionality and improved performance to meet increasing traffic demands. Some of the immediate improvements we are looking into include:
- SQL query & index improvements: We plan to make performance improvements, so that when data gets passed from the Postgres database, it spends fewer cycles looking for data and responds to the request more quickly.
- Event-replay optimization: when there is a large upgrade for the API, we want to improve the time it takes for the API to get back up to speed and working again.
- Improved CPU performance: when the API connects to the Postgres database or responds to requests, we want to make some performance improvements, so that it does these things more efficiently.
Looking further out, there are a number of improvements for the API that Hiro is exploring (but that does not mean we will do them!). Those initiatives include:
- Microservices: The API does a lot of jobs right now. It’s more than a proxy that sends requests. The API performs a lot of data logic and processing on its end too. As blockchain gets more advanced, so will the API. We only see that data logic increasing. As a potential next step, we are looking into microservices to separate concerns. This would allow us to scale specific parts of the API as needed. Right now, the API is all inclusive. If you need to scale one thing, you have to scale everything. Microservices would help with availability and resiliency and make it a more cloud-native application.
- Reduce deployment friction: We want to make it easier for developers and other entities to run their own instance of the API. We are looking at push-button deployments where developers can run a single line in their CLI or push a button on a site and launch their own API. They would also have an intuitive, guided setup, and we want to make it easier to manage the life cycle of the API.
If these upcoming improvements to the API sound interesting, join the conversation on GitHub and give us feedback! We encourage community collaboration.
Looking to jumpstart your journey into Web3? Download our free guide to developing smart contracts.