Fourth part deals with creating a rate limiter
So, you want to be a rate limiter, eh? Do you even know what that means?
Because until recently, I didn’t! So, let’s go ahead and see what a rate limiter is and why do we want them.
So, imagine you’re a company, say Twitter (or X nowadays). Now, you allow users to create a lot of tweets. But what happens if someone is way too enthusiastic about your product and wants to use it maybe a little too much? Or, if there is someone malicious who wants to cause you harm?
Security is a big topic and rate limiter doesn’t deal with the worst issues, but it can help us set some limits for the users.
Now, we’ve played around with twitter estimations in the previous chapter - specifically queries per second and the amount of data we store. Well, let’s take the example again.
300 million monthly active users
50 % use twitter daily
User post twice a day
10 % of the tweets contain media
The data is stored for 5 years
Now, we have defined that user posts twice a day and we have 150 millions daily users. That’s 300 millions tweets a day.
Because creating a post is just an HTTP call to a service with auth token and some payload, a user could create one post, copy the HTTP request, and use cURL to create 300 millions requests. And that user alone can effectively break our estimation design. Our services would break, they would be performing too many operations, and users would be unable to use our preduct due to Denial of Service.
But even if users being unable to use it wasn’t our biggest concern, we would still face many issues - using too much storage, having more services running, and basically raising our service costs.
So, a rate limiter effectively helps us with following:
Now, to get back to the example above, a rate limiter could be applied to the third point in the quote above. If we expect user to post twice a day, we can enforce the expectation to allowing posting only twice a day. Or, we can still expect them to post twice a day, but limit it at 5.
To see specific cases of rate limiting, see for example Twitter docs. They allow different limits for different types of accounts. Another case may be captchas on google if you try to make many requests.
Finally, we have can rate limit on almost anything. Just to name a few:
So, now that we know why rate limiters are good, let’s start thinking about a specific example. As is the case with system design, let’s first put some requirements here:
Now, to add to these, let’s think about 2 more cases. When you’re using google a lot, you’ll probably understand why you’re getting captchas. It is imperative for the user to know why his requests are not working.
Finally, we can have 2 types of rate limiters - centralized, or distributed. Imagine that you’re developing Twitter. Now, as we mentioned before, we can have servers on multiple places - EU, US, anywhere. So, this system is distributed. In contrast to that, if you have only servers in EU, you probably don’t need to care about other servers. So, let’s add two more rules. To recap:
Now that we have that defined, let’s continue with high level design!
So, we have a set of requirements. Now, where do we put the rate limiter? It can be either client or server.
Generally speaking, you never want rate limiter on client. You may want it to disable it on client as well, but savvy users can get around it.
So, scratching the client limiting, we’re left off with server implementation. But which? Because we again have 2 options:
In both cases, the flow will be something like:
With cloud microservices, rate limiting is implemented using API Gateway. This service allows not only rate limiting, but also SSL termination, authentication, IP whitelisting, and more. This is effectively the middleware.
So, the question to ask is - do we use API gateway or server? Well, as is the case with everything, there is no simple answer. Let’s put a couple points forward to at least have an idea
So, simply speaking - evaluate if it’s efficient and worth the company resources to implement it ourselves.
So, even though this is about designing a rate limiter, it’s helpful to see a bunch of algorithms. I won’t write code here, but to give an idea:
With the sources above, if you want to, you can deep dive into each of them. To keep this part simple, I’ll do a quick summary.
Token bucket is probably the simplest of the ones we’ll be covering. It’s well used and easy to implement. In short:
Note that the 5 is an arbitrary number. There can be thousand tokens available.
Soo, imagine the following design:
That means:
This algorithm is easy to implement and memory efficient. It allows short-term burst of traffic, but long-term traffics have a hard limit.
The leaking bucket is very similar to the token bucket. However, there is a slight difference - rather than using a bucket, we use a queue processed at fixed rate. To keep it simple, the fixed rate is 2 requests per second. So, what does it mean?
Imagine the same scenario as above:
So, this is where the 2 differ. Because with leaking bucket, we allow the requests. But we do not handle them right away - we handle them at a fixed rate
This is where the leaking bucket is different. While with token bucket, the other 8 would get Too Many Requests
, in this case, that’s not gonna happen. So what will happen?
Now, it may look like the difference between token and leaking bucket are little - we’re just using queue instead of bucket, right?
Well, consider the following example:
The main difference is in the last 2 points - leaking bucket allows for stable flow.
Rather than with data structures, the fixed window counter works with time. Again, we want to limit the amount of requests. How do we do it with time?
Let’s put this into numbers yet again:
So, what we can effectively do is send 20 requests in any of these windows (or less). We do not fill queue with old requests, and we allow for some traffic burst. However, instead of refilling (as we did with token bucket), we clean it all up.
There’s one major downside to this. By allowing all requests in a time window, there’s a chance of overloading our system. Specifically:
So, previously we’ve had fixed window and identified a big issue with that algorithm. Let’s try to fix it here.
Instead of fixed time, we can have a sliding window. How do we achieve that? Well, this time, we need to keep a little more information about the request
So, what we will do is:
Let’s consider the previous example again, except the first request defines the time window, and we allow for 20 requests from that window.
So, this is a little complex.
So, previously, we’ve taken a look at some exact window measurement. What we’ll do now is very similar. We will try to combine the sliding window log with fixed time window. To do that, we’ll use the power of math! And also approximation.
Consider the following example:
This last algorithm is memory efficient because it does not store anything. It also allows for very little trafic spikes.
However, it is also not exact. We should use this only if we don’t care about the approximation.
So, the high level architecture is actually really simple. Let’s review the requirements:
Limit excessive requests accurately
Low latency
Little memory usage
High fault tolerance
Exception handling (clear reasoning to users)
Distributed rate limiting (shared across servers)
So, the first thing is low latency and little memory usage. The low latency immediately implies we don’t want to store things in database due to how slow disk reads are.
So, we’ll use a cache. Other than low latency, it also offers time-based expiration strategies, making it ideal for rate limiting.
Finally, we’ll go with middleware. That means:
And that’s it! We have a basic architecture
So, we have already tackled the low latency. Now, let’s see how we handle the excessive requests, and what we do with those that were dropped.
First, we want to set up some rate limiting rules. They are usually saved on disc. We can see an example of Lyft’s rate limiting on GitHub
Specifically, we can see some examples in the config.yaml
files:
domain: mongo_cps
descriptors:
- key: database
value: users
rate_limit:
unit: second
requests_per_unit: 500
- key: database
value: default
rate_limit:
unit: second
requests_per_unit: 500
So, what we can do is we can simply write code for these config files, which we tune depending on our use case. Furthermore, we can limit multiple domains easily with them.
Now, what do we do when someone exceeds the rate limit? Well, we’ve already discussed the HTTP Code 429. But, that’s just a code. How can we let the user know something more meaningful?
Well, there’s a convention of using Ratelimit
headers, specifically:
X-Ratelimit-Remaining
- The number of allowed requests to go through until the next windowX-Ratelimit-Limit
- The total number of allowed requests in a windowX-Ratelimit-Retry-After
- The number of seconds to wait until a request can be made again without being throttledBy using these headers, we can then easily show a user some meaningful message, such as “Sorry, you’ve reached the amount of request, try again in X minutes”.
So, let’s review the things we need to do and see how we would deal with them:
Limit excessive requests accurately
~~Low latency~~ - We use cache rather than DB
~~Little memory usage~~ - We use cache rather than DB
High fault tolerance
~~Exception handling~~ - We return proper headers
Distributed rate limiting (shared across servers)
So, let’s continue with the first one. Since we are using Redis (which efficiently keeps timestamps), we can fairly easily solve this - What algorithm is the most exact and also happens to be using timestamps? That’s right, sliding window log! Another thing to tick off our list!
Now, before we continue with fault tolerance and distributed rate limiting, let’s revisit what we have in graph:
Above we can see what we have right now:
yaml
file. These rules are loaded into cache at regular intervals by workers.Now, we have 2 last issues to tackle - distributed environment and high fault tolerance.
Having a rate limiter on a single server is easy, and we could just stop here, but distributed environment is challenging and will change our approach to high fault tolerance.
So, what are the main issues? Well, as is always the case with distributed environment:
So, let’s start with race conditions:
Consider the following example:
currentValue + 1
is less or equal to maximum. In this case, both requests would be accepted, because they both
asume that the value is 4. However, in reality, it is 5.Locking the requests while the cache is being manipulated is the most obvious solution. However, this may significantly slow down the system. Other than locks, we can write a script around saving the values into redis, or we can use a sorted sets with Redis
Now, for the synchronization issue:
So, regarding the distributed environment - we’ve identified 2 main issues. Synchronization and race conditions. In this case, we’ve:
(We didn’t really choose them, but hey, at least we have an answer!)
Finally, the last part is high fault tolerance. To deal with that, I opt to avoid it as much as possible. We can do that by 2 things:
So, in this part, we’ve dealt with thinking how to design a rate limiter, and we’ve answered some questions. For our use case:
~~Limit excessive requests accurately~~ - We used a proper algorithm. In our case, sliding window log was useful
~~Low latency~~ - We use cache rather than DB (redis)
~~Little memory usage~~ - We use cache rather than DB (redis)
~~High fault tolerance~~ - We optimize performance and monitor properly so that we can adjust the system
~~Exception handling~~ - We return proper headers upon which the client can act upon
~~Distributed rate limiting~~ - We've defined a strategy for race conditions (sorted sets) and for synchronization (shared data source)
Other than system design itself, we’ve gone through a couple algorithms - 5 of them to be exact.
Some final thoughts to be taken from here: