System Design 8 - URL shortener

Creating own tinyurl

Introduction

So, this has been fruitful! So far we’ve gone through so much stuff! But before I’ll do a recap of what we’ve learned, let us go and actually design a tool!

So far, we’ve been talking about designs involving algorithms. So let’s go back a bit and design a complete system from scratch!

This time, we’ll be taking a look at a URL shortener, trying to reuse some of the topics we’ve seen so far.

So, have you ever wondered how to create your own tinyurl? You’ve come to the right place.

Defining a scope

  • Any URL entered into the form and shortened can be sent to someone else and it opens on the same page
  • This includes all query and route parameters
  • 100 millions URLs are generated per day
  • The URL is as short as possible
  • The characters in the URL are [0-9], [a-z] and [A-Z]
  • The URL can’t be neither updated or deleted
  • The service will be running for 10 years

Back-of-the-envelope

So, we’ve defined a scope. Now, let’s try to estimate a little!

  • 100 millions URLs per day means 36.5 billions per year, 365 billions in 10 years
  • Average URL is 100 bytes long
  • 365 billions * 100 bytes ~ 36.5 trillion ~ 36.5 TB storage

That’s for storage. Now, 100 millions URLs are created per day. That means 100 millions of writes.

  • When a URL is created once, every subsequent creation is actually a read as it’s been generated
  • There will be way less writes than reads
  • Let’s say the ratio of reads will be 10 reads to 1 write.

To get QPS:

  • 100 millions / (60 60 24) ~ 1160 writes per second
  • 10 : 1 read/write ratio => ~ 11600 reads per second

So, to recap:

  • 36.5 TB storage
  • 1160 writes per second
  • 11600 reads per second

Now, the URL needs to be as small as possible.

  • The total amount of URLs is 365 billions
  • The charset available is [0-9], [a-z] and [A-Z]
  • There are 26 chars between a-z, meaning (2 * 26 + 10) = 52 + 10 = 62
  • We are using base 62 encoding

From there, we can calculate the maximum size of the short URL.

  • 62^n is less than 365 billions => log(62) 365 billions => 6.5
  • We round up because we’re defining number of characters => 7
  • The shortened URL size will be 7 characters

High-Level Design

So, we’re gonna do a URL shortener. Well, it’s basically 2 endpoints:

  • my-url-shortener.com/api/shorten
  • my-url-shortener.com/api/{shortUrl}

shorten endpoint:

The shorten endpoint will basically take a long URL that will be processed, and returns a short URL

  • POST request with longUrl: string body
  • Returns shortUrl: string
  • Somewhere in the backend generates and saves the shortUrl

And that’s it for this one. For the other endpoint, it’s more complicated

{shortUrl} endpoint

This is the one where a lot of the magic happens to the user. Because what we need to do here is force a redirect.

Luckily, there are some HTTP Codes that allow for it:

  • HTTP 301
    • 301 Moved Permanently
    • Used for permanent redirecting
    • Used e.g. when using http -> https
    • Redirects the user to a Location (in response headers)
  • HTTP 302
    • 302 Found
    • Also defined as 302 Moved Temporarily
    • Redirects the user to a Location (in response headers)

Now, the main difference between these two is that once 301 is returned the first time, it will no longer send the request to our shortening service directly. That’s what permanently means - it caches the last result of this call and returns it right away.

With the 302, the calls are still made to our service.

Again, it doesn’t mean that one is worse than another

  • If reducing server load is priority, use HTTP 301
  • If analytics is important, use HTTP 302 as you catch all requests

Now, what if the shortUrl doesn’t exist yet? Well, we can just return 404 as no such shortUrl yet exists.

Deep-dive

So, going deeper to the shortener, we’ll need to think a little about the shortening itself.

Now, there are a bunch of IDs that we can use for generating. In the previous chapter, we’ve mentioned UUIDs. But I’ve also tackled this topic here already.

Since the requirement is as short as possible, in the BOTE part, I’ve mentioned base 62 and 7 characters.

So, we’ll be using base 62. And how are we going to generate it? Well, we’ll have to use something. For simplicity, I’ll be using autoincrement in this section, so we can expect every new URL to be sent here to be incremented by a single ID. I also know that this isn’t the best approach as discussed in previous chapter. We could generate a new ID multiple ways and then convert that to base 62, but for simplicity, let’s not.

So, considering autoincrement, all new URLs to be generated will be increased by 1 in terms of ID. So how do we generate the ID for shortening service?

Well, we’ll convert it to base62:

  • The number 10 in base 2 is 1010
    • 1 2^3 + 0 2^2 + 1 2^1 + 0 2^0 => 8 + 0 + 2 + 0 => 10
  • The number 10 in base 10 is 10
    • 1 10^1 + 010^0 => 10 + 0 => 10
  • Same goes for any other base. For base 16, let’s consider FF
    • F is the last number (and base 16 goes from 0-15), so F is 15
    • 15 16^1 + 15 16^0 = 240 + 15 = 255
  • Doing the same approach, we can say the Z is 61 because Z is 61 * 62^0

So, this is how we will shorten the URL.

Now, we know what endpoints we will have. How will they look on the inside?

{shortUrl} endpoint

This one is fairly straightforward.

  • Find in DB if there is an item with {shortUrl}
  • If there is, return HTTP Code 301 with Location: {longUrl}
  • If there is not, return HTTP Code 404

So it’s basically a read operation! But - how will the write happen?

shorten endpoint:

We know that user sends us the URL he wants to shorten. So, what will need to happen is basically:

  • longUrl -> shorten() -> save -> return shortUrl

But, what if the URL has already been shortened? Well, we need to check for it first!

  • longUrl -> isInDB()
    • if true: return shortUrl from DB
    • if false: shorten() -> save -> return shortUrl

And the shorten itself? Well, as described above:

  • get count of DB items
  • create ID with id = count + 1
  • create base62 representation from ID
  • save id, shortUrl and longUrl to database

Performance

So, we have:

  • API endpoints
  • Shortening function
  • DB schema
  • Overview of how it’s all gonna work

Now, what about the performance? Well, there are 2 issues here:

In distributed environment (which this definitely will be with 100 millions URLs being generated daily), we have to account for distributed ID generation

The second thing is getting fast responses. This would be solved by using content delivery network/edge servers, load balancer, multiple servers, as well as caching and database

Summary

So, in this chapter, we’ve gone through URL shortener

  • We’ve gone through API endpoints
  • We’ve gone through shortening of the URL and creating own hashing
  • We’ve gone through the database schema

The things that would need more thought are again

  • Adding rate limiters
  • Faster responses by adding multiple servers and caching
  • Distributed ID generation due to multiple servers
  • CDN/Edge servers
  • And more

Now, also we might want to think more about other things

  • DB scaling
  • Analytics, monitoring
  • Availability, consistency, reliability

We’ve to an extent discussed all the individual parts of this. This is some space for your own thoughts. Chances are, if you made it here, it won’t be wrong.

References

SimProch logo