System Design 16 - Afterword

The road so far

Introduction

So, we did it. Together we’ve gone through a massive book and I tried to explain it using my own words.

We’ve gone through a lot of stuff.

  • From rate limiting and a lot of algorithms
  • Through consistent hashing and creating own ID creation in distributed ID generation and URL shortening
  • Through notification and chat services, using third party APIs and websockets
  • Through web crawlers, understanding processing of a lot of data
  • Learning new data structures with autocomplete
  • And finally, doing a lot of compression and block separation in google drive and youtube

We’ve done a lot. Now, I’d like to go back a little.

You may see a lot of images. Those are from the book I’ve listed in each reference. While I did not cite it word by word, I went a lot with the context it has and tried to put it for my own understanding.

Some pictures come from a course that - at least first 15 parts - are same as in the book. Visit ByteByteGo to learn more about it.

Now, I’d like to reiterate on the concepts we’ve seen many times:

  • Load Balancers and multiple servers to distribute load
  • Multiple databases, sharding, unique ID generation
  • Graceful error repair, such as handling request by different server, or different DB
  • A lot of redundancy in order to keep the data
  • Having multiple write servers and read servers, depending on app usage
  • Using cache because it’s fast
  • Using CDN because it’s fast, but potentially also costly
  • Message queues for more decoupled system
  • Logging, metrics, analytics
  • Vertical and horizontal scaling (add more resources VS add more servers)

That was a lot to cover. These are just to name a few.

Now, during my time here, perhaps the most recurring concept was Back of the Envelope Estimation. Here are a couple of mnemonics to go by:

One that was a lot used was powers of two:

Power Value Full name Short Name
10 Thousand Kilobyte KB
20 Million Megabyte MB
30 Billion Gigabyte GB
40 Trillion Terabyte TB
50 Quadrillion Petabyte PB

Now, what’s a good mnemonic for this if you don’t want to remember it? For me, Million worked best:

  • 5 million DAU
  • 10 % of users store 100kB of Data
  • 5 million times 0.1 times 100kB
    • Now here, I’m working with MB. The mnemonic that worked for me is that both million and megabyte start with M
    • For easy computation:
      • 5 000 000 * 0.1 = 500 000 (remove one zero)
      • 100kB => 500 000 * 100 => 50 000 000kB (add 2 zeroes)
      • Daily stored data = 50 000 000 kB => 50 000 MB => 50 GB (remove 3 zeroes per each)

Now, we’ve often discussed about seconds. Specifically, queries per second. Now, let’s take a look at that!

  • 5 million DAU => 5 000 000 / 24 / 60 / 60 (or 5 000 000 / (24 60 60))
  • A good mnemonic here is look at what the result of the multiplication is
    • 24 60 60 = 86400
    • N / 86400 ~= N / 100 000 * 1.2
  • The real result of 5 million DAU for QPS is 57
  • The approximate result is 5 000 000 / 100 000 1.2 ~= 50 1.2 ~= 60

This approximation is good enough for most of the use cases, and it’s easier for our brain to do this computation

Let’s take this exactly from the book again:

  • 300 million monthly active users
  • 50 % use twitter daily
  • User post twice a day
  • 10 % of the tweets contain media
  • The data is stored for 5 years

So, QPS here is 3472:

  • 300 000 000 0.5 2 = 300 000 000
  • QPS is 300 000 000 / 100 000 1.2 ~= 3000 1.2 ~= 3600

(You can get the same result with 1.1574 as with the real QPS, but it’s still harder to process in head)

And now, finally, when on an interview

  • Ask for clarifications. Know the important features. Chances are, each feature will have a different service and you’d share storage
  • A lot of the requirements were given by me. But they are a result of discussion with the interviewer

Hope that helped. Good luck and thanks for being here with me!

References

SimProch logo