System Design 16 - Afterword
The road so far
So, we did it. Together we’ve gone through a massive book and I tried to explain it using my own words.
We’ve gone through a lot of stuff.
- From rate limiting and a lot of algorithms
- Through consistent hashing and creating own ID creation in distributed ID generation and URL shortening
- Through notification and chat services, using third party APIs and websockets
- Through web crawlers, understanding processing of a lot of data
- Learning new data structures with autocomplete
- And finally, doing a lot of compression and block separation in google drive and youtube
We’ve done a lot. Now, I’d like to go back a little.
You may see a lot of images. Those are from the book I’ve listed in each reference. While I did not cite it word by word, I went
a lot with the context it has and tried to put it for my own understanding.
Some pictures come from a course that - at least first 15 parts - are same as in the book. Visit ByteByteGo
to learn more about it.
Now, I’d like to reiterate on the concepts we’ve seen many times:
- Load Balancers and multiple servers to distribute load
- Multiple databases, sharding, unique ID generation
- Graceful error repair, such as handling request by different server, or different DB
- A lot of redundancy in order to keep the data
- Having multiple write servers and read servers, depending on app usage
- Using cache because it’s fast
- Using CDN because it’s fast, but potentially also costly
- Message queues for more decoupled system
- Logging, metrics, analytics
- Vertical and horizontal scaling (add more resources VS add more servers)
That was a lot to cover. These are just to name a few.
Now, during my time here, perhaps the most recurring concept was Back of the Envelope Estimation. Here are a couple of mnemonics to go by:
One that was a lot used was powers of two:
Power | Value | Full name | Short Name |
10 | Thousand | Kilobyte | KB |
20 | Million | Megabyte | MB |
30 | Billion | Gigabyte | GB |
40 | Trillion | Terabyte | TB |
50 | Quadrillion | Petabyte | PB |
Now, what’s a good mnemonic for this if you don’t want to remember it? For me, Million worked best:
- 5 million DAU
- 10 % of users store 100kB of Data
- 5 million times 0.1 times 100kB
- Now here, I’m working with
MB
. The mnemonic that worked for me is that both million and megabyte start with M
- For easy computation:
- 5 000 000 * 0.1 = 500 000 (remove one zero)
- 100kB => 500 000 * 100 => 50 000 000kB (add 2 zeroes)
- Daily stored data = 50 000 000 kB => 50 000 MB => 50 GB (remove 3 zeroes per each)
Now, we’ve often discussed about seconds. Specifically, queries per second. Now, let’s take a look at that!
- 5 million DAU => 5 000 000 / 24 / 60 / 60 (or 5 000 000 / (24 60 60))
- A good mnemonic here is look at what the result of the multiplication is
- 24 60 60 = 86400
- N / 86400 ~= N / 100 000 * 1.2
- The real result of 5 million DAU for QPS is 57
- The approximate result is 5 000 000 / 100 000 1.2 ~= 50 1.2 ~= 60
This approximation is good enough for most of the use cases, and it’s easier for our brain to do this computation
Let’s take this exactly from the book again:
- 300 million monthly active users
- 50 % use twitter daily
- User post twice a day
- 10 % of the tweets contain media
- The data is stored for 5 years
So, QPS here is 3472:
- 300 000 000 0.5 2 = 300 000 000
- QPS is 300 000 000 / 100 000 1.2 ~= 3000 1.2 ~= 3600
(You can get the same result with 1.1574 as with the real QPS, but it’s still harder to process in head)
And now, finally, when on an interview
- Ask for clarifications. Know the important features. Chances are, each feature will have a different service and you’d share storage
- A lot of the requirements were given by me. But they are a result of discussion with the interviewer
Hope that helped. Good luck and thanks for being here with me!