To build my own UUID
So far we’ve discussed the issues when working with distributed environments. When we dealt with consistent hashing, we had to deal with hashing that’s used in multiple servers. Well, the same thing goes for databases. And this chapter is dedicated to that.
Often times, when working with IDs, you’d typically use either numeric identifiers with @autoIncrement
on the DB table, or generate a UUID.
Well, did you ever think about why UUIDs are so generally used? Sure, they are unique - or at least should be. As per wikipedia:
For example, the number of random version-4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion (2.71 * 10^18)
So, that’s a rather large number.
Before going to the details why we may need to have something else, let’s first go through a couple of ID generators
So, the first that comes to mind when we want unique ID on databse is autoincrement. Surely this will work, right?
An in fact, yes, it will, as long as you know what you’re going for.
Consider that you’re building a software that is to be used only in Czechia. Czechia is pretty small and you really need a server only in Prague.
Pros:
Cons:
Well, that is easy. But we’ve been talking about adding and removing servers on the go. What would happen here? Well, collisions. Basically, multiple databases having same IDs for different records, making it impossible to get some business insight from your data.
So, let’s try to scale it up. We’ve talked about this in consistent hashing already. This is also sometimes called Multi-master replication.
What we’re gonna do is we’re gonna autoIncrement, but we’ll do it with a specific value. Consider the following:
server0
will have 1, 3, 5, 7, ...
server1
will have 2, 4, 6, 8, ...
Now, if we had 3 servers, we would just increment by 3. In practice, we’d always increment by number of servers.
This sounds nice, but it has quite a massive disadvantage. Imagine that you have a traffic burst and you want to add more servers. Where do you go from there?
Well, you could raise more servers and change the increment. But if you originally had 2, and now you have 3 servers, without data migrations, you’d end up with:
server0
: 1, 3, 5, 7, 10
server1
: 2, 4, 6, 8, 11
server3
: 3
Now, you see the issue. We’ve already had 3 before. It’s inconsistent. And you’d have to do quite some work around it. Let’s make it better with UUIDs
So, we’ll be working with something unique now! UUIDs are quite well described. Basically, they are 32 hexadecimal characters.
f
, we would do 1111
I’ve mentioned that on purpose. 128 bits is a BIG number. I had to look up that number and it’s 340 undecillion combinations.
Basically, you are way more likely to win the lottery multiple times in a row than generating a duplicate UUID.
But, we wouldn’t be here if it were without disadvantage, would we?
Well, as it turns out, you can’t sort by them, because they are quite random. Imagine you wanted to get the first 10 UUIDs you’ve ever generated. From UUID, it’s impossible!
Furthermore, as I mentioned, 128 bits is a BIG number. It’s also 16 bytes. Let’s do some math:
Now, if you came up with a way to make it 64 bits, you’d effectively halve the storage required! So, chances are, you may want to have less than that!
While ticket server doesn’t solve the UUID issue, it’s an interesting way to generate numeric IDs. Flickr actually created one for themselves.
So, how do they work? Well, it’s basically centralized autoincrement for servers.
This is another one that’s easy to implement, but has drawbacks:
So, we’ve shown a couple of options.
So, let’s first create the issue why UUIDs can’t be used! Let’s say you’re on an interview, and you’re given the following task:
So, this is actually a thing that UUIDs completely fail to oblige! The only thing they are is they are unique! (And also numerical, kinda, you can transform 128 bits to a number)
And why would we have these requirements? Well, let’s say it’s because you’re on an interview! Or also because we want to save some data with the 64 bits.
What would we do then? Well, it turns out we can fairly easily create our own ID!
In this part, I’m gonna create my own ID! Well, not necessarily, I’m gonna follow what Twitter did
So, let’s start with unique IDs. How would we generate one that’s unique? Well, turns out, it’s fairly simple. We can just increment it. But hold on, Simon. You’ve said we can’t increment it!
No! What I said was we can’t autoincrement on database. But what if we incremented manually, and only some parts? Well, let’s take a look at it!
We’re gonna do the same thing that’s done to IP Packets. We will separate a sequence into a section.
We have 64 bits available. That means 64 ones and zeroes. And we know how binary works. So, what we’re gonna do is we will:
So, how can we use the 42 bits to allow sorting by date, and allow for way more IDs? Well…
We can use them to store the number of milliseconds from an arbitrary epoch.
Now, if we go back up - we’ve had 4096 ids on a single machine. With this addition, it makes it 4096 ids on a single machine per millisecond.
So, theoretically, our hard limits here are:
0
.And the best thing is, we can also sort by date! Because the first 42 bits define the number of milliseconds from the Epoch, we can sort by these! And with some basic math, we can also get periods and more!
So, in this part, we’ve gone through 4 different options of generating IDs
We’re gonna use these in an interesting way in the next part where we build our own URL shortener