How google suggests search results
So, you want to be like google, eh? Well, you’ve come to the right place.
We’ve already built a web crawler. We understand now how to do that. However, would would we do autocomplete and allow searching over the web? Well, let’s take a looks
This is again one of the questions where we need a deeper understanding of algorithms. Or, actually, data structures.
Why? Well, imagine the following:
Now, how would you search for them? People can look for a lot of stuff, sometimes weird stuff. Try getting all words starting with “muy” in every google search ever searched.
Luckily, there IS a data structure to help us. It’s a tree like structure. Let’s TRIE IT OUT (haha, I’ll see myself out).
So, the data structure we’ll use is called a Trie. So, how does it work?
Well, let’s quickly say something about trees in Computer Science, because they are unnatural - they are basically trees upside down:
Now, how does this help us? Well, tries are a specific kind of trees
So, consider the following trie:
{
root: {
t: {
to: {}
te: {
tea: {}
ted: {}
}
}
a: {}
i: {
in: {}
}
}
}
Now, it looks really funky in the JSON above. But upon closer inspection, we can actually see it’s an autocomplete!
Furthermore, we know that:
Now, we could use this as autocomplete. This is a very basic example, but if a user types “t”, we don’t need to search entire database! We would just save all items in a Trie in the first place, and then whenever a user types “t”, we could immediately return 4 words rather than iterating over all!
Now, let’s take it a little further. We can now see why tries are good for autocomplete. Imagine there are billions of records in there. You’d quickly filter out most.
In fact, you’d get to your requested result in a short time! All you have to do is find the node containing your text!
Consider the following:
So, it’s also very fast! But, we also need to save these somewhere. You could either use key-value store (because it’s just JSON), or you could use document store like MongoDB.
Great, now we have it also stored! But we still have an issue. How would we count the number of times something was searched? After all, autocomplete often suggests most searched requests.
Well, let’s take a look back to the original example:
{
root: {
t: {
to: {}
te: {
tea: {}
ted: {}
}
}
a: {}
i: {
in: {}
},
}
}
It’s just an object! So, we can also add some more values to it! We could just add another “node” to the root. But this time,
it would be just a property. For example, we could add mostSearched: ['tea', 'ted', 'in']
.
By this simple property, we could show right away what are the most searched. But we can do better! We can do this on every node!
Or even better! We can store the count of how many times it was searched! Consider the following:
{
root: {
mostSearched: [{tea: 20}, {ted: 10}, {in: 5}]
t: {
mostSearched: [{tea: 20}, {ted: 10}, {to: 2}]
to: {}
te: {
tea: {}
ted: {}
}
}
a: {}
i: {
in: {}
},
}
}
With the above example, when a user focuses the input, we can immediately show him the most searched items
t
, we can immediately show him most searched items starting with t
to
, we could then update the counter to 3
to
many times, we would propagate it to mostSearched
in parent nodes to adjust for the popularityTherefore, a Trie is a great tool for this! It’d finally in DB look something like this:
So, we now know how to store the data efficiently for autocomplete. Let’s try to come up with a design. Again, let’s consider the requirements!
So, in short:
Furthermore, let’s get the QPS:
mysearch.com?q=d
-> mysearch.com?q=di
-> mysearch.com?q=din
-> mysearch.com?q=dine
So, as mentioned above, if the dataset would be very small, we could easily just do this on a regular relational database with:
SELECT * FROM myTable
WHERE query LIKE '{prefix}%'
ORDER BY frequency DESC
LIMIT 5
But, it’d be way too easy and not scalable. Soo, let’s do it the hard way!
So, we already know how we are going to store the data. It’s gonna be in a Trie. We’ll also cache the data in there. We’ve already done it - we will keep the top searches on the individual nodes.
Now, we could make a decision here:
If the latter, we’d have to set up some cache control. Otherwise, we could just keep them all the time and override it.
The problematic part here is analytics. Because we don’t want our search to be slowed down when analytics are broken.
Imagine the following flow:
Now, that can be fairly slow, especially with the writes. So, what we could do is separate the services. We’d have:
And that’s pretty much the design!
In the background, an analytics service is also doing its job:
So, in the end, we’d have:
Furthermore, when the request is back in the browser, we could cache the most popular searches in local storage. That’d save some server time as well and make it faster for the user.
Finally, we could want to have some blacklisted suggestions with sexual or violent content. In that case, we’d still store them in the Trie, but we would simply filter them out after retrieving them.
For the storage - we can have a lot of data to store. So, how’d we scale that? Well, we could create multiple DBs per starting letter (a-m, n-z), or even 1 for each letter (a, b, c, …, z). While it sounds reasonable, we’d quickly find out that some letters are more common than others - quickly try to think about a word starting with X and starting with E.
But, we could do some analytics around it and find the most suitable sharding approach, e.g.
ea-em
en-ez
u-z
So, we’ve created an autocomplete. We’ve discussed Trie a lot, and this entire design is basically around that and potential analytics of storage approach.
Of course, we might need to support additional things. If we’d like to support multiple languages, we might need to support unicode (which is double in size). We could create multiple tries per country as well. There are a lot of ideas.
Additionally, we could support trending searches - e.g. a news event broke out, such as war in Ukraine. For this case, our system is not ready as we do not support real time updates. We’d have to be updating existing tries in the cache and add weights based on searching within short time period (e.g. show “War in Ukraine” before “tree” even though “tree” was searched for 100 times in the last 7 days, while “War in Ukraine” was searched 50 times, but in last hour)