Building web archive
This time we’re building a web crawler. Now, what exactly is a web crawler?
Sometimes called robot or spider, it’s basically a program that goes through a web page and saves it for later use.
There are multiple use cases of web crawler
So, it’s a very useful tool. Now, how would we design one?
Well, in short, it’s fairly simple:
A basic architecture can be seen below
Now, it seems fairly simple at first glance. If I were to go through URLs of my university, it’d be fairly easy and fast. But the problem comes with scalability. So, let’s set some rules:
So, looking at the above requirements, we need to:
So, we’ve already had a basic web crawler architecture. But we need to have our one quite scaled.
We’re gonna crawl a lot of web pages. To achieve that, we’ll need a couple of items:
So, one of the things we need to do is get a list of URLs to crawl. We can see that as a so-called URL Frontier:
So, here we can basically split our system into 3 components now:
A seed URL is basically a root URL from where we begin crawling. This could be about.google.com
or google.com
- anything where we plan to go further.
Once we went through the links in seed URL, we then insert them into URL Frontier.
Effectively, the main difference is:
So, that’s the first two parts. What happens next?
Well, we need to download the HTML. There are many popular tools to do that, but this will likely be slow. Furthermore, we need a DNS resolver to be able to
match wikipedia.org
to it’s IP address at the current time, because DNS of websites can change over time, but their name won’t, and we need DNS to download it.
After that, we’ll perform the filtering of duplicate web pages. For example, we can create a hash of the entire web page and see if it already exists. If not, we’ll process it. If it exists, there’s no need to keep this in our storage.
And finally, we can filter out some unwanted pages, such as blacklisted pages or unsupported content types/file extensions.
After all of this, we could save it to our storage. BUT! There are actually more steps!
/images/system-design
. That means absolute path relative to
root of the page. So, if we’d try to download this, we need to build the full URL.So finally, the design is:
And that’s our high level design!
In this part, we’ll go through some deeper knowledge. One of them will again be algorithms.
What we’ll discuss is:
Traversing is the algorithmic part. If you’d open developer tools, you’d see that it’s actually a tree (or a graph). The difference is out of scope of this post.
There are a bunch of algorithms to do that. The two I’ll mention are Depth-First Search (DFS) and Breadth-First Search (BFS)
Consider the following HTML code:
<div>
<div id="first">
<div id="first-1"></div>
<div id="first-2"></div>
<div id="first-3"></div>
<div id="second"></div>
<div id="third"></div>
</div>
DFS is basically going deep first, as the name implies. What it means is with the above code, DFS process it as follows:
BFS is very similar to it, traversing it on “levels”
Now, each have their use cases:
Either way, we’re traversing a lot of stuff in short time. We need to be careful to not flood the page with requests.
As mentioned, we need to address flooding the page. URL frontier helps with that. This component contains URLs that are yet to be crawled.
In here, we can effectively slow processing of some pages down. This is also called politeness
What we would use here is basically a queue. We’d add a delay between individual downloads of pages. Imagine the following:
Similarly, we want to give different priorities from pages - for example, a forum post on apple has (probably) less priority than root apple page. Therefore, in URL frontier, we’d basically have 2 components:
The final component could ook like:
Finally, we’d need to make sure the pages are fresh. So, finally, once we’d crawl the pages, we want to recrawl them again. A blog post from a year back is unlikely to have new content. A main company profile is. So, we’d prioritize here as well.
Now the storage part is tricky - we want to keep the URLs on the disk, but getting them all from disk always is slow. So, we’ll have a hybrid approach
The HTML downloader downloads the page. There’s an important nuance here - robots.txt.
robots.txt file is available on many pages that allows or disallows specific content to be crawled. You can find one on Amazon
These could be cached so we don’t need to download them all the time. When all domain has been crawled, it can be removed from memory. This allows better memory.
Another performance optimizations we can do is having distributed crawls.
As is always the case, we want to have as robust system as we want. Other than performance, we need to care about data integrity and others:
For extensibility, consider again the original design:
Now, imagine that we have decided to download all pictures from here. Where would we do that? Well, probably after downloading the content.
All systems need to be flexible, and crawler can be extended by supporting new content types and downloading specific content. Our system is extensible as of now, but with some new changes it might not be. This was just a test to see if we can extend it - and we can.
Now, I’ve mentioned problematic content before. In this case, I’m talking more about problems for the crawler:
Finally, there are some pages that are called Spider traps. These could cause serious harm.
To give an idea, example.com
is a page that allows anything in URL. So, we could potentially have something like example.com/foo/bar/foo/bar/foo/bar
up to
maximum length. And each of these could render the same page - with only link being to another page.
These are called spider traps because there is no content and your crawler is still processing it, making it slow with no value.
We can do many things:
So, here we discussed a web crawler. Hopefully that was fruitful and we’ve learned something new.
Keep in mind that there can be quite a few problems again: