It All Starts with a Crawl

The first problem when building a search engine is knowing what information is out there. A search engine presumably wouldn’t have another search engine to help it find pages, so the search engine needs to build its own index (or list) of every web page on the internet. How can it possibly find every page on the internet? The answer is through links. A link is a clickable button or text that takes a user from one page to another. The web is a large collection of web pages, and the only thing tying them together is links.

The search engine needs only to build a spider (a program designed to scan and record web pages) which will visit a single web page. The spider can then scan that page, record that it exists, record what’s on it, and then follow every link it finds on that page in a process called crawling. It will then visit any other page it finds, crawl those pages, record them, follow all the links on those pages, and so on.

If we assume that the average page has 20 links pointing to 10 other pages, then the spider would be able to crawl the following number of pages in eight steps:

1 > 20 > 400 > 8,000 > 160,000 > 3,200,000 > 64,000,000 > 1,280,000,000 > 25,600,000,000 (more pages than exist on the internet).

Links on web pages help search engines effectively crawl the internet.

Photo by Boskampi via Pixabay.

It’s not quite that simple, because more web pages means more resources are needed to crawl and record those pages, but the idea of crawling the internet in order to find every page is actually a very effective method to index the entirety of the internet.

As we discussed earlier, building a spider that can crawl and record every page it finds is an effective way to index the internet. The spider, however, isn’t just recording the existence of each page and analyzing the content on the pages it finds. It’s also recording which pages link where. This allows the spider to build a record of how many pages link to every page on the internet.

For most search engines, this has become a powerful signal that the content is worth showing in search results. The more pages that link to a particular page, the more trustworthy and valuable that page is, which means it should rank higher if a user searches for content related to that page.

It functions kind of like a popularity contest. Imagine a popularity contest involving every American, in which anyone can vote for whoever they want as many times as they want. You might first say that whoever has the most votes wins. You’d then quickly notice that the people winning the contest were not, in fact, popular. Instead, they had a small group of friends who weren’t above voting for them thousands of times. To respond, you may say that each subsequent vote from each individual counts less and less. That would prevent people from stuffing the ballot box, but wouldn’t prevent them from voting as many times as they want, which is intended. You might also say that a vote from someone who has a lot of votes themselves is worth more because they know a thing or two about popularity. So that means that someone who has no votes in the popularity contest can still cast votes, but those votes might not carry as much weight.

Links on the internet work in a similar (though much more complicated) way. The more links a page has, the higher it will rank. The more varied the websites that link to a page are, the higher it will rank. The more links the page gets from other pages that, themselves, have received a lot of links, the higher the page will rank. Again, it’s actually much more complicated than that, but, in a nutshell, that’s how most modern search engines treat links from one page to another in terms of search results.

Finally, in addition to links, there are some tools to ensure that all of your web pages are found, which we’ll be discussing in a later chapter.