1.5 Understanding a Found Webpage

The next piece of search engine magic comes from a woman named Karen Spärck Jones. She was a British computer scientist who worked in the field of natural language processing (programming computers to process and analyze large amounts of natural language data) and information retrieval (search).

In 1972, Spärck Jones wrote a paper introducing the idea of term frequency–inverse document frequency (TF-IDF). TF-IDF is a numerical statistic that shows how important a word is to a document, or in the case of the internet, a web page.

$idf (t, D) = l o g \frac{N}{| {d \in D : t \in d} |}$

As modern search engines crawl the internet and record pages, they use this or other systems like it to understand what that web page is truly about and what seem to be the most important words for that web page. This allows the search engine to return results that have a higher chance of being relevant than if the search engine just looks for documents that have the highest usage of the word searched for. Today’s search engines use more sophisticated algorithms and aren’t entirely based on TF-IDF, but it’s still a part of how search engines understand content.

Previous Next