How does a search engine select the pages to show for a given query? How a specific query is processed? How does a search engine finds the pages online?
This article briefly explains how a search engine works

1. Discovery

Search engines use automated programs (called spiders or bots) that explore the web, jumping from one page to the other following the links they found.

2. Index

When a page is found, or a known page is re-visited, its content it’s saved in the search engine database, so it can be accessed faster in the future.

3. Returning Results

When a query is sent to the search engine (i.e. when a user hit the “search” button in the search engine homepage), the matching pages are selected and ranked with a specific algorithm (every search engine has it’s super-secret ranking algorithm), and the pages are returned to the user ordered by descending importance.

Ranking Criterion

There are enormous differences in the ranking algorithms used by the search engines, but all of them are based on relevance and popularity.
This are terms from the Information Retrieval, of which search engines are one of the most visible application.

Basically higher relevance means that the document is more focused on the given search term, and higher popularity means that the document is more cited from other sources.
In terms of search engines,

relevance
is evaluated analyzing

  • the page textual content
  • the pages that provide inbound links
    • reading the anchor text used to link to the document
    • reading the text surrounding the link
    • evaluating the linking pages

This means, for example, that a page can rank well for a phrase or keyword even if that phrase never appear in that page. (One famous case is Bush bio ranking #1 for miserable failure on google… this is the result of a massive use of “miserable failure” as anchor text for www.whitehouse.gov/president/).(This is no more true due to a change in google algorithm)
popularity is evaluated counting the number of links to the given page (more links means more popularity)

Given this two main criterions, each search engine adds its own interpretations, for example giving more weight to some “trusted” sites (.edu and .gov domains and sites with higher popularity are considered more trusted), or giving different weights to each element (page title, body, heading tags…)
As an example consider my google guide: it ranks #1 for ‘mapelli’ (my last name) on google, because it has been widely linked with the title of the page (that contains the domain name, i.e. www.mapelli.info), and google gives high relevance to inbound links text, while the same article is not in the top 100 results on yahoo. (This is no more true due to a change in google algorithm)

The obvious consequence is that if you want to get higher rankings you have to

  • allow search engines to find your site
  • make easy for the spiders to understand the structure of the pages
  • increase your relevance
  • increase your popularity

We’ll talk about how to do this in the next few articles.

Summary

  • Spiders or bots: automated programs that crawl the web and index the pages
  • Relevance: represents how much a web page match the search terms
  • Popularity: represents the number of “citations” (inbound links) of a given webpage, it’s a metric of the importance of the webpage

In the next article in SEO for Dummies I’ll talk about the most well-known ranking system: Google Pagerank.

Resources:

francesco mapelli