Friday, February 19, 2010

A Helpful Guide - How Search Engines Work

You know how important it is to score high in the SERPs (Search Engine Result Page). But your site isn't featuring on the first three pages, and you don't understand why? It could be that you're confusing the web crawlers trying to index it. How can you find out?

In order to achieve a useful level of website optimization, it is essential to understand how search engines operate and how they arrive at their results.

Search engines work in a number of different ways that directly relate to search engine optimization:

1. Crawling the Web

Search engines run automated programs, called "bots" or "spiders" or "crawlers" that use the hyperlink structure of the web to "crawl" the pages and documents that make up the World Wide Web. Spiders only can follow links from one page to another and from one site to another. That is the primary reason why links to your site (inbound links) are so important. Links to your website from other websites will give the search engine spiders more "food" to chew on.

Spiders find Web pages by following links from other Web pages, but you can also submit your Web pages directly to a search engine or directory and request a visit by their spider. It can be useful to submit your URL straight to the various search engines; but spider-based engines will usually pick up your site regardless of whether or not you've submitted it to a search engine.

2. Indexing Documents

Once pages & Web addresses have been crawled or collected, sent to the search engine's indexing software. The indexing software extracts information from the documents, storing it in database. The kind of information indexed depends on particular search engine. Some index every word in a document; others index the document title only.

3. Processing Queries

When you perform a search by entering keywords, Search engines match queries against an index that they create and assembles a web page that lists the results as hypertext links. The index consists of the words in each document, plus pointers to their locations within the documents. This is called an inverted file.

4. Ranking Results

Once the search engine has determined which results are a match for the query, the engine's algorithm (a mathematical equation commonly used for sorting) runs calculations on each of the results to determine which is most relevant to the given query. They sort these on the results pages in order from most relevant to least so that users can make a choice about which to select.

Search engine crawlers may look at a number of different factors when crawling a site. Not every page is indexed by the search engines. Certain types of navigation may hinder or entirely prevent search engines from reaching your website's content:

Speed Bumps & Walls

Complex links and deep site structures with little unique content may serve as "bumps". Data that cannot be accessed by spiderable links qualify as "walls".

Possible "Speed Bumps" for SE Spiders:

*  URLs with 2+ dynamic parameters; i.e. http://www.url.com/page.php?id=4&CK=34rr&User=%Tom% (spiders may be reluctant to crawl complex URLs like this because they often result in errors with non-human visitors).

*  Pages with more than 100 unique links to other pages on the site (spiders may not follow each one).

*  Pages buried more than 3 clicks/links from the home page of a website (unless there are many other external links pointing to the site, spiders will often ignore deep pages).

*  Pages requiring a "Session ID" or Cookie to enable navigation (spiders may not be able to retain these elements as a browser user can).

*  Pages that are split into "frames" can hinder crawling and cause confusion about which pages to rank in the results.

Possible "Walls" for SE Spiders:

*  Pages accessible only via a select form and submit button.

*  Pages requiring a drop down menu (HTML attribute) to access them.

*  Documents accessible only via a search box.

*  Documents blocked purposefully (via a robots meta tag or robots.txt file - see more on these here).

*  Pages requiring a login.

*  Pages that re-direct before showing content (search engines call this cloaking or bait-and-switch and may actually ban sites that use this tactic.

The key to ensuring that a site's contents are fully crawlable is to provide direct, HTML links to to each page you want the search engine spiders to index. Remember that if a page cannot be accessed from the home page (where most spiders are likely to start their crawl) it is likely that it will not be indexed by the search engines. A sitemap can be of tremendous help for this purpose.