Search Engine Robots and Spiders

What are search engine robots?
Search engine robots, crawlers or spiders are automated agents that search engines run to index numerous Internet web pages. These search engine agents visit websites, navigate websites and read data of web pages by following links or leads from one website to another.

The functionality of these robots or spiders are very limited. There won’t be any kind of magic practice to make your website get a top rank in search engine result pages with these robots. All that search engines robots do is reading and making a copy of information of meta tags, a page title and textual content of web pages. Search engines index these later. To make your site get a good rank in search results, what’s needed is search engine optimization. Web crawlers won’t miss out your important content if your web site is optimized for search engines.

Controlling search engine spiders or robots
We can still control the behaviors of search engine robots to a certain degree.

- Not indexing
By editing “robots.txt”, you can make robots not access certain pages or directories using robots.txt.

# this deters crawlers from indexing this directory.
User-agent: *
Disallow: /not_to_be_indexed/

(’*’ refers to all agents)

(Note, this access management can be done through meta tags of web pages. You can manage individual pages by using meta tags.)

- Blocking bad crawlers
You can block harmful crawlers from indexing your website.

# this blocks bad_crawler.
User-agent: bad_crawler_name
Disallow: /

(’/’ means to block the whole site)

- Controlling multiple robots
You can control many robots at a time.

# this blocks all robots from crawling any of your site, EXCEPT Google.
# Googlebot is allowed to crawl the entire site, EXCEPT the ‘cgi-bin’ directory.
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/

- Where should I put the robots.txt file?
Place it in the root directory of your website.

- Things to consider
The robots.txt file is placed under the root directory of a domain and can be accessed by anyone. You can just type “http://www.mydomain.com/robots.txt” to look at a site’s robots file. Because of security concern, you may want to consider removing all links to the directories that you don’t want robots to crawl or people to check out, instead of putting them explicitly in robots.txt by using “disallow”. Robots or spiders won’t access unlinked pages. Removing links to those directories can be better than using “disallow” in robots.txt.

- Useful links
List of Search Engine Robots, Spiders and Cralwers
How to write robots.txt

No Comments

Leave a comment