The robots.txt file

When a robot visits a Web site, say http://www.example.com/, it firsts checks for http://www.example.com/robots.txt.All major search engines will observe robots.txt. If it can find this document, it will analyse its contents to see if it is allowed to retrieve the document. We can create a customised robots.txt file to apply to all or only specific robots,and to disallow access to specific directories or files. The robots.txt file is simply a text file and must be at the root level in our server.

If We want to give any robot access to a restricted set of files:

User-agent: *
Disallow: /Architext/
Disallow: /bin/
Disallow: /cgi/
Disallow: /Excite/
Disallow: /includes/
Disallow: /tmp/
Disallow: /~
Disallow: /stats/

Here is a sample robots.txt file that prevents any robot from visiting the entire site except for the CS Ultraseek robot

The Robot will simply look for a "/robots.txt" URI at the top level on our site, where a site is defined as an http server running on a particular host and port number. Look at the robots.txt file at these locations for approaches to limiting access:

The single "/robots.txt" at the top level on a site is the only one that is used by robots. A robot will never look at "robots.txt" files in directories. If We want our users to be able to create their own "robots.txt", We will need to merge them all into a single "/robots.txt". If We don't want to do this, our users might want to use the robots meta tag instead

The meta element

The robots meta tag

The robots meta tag allows HTML authors to tell visiting robots whether a single document may be indexed or used to harvest more links. No server administrator action is required. The meta elements work in tandem with a robots.txt file, though, with the robots.txt as the first line of exclusion. Only if a robot is allowed to look at files can the robots meta element be observed.

The content of the robots meta tag contains directives separated by commas. The currently defined directives are: [NO]INDEX and [NO]FOLLOW (the [] means optional). The INDEX directive specifies if an indexing robot should index the page. The FOLLOW directive specifies if a robot is to follow links on the page. The defaults are INDEX and FOLLOW. The values ALL and NONE set all directives on or off: ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW.

The description and keywords meta tags

In the absence of other information, robots will index all the text in a document (but not HTML tags) including ALT tags but excepting comments, and will use thefirst few words as a summary to describe our page in the search results. If our page is of suitable construction for this system to give a clear idea of the contents, then nothing else is necessary, although We may wish to add some keywords, especially if our page is in a specialist area. If the page does not contain any descriptive text, such as a frameset or a page with description in an imagemap, then add a META tag containing a description to ensure the search engine indexes it and presents suitable text when it is shown as a result.

We can use the HTML description META tag to specify the summary text that will appear in a search results list. The keyword META tag allows We have to add further keywords (up to 1000 characters) for indexing that are not in the description or in the page text itself (if they are, they will be indexed already). The robots META tag allows us to control if and how Ultraseek indexes our page. The META tags must be placed within the HEAD portion of our web page. Do not use any HTML tags within the META tag itself.