University of Cambridge Home Computing Service
University of Cambridge  > Computing Service >  Web Support

Controlling access intentionally: 2

robots.txt

Robots.txt sits at the root level of your web server and give information about what should not be indexed. An example might look like this:

# A comment line just to show what one looks like; it is ignored.
User-agent: *
Disallow: /bin/
Disallow: /cgi/
Disallow: /includes/
Disallow: /tmp/
Disallow: /~
Disallow: /stats/
Disallow: /local.html

Another example, including a reference to a named search engine robot, might look like this:

User-agent: Ultraseek (webmaster@ucs.cam.ac.uk)  # local search engine
Disallow: /bin/
Disallow: /cgi/
Disallow: /includes/
Disallow: /tmp/
Disallow: /~
Disallow: /stats/
Disallow: /local.html

# tell all others to go away

User-agent: *
Disallow: /

Robots meta tag

If the information providers can neither update the robots.txt file nor request changes to it, they can use robots META tag to specify within an HTML page whether indexing robots may index the contents of the document and/or follow links from it to other documents. This is of limited use, since it can only be used in HTML documents, but does not require changes to any robots.txt file. If there is also a robots.txt file, the exclusions there are processed first.

All META tags must be placed within the <HEAD> section of the HTML. The name attribute must be "robots", and the content attribute contains a comma-separated list of directives to control indexing, chosen from

The values of the name and content attributes are case-insensitive. Repeated or contradictory values should be avoided. The defaults are INDEX,FOLLOW, i.e. all indexing is allowed. Note that INDEX and/or FOLLOW cannot override exclusions specified in a robots.txt file, since an excluded document would not be fetched and the tag would not be seen. Also, the NOFOLLOW exclusion applies only to access through links on the page containing the tag - the target documents may still be indexed if the search engine finds links to them elsewhere.

Ignoring the "shorthand" ALL and NONE variants, the following examples show all the possible combinations:

<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">

Contents / Previous / Next