PDA

View Full Version : The proper usage of the robot.txt file


Titan
02-17-2005, 02:11 PM
When optimizing your web site most webmasters don’t consider using the robot.txt file.

This is a very important file for your site. It let the spiders and crawlers know what they can and can not index. This is helpful in keeping them out of folders that you do not want index like the admin or stats folder. The robot.txt file goes in the root directory.

Here is a list of variables that you can include in a robot.txt file and there meaning:

1) User-agent: In this field you can specify a specific robot to describe access policy
for or a “*” for all robots more explained in example.

2) Disallow: In the field you specify the files and folders not to include in the crawl.

3) The # is to represent comments

Here are some examples of a robot.txt file

User-agent: *
Disallow:

The above would let all spiders index all content.

Here another

User-agent: *
Disallow: /cgi-bin/

The above would block all spiders from indexing the cgi-bin directory.

User-agent: googlebot
Disallow:

User-agent: *
Disallow: /admin.php
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /stats/

In the above example googlebot can index everything while all other spiders can not index admin.php, cgi-bin, admin, and stats directory. Notice that you can block single files like admin.php.

NickPapageorgio
02-17-2005, 02:21 PM
Ok in the last example, that would be two commands, both included in the same robots.txt file? Like you're saying first and foremost that google has access to everything, and secondly, any other agent cannot access the specified files?

Thanks. Always wondered exactly what robots.txt was for. Any other good uses?

Titan
02-17-2005, 03:58 PM
i think u can use the comman [makecoffee] aswell ;)

iknowalttl
03-18-2005, 10:11 PM
Would there be any reason to allow spiders to crawl any folders other than the html and ftp public folders? Would I be hurting myself in anyway to disallow all folders except these?

Titan
03-20-2005, 05:26 PM
unless u want to keep something unlisted in the se's then i dont see why you would - everything they crawl is a potential listing somewhere in the se's