####################################################################### # Searchable Keywords: robots meta tag Search engines will look in your root domain for a special file named "robots.txt" (http://www.mydomain.com/robots.txt). The file tells the robot (spider) which files it may spider (download). This system is called, The Robots Exclusion Standard. The format for the robots.txt file is special. It consists of records. Each record consists of two fields : a User-agent line and one or more Disallow: lines. The format is: ":" The robots.txt file should be created in Unix line ender mode! Most good text editors will have a Unix mode or your FTP client *should* do the conversion for you. Do not attempt to use an HTML editor that does not specifically have a text mode to create a robots.txt file. User-agent The User-agent line specifies the robot. For example: User-agent: googlebot You may also use the wildcard charcter "*" to specify all robots: User-agent: * You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders. Disallow: The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm: Disallow: email.htm You may also specify directories: Disallow: /cgi-bin/ Which would block spiders from your cgi-bin directory. There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed). If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present. --------------------------- Examples ------------------------- The following allows all robots to visit all files because the wildcard "*" specifies all robots. User-agent: * Disallow: This one keeps all robots out. User-agent: * Disallow: / The next one bars all robots from the cgi-bin and images directories: User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Meta Robots Tag One other meta tag worth mentioning is the robots tag. This lets you specify that a particular page should NOT be indexed by a search engine. To keep spiders out, simply add this text between your head tags on each page you don't want indexed. The format is shown below. Page I Don't Want In Search Engines You do NOT need to use variations of the meta robots tag to help your pages get indexed. They are unnecessary. By default, a crawler will try to index all your web pages and will try to follow links from one page to another. Most major search engines support the meta robots tag. However, the robots.txt convention of blocking indexing is more efficient, as you don't need to add tags to each and every page. See the Search Engines Features page for more about the robots.txt file. If you use do a robots.txt file to block indexing, there is no need to also use meta robots tags. The meta robots tag also has some extensions offered by particular search engines to prevent indexing of multimedia content. The article below talks about this in more depth and provides some links to help files. Search Engine Watch members should follow the link from the article to the members-only edition for extended help on the subject.