The robots.txt file and robot META tags are methods used to allow and disallow crawling portions of your
site by robots (web robots, spiders). Website administrators and content providers can define what parts
of the site the robot should and should not be visited.
Spiderline honors the robot exclusion protocol and robot META tags. Our spider will not index directories
or follow links that have been disallowed in the robots.txt configuration file located on a server or
META tags designating "noindex" and/or "nofollow". If you use these methods for controlling spidering,
you do not need to specify NOINDEX and NOFOLLOW for your account in the URL Configuration fields.
If you do not have a Spiderline account or you want to disallow other robots from crawling your website, the
follow document provides general information regarding Robot Exclusion. Disallowing robots from your website or
part(s) of your website can be accomplished by two methods:
- Robot Exclusion Protocol (robots.txt)
- Robot META tags
Robot Exclusion Protocol - robots.txt
The robots.txt is a TEXT file (not HTML). When a compliant robot vists a site, it first checks for a
"/robots.txt" URL at the web root. If this file exists, the robot parses its contents for directives that
instruct the robot to visit or not visit certain parts of the site.
Each Directive has a user-agent line which names the robot to be controlled and has a list of "disallows".
The disallows are scanned in order, with the last match encountered determining whether a document is
allowed to be visited or not. If there are no matches at all then the document may be crawled.
| A Directive in the robots.txt consists of the following fields: |
| |
User-agent:
Disallow:
|
|
- The User-agent Field
- The name of the robot that should follow the specified access policy.
- Acceptable values include a Robot's name or an asterik * to indicate all robots.
- Each Disallow field must be preceeded by a User-agent field.
- More than one User-agent field can be present per directive.
- The Disallow Field
- A URL path or pattern that should not be visited (crawled).
- Acceptable values include a full path, partial path, or empty set.
- An empty set (the value is left blank) indicates that all paths can be visited.
- Each User-agent field must be accompanied by a Disallow field.
- More than one Disallow field can be present per directive.
Examples:
| Exclude all robots from the entire server: |
| |
User-agent: *
Disallow: /
|
 |
| Allow all robots complete access: |
| |
User-agent: *
Disallow:
Or create an empty "/robots.txt" file.
|
 |
| Exclude all robots from part of the server: |
| |
User-agent: *
Disallow: /private/
Disallow: /tmp/
Disallow: /cgi/
|
 |
| Exclude a single robot: |
|
User-agent: Badbot
Disallow: /
|
 |
| Exclude more than one robot: |
| |
User-agent: Badbot_1
User-agent: Badbot_2
User-agent: Badbot_3
Disallow: /
|
 |
| Allow a single robot: |
| |
User-agent: Spiderline
Disallow:
|
 |
| Directives can be combined for more specific instructions and control. |
 |
| |
User-agent: Spiderline
Disallow:
|
| |
User-agent: *
Disallow: /
|
 |
Exclude all files or directory paths except one:
This is difficult as there is no "Allow" field. The easiest way to accomplish this task is to place the files
or directories you do not want crawled in a directory, for example 'norobots'. Put the file(s) and directories
you do want robots to crawl in a level above the norobots directory. |
| |
User-agent: *
Disallow: /norobots/
|
 |
| Alternatively you can explicitly disallow all pages that should not
be visited by robots: |
 |
| |
User-agent: *
Disallow: /dir/private.html
Disallow: /dir/tmp.html
Disallow: /dir/
|
IMPORTANT NOTES!
- There is a difference between the following:
- User-agent: *
Disallow: /docs
- and
-
User-agent: *
Disallow: /docs/
In the first example, compliant robots will not visit documents that begin with '/docs'. This would cover
/docs.html, docs.pdf, and docs.jpg
In the second example, compliant robots will not visit the three documents mentioned above; but it will
also disallow robots from visiting /docs/webpage.html, /docs/tmp/page.pdf, /docs/dir/tmp/image.gif
Regular expression are not supported in the User-agent or Disallow fields. The '*'
in the User-agent field is a special value meaning "any robot". Specifically, you cannot have
entries such as "Disallow: /tmp/*" or "Disallow: *.gif".
You need a separate "Disallow" line for every URL prefix you want to exclude. You cannot
enter "Disallow: /cgi-bin/ /tmp/" on one line. Also, you may not have blank lines in a record, as they are
used to delimit multiple records.
Robots META Tags
The Robots META tag is another method that may be used to indicate to visiting robots whether a page should
be indexed (crawled), or links on the page should be followed. It differs from the Protocol for Robots
Exclusion in that you need no effort or permission from your Web Server Administrator.
The content of the robots META tag contains directives separated by commas. You can define [no]index, [no]follow,
all, or none. The INDEX directive specifies if an indexing robot should index the page. While a robot crawls
around your web site, it collects information about the words and links on each page; this is the process of
indexing. The FOLLOW directive specifies if a robot is to follow links on the page. The defaults are INDEX and
FOLLOW. The values ALL and NONE set all directives on or off: all=index,follow and none=noindex,nofollow.
NOTE: The "robots" name of the tag and the content are case insensitive.
Like any META tag it should be placed beteen the <head></head> tags of an HTML page:
| |
<html>
<head>
<meta name="robots" content="none">
<meta name="description" content="This page ....">
<title>...</title>
</head>
<body>
...
|
Examples:
| HTML page you do not want crawled/indexed: |
| |
<meta name="robots" content="noindex">
|
 |
| HTML page you want crawled, but do not want the robot to
follow the links on that page: |
| |
<meta name="robots" content="nofollow">
|
 |
| HTML page you do not want crawled AND do not want the robot
to follow the links on that page: |
| |
<meta name="robots" content="none">
|
Excluding/Including Sections of a Page
This help topic describes how to prevent sections of a document
from being indexed. To prevent an entire document from being indexed,
see the topics above.
Spiderline supports the proprietary "robots" comment tag. This tag allows
a web author to apply robots exclusion rules to arbitrary sections of a
document. The tag has one attribute, content, with the following possible
values:
- noindex - the text enclosed in the tag is not saved in the index
- nofollow - links are not extracted from the text enclosed
- none - enclosed text is not indexed nor searched for links
Values "index", "follow", and "all" are also valid. In practice they
are ignored since they are the unspoken defaults.
This feature is expected to fit the customer need of preventing certain
parts of a document - such as a navigational sidebar - from being included
in the search.
Example:
<HTML>
<BODY>
This text will be indexed.
<A HREF="foo.html"> this link will be followed </A>
<!-- robots content="none" -->
This text will NOT be indexed.
<A HREF="bar.html"> this link will NOT be followed </A>
<!-- /robots -->
<!-- robots content="noindex" -->
This text will NOT be indexed.
<A HREF="bar1.html"> this link WILL be followed </A>
<!-- /robots -->
<!-- robots content="nofollow" -->
This text WILL be indexed.
<A HREF="bar1.html"> this link will NOT be followed </A>
<!-- /robots -->
la la la
</BODY>
</HTML>
For the example of a navigational sidebar, the "noindex" value
would be the best choice.
This syntax was designed to match the robots META tag.
For documents which have both the "robots" META tag and
the "robots" comment tag, the most restrictive interpretation will
be made, always erring on the side on not indexing or not following.