To exclude areas from being indexed, you will need to put in commands to the URLs section of the Crawl settngs. These Patterns will tell the crawler what to index and what to avoid indexing. Type each pattern, one per line, with any desired index or follow options.
If you want everything on your website searchable, enter your domain name in the Patterns field. If you have only a few specific paths you want crawled, enter those paths in the Patterns field. Searchable documents should match a pattern that is followed by the INDEX option. In order to exclude a certain document or directory, the pattern entered must specify NOINDEX or the document must only be linked to by another document which has a NOFOLLOW command.
The precedence of entries in the Patterns field is the inverse of their order. Entries are read from top to bottom, meaning that the last entry's options will override any previous conflicting entries. If the Spiderline robot encounters a page that matches more than one entry in the Patterns field, the entry that is listed later will take precedence. This allows you to be even more specific in what you do and do not want crawled.
Examples:
- * I enter "http://www.mydomain.com INDEX FOLLOW".
All documents and directories that contain this pattern will be crawled. Entering your domain name in this field ensures that links to websites outside your domain will not be followed or crawled.
- * I enter "/ INDEX FOLLOW". CAUTION!!!
All documents have a "/" in their path somewhere! For example, "http://www.anydomain.com/. All documents on your website and on websites you link to will be crawled. Spiderline does not necessarily crawl all documents beginning with your domain first. If you enter a "/", you could crawl every website on the internet.
- * I enter "/only/these/dirs/ INDEX FOLLOW".
All documents and directories that contain this path will be crawled. A document found at "/only/here.html", "/only/these/here.html", or "/what/about/here.html" will not be crawled unless another pattern matching the document path is entered. Also, since the entry ended with a "/" ("dirs/"), the document "/only/these/dirs.html" will not be crawled. If the final "/" was omitted, dirs.html would be crawled.
- * I only enter "/dirs/ INDEX FOLLOW" and "/some/path/ INDEX FOLLOW".
A page must contain one or the other. All pages that do not have one of these pattterns will not be crawled.
- * I enter "/not/here/ NOINDEX FOLLOW".
Any document or directory that contains this path will not be crawled.
- * I enter ".cgi NOINDEX NOFOLLOW".
All documents that have a ".cgi" extension will not be crawled and any links found will not be followed.
- * I enter on line 1 "/dir" and on line 2 "/tmp NOINDEX NOFOLLOW".
All documents that contain "/dir" will be crawled, unless the document path also contains "/tmp". A document located at "/tmp/x.html", "/dir/tmp/here.html", or "/dir/tmp.html" will not be crawled and links withini those documents will not be followed.