Configuring URLs can be as simple or detailed as needed for your website. The Starting URL and
Pattern fields in combination with the INDEX and FOLLOW options allow you to control exactly
what portions of your website are crawled and indexed by Spiderline.
Starting URLs
Enter paths to web pages for your Starting URLs. These should not be a pattern, but rather a
complete URL that points to an actual web page. Each Starting URL lets you specify pages on your
website that our spiders should start crawling from.
- Type your Starting URL(s), one per line. You may use the noindex option in this field.
Entering the nofollow option in this field would negate the purpose of the
starting URL.
-
The Starting URL typically matches the homepage of the web site you want to index and search.
All other pages are linked to either directly or indirectly from the homepage URL.
-
If your site has multiple domains, subdomains, or if portions of your site are not linked from
one main Starting URL, you may enter additional starting URLS. This is useful for
pages on your web site that are not linked to from pages under the homepage.
-
Regardless of the Starting URL, Spiderline will still honor any INDEX / FOLLOW options, robot META
tags, and the standard robot exclusion protocol.
Patterns
The patterns field is used to specify what documents and directories linked to, directly or indirectly, from
the starting urls should be crawled and indexed and which ones should not.
-
Enter document paths, patterns, or regular expressions in the Patterns field. Type each pattern,
one per line, with any desired index or follow options. Learn more about
index and follow options.
-
Entries in the Patterns field should follow the format below. The default is INDEX and FOLLOW, unless
otherwise specified.
Format:
Pattern Index_Option Follow_Option
-
If you want everything on your website searchable, enter your domain name in the Patterns field.
CAUTION! Entering just a "/" rather than your full domain name will allow our spiders to crawl and index all
documents on your website and on websites you link to. All documents have a "/" in
their path somewhere! For example, "http://www.anydomain.com/. Spiderline does not necessarily crawl
all documents beginning with your domain first. If you enter a "/", you could crawl every website on the
internet. Fortunately, Spiderline has document limits in place in preparation for such human errors.
-
If you have only a few specific paths you want crawled, enter those paths in the Patterns field.
Searchable documents should match a pattern that is followed by the INDEX option. In order to exclude a
certain document or directory, the pattern entered must specify NOINDEX or the document must only be
linked to by another document which has a NOFOLLOW command.
Order of URLs in the Patterns Field
The precedence of entries in the Patterns field is the inverse of their order. Entries are read from
top to bottom, meaning that the last entry's options will override any previous conflicting entries.
If the Spiderline robot encounters a page that matches more than one entry in the Patterns field, the
entry that is listed later will take precedence.
This allows you to be even more specific in what you do and do not want crawled.
Example:
By entering the following two lines in the patterns field, all documents that contain "/dir"
will be crawled, unless the document path also contains "/tmp". A document located at "/tmp/x.html",
"/dir/tmp/here.html", or "/dir/tmp.html" will not be crawled and links within those documents will
not be followed.
/dir
/tmp NOINDEX NOFOLLOW
Preventing Documents from being Searchable
You can use the URL Patterns field to prevent documents on your website from being indexed or crawled, and
therefore not searchable.
-
If you want to prevent a particular type of file from being indexed,
enter the file extension followed by NOINDEX.
For example:
- .cgi NOINDEX
- .pdf NOINDEX
-
If you want to prevent just a few specific documents from being searchable,
enter a path to each page followed by NOINDEX.
For example:
- /some/path/private.html NOINDEX
- /tmp/finances.pdf NOINDEX
-
If you have many documents you do not want searchable,
enter a path to each page or directory where the documents are all located followed by NOINDEX.
For example:
- /some/path NOINDEX
- /tmp/ NOINDEX
-
If the pages you do not want searchable have links to other documents, which you also do not
want searched you should add NOFOLLOW to the entry.
For example:
.pdf NOINDEX NOFOLLOW
/some/path/private.html NOINDEX NOFOLLOW
/tmp/ NOINDEX NOFOLLOW
The Difference between Ending a Pattern entry with "/" versus no slash
Yes, there is a difference between the following two entries for the Patterns field:
/path/x/ NOINDEX NOFOLLOW
/path/x NOINDEX NOFOLLOW
The first entry will not index or follow links from documents that begin with '/path/x/'. This would cover
/path/x/a.html, /path/x/b.html, and /path/x/etc.html.
The second entry has the same effects on the example documents a.html, b.html, and etc.html, but will also
not index or follow links from documents such as /path/x.html, /path/x.pdf, /path/x_file.html.
Linking to Documents Outside your Domain
In order to make documents on other websites searchable, but only the documents you link to and not
the entire other website, enter "/ INDEX NOFOLLOW" on the first line of the Patterns field.
And on the second line, enter "www.yourdomain.com INDEX FOLLOW". This allows you to still
configure what parts of your website you do and do not want searchable on subsequent lines in the Patterns field.
    / INDEX NOFOLLOW
    www.yourdomain.com INDEX FOLLOW
If you have only a few webiste you link to, you can just specify their domain name followed by INDEX NOFOLLOW.
    www.referencesite_a.com INDEX NOFOLLOW
    www.referencesite_b.com INDEX NOFOLLOW