MOMspider Instruction Files

The primary location for defining the behavior of a MOMspider process is within the instruction file. At the beginning of processing, MOMspider reads the instructions in their entirety and loads them into internal tables. The location of the text-based instruction file is named by the -i command-line option or by the default name set in the configuration defaults.

A MOMspider instruction file consists of a series of (optional) global directives followed by a series of traversal tasks. MOMspider sets the configuration options associated with the global directives and then proceeds to perform each of the listed tasks in the given order. After completing the last task, MOMspider will output a summary of the overall process results and then exit.

The format for the instruction file is fairly rigid. Blank lines and any lines beginning with '#' are ignored. All other instruction directives should be on a single line (regardless of length) and there is no line-continuation character. Task instructions are begun with a "<TYPE" directive and end with a ">" directive on a line by itself. Several examples are provided with the distribution. All instructions are case-sensitive.

Global Directives

All global directives should be listed at the top of the instruction file, one per line, with the directive name flush-left. The following global directives are available:
SystemAvoid pathname
This directive specifies that the systemwide avoid file for this process can be found at the given pathname. If present, this directive overrides the default configuration, but can itself be overridden on the command-line by the -A option.
SystemSites pathname
This directive specifies that the systemwide sites file for this process can be found at the given pathname. If present, this directive overrides the default configuration, but can itself be overridden on the command-line by the -S option.
AvoidFile pathname
This directive specifies that the user's writable avoid file for this process can be found at the given pathname. If present, this directive overrides the default configuration, but can itself be overridden on the command-line by the -a option.
SitesFile pathname
This directive specifies that the user's writable sites file for this process can be found at the given pathname. If present, this directive overrides the default configuration, but can itself be overridden on the command-line by the -s option.
SitesCheck N
This directive specifies the number of days between checks of a site's /robots.txt file as per the robot exclusion protocol. The default is usually fifteen (15) days.
ReplyTo email_address
This directive specifies the real e-mail address of the person running this MOMspider. This address MUST correspond to the human being that should be notified in case someone is having problems with how you have been running MOMspider. The default address is normally set by libwww-perl to be user@hostname, but should be re-specified here if the default address does not receive e-mail.
MaxDepth N
This directive specifies the maximum allowed depth of any MOMspider traversal. It's purpose is to prevent the spider from crawling down a "black hole" -- an infinitely recursive and self-modifying URL. The default value (usually 20) should be larger than any of the traversal hierarchies that MOMspider will ever want to traverse.

Traversal Tasks

Traversal tasks are compound instructions, consisting of a set of task directives surrounded by angle brackets and the type of the traversal. For each task, MOMspider traverses the web, in breadth-first order, from the specified top document down to each leaf node. A leaf node is defined to be any information object which is not of content-type HTML (and thus cannot contain any further links) or which is outside the given infostructure. MOMspider determines the boundaries of an infostructure according to the task's traversal type: Site, Tree, or Owner.

Tasks are performed in the order they are listed in the file. In general, it is most useful to list the tasks in a bottom-up order by their hierarchy. This allows more information to be available for the later, higher-level indexes which may link to these earlier tasks.

The following task directives are available:

<Site
This directive indicates the start of a task instruction for a Site traversal. Site traversal specifies that any URL which points to a site (the pairing of hostname/IP address and port) other than that of the top document is considered a leaf node.
<Tree
This directive indicates the start of a task instruction for a Tree traversal. Tree traversal specifies that any document not at or below the "level" of the top document is considered a leaf node, where level is determined by the pathname in the URL. Note that a tree traversal of any URL at the server's root level will have the same effect as a Site traversal of that URL.
<Owner
This directive indicates the start of a task instruction for an Owner traversal. Owner traversal specifies that any document beyond the top which does not contain an "Owner:" metainformation header equal to the infostructure name is considered a leaf node. On most current servers, this effectively means that only the top URL is traversed.
Name infostructure_name
Specifies the infostructure name. This is used both to identify the infostructure in generated messages and also as the owner name for Owner traversals. The name is required for all tasks and must be a single word (no whitespace).
TopURL URL
Specifies the URL of the top of the infostructure to be traversed. If it is relative, the URL is resolved as a file://localhost/ URL relative to the current working directory at process start. The top URL is required for all tasks and must be a single word (no whitespace). Any fragment identifier will be ignored.
IndexURL URL
Specifies the URL of the HTML index file that will be produced for this task. This directive is required and the URL must be in absolute form.
IndexFile pathname.html
Specifies the pathname of the actual file for the HTML index. This directive is required and must specify a valid pathname. If the file already exists, it will be renamed pathname.old.html and a link to it will be included in the new index.
IndexTitle string
Specifies the character string to use as the HTML index title and also the subject line of any e-mail message. This directive is optional. If not present, the title will be "MOMspider Index for Name" where Name is the infostructure name.
ChangeWindow N
Specifies the window in N days (N being a natural number) prior to the current date within which a tested URL's Last-modified date is considered "interesting" and should be highlighted in the HTML index. If N=0, no last-modification dates are considered interesting. This directive is optional and defaults to seven (7) days.
ExpireWindow N
Specifies the window in N days (N being a natural number) after the current date within which a traversed URL's Expires date is considered "interesting" and should be highlighted in the HTML index. If N=0, no expiration dates are considered interesting. This directive is optional and defaults to zero (0). Since expires dates are rarely used in the WWW, this directive is rarely useful.
EmailAddress email_addresses
Specifies the e-mail addresses to which an automatically generated message should be sent if one or more of the other Email directives below applies to any of the URLs tested during this task. This directive is optional only if no other Email directives are given. The format should be exactly the same as that given to the "To:" header when sending normal e-mail messages.
EmailBroken
Specifies that an e-mail message should be generated if any of the tested links in this task are found to be broken. This directive is optional and, if present, requires that EmailAddress also be given.
EmailRedirected
Specifies that an e-mail message should be generated if any of the tested links in this task are found to be redirected. This directive is optional and, if present, requires that EmailAddress also be given.
EmailChanged N
Specifies that an e-mail message should be generated if any of the tested links in this task are found to have been changed within the past N days, where N is a natural number. Note that this directive is similar to, but independent of, the ChangeWindow directive. This directive is optional and, if present (with N > 0), requires that EmailAddress also be given.
EmailExpired N
Specifies that an e-mail message should be generated if any of the traversed documents in this task will expire within the next N days, where N is a natural number. Note that this directive is similar to, but independent of, the ExpireWindow directive. This directive is optional and, if present (with N > 0), requires that EmailAddress also be given.
Exclude URLprefix
Specifies that the given URLprefix should be added to the Leaf Table such that all URLs encountered during this task's traversal which contain the given prefix will only be tested and not traversed. Multiple Exclude directives can be specified for any task. The IndexURL is automatically excluded at the beginning of every task.
>
This directive, on a line by itself, signals the end of the current task instruction. Each task must be terminated before the next begins.

Roy Fielding <fielding@ics.uci.edu>
Department of Information and Computer Science,
University of California, Irvine, CA 92717-3425
Last modified: Wed Aug 10 01:15:17 1994