Definitions
The following definitions will be used in this document:
- BASE-NAME
- The initial, stem part of a file name, before the
extension. Usually specified by the user. Note that a BASE-NAME
MAY contain one or more periods, if so specified by the user.
- CLOSED LIST
- The set of PAGEs that have been retrieved
(downloaded) by MSpider. Used for cycle detection. NOTE
This is known colloquially as a ``list'', but that does not mean
that it should be literally implemented as a list data structure.
It is likely that a different data structure will be more effective
for the goals of the CLOSED LIST.
- CRAWL-MAX
- The maximum number of PAGEs that the spider engine
can download in a single run (including restarts). The spider MUST
NOT crawl more PAGEs than allowed by CRAWL-MAX.
- MAY
- A requirement that the product can choose to implement if
desired. Can also indicate a choice among acceptable alternatives
(e.g., ``The program MAY do x, y, or z.'' indicates that the choice of
behavior x, y, or z is up to the designer.)
- MUST
- A requirement that the product must implement for full
credit.
- MUST NOT
- A behavior or assumption that must not be violated.
Violating a MUST NOT restriction will result in a penalty on the
assignment.
- OPEN LIST
- The set of URLs that have been parsed out of
previously retrieved PAGEs, but that have not yet been retrieved or
examined themselves. These URLs are candidates for future
crawling. NOTE
This is known colloquially as a ``list'', but that does not mean
that it should be literally implemented as a list data structure.
It is likely that a different data structure will be more effective
for the goals of the OPEN LIST.
- PAGE
- The content document pointed to by a URL. This project
MUST support ``text/html'' documents, but the system MAY support
additional document types, at the designer's discretion.
- RECOVERABLE ERROR
- An error condition that the software can
ignore, correct, or otherwise recover from. The program MUST produce
a warning message and then cleanly continue with no corruption or loss
of valid data.
- REVERSE INDEX
- A mapping that records which PAGEs each WORD
occurs in.
- SHOULD
- A requirement that is recommended, but not required.
The designer may violate a SHOULD requirement, but should be
prepared to explain why.
- SPIDER
- The module responsible for crawling the web to locate
and retrieve PAGEs.
- PUNCTUATION
- Punctuation characters. For the purposes of this
project, the punctuation characters are considered to be any
characters other than WHITESPACE, letters, digits, or parentheses.
- QUERY
- A user's request for relevant PAGEs, expressed as a
sequence of WORDs joined by an implicit conjunctive query
(Section 5.3.2) or (optionally) an AND/OR query language
(Appendix C).
- TF/IDF
- A scoring function that attempts to assess the
relevance of a PAGE with respect to a QUERY. See Appendix A for
details.
- UNRECOVERABLE ERROR
- An error condition from which recovery is
impossible. The program MUST produce an error message describing the
condition and then cleanly halt.
- URL
- A pointer to a PAGE, represented in html with a fully
qualified universal resource locator. Because each URL maps
one-to-one onto a single PAGE, this specification will often use the
two interchangeably. Note that to ensure the one-to-one mapping, URLs
will have to be canonicalized.
- WEB DATABASE
- The complete database of information necessary
for the Moogle client to do its job. This includes the
REVERSE INDEX, but will also include additional information to, for
example, support crawl restarts and TF/IDF result sorting.
- WHITESPACE
- Non-printable characters including (but not limited
to) space, horizontal and vertical tabs, newlines, and carriage
returns. See the Java JDK API call Character.isWhitespace().
- WORD
- The smallest unit of parsing (for the MSpider
engine) or querying (for the Moogle client). In this
project, WORDs are considered to be sequences of letters and digits.
WORDs SHOULD be treated as case-insensitive,
but MAY be treated as case-sensitive. Either way, case
(in)sensitivity MUST be documented in the user manual.
Terran Lane
2005-09-21