The heart of the web
search engine is a REVERSE INDEX that stores a mapping from WORDs to
the PAGEs in which those WORDs occur. In addition, to support
document scoring and TF/IDF ranking, the REVERSE INDEX will have to
store the count of the number of times each word occurs in each
document and the total number of documents in which each word has been
seen. Note that the REVERSE INDEX will have a Map at its
core, but it captures additional functionality and will be more
complex than a basic Map.
The WEB DATABASE will contain the REVERSE INDEX, but will also need to
include additional information beyond the basic data in the REVERSE
INDEX. To track max-crawl and implement cycle-detection, the WEB
DATABASE will also have to store a list of all PAGEs that have been
seen. Finally, to implement durable state and restartability, the
WEB DATABASE will have to contain a list of the outstanding URLs that
have not yet been examined.
Terran Lane
2005-01-19