Reverse Index and Web Database

The heart of the web search engine is a REVERSE INDEX that stores a mapping from WORDs to the PAGEs in which those WORDs occur. In addition, to support document scoring and TF/IDF ranking, the REVERSE INDEX will have to store the count of the number of times each word occurs in each document and the total number of documents in which each word has been seen. Note that the REVERSE INDEX will have a Map at its core, but it captures additional functionality and will be more complex than a basic Map.

The WEB DATABASE will contain the REVERSE INDEX, but will also need to include additional information beyond the basic data in the REVERSE INDEX. To track max-crawl and implement cycle-detection, the WEB DATABASE will also have to store a list of all PAGEs that have been seen. Finally, to implement durable state and restartability, the WEB DATABASE will have to contain a list of the outstanding URLs that have not yet been examined.



Terran Lane 2005-01-26