Automatic Ontology Building from Web Documents

An Ontology represents semantics and allows interoperability between heterogeneous 
web documents for the same domain. Most web documents present a semi-structured 
organization, e.g. html, but lack explicitly defined semantics. We propose a model 
to analyze the unstructured and semi-structured data of web documents, with the 
purpose of building automatically, an ontology that represents the concepts in the 
contents of the documents. This model includes three phases. The first phase uses 
Natural Language Processing and Statistical methods to analyze the unstructured 
data and get important vocabulary (concepts) of the ontology. The second phase uses 
a Web Content Mining method to analyze the semi-structured data and find the 
relationships derived from the content. The last phase evaluates the concepts and 
relationships to determine the structure and knowledge to preserve for the ontology.