IAL (Overview)

Information Analysis: Overview

The analysis of information is an area of Computer Science rapidly growing in importance. Information analysis is an umbrella term that applies to a multitude of techniques for extracting from massive quantities of information various types of important, interesting, or unexpected phenomena. Because the information of interest is of a wide variety of natures--including structured databases, unstructured text, real-valued sensor data, and digitized images--and because the type of phenomena which we seek varies and often is ill defined, many diverse technologies must be developed and applied in novel ways.

It often is convenient to view information analysis as involving three main steps: data acquisition, information extraction and representation, and analysis.

Data Acquisition

Data of a variety of natures is acquired from a possibly large number of diverse sources. Examples of data and their sources include:

Structured data: Computer audit trails, financial data, attributes of a complex system, and, generally, data from existing database systems may be subjected to a variety of analyses in attempt to detect behaviors such as intrusion, fraud, or system malfunction.

Numerical and scientific data form an important sublcass of structured data, a subclass whose analysis warrants special consideration. For example, autonomous sensors play an important role in safeguard and nonproliferation applications. While the data produced may be structured (i.e., in a pre-specified format with well-defined features), the challenges differ from the above-mentioned sources in several ways. For example, these data may include real-valued vectors of variable length and data that is of a temporal nature (where change-of-state often is the critical component of the analysis). It also often is the case that a large amount of complex meta-data (e.g., scientific formulas and other types of rules) is required to capture the semantics of this type of data.
Images: Satellite and other types of image data are important for such applications as nonproliferation, climatology, and environmental studies. Such digitized images introduce a variety of new information attributes, such as three-dimensional spacial and four-dimensional time-space relationships.
Free text data: Documents, reports, technical articles, and articles from the popular press contain a wealth of information to mine. These sources present a particularly formidable challenge. Though not as difficult as natural language understanding, useful analyses will require context to establish semantics, similarities of topics, patterns of usage, and relevance to target queries.

Extraction and Representation

A crucial aspect of the analysis system is the representation, storage, and retrieval of the information under study. Rather than develop distinct analysis techniques for each type of data we might encounter, the best approach, we argue, is to represent and exploit the salient features of data within a common data model and to develop a uniform analysis methodology that operates upon this common model.

Once a suitable representation is chosen, extraction tools are defined for each type of data source to map data from the form gathered into the common representation and to store the resulting data in the underlying database.

The data model devised must be sufficiently rich and flexible to support the variety of data we expect, but it also must be capable of supporting efficiently sophisticated analyses against massively large data sets, including retrieval operations required for data mining. To address these issues, we are interested in customizing and adapting one or more data models well studied in computer science so as to be suitable for our problem. Adaptations may include support for statistical analyses, expert rule bases and axiom systems, complex hierarchical relationships such as and/or relationships, and the identification of data equivalence classes.

Information Analysis

Information analysis requires a suite of sophisticated support tools, including:

Data mining tools for discovering and prioritizing potentially interesting information: Our research in data mining both explores foundational issues and seeks to apply our results by incorporating into an experimental software system our data exploration methodology and algorithmic advances.
The data mining foundation we have built is based on ``information prioritization'', a problem model where we are presented with a large number of data points which must be prioritized. The prioritization produced allows one to pursue items from highest- to lowest-ranked until time, money, or interest is exhausted. Another defining characteristic of our work is that we have developed methods which perform analyses even in the most information-deprived environments (for example, environments lacking labeled training sets, expert rules, and feedback).
The challenges presented by future research include the mining of temporal, spacial, and textual patterns, the construction of abstract statistical models of undersirable behavior in new domains of interest, and the integration of our statistical data mining techniques with automated reasoning techniques. This last project is considered below.
Automated deduction tools for reasoning about data: An automated deduction tool allows us to make inferences and draw logical conclusions about retrieved data based on general rules and relationships. Our work in automated deduction focuses on the development of inference rules and strategies needed to reason effectively about problems from mathematics and logic and for application areas such as the analysis of information. In order to develop an effective reasoning component for an information analysis system, we are working on problems such as the following.
- Data to be interpreted will be at a various levels of abstraction, ranging from raw sensor data to high-level terms that are the output of other data mining and analysis steps. We are working to enhance the inference and search capabilities of the automated deduction system in order to be able to reason effectively at multiple levels of abstraction.
- One aspect of the analysis of information is to search for sets of observables that are considered to be evidence for activities or conditions of interest. It often will be the case that several sets of observables will be considered to be ``equivalent'' evidence for some activity or condition. We are developing strategies to account for equivalence classes in the search for evidence. Specifically, we are attempting to use the automated deduction system to search for evidence using functional rather than strictly syntactic matching criteria.
Integration of data mining and automated reasoning: Data mining and automated reasoning techniques traditionally are applied to quite different types of information analysis problems. We believe each technology can benefit from the other, and together can form the core of a powerful information analysis architecture. The directions this integration may take include, for example, the following.
- The statistically-based data mining system operates as a last line of defense, inspecting data which the automated deduction system does not flag as violating constraints specified in its rule base.
- The data mining component prepares data for analysis by the automated deduction component. For example, the data mining component can provide data at a level of abstraction well suited for analysis by the automated deduction component and can prioritize this information to help guide the automated deduction system's search. Further, clustering of data values suggested by the data mining component can affect the application of inference rules (e.g., whether or not a rule fires).
- Automated deduction can be used to identify a subset of data for further interpretation using data mining strategies.
- Information prioritization results obtained by mining a database of successful proofs can be used to develop new search strategies for automated deduction.
- The automated deduction system may discover equivalences that can be used by the data mining component to simplify its search.
Eventually, the two components may be even more tightly coupled, iterating to extract information at successively higher levels of abstraction, and interacting in a hypothesize-test mode.

Return to
IAL Home

Last Modified: October 28, 1998 by veroff@cs.unm.edu