Information Analysis: Overview
The analysis of information is an area of Computer Science rapidly growing in
importance. Information analysis is an umbrella term that applies to a
multitude of techniques for extracting from massive quantities of
information various types of important, interesting, or unexpected phenomena.
Because the information of interest is of a wide variety of natures--including
structured databases, unstructured text, real-valued sensor data,
and digitized images--and
because the type of phenomena which we seek varies and often is ill defined,
many diverse technologies must be developed and applied in novel ways.
It often is convenient to view information analysis as involving
three main steps: data acquisition, information extraction and
representation, and analysis.
Data Acquisition
Data of a variety of natures is acquired from a possibly large number of
diverse sources. Examples of data and their sources include:
-
Structured data: Computer audit trails, financial data,
attributes of a complex system, and,
generally, data from existing database systems may be subjected to
a variety of analyses in attempt to detect behaviors such as intrusion,
fraud, or system malfunction.
Numerical and scientific data form an important sublcass of
structured data,
a subclass whose analysis warrants special consideration. For example,
autonomous sensors play an important role in
safeguard and nonproliferation applications. While the data produced may
be structured (i.e., in a pre-specified format with well-defined features),
the challenges differ from the above-mentioned
sources in several ways. For example, these data
may include real-valued vectors of variable length
and data that is of a temporal nature (where change-of-state often is
the critical component of the analysis). It also often is the case that
a large amount of complex meta-data (e.g., scientific formulas and other
types of rules) is required to capture the semantics of this type of data.
-
Images: Satellite and other types of image data are important
for such
applications as nonproliferation, climatology, and environmental studies.
Such digitized images introduce a variety of new information attributes,
such as three-dimensional spacial and four-dimensional time-space
relationships.
-
Free text data: Documents, reports, technical articles, and
articles from the popular press contain a wealth of information to mine.
These sources present a particularly formidable challenge. Though not
as difficult as natural language understanding, useful analyses will
require context to establish semantics, similarities of topics, patterns of
usage, and relevance to target queries.
Extraction and Representation
A crucial aspect of the analysis system is the representation, storage,
and retrieval of the information under study.
Rather than develop distinct analysis techniques for each type of data we
might encounter, the best approach, we argue, is to
represent and exploit the salient features of data within a common data
model and to develop a uniform analysis methodology that operates upon this
common model.
Once a suitable representation is chosen, extraction tools are defined for each
type of data source to map data from the form
gathered into the common representation and to store the resulting data in
the underlying database.
The data model devised must be sufficiently rich and
flexible to support the variety of data we expect, but it also must be
capable of supporting efficiently sophisticated analyses against massively
large data sets, including retrieval operations required for data mining.
To address these issues, we are interested in customizing and adapting
one or more data models well studied in computer science so as to be
suitable for our problem. Adaptations may include support for
statistical analyses, expert rule bases and axiom systems,
complex hierarchical relationships such as and/or relationships,
and the identification of data equivalence classes.
Information Analysis
Information analysis requires a suite of sophisticated support tools,
including:
-
Data mining tools for discovering and prioritizing potentially
interesting information:
Our research in data mining both explores foundational issues and
seeks to apply our results by incorporating into an experimental
software system our data exploration methodology and algorithmic
advances.
The data mining foundation we have built is based on ``information
prioritization'', a problem model where we are presented with a large
number of data points which must be
prioritized. The prioritization produced
allows one to pursue items from highest- to lowest-ranked until
time, money, or interest is exhausted.
Another defining characteristic of our work is that we have developed
methods which perform analyses even in the most information-deprived
environments (for example, environments lacking labeled training
sets, expert rules, and feedback).
The challenges presented by future research include
the mining of temporal, spacial, and textual patterns,
the construction of abstract statistical models of undersirable
behavior in new domains of interest, and the integration of our statistical
data mining techniques with automated reasoning techniques. This last
project is considered below.
-
Automated deduction tools for reasoning about data:
An automated deduction tool allows us to make inferences and draw logical
conclusions about retrieved data based on general rules and relationships.
Our work in automated deduction focuses on the development of inference rules
and strategies needed to reason effectively about problems from
mathematics and logic and for application areas such as the analysis of
information. In order to develop an effective reasoning component for
an information analysis system, we are working on problems such as
the following.
-
Data to be interpreted will be at a various levels of abstraction, ranging
from raw sensor data to high-level terms that are the output of other data
mining and analysis steps.
We are working to enhance the inference and search capabilities of the
automated deduction system in order to be able to reason effectively at
multiple levels of abstraction.
-
One aspect of the analysis of information is to search for sets of
observables
that are considered to be evidence for activities or conditions of
interest.
It often will be the case that several sets of observables will be
considered to be ``equivalent'' evidence for some activity or condition.
We are developing strategies to account for equivalence classes in the
search for evidence. Specifically, we are attempting to use the automated
deduction system to search for evidence using functional rather than
strictly syntactic matching criteria.
-
Integration of data mining and automated reasoning:
Data mining and automated reasoning techniques traditionally are applied
to quite different types of information analysis problems. We believe
each technology can benefit from the other, and together can form the core
of a powerful information analysis architecture.
The directions this integration may take include, for example, the following.
-
The statistically-based data mining system operates as a last line of
defense, inspecting data which the automated deduction system does not
flag as violating constraints specified in its rule base.
-
The data mining component prepares data for analysis by
the automated deduction component. For example, the data
mining component can provide data
at a level of abstraction well suited for
analysis by the automated deduction component and
can prioritize this information to help guide
the automated deduction system's search.
Further, clustering of data values suggested by the data mining component
can affect the application of inference rules (e.g., whether or not
a rule fires).
-
Automated deduction can be used to identify a subset of
data for further interpretation using data mining
strategies.
-
Information prioritization results obtained by mining a database of
successful proofs can be used to develop new search strategies for
automated deduction.
-
The automated deduction system may discover
equivalences that can be used by the data mining component
to simplify its search.
Eventually, the two components may be even more tightly coupled,
iterating to extract information at successively higher levels of
abstraction, and interacting in a hypothesize-test mode.
Return to
IAL Home
Last Modified: October 28, 1998
by veroff@cs.unm.edu