Poster for Pacific Symposium on Biocomputing '99

Data Requirements for Inferring Genetic Networks from Expression Data.

 
Patrik D'haeseleer
University of New Mexico
Department of Computer Science,
Albuquerque, NM 87131
 
Large-scale mRNA expression measurements raise the possibility of extracting information on the underlying genetic regulatory interactions directly from the data. Mjolsness, Reinitz and Sharp (1991) pioneered the use of models based on a connection matrix in modeling Drosophila development. The main parameters the connection weights can be optimized using a variety of techniques to match the available gene expression data. A number of such models are being presented at this symposium.

The purpose of this poster is twofold: (1) to examine some of the difficulties and pitfalls associated with connection-based modeling from large-scale data; and (2) to provide some guidelines for experimentalists who are collecting such data with the intent of using it for genetic network inference.


 
Gene Expression data sets GAD/GABA Interactions
From expression data to gene network (from D'haeseleer et al., Pacific Symposium on Biocomputing, 1999)

 
1. The Curse of Dimensionality

Our human intuition is to observe as many parameters of a system as possible, hence the current effort to measure the expression levels of more and more genes simultaneously, rather than to measure these expression levels as often as possible (which might require reusable or continuous measurement techniques).

However, the more variables, the harder the modeling task, because the size of the search space increases exponentially with the number of parameters of the model. This is often referred to as the Curse of Dimensionality.

In order to achieve an accurate model, we must at least measure those variables which are important to the process being studied. If some intermediate variables are not measured, it may be possible to infer them during the modeling process, but this can be very hard. We should be as inclusive as possible in which variables we measure, and try to eliminate redundant variables after the data is collected.

Narrowing down the range of plausible models by adding extra constraints can simplify the search for the best model considerably. Constraining the model by using a priori information about what is biologically known or plausible is probably the most important weapon we have to fight the Curse of Dimensionality! How precisely to include this information into the inference process is the true art of modeling. Additional information on the organism to be modeled is also crucial in verifying the results of the model, and in efforts to develop the modeling technology. The current neglect of E. coli expression mapping in favor of S. cerevisiae is unfortunate, because E. coli is less complex and more is known of the regulatory interactions.


2. What are the important variables?

The state of a cell consists of all those parameters--both internally and externally--which determine its behavior. According to the Central Dogma, the activity of a cell is determined by which of its genes are being expressed.

"The mRNA levels sensitively reflect the state of the cell, perhaps uniquely defining cell types, stages, and responses. To decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously ... " [Lander]

Apart from the classical DNA->mRNA->protein pathway, the genes are in turn regulated by proteins. Protein interactions and secretion can cause significant discrepancies between protein and mRNA levels. In a recent comparison of selected mRNA and protein abundances in human liver, a correlation of only 0.48 was observed between the two. Clearly, protein levels form an important part of the internal state of a cell.
 

mRNA vs. protein abundance in liver cells mRNA vs. protein abundance in plasma
mRNA versus protein levels in liver cells and plasma (from Anderson et al. Electrophoresis. 1997 Mar-Apr; 18(3-4):533-7).

In addition to mRNA and protein levels, one could imagine measuring a number of other parameters, including cell volume, growth rate, methylation states of DNA, phosphorylation state of proteins, ion levels, and especially metabolite and nutrient levels.

Currently, most studies trying to infer expression mechanisms from cell state data use mRNA levels, because they are the easiest to measure.

Large-scale protein measurements (e.g. 2D-PAGE) typically only measure the highest abundancy proteins, but can be supplemented with more exact measurements of individual proteins which are known to play an important role. At the very least, this data should be collected at the start and end of each time series. Similarly, when measuring gene expression data on a process involving metabolism (and which cellular process doesn't?), an effort should be made to measure the most important metabolite and nutrient levels.


3. Number and variety of data points needed

Gene network inference techniques are data-hungry. Gene expression time series yield lots of data, but all the data points tend to be about a single dynamical process in the cell, and will be related to the surrounding time points. A 10-point time series will contain much less information than a data set of ten expression measurements under different environmental conditions, or with different mutations. The advantage of the time series is that it can provide insight in the dynamics of the process.

Both types of data, and multiple data sets of each, will likely be needed to unravel the regulatory interactions of the genes. To correctly infer the regulation of a single gene, we need to observe the expression of that gene under many different combinations of expression levels of its regulatory inputs. This implies a wide variety of different environmental conditions and perturbations.

How many data points do we really need to infer a gene network on N genes? Depends on the model used to do the inference! Constraining the connectivity of the network (maximum number of regulatory inputs per gene) and the nature of the regulatory interactions can dramatically reduce the amount of data needed. Here are some estimates for number of data points needed (asymptotic growth rate) for N genes using different models:

Model Data needed
Boolean, fully connected 2N
Boolean, connectivity K 2Klog(N) ?
Boolean, connectivity K, lin. sep. Klog(N/K)
Continuous, fully connected, additive N
Continuous, connectivity K, additive Klog(N/K) ?
Pairwise correlation log(N)

Fully connected: each gene can receive regulatory inputs from all other genes. Connectivity K: at most K regulatory inputs per gene. Additive, lin. sep. (linearly separable, for Boolean functions): regulation can be modeled as a weighted sum. Pairwise correlaton: significance level for pairwise comparisons based on correlation must decrease inversely proportional to number of variables.

Ideally, the number of data points needed will scale with log(N), rather than N. These estimates hold for independently chosen data points, and only indicate asymptotic growth rates, ignoring any constant factor. In practice, the amount of data may need to be orders of magnitude higher because of non-independence and large measurement errors.

From an Information Theory viewpoint, each measurement can give us up to 2-3 bits of information using RT-PCR (assuming a 15-20% measurement error). From cDNA microarrays and oligonucleotide chips, we can get 1-2 bits per measurement (assuming a 30-50% measurement error). If measurement accuracy is low, more data points may need to be collected to achieve the same accuracy in the model. Modeling real data with Boolean networks discards a lot of information in the data sets, because the expression levels need to be discretized to one bit per measurement. Continuous models will tend to use the available information in the data set better.


4. Combining data from different sources

The need for large numbers of data points, and many different conditions, implies that successful modeling efforts will probably have to use data from different sources. Modeling methodologies have to be able to deal with different data types such as time series and steady-state data, different error levels, incomplete data, etc. Also, data collected will have to be calibrated properly to allow comparison with other data sets. Relative expression levels have limited usefulness unless they can be calibrated with respect to other data sets post facto (e.g. using expression levels relative to a given standard). Also, organism strains and growing conditions should be as standard as possible to ensure other relevant data sets will be available.



© Copyright 1997 by Patrik D'haeseleer, patrik at cs dot unm dot edu
c/o Computer Science Department, University of New Mexico, Albuquerque, NM, 87131

(505) 277-9428 (office)
(505) 277-6927 (fax)