| Large-scale mRNA expression measurements raise the possibility of
extracting information on the underlying genetic regulatory interactions
directly from the data. Mjolsness, Reinitz and Sharp (1991) pioneered
the use of models based on a connection matrix in modeling Drosophila
development. The main parameters the connection weights can be optimized
using a variety of techniques to match the available gene expression
data. A number of such models are being presented at this symposium.
The purpose of this poster is twofold: (1) to examine some of the difficulties and pitfalls associated with connection-based modeling from large-scale data; and (2) to provide some guidelines for experimentalists who are collecting such data with the intent of using it for genetic network inference. |
| From expression data to gene network (from D'haeseleer et al., Pacific Symposium on Biocomputing, 1999) |
However, the more variables, the harder the modeling task, because the size of the search space increases exponentially with the number of parameters of the model. This is often referred to as the Curse of Dimensionality.
In order to achieve an accurate model, we must at least measure those variables which are important to the process being studied. If some intermediate variables are not measured, it may be possible to infer them during the modeling process, but this can be very hard. We should be as inclusive as possible in which variables we measure, and try to eliminate redundant variables after the data is collected.
Narrowing down the range of plausible models by adding extra constraints can simplify the search for the best model considerably. Constraining the model by using a priori information about what is biologically known or plausible is probably the most important weapon we have to fight the Curse of Dimensionality! How precisely to include this information into the inference process is the true art of modeling. Additional information on the organism to be modeled is also crucial in verifying the results of the model, and in efforts to develop the modeling technology. The current neglect of E. coli expression mapping in favor of S. cerevisiae is unfortunate, because E. coli is less complex and more is known of the regulatory interactions.
| "The mRNA levels sensitively reflect the state of the cell, perhaps uniquely defining cell types, stages, and responses. To decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously ... " [Lander] |
Apart from the classical DNA->mRNA->protein pathway, the genes
are in turn regulated by proteins. Protein interactions and secretion
can cause significant discrepancies between protein and mRNA levels. In
a recent comparison of selected mRNA and protein abundances in human
liver, a correlation of only 0.48 was observed between the two. Clearly,
protein levels form an important part of the internal state of a cell.
| mRNA versus protein levels in liver cells and plasma (from Anderson et al. Electrophoresis. 1997 Mar-Apr; 18(3-4):533-7). |
In addition to mRNA and protein levels, one could imagine measuring a number of other parameters, including cell volume, growth rate, methylation states of DNA, phosphorylation state of proteins, ion levels, and especially metabolite and nutrient levels.
Currently, most studies trying to infer expression mechanisms from cell state data use mRNA levels, because they are the easiest to measure.
Large-scale protein measurements (e.g. 2D-PAGE) typically only measure the highest abundancy proteins, but can be supplemented with more exact measurements of individual proteins which are known to play an important role. At the very least, this data should be collected at the start and end of each time series. Similarly, when measuring gene expression data on a process involving metabolism (and which cellular process doesn't?), an effort should be made to measure the most important metabolite and nutrient levels.
Both types of data, and multiple data sets of each, will likely be needed to unravel the regulatory interactions of the genes. To correctly infer the regulation of a single gene, we need to observe the expression of that gene under many different combinations of expression levels of its regulatory inputs. This implies a wide variety of different environmental conditions and perturbations.
How many data points do we really need to infer a gene network on N genes? Depends on the model used to do the inference! Constraining the connectivity of the network (maximum number of regulatory inputs per gene) and the nature of the regulatory interactions can dramatically reduce the amount of data needed. Here are some estimates for number of data points needed (asymptotic growth rate) for N genes using different models:
| Model | Data needed |
| Boolean, fully connected | 2N |
| Boolean, connectivity K | 2Klog(N) ? |
| Boolean, connectivity K, lin. sep. | Klog(N/K) |
| Continuous, fully connected, additive | N |
| Continuous, connectivity K, additive | Klog(N/K) ? |
| Pairwise correlation | log(N) |
| Fully connected: each gene can receive regulatory inputs from all other genes. Connectivity K: at most K regulatory inputs per gene. Additive, lin. sep. (linearly separable, for Boolean functions): regulation can be modeled as a weighted sum. Pairwise correlaton: significance level for pairwise comparisons based on correlation must decrease inversely proportional to number of variables. |
Ideally, the number of data points needed will scale with log(N), rather than N. These estimates hold for independently chosen data points, and only indicate asymptotic growth rates, ignoring any constant factor. In practice, the amount of data may need to be orders of magnitude higher because of non-independence and large measurement errors.
From an Information Theory viewpoint, each measurement can give us up to 2-3 bits of information using RT-PCR (assuming a 15-20% measurement error). From cDNA microarrays and oligonucleotide chips, we can get 1-2 bits per measurement (assuming a 30-50% measurement error). If measurement accuracy is low, more data points may need to be collected to achieve the same accuracy in the model. Modeling real data with Boolean networks discards a lot of information in the data sets, because the expression levels need to be discretized to one bit per measurement. Continuous models will tend to use the available information in the data set better.