Data Requirements
for
Gene Network Inference



1. The Curse of Dimensionality
2. What are the important variables?
3. Constraining the model
4. Number and variety of data points needed


The purpose of this section is twofold: (1) to examine some of the difficulties and pitfalls associated with inferring gene networks from large-scale data; and (2) to provide some guidelines for experimentalists who are collecting such data with the intent of using it for genetic network inference.

1. The Curse of Dimensionality

Measuring more variables allows for a more exact model, but makes the correct model exponentially harder to find.

Our human intuition when faced with the task of modeling an unknown process is to observe as many parameters of the system as possible. This is clearly reflected in the current effort to measure the expression levels of more and more genes simultaneously, rather than to measure these expression levels as often as possible (which would require reusable or continuous measurement techniques).

However, in Machine Learning it is well known that the more variables one needs to model, the harder the modeling task becomes, because the size of the search space increases exponentially with the number of parameters of the model. This is often referred to as the Curse of Dimensionality.

Does this mean that our human intuition about modeling is wrong? Not necessarily. Although we humans do want to be able to look at as many variables of the problem as possible, we rather quickly select those we think are really important to the system, and simply ignore the others. Our reason for wanting to know all the variables is so we wouldn't miss any of the important ones, not so we could include all the non-important ones in our model. Similarly, in Machine Learning, careful selection of the input variables is crucial to get around the Curse of Dimensionality. Use of a priori information can also help narrow down the range of plausible models.

2. What are the important variables?

The state of a cell consists of all those parameters--both internally and externally-- which determine its behavior. Following the Central Dogma of molecular biology, the activity of a cell is determined by which of its genes are being expressed or not. If a particular gene is being expressed, its DNA is transcribed into complementary messenger RNA (mRNA), which is then translated into the specific protein the gene codes for. We can measure the level of expression of each gene by measuring how many mRNA copies are present in the cell.

"The mRNA levels sensitively reflect the state of the cell, perhaps uniquely defining cell types, stages, and responses. To decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously ... " [Lander]

This cartoon picture of the Central Dogma is of course highly incomplete. Apart from the classical DNA -> mRNA -> protein pathway, the genes in the DNA are themselves regulated by the presence or absence of certain proteins. Furthermore, many of the interactions going on in the cell occur entirely at the protein level, which can cause significant discrepancies between protein and mRNA levels. In a recent comparison of selected mRNA and protein abundances in human liver, a correlation of only 0.48 was observed between the two. Clearly, protein levels form an important part of the internal state of a cell.

In addition to mRNA and protein levels, one could imagine measuring a number of other parameters, including cell volume, growth rate, methylation states of DNA, phosphorylation state of proteins, localization of proteins and mRNA within the cell, ion levels, etc. One class of data which could be very useful is metabolite and nutrient levels.

For example, during the diauxic shift in yeast (transition from glucose metabolism to ethanol metabolism), one would of course need to measure glucose levels, but preferably also a number of other metabolites involved such as acetate, pyruvate, glycogen, trehalose, etc. Arkin et al. uses capillary zone electrophoresis to simultaneously measure eight of the small molecular species in an in vitro glycolysis reaction.

Currently, most studies trying to infer expression mechanisms from cell state data use mRNA levels, because they are the easiest to measure (especially with the new large-scale gene expression technologies). Large-scale protein measurements tends to be very incomplete (typically only measuring the highest abundancy proteins), but can be supplemented with more exact measurements of individual proteins which are known to play an important role. If most protein levels turn out to be exactly correlated with the corresponding mRNA levels, they can always be left out of the model. Similarly, when measuring gene expression data on a process involving metabolism (and which cellular process doesn't?), an effort should be made to quantitate the most important metabolite and nutrient levels.

3. Constraining the model

The space of models to be searched increases exponentially with the number of parameters of the model, and therefore with the number of variables. Narrowing down the range of plausible models by putting on extra constraints can simplify the search for the best model considerably. For example, constraining the genes to be regulated by no more than 7 other genes will drastically simplify the number of regulatory interactions we need to consider. Similarly, for Boolean networks, constraining the rules for each gene to be biologically plausible can significantly reduce the number of Boolean rules that match the data we have on the regulation of each gene.

Constraining the model by using a priori information about what is biologically known or plausible is probably the most important weapon we have to fight the Curse of Dimensionality! How precisely to include this information into the inference process is the true art of modeling.

4. Number and variety of data points needed

The gene network inference techniques we will cover have one thing in common: they tend to be data-hungry. Measuring gene expression time series has the nice feature of yielding lots of data. However, all the data points in a single time series tend to be about a single dynamical process in the cell, and will be related to the surrounding time points. A data set of ten expression measurements under different environmental conditions, or with different mutations, will actually contain more information than a time series of ten data points on a single phenomenon. The advantage of the time series is that it can provide crucial insights in the dynamics of the process.

Both types of data, and multiple data sets of each, will likely be needed to unravel the regulatory interactions of the genes. Indeed, to correctly infer the regulation of a single gene, we need to observe the expression of that gene under many different combinations of expression levels of its regulatory inputs. This implies a wide variety of different environmental conditions and perturbations.

How many data points do we really need to infer a gene network on N genes? For a completely unconstrained, potentially fully connected Boolean network model, we would need to measure al possible 2^N input-output pairs. This is clearly inconceivable for realistic numbers of genes (30 genes would imply more than a billion data points needed). If we constrain the genes to have no more than K inputs from other genes, the number of (independent!) data points needed becomes proportional to log(N). Preliminary experimental results from Liang et al at PSB '98 and Akutsu et al at PSB'99, as well as calculations based on the probability that all the entries in the rule tables are uniquely specified after n independent input-output pairs, suggest the number of data points needed scales as 2^K log(N). If we further restrict the Boolean function implementing the regulatory interactions at each gene to be linearly separable (i.e. can be modeled using a weighted sum and a threshold function), the amount of data needed is O(K log(N/K)), or O(K log(N)) for K << N. For details of this derivation, see Hertz, PSB '98. Similarly, for a fully connected linear or quasi-linear continuous model, we would need at the very least as many data points as genes. For models with restricted connectivity, we expect a similar improvement as for the Boolean case.

Model Data needed
Boolean, fully connected 2N
Boolean, connectivity K 2Klog(N) ?
Boolean, connectivity K, lin. sep. Klog(N)
Continuous, fully connected, additive N
Continuous, connectivity K, additive Klog(N) ?
Pairwise correlation log(N)

Fully connected: each gene can receive regulatory inputs from all other genes. Connectivity K: at most K regulatory inputs per gene. Additive, lin. sep. (linearly separable, for Boolean functions): regulation can be modeled as a weighted sum. Pairwise correlaton: significance level for pairwise comparisons based on correlation must decrease inversely proportional to number of variables.

From an information theory viewpoint, we retain more information by not quantizing the expression levels. Assuming a 15-20% quantitation error for RT-PCR, each measurement can give us up to 2-3 bits of information, whereas quantizing into Boolean values would give us only one bit. From cDNA microarrays and oligonucleotide chips, we can get 1-2 bits per measurement (assuming a 30-50% quantitation error).


© Copyright 1997 by Patrik D'haeseleer, patrik at cs dot unm dot edu

c/o Computer Science Department, University of New Mexico, Albuquerque, NM, 87131


(505) 277-9428 (office)
(505) 277-6927 (fax)