MINING THE GENE EXPRESSION MATRIX:
INFERRING GENE RELATIONSHIPS FROM
LARGE SCALE GENE EXPRESSION DATA
Patrik D'haeseleer,1 Xiling Wen,2 Stefanie Fuhrman,2 and Roland Somogyi2
1Computer Science DepartmentINTRODUCTION
In order to infer the logical principles underlying biological development and phenotypic change, it is necessary to determine large-scale temporal gene expression patterns. To quote Eric Lander, "The mRNA levels sensitively reflect the state of the cell, perhaps uniquely defining cell types, stages, and responses. To decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously ..." (Lander, 1996). One method for accomplishing this involves the use of reverse transcription polymerase chain reaction (RT-PCR) to assay the expression levels of large numbers of genes in a tissue at different time points during development, with a standard protocol. The relative amounts of mRNA produced at these time points provide a gene expression time series for each gene.
The Gene Expression Matrix presented in Wen et al. (1997), contains expression levels of 112 different genes at nine stages during rat cervical spinal cord development. Earlier analysis of this data set used Euclidean distance and information theoretic measures to cluster the genes into related expression time series (Somogyi et al., 1996; Wen et al., 1997). A significant problem with this approach is the variety of measures that can be used. Each measure produces a unique clustering of gene expression patterns, and measures must be selected in part according to whether they capture the types of interactions we would expect to find in a biological system. For example, a cluster of positively correlated gene expression patterns suggests that all these genes may share the same input(s), but ignores the possibility that other genes may be inversely regulated by those same inputs. Another important consideration is the extent to which the measures capture the total amount of information contained in the gene expression data.
This paper focuses on determining significant relationships between individual genes, based on linear correlation, rank correlation and information theory. Such relationships may be used to form hypotheses concerning potential pathways of information flow between the genes in question, or from an unobserved source towards the genes. Alternatively, the relationships between individual genes can be combined to infer a model of a gene network, in order to form hypotheses about the behavior of such a network (Kauffman, 1993; Somogyi and Sniegoski, 1996). These hypotheses can then be tested experimentally by perturbing a specific gene and observing the effect on gene expression levels at different time points following the perturbation.
LINEAR CORRELATION
Looking at the gene expression matrix (Wen et al., 1997), we see that many genes seem to be highly correlated. Indeed, we find positive (Pearson) correlation coefficients of up to 0.992 (between 5HT1b and GRa4, Figure 1 a,b), and negative correlation coefficients of up to -0.986 (between aFGF and IGH II, Figure 1 c,d). Using a two-tailed t-test with n-2=7 degrees of freedom, we can reject (at the 1% significance level) the null hypothesis that the population correlation coefficient rho is zero, if the absolute value of the sample correlation coefficient |r|>0.798. Out of the 112x111/2 = 6216 pairs of expression time series, we find 806 pairs that have a significant correlation. Whereas this is much better than having to examine all 6216 pairs of genes, it would still take an immense amount of work to check each of these significant correlations in the laboratory. Furthermore, at the 1% level we would expect about 62 such significant correlations even from independent time series. Because of the large number of gene pairs, we will get a fair number of false correlations even at this significance level.
Figure 1. High positive correlation between gene expression patterns 5HT1b and GRa4 (a, b); high negative correlation between aFGF and IGF II (c, d). Expression levels for each gene are normalized with respect to the maximum expression level for that gene.
If we want to further restrict the number of relationships to be examined, we might want to test which correlations are not only significantly different from zero, but also significantly larger than a certain value. For instance, to find those relationships in which at least 50% of the variance is explained by the correlation, i.e. rho2>0.5, we need |r|>0.96 to reject at the 1% significance level the null hypothesis that |rho|<0.7071 (we use a transformation from r values to an approximately normal distribution to calculate the significance level). The 33 pairs of time series that have such a high estimated correlation are listed in Table 1.
Table 1. 33 pairs of time series with |r|>0.96 (rho2>0.5 at the 1% level).
| Gene 1 | Gene 2 | r | Gene 1 | Gene 2 | r | Gene 1 | Gene 2 | r | ||
| MAP2 | SC6 | -0.960 | ChAT | aFGF | 0.982 | GRa4 | 5HT1b | 0.992 | ||
| NFM | mGluR1 | 0.971 | ChAT | IGF II | -0.977 | GRa5 | mGluR3 | 0.967 | ||
| NFM | NMDA2A | 0.985 | ChAT | IP3R2 | 0.966 | GRa5 | mGluR7 | 0.972 | ||
| NFH | GRg1 | 0.971 | ACHE | 5HT1c | 0.981 | GRg1 | cyclin B | -0.965 | ||
| S100 beta | GRg1 | 0.983 | ACHE | statin | 0.965 | GRg3 | 5HT2 | 0.976 | ||
| S100 beta | cyclin B | -0.964 | ODC | IGF II | 0.975 | mGluR1 | NMDA2A | 0.962 | ||
| GAD65 | preGAD67 | 0.964 | ODC | cyclin B | 0.965 | mGluR3 | NMDA2B | 0.970 | ||
| GAD67 | GRg3 | 0.974 | GRa1 | GRg1 | 0.971 | mGluR5 | NMDA1 | 0.980 | ||
| GAD67 | 5HT2 | 0.988 | GRa2 | GRa5 | 0.972 | bFGF | aFGF | 0.968 | ||
| GAT1 | SC6 | -0.963 | GRa2 | mGluR3 | 0.974 | bFGF | IGF II | -0.974 | ||
| ChAT | ODC | -0.962 | GRa2 | statin | 0.970 | aFGF | IGF II | -0.986 |
There are a number of ways to visualize these results. Earlier work on this data (Somogyi et al., 1996; Wen et al., 1997) relied on hierarchical clustering of the time series into a dendrogram using the Fitch-Margoliash (1967) algorithm as implemented in the PHYLIP package (Felsenstein, 1993). If we define a distance measure based on the residual variance between any two time series (i.e. d=1-r2; d=0 if perfectly correlated, d=1 if uncorrelated), we can construct a similar dendrogram based on linear correlation. The resulting tree is quite different from the one based on Euclidean distance (data not shown). Another, more intuitive way to visualize this data is to use multidimensional scaling, which maps time series into a two-dimensional plane, while attempting to preserve the specified distances between them. Figure 2 shows a multidimensional scaling of the 34 genes involved, where the distance measure used was the residual variance between any two time series. Multidimensional scaling of the entire set of 112 genes shows these 34 as very tight clusters amidst a more loosely connected set of genes.
Figure 2. Multidimensional scaling of 34 time series with high correlation. Distance represents the residual variance, 1-r2, with those pairs of genes for which |r|>0.96 linked by an edge.
Lastly, we might want to ask the following question: given that
there is a certain amount of noise in our measurements, at which point
can we say that time series A cannot really be distinguished
from a linear transformation of time series B? In order to
improve the accuracy of the expression data, each value v in the
Gene Expression Matrix is really an average over up to three separate
individuals. The standard error s<v> on this average is
sv /
n, where sv is the estimated standard deviation of the individuals and
n the number of individuals. If we assume the time series are
homoscedastic, i.e. the standard error is the same for each time point of
a time series A, we can estimate s<A> by the
average over the standard errors at each time point. This
s<A>
is a measure of
how much noise there is in the entire time series. If we find that the
variance s2A.B of the
residue of the regression of time series A on time series B
(i.e. the amount of variation in A that cannot be explained by
B) is smaller than s2<A>, we can
say that A cannot be distinguished from a linear
transformation of B. We may then choose to reduce the data set,
by replacing each such pair by a single time series. We find 15 pairs
of time series (listed in Table 2) where this occurs.
Table 2. 15 pairs of time series for which the variance of the residues is smaller than the noise on one of the time series.
Gene A |
Gene B |
s2A.B |
s2<A> |
GAD65 |
pre-GAD67 |
0.079 |
0.083 |
GAD67 |
5HT2 |
0.058 |
0.069 |
GRa2 |
GRa5 |
0.092 |
0.108 |
GRa2 |
mGluR3 |
0.091 |
0.108 |
GRa2 |
statin |
0.095 |
0.108 |
5HT1b |
GRa4 |
0.040 |
0.063 |
CNTFR |
IGF I |
0.045 |
0.045 |
cyclin A |
NFH |
0.058 |
0.072 |
cyclin A |
MOG |
0.069 |
0.072 |
cyclin A |
GRa1 |
0.066 |
0.072 |
cyclin A |
GRg1 |
0.061 |
0.072 |
cyclin A |
IGFR2 |
0.068 |
0.072 |
H2AZ |
MOG |
0.052 |
0.061 |
H2AZ |
mAChR4 |
0.053 |
0.061 |
H2AZ |
CRAF |
0.046 |
0.061 |
The approach above is rather simplistic because it relies on homoscedasticity of the time series, and does not give any significance levels. There is a more rigorous approach to answer the same question. We can represent time series A and its best linear fit A' = a + bB as points in a nine-dimensional space, with each dimension corresponding to a specific time point in the series, and the error in each dimension being the standard error on the value for that time point. We can then use a multivariate test to check whether A and A´ are significantly different.
NONLINEAR CORRELATION
One obvious way to capture nonlinear relationships between time series is to fit the data to a nonlinear model. If we are interested in monotonic relationships we could suggest a sigmoidal model for example. For a very general sigmoid model one would want to include at least parameters for scaling and offset along both axes. However, with only nine points to fit for each pair of genes, a four-parameter model might be excessive. Pearson correlation, whether using a linear or nonlinear model, also assumes an approximately Gaussian distribution of the points, and may not be very robust for non-Gaussian distributions. For these reasons, we have chosen a more general measure for monotonic relationships: the Spearman rank correlation rs. The Spearman correlation coefficient is equivalent to the Pearson correlation coefficient calculated over the ranked values. The 1% significance level for rs for short time series can be derived from tables (Olds, 1938; for n>10 we can use a modified t-test). For n=9, we can reject the null hypothesis that the population rank correlation rhos is zero at the 1% significance level if the sample rank correlation |rs|>0.833. Using this test, we find a total of 491 pairs of expression time series, involving 98 genes, which have a significant rs, ranging from -0.979 to 0.996. 119 of these pairs do not have a linear correlation that is significantly different from zero at the 1% level (i.e. |r|<0.798). Figure 3 shows one such time series pair which has a high rank correlation but low linear correlation.
Figure 3. High rank correlation but low linear correlation between mGluR1 and GRa2.
As in the previous section, even if the null hypothesis were true (uncorrelated time series), at the 1% level we would expect an average of 62 pairs where we wrongly reject the null hypothesis and think we see a significant correlation. Again, one might want to find those rank correlations that lie above a certain value. However, unlike linear correlation, it is hard to find a plausible explanation of the meaning of such a level. Also, since there is no adequate mathematical model for the distribution of rs for small n, the 1% significance level has to be derived from tables or calculated exhaustively (Olds, 1938).
Because of ranking, we may lose a significant amount of information
present in the data. For instance, the maximum information content
for a time series of nine data points with values ranging between 0.00
and 1.00 (with a .01 resolution, i.e. 101 possible expression levels)
is 9log2(101)
60 bits. The maximum
information content for a ranked nine-point time series without ties
in the ranking is log2(9!)
18.5 bits
(slightly larger if ties are allowed). In effect we may be ignoring up
to 70% of the information content of the data.
INFORMATION THEORY
An even more general method to detect association between gene expression time series can be derived from information theory. The mutual information M(A,B) between two information sources A and B represents how much information A contains about B (and vice versa). Formally, if H(A) and H(B) are the (Shannon) entropies of sources A and B respectively, and H(A,B) the joint entropy of the sources, then M(A,B) = H(A) + H(B) - H(A,B) (Shannon, 1948). Mutual information is defined both for continuous and discrete distributions, but the discrete form is much easier to use. To apply this technique we will therefore first discretize the time series by partitioning the expression levels into bins. Some regulatory genes exhibit a close approximation to on/off behavior, with several orders of magnitude difference between basal and induced expression levels. In such a case, the gene expression levels can be discretized without loss of information. However, if part of the regulatory activity of the gene depends on small fluctuations superimposed on the on/off behavior, then this will not be captured by a discretized model. Similarly, the expression levels of some genes mirror continuously varying environmental parameters and have a regulatory effect over their entire range. Discretization of expression levels of such genes will lead to a loss of information.
The fewer bins we use to discretize the data, the more information about the original time series we ignore. On the other hand, too fine a binning will leave us with too few points per bin to get a reasonable estimate of the frequency of each bin, especially when calculating the joint entropy. Given that our data only contains nine time steps, three bins is the most we should use.
As with rank correlation, we can get an estimate of how much
information we may ignore in doing this binning. A time series
of nine data points between 1 and 3 has a maximum entropy of only
9log2(3)
14 bits, as opposed to
approximately 60 bits for
values between 0.00 and 1.00 (see above), so we may be ignoring up to 76%
of the information content of the data. This is of course a worst-case
calculation, and although it casts doubt on the usefulness of this method
for the amount of data we currently have, the generality of this approach
may outweigh the information loss for larger data sets. This section then
should be regarded more as an example of how we could use information
theoretic measures once we have more data to work with, rather than as
an overview of significant results on the current data set.
Most of the properties covered in this section can be summarized in a single graph, a detail of which can be found in Figure 4. For the complete graph, see the web site at http://rsb.info.nih.gov/mol-physiol/IPCAT/patrik/mgraph.html.
Some time series map to the same discretized series. For two
independent time series, the probability of mapping to the same series
is 3-9
10-4. However, due to the
coarseness of
the binning, identity of binned expression time series is of course a
less accurate predictor for association than a small Euclidean distance
or large correlation. In total, from 112 unique continuous-valued time
series we get 91 discretized time series. Table 3 shows the discretized
series that correspond to more than one gene. In Figure 4, these genes
are listed within the same node.
Table 3. Gene expression time series that map into the same discretized series.
E11 |
E13 |
E15 |
E18 |
E21 |
P0 |
P7 |
P14 |
A |
genes |
0 |
0 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
MAP2, pre-GAD67, GAT1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
2 |
NFM, mGluR1, NMDA2A |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
2 |
2 |
NFH, ChAT |
0 |
0 |
1 |
1 |
2 |
2 |
2 |
2 |
2 |
synaptophysin, ACHE |
0 |
1 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
neno, trkC |
0 |
0 |
0 |
1 |
1 |
1 |
1 |
2 |
2 |
S100 beta, GRg1 |
0 |
0 |
0 |
2 |
2 |
2 |
2 |
1 |
1 |
GAD67, mGluR5, NMDA1 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
1 |
1 |
ODC, CRAF, cyclin A |
0 |
0 |
1 |
2 |
2 |
2 |
2 |
2 |
2 |
GRa2, GRa5, statin |
0 |
0 |
1 |
2 |
2 |
2 |
1 |
2 |
2 |
mGluR3, NMDA2B |
0 |
0 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
nAChRe, trk |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
CNTFR, PDGFa, IGF I, H2AZ, TCP, actin |
Next, we can find those pairs of time series for which there
is a one-to-one mapping by permuting the bin numbers, i.e. for which
H(A)=H(B)=M(A,B). For instance, in Table
3 we notice that there is a one-to-one mapping for S 100 beta (row 6)
and GAD67 (row 7) by permuting bins 1 and 2. In total, we find 17
pairs of unique time series where this occurs. The probability that
two independent but unique time series have a one-to-one mapping is
approximately (3!-1)×3-9
10-3.
Series for
which there exists a one-to-one mapping are equivalent with respect to
their information properties, so we can replace such time series by one
single series, leaving us with a set of 77 unique, non-equivalent time
series. These correspond to the nodes in Figure 4. Time series which
have a one-to-one mapping to each other are grouped in rectangular nodes,
with the node compartments corresponding to the different permutations of
the bin numbers. Note that any kind of permutation of the bins suffices,
which can lead to very discontinuous and not biologically plausible
one-to-one mappings for larger numbers of bins.
When there is no one-to-one mapping, we could use various symmetric measures for the information sharing between two time series, such as M(A,B)/max(H(A),H(B)), or M(A,B)/H(A,B). However, the asymmetric measure R(A,B) = M(A,B)/H(B) has shown to be more informative. This relative mutual information measure reflects what fraction of the information in time series B can be derived from A. For some pairs of series, R(A,B) = 1.0, which means that all the information about time series B is contained in time series A, i.e. A maps into B. Relative mutual information might be used as an indicator for a possible causal relationship between gene A and gene B. In Figure 4, R(A,B) = 1.0 is represented by a black arrow from node A to node B. (Note that nodes corresponding to low-entropy time series will tend to have many incoming arrows, whereas high-entropy nodes will have few black arrows coming in). If we cannot find a time series that will map exactly into a given time series B, we can still find the series A' that carries the most information about B, i.e. that maximizes R(A',B). In Figure 4, those nodes B with no incoming black arrows will have a gray arrow coming from time series A' that carry the most information about B.

Figure 4. Detail of graph summarizing information theoretic relationships between time series.
OTHER METHODS
All the above approaches look only at the relationship between
two time series. For those series which do not have a clear one-to-one
relationship with any other time series, we might want to look at
multi-gene interactions, for instance using multiple regression. However,
it turns out that we have too many possible models to match the amount of
data in our current data set. Suppose we are given a time series A,
and we want to find a function f and two other time series B and
C such that A
f(B,C).
Note that there are 111×110/2=6105 pairs of time series
(B,C) to choose from. Considering that series A
contains only nine time steps, even randomly generated data will tend to
generate a large number of supposedly good models of the form A
f(B,C).
Most of the problems we encounter with this data set stem from the fact that we have so few time steps. We can get around this problem somewhat by interpolating between the values. Of course, interpolation does not in fact give us more data to work with, rather it may allow us to get more out of the data we already have. In the information theoretic approach for instance we might be able to use a larger number of bins and still get a reasonable frequency estimation. Another interesting approach might be to plot the data for each of the three individuals that were sampled at each time point. These three individuals will generally be at slightly different stages in their development, because there is a certain degree of error on the estimate of the age of each rat embryo, as well as slight variations in the speed of the developmental process. This data is more noisy, but it might be possible to extract more information out of it using the correct statistical tools.
We can also use continuous information theoretic measures, which requires estimating the continuous joint entropy distributions. Such estimates can be achieved based on interpolation between the time points and the use of kernel estimators to represent the amount of variance on the measurements.
So far, we have not yet taken advantage of the timing information in the time series. It might be possible to infer causal relationships based on the timing with which changes percolate through the system, similar to the work done by Arkin et al. (1995, 1997) on reconstructing reaction pathways. Similarly, hysteresis effects could indicate the ordering of genes within a regulatory chain. However, the time resolution may have to be of the same order of magnitude as the response time of the individual genes. With the current time resolution however (up to one time step per day), we may only be able to infer the global state trajectory of the system and larger-scale interactions between multi-gene subsystems.
CONCLUSION
Linear correlation can be used very effectively to detect linear relationships. Because linear correlation is scale and translation invariant, it will also detect relationships not captured by Euclidean distance, such as high negative correlations. However, it only detects relationships with a large linear component and assumes Gaussian distribution of the expression levels. Many standard statistical techniques using correlation coefficients are available, making it an excellent tool. For instance, Path Analysis can be used to determine the strength of interaction between correlated genes for a given causal scheme (Wright, 1921).
Rank correlation can be used to detect non-linear relationships. It is much more robust with respect to the distribution of expression levels. Some statistical tools for linear correlation may still be applicable. However, rank correlation assumes monotonic relationships and some information may be lost because it only uses the relative ordering of the expression levels.
Information theory can be used to detect genes whose (binned) expression patterns share information. It will detect any mapping from time series A to B, but this also means that it will not distinguish between biologically plausible or implausible relationships. It is an excellent tool for discrete-valued data. However if continuous data needs to be discretized first, a lot of information may be lost.
Of course, the methods presented in this paper could also be used on expression data from a single cell type in culture, where we do not have to worry about any spatial development patterns. These methods are also not limited to time series. In fact, the time resolution available so far might be too coarse to infer cause and effect. If so, sampling states that are temporally unrelated (e.g., under different growth conditions) may tell us more about underlying circuitry because they allow us to cover a larger region of the entire state space.
These are only the first steps towards deciphering the logic of gene regulation based on large-scale gene expression data. As more of this data gets produced, and especially as more data per gene becomes available, we can expect techniques such as these to become even more powerful. We think it is important to start looking now for the tools that we will need tomorrow.
ACKNOWLEDGMENTS
This research is supported in part by grants from the National Science Foundation (grant IRI-9157644) and the Office of Naval Research (grant N00014-95-1-0364). The Santa Fe Institute is to be credited for originating this collaboration.
REFERENCES
Arkin, A., and Ross, J., 1995, Statistical construction of chemical reaction mechanism from measured time-series, J. Physical Chemistry, 99:970-979.
Arkin, A., Shen, P., and Ross, J., 1997, A test case of correlation metric construction of a reaction pathway from measurements, Science, 277:1275-1279.
Felsenstein, J., 1993, PHYLIP (Phylogeny Inference Package) version 3.5c, distributed by the author, Department of Genetics, University of Washington, Seattle.
Fitch, W.M., and Margoliash, E. , 1967, Construction of phylogenetic trees, Science 155:279-284.
Kauffman, S.A. , 1993, The Origins of Order: Self-Organization and Selection in Evolution, Oxford University Press, Oxford.
Lander, E.S., 1996, The new genomics: global views of biology, Science 274:536-539.
Olds, E.G., 1938, Distributions of sums of squares of rank differences for small numbers of individuals, Annals of Mathematical Statistics 9:133-148.
Shannon, C.E., 1948, A mathematical theory of communication, Bell Sys. Tech. Journal 27:379-423, 623-656.
Somogyi, R., Sniegoski, C.A., 1996, Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation, Complexity 1(6):45-63.
Somogyi, R., Fuhrman, S., Askenazi, M., and Wuensche, A., 1996, The gene expression matrix: towards the extraction of genetic network architectures, Proc. of the Second World Congress of Nonlinear Analysts (WCNA96). Elsevier Science, in press.
Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L., and Somogyi, R., 1997, Large-scale temporal gene expression mapping of CNS development, Proc. Natl. Acad. Sci., in press.
Wright, S., 1921 Correlation and causation, Journal of Agricultural Research 20:557-585.