Additive Regulation Models



1. Advantages of continuous models over Boolean
2. Correlation analysis
2.1. Linear and rank correlation
2.2. Correlation Metric Construction
3. Reverse engineering: network inference
3.1. Linear and quasi-linear model
3.2. Differential equation models


1. Advantages of continuous models over Boolean

Examining some of the publicly available gene expression data sets, it is clear that genes spend a lot of their time at intermediate values: gene expression levels tend to be continuous rather than discrete, and discretization can lead to a large loss of information. Furthermore, important concepts in control theory that seem indispensable for gene regulation systems either cannot be implemented with discrete variables, or lead to a radically different dynamical behavior: amplification, subtraction and addition of signals; maintaining equilibrium using negative feedback; smoothly varying an internal parameter to compensate for a continuously varying environmental parameter; smoothly varying the period of a periodic phenomenon like the cell cycle, etc.

Granted, some of these problems can be alleviated by hybrid Boolean systems. In particular, Glass has proposed sets of piecewise linear differential equations, where each gene has a continuous-valued internal state, and a Boolean external state. Thomas has proposed an asynchronously updated logic with intermediate threshold values. These systems allow for easy analysis of certain properties of networks, but still do not seem appropriate for quantitative modeling of real gene networks.

2. Correlation analysis

2.1. Linear and rank correlation

Observation of correlations between variables has long been used in biology to predict causal relationships. Although correlation can never provide proof of a causal relationship, it can lead us to propose hypotheses that can be tested by other means. In terms of gene regulation, a high correlation (or anti-correlation) between A and B can be caused by (1) gene A regulating gene B, (2) gene B regulating gene A, (3) gene A and B being co-regulated by a third gene C, or (4) accident. Of course, all of these regulatory interactions can be indirect, through one or more intermediates. Nevertheless, a sufficiently high correlation between two genes (taking into account number of data points, error levels on the data, general regulation trends, etc.) warrants an investigation of the genes in question.

We have presented a preliminary statistical analysis of the rat spinal cord data set, in which relationships between individual genes were inferred based on both linear correlation and rank correlation. Rank correlation allows the detection of tight but nonlinear relationships between variables. Several gene pairs with very high linear correlation were identified (from 0.992 to -0.986), as well as a number of genes with high rank correlation but small linear correlation.

Positive linear correlation is related to Euclidean distance between expression patterns, but is not sensitive to an absolute shift in expression levels. Negative correlation, even though it may indicate a strong linkage between genes, will not show up in a Euclidean distance analysis at all. Clustering based on correlation (using the residual variance, 1-r2) may give a better appreciation of co-regulation among genes.

2.2. Correlation Metric Construction

Adam Arkin and John Ross at Stanford University have been working on a method called Correlation Metric Construction, to reconstruct reaction networks from measured time series of the component chemicals This approach is based in part on electronic circuit theory, general systems theory and multivariate statistics.

The system (a reactor vessel with chemicals implementing glycolysis) is driven using random (and independent) inputs for some of the chemical species, while the concentration of all the species is monitored over time. First, the time-lagged-correlation (cross correlation) matrix is calculated, and from this a distance matrix is constructed based on the maximum correlation between any two chemical species. This distance matrix is then fed into a simple clustering algorithm to generate a tree of connections between the species. To visualize the results, the chemical species and the tree connecting them is displayed using multidimensional scaling (MDS), mapping each species to a point in 2D space while trying to preserve the distances between each prescribed in the distance matrix. It is also possible to use the information regarding the time lag between species at which the highest correlation was found, which could be useful to infer causal relationships. More sophisticated methods from general systems theory, based on mutual information, could be used to infer dependency.

3. Reverse engineering: network inference

3.1. Linear and quasi-linear model

The correlation analysis from section
2.1. can trivially be extended to finding the subset of genes whose weighted sum correlates best with the expression levels of a specific gene of interest:

xi (t+1) = Swji xj (t) + bi

Where xi is the expression level of gene i at time t, bi is a bias term indicating whether gene i is expressed or not in the absence of regulatory inputs, and weight wji indicates the influence of gene j on the regulation of gene i. We can rewrite this as a difference equation:

delta xi = Sw'ji xj + bi

Where w'jj = wjj - 1. Given a set of expression patterns equidistant in time, we can use linear algebra to solve for the weights wji to match the data (provided we have more data points than variables).

We have used this approach to model the 65 genes that make up the intersection of the rat spinal cord and hippocampal data set. An extra input to the weighted sum was added to cover differences in gene regulation among the two tissue types, and another to model the effect of kainate on the hippocampus (half of the hippocampal data set consists of a perturbation experiment tracking the changes in gene expression after kainate injection). Combining the two data sets gives us 22 non-uniformly spaced data points. Very finely spaced equidistant data points were derived through interpolation. The interpolation imposes a smoothness constraint on the expression levels between the original data points, so it does buy us some extra information. Of course we would have preferred to have more than 22 data points to model 65 genes in the first place.

Despite the fact that this technique was only borderline feasible due to the limited amount of data, some of the results were quite interesting: The resulting weight matrix turned out to be sparse, in agreement with our intuition that genes are not regulated by all other genes (note that this model is unconstrained with respect to number of regulatory interactions); some biologically important genes were regulating many other genes, whereas many others had very few regulatory outputs (the number of regulatory inputs to genes had a much narrower distribution); some of the genes seemed to have a primarily positive or negative regulatory role. Starting with the initial gene expression levels, the model could accurately regenerate the entire trajectories (spinal cord development, hippocampus development and hippocampus injury) by iterating the difference equation for each time step. Eigenvector analysis showed three attractors of the system at the adult spinal cord, adult hippocampus and injured hippocampus expression patterns. However, most of the specific predictions of the model could not be verified because so little is known about regulatory interactions in mammal CNS.

For added biological realism, we can include a sigmoidal squashing function into the equation above:

xi (t+1) = g(Swji xj (t) + bi )

Weaver et al show in a paper at PSB'99 that this sort of quasi-linear model can be solved by linear algebra as well, by first applying the inverse of the squashing function:

g-1(xi (t+1)) = Swji xj (t) + bi

They also showed that randomly generated networks can be accurately reconstructed using this modeling technique.

Mjolsness, Reinitz and Sharp have used a similar approach to model small gene networks involved in pattern formation during the blastoderm stage of development in Drosophila. They added a simplified cellular model, with synchronized cell divisions (cell divisions are under the control of a maternal clock at this stage) along a longitudinal axis, alternated with updating the gene expression levels. Because of the more complex hybrid model, simulated annealing was used to find a least-squares fit to real gene expression data. The model was able to successfully replicate the pattern of eve stripes in Drosophila, as well as some mutant patterns on which the model was not explicitly trained.

Various people have coined different names for this sort of models: connectionist model (Mjolsness, Reinitz and Sharp), linear model (D'haeseleer), linear transciption model (Chen et al), weight matrix model (Weaver et al). Considering the core of these models contain a weighted sum to implement gene regulation, perhaps we should call them additive models.

3.2. Differential equation models

The difference equations from the previous section should remind us that differential equations have been used for years to model known biomolecular interactions on individual operators.

One implicit assumption is that the concentrations of the chemical species are continuous, i.e. that stochastic fluctuations due to single molecules can be ignored. We know that this does not hold at least for some proteins which are present in concentrations of only a couple of molecules per cell. Indeed, there are indications that stochastic fluctuations may actually be exploited by some organisms. However, differential equations are widely used to model biochemical systems. Hopefully, a continuous approach will prove to be appropriate for the majority of interesting mechanisms.

Chen et al at
PSB'99 present a number of linear differential equation models, including both mRNA and protein levels. They show theoretically how to solve for the parameters of the models using linear algebra (as in 3.1.) and Fourier transforms. They find that their model can not be solved from mRNA concentrations alone, without at least the initial protein levels. Conversely, the model can be solved given only a time series of protein concentrations.


© Copyright 1997 by Patrik D'haeseleer, patrik at cs dot unm dot edu
c/o Computer Science Department, University of New Mexico, Albuquerque, NM, 87131

(505) 277-9428 (office)
(505) 277-6927 (fax)