Additive Regulation Models
1. Advantages of continuous models over Boolean
2. Correlation analysis
2.1. Linear and rank correlation
2.2. Correlation Metric Construction
3. Reverse engineering: network inference
3.1. Linear and quasi-linear model
3.2. Differential equation models
1. Advantages of continuous models over Boolean
Examining some of the publicly available gene expression data sets, it
is clear that genes spend a lot of their time at intermediate values:
gene expression levels tend to be continuous rather than discrete, and
discretization can lead to a large loss of information. Furthermore,
important concepts in control theory that seem indispensable for gene
regulation systems either cannot be implemented with discrete variables,
or lead to a radically different dynamical behavior: amplification,
subtraction and addition of signals; maintaining equilibrium using
negative feedback; smoothly varying an internal parameter to compensate
for a continuously varying environmental parameter; smoothly varying
the period of a periodic phenomenon like the cell cycle, etc.
Granted, some of these problems can be alleviated by hybrid Boolean
systems. In particular, Glass has proposed sets of piecewise linear
differential equations, where each gene has a continuous-valued
internal state, and a Boolean external state. Thomas has proposed an
asynchronously updated logic with intermediate threshold values. These
systems allow for easy analysis of certain properties of networks,
but still do not seem appropriate for quantitative modeling of real
gene networks.
2. Correlation analysis
2.1. Linear and rank correlation
Observation of correlations between variables has long been used
in biology to predict causal relationships. Although correlation
can never provide proof of a causal relationship, it can lead us to
propose hypotheses that can be tested by other means. In terms of gene
regulation, a high correlation (or anti-correlation) between A and B
can be caused by (1) gene A regulating gene B, (2) gene B regulating
gene A, (3) gene A and B being co-regulated by a third gene C, or (4)
accident. Of course, all of these regulatory interactions can be indirect,
through one or more intermediates. Nevertheless, a sufficiently high
correlation between two genes (taking into account number of data points,
error levels on the data, general regulation trends, etc.) warrants an
investigation of the genes in question.
We have presented a preliminary statistical analysis of the rat spinal
cord data set, in which relationships between individual genes were
inferred based on both linear correlation and rank correlation. Rank
correlation allows the detection of tight but nonlinear relationships
between variables. Several gene pairs with very high linear correlation
were identified (from 0.992 to -0.986), as well as a number of genes
with high rank correlation but small linear correlation.
Positive linear correlation is related to Euclidean distance between
expression patterns, but is not sensitive to an absolute shift in
expression levels. Negative correlation, even though it may indicate a
strong linkage between genes, will not show up in a Euclidean distance
analysis at all. Clustering based on correlation (using the residual
variance, 1-r2) may give a better appreciation of co-regulation
among genes.
2.2. Correlation Metric Construction
Adam Arkin and John Ross at Stanford University have been working on a
method called Correlation Metric Construction, to reconstruct reaction
networks from measured time series of the component chemicals This
approach is based in part on electronic circuit theory, general systems
theory and multivariate statistics.
The system (a reactor vessel with chemicals implementing glycolysis) is
driven using random (and independent) inputs for some of the chemical
species, while the concentration of all the species is monitored over
time. First, the time-lagged-correlation (cross correlation) matrix is
calculated, and from this a distance matrix is constructed based on the
maximum correlation between any two chemical species. This distance
matrix is then fed into a simple clustering algorithm to generate a
tree of connections between the species. To visualize the results,
the chemical species and the tree connecting them is displayed using
multidimensional scaling (MDS), mapping each species to a point in 2D
space while trying to preserve the distances between each prescribed in
the distance matrix. It is also possible to use the information regarding
the time lag between species at which the highest correlation was found,
which could be useful to infer causal relationships. More sophisticated
methods from general systems theory, based on mutual information, could
be used to infer dependency.
3. Reverse engineering: network inference
3.1. Linear and quasi-linear model
The correlation analysis from section 2.1. can
trivially be extended to finding the subset of genes whose weighted
sum correlates best with the expression levels of a specific gene of
interest:
xi (t+1) = Swji xj (t) + bi
Where xi is the expression level of gene i at time t,
bi is a bias term indicating whether gene i is expressed
or not in the absence of regulatory inputs, and weight wji
indicates the influence of gene j on the regulation of gene i. We can
rewrite this as a difference equation:
delta xi = Sw'ji xj + bi
Where w'jj = wjj - 1. Given a set of expression
patterns equidistant in time, we can use linear algebra to solve for
the weights wji to match the data (provided we have more data
points than variables).
We have used this approach to model the 65 genes that make up the
intersection of the rat spinal cord and hippocampal data set. An
extra input to the weighted sum was added to cover differences in gene
regulation among the two tissue types, and another to model the effect of
kainate on the hippocampus (half of the hippocampal data set consists of
a perturbation experiment tracking the changes in gene expression after
kainate injection). Combining the two data sets gives us 22 non-uniformly
spaced data points. Very finely spaced equidistant data points were
derived through interpolation. The interpolation imposes a smoothness
constraint on the expression levels between the original data points, so
it does buy us some extra information. Of course we would have preferred
to have more than 22 data points to model 65 genes in the first place.
Despite the fact that this technique was only borderline feasible due to
the limited amount of data, some of the results were quite interesting:
The resulting weight matrix turned out to be sparse, in agreement with
our intuition that genes are not regulated by all other genes (note
that this model is unconstrained with respect to number of regulatory
interactions); some biologically important genes were regulating many
other genes, whereas many others had very few regulatory outputs (the
number of regulatory inputs to genes had a much narrower distribution);
some of the genes seemed to have a primarily positive or negative
regulatory role. Starting with the initial gene expression levels,
the model could accurately regenerate the entire trajectories (spinal
cord development, hippocampus development and hippocampus injury)
by iterating the difference equation for each time step. Eigenvector
analysis showed three attractors of the system at the adult spinal cord,
adult hippocampus and injured hippocampus expression patterns. However,
most of the specific predictions of the model could not be verified
because so little is known about regulatory interactions in mammal CNS.
For added biological realism, we can include a sigmoidal squashing
function into the equation above:
xi (t+1) = g(Swji xj (t) + bi )
Weaver et al show in a paper at PSB'99 that this sort of
quasi-linear model can be solved by linear algebra as well, by first
applying the inverse of the squashing function:
g-1(xi (t+1)) = Swji xj
(t) + bi
They also showed that randomly generated networks can be accurately
reconstructed using this modeling technique.
Mjolsness, Reinitz and Sharp have used a similar approach to model small
gene networks involved in pattern formation during the blastoderm stage
of development in Drosophila. They added a simplified cellular model,
with synchronized cell divisions (cell divisions are under the control of
a maternal clock at this stage) along a longitudinal axis, alternated with
updating the gene expression levels. Because of the more complex hybrid
model, simulated annealing was used to find a least-squares fit to real
gene expression data. The model was able to successfully replicate the
pattern of eve stripes in Drosophila, as well as some mutant patterns
on which the model was not explicitly trained.
Various people have coined different names for this sort of models:
connectionist model (Mjolsness, Reinitz and Sharp), linear model
(D'haeseleer), linear transciption model (Chen et al), weight matrix
model (Weaver et al). Considering the core of these models contain a
weighted sum to implement gene regulation, perhaps we should call them
additive models.
3.2. Differential equation models
The difference equations from the previous section should remind us
that differential equations have been used for years to model known
biomolecular interactions on individual operators.
One implicit assumption is that the concentrations of the chemical species
are continuous, i.e. that stochastic fluctuations due to single molecules
can be ignored. We know that this does not hold at least for some proteins
which are present in concentrations of only a couple of molecules per
cell. Indeed, there are indications that stochastic fluctuations may
actually be exploited by some organisms. However, differential equations
are widely used to model biochemical systems. Hopefully, a continuous
approach will prove to be appropriate for the majority of interesting
mechanisms.
Chen et al at PSB'99 present
a number of linear differential equation models, including both mRNA and
protein levels. They show theoretically how to solve for the parameters
of the models using linear algebra (as in 3.1.)
and Fourier transforms. They find that their model can not be solved
from mRNA concentrations alone, without at least the initial protein
levels. Conversely, the model can be solved given only a time series of
protein concentrations.
© Copyright 1997 by Patrik D'haeseleer, patrik
at cs dot unm dot edu
c/o Computer Science Department, University of New Mexico,
Albuquerque, NM, 87131
(505) 277-9428 (office)
(505) 277-6927 (fax)