UNM Computer Science

Search Technical Reports by Keyword



This uses a sub-string search, so that "graph" matches both "graph" and "graphical".

This search only searches by keyword. If you'd like, you can also search by researcher.

Found 31 results.

Listing from newest to oldest



TR-CS-2007-06

A Simulation Framework for Modeling Combinatorial Control in Transcription Regulatory Networks
Sushmita Roy, Terran Lane, Margaret Werner-Washburne

With the increasing availability of genome scale data, a plethora of algorithms are being developed to infer genome scale regulatory networks. However, identifying the appropriate algorithm for a particular inference task is a critical and difficult modeling question. This is because for most real-world data, the true network is rarely known and different algorithms produce different networks, which may be due to assumptions inherrent to the algorithm or the objective the algorithm is designed to optimize. One approach to address this issue is to build artificial, but biologically realistic, gene networks for which the true topology is known. These networks can be sampled to generate data, which can then serve as a testbed to compare a suite of network inference algorithms.

Existing simulation systems either require highly detailed descriptions of network components making the generation of large-scale networks (> 100 nodes) infeasible, or, make simplifications that render the simulated networks very far from true models of gene regulation. We have developed a simulation system that builds a network of genes and proteins and models the rate of transcriptional activity as a non-linear function of a set of transciption factor complexes. Our simulator uses a system of differential equations and interfaces with Copasi, a differential equation solver, to produce steady-state and time-series expression datasets. Overall, our simulator has the following advantages: (i) it can produce expression datasets in an efficient and highthroughput manner, (ii) it is biologically more realistic since it models transcription rate as a function of transcription factor levels rather than gene expression levels, (iii) it incorporates the combinatorial control among different transcription factors, and (iii) it incorporates gene and protein expressions, thus allowing the comparison of algorithms that use different combinations of these data sources.

PDF



TR-CS-2005-26

Machine learning approaches to siRNA efficacy prediction.
Sahar Abubucker

RNA interference (RNAi) is being widely used to study gene expression and gene regulation via selective knockdown of gene expression, which is important for functional genomic studies. RNAi refers to the biological process by which short interfering RNA (siRNA) after incorporation into the RNA induced silencing complex (RISC) degrades complementary messenger RNA (mRNA) sequences. This knockdown of mRNA prevents it from producing amino acid sequences that are responsible for gene expression. Recent studies indicate that all siRNAs do not produce the same knockdown effects. Due to the high cost of synthesizing siRNAs and the extensive effort required to test siRNAs, rational siRNA design---a priori prediction of functionality for specific siRNAs---is a critical task for practical RNAi studies. Therefore, a key component of RNAi applications is the selection of effective siRNA sequences---ones that degrade at least 80% of the targeted mRNA. The goal of siRNA efficacy prediction is to aid in designing siRNA sequences that are highly efficient in degrading target mRNA sequences. Most of the current algorithms use positional features, energy characteristics, thermodynamic properties and secondary structure predictions to predict the knockdown efficacy of siRNAs. In this work, frequently occurring patterns in siRNAs are used to make predictions about their efficacy. Time after transfection is shown to be an important factor in siRNA efficacy prediction---a feature that has been ignored in previous efficacy studies. The relevance of these extracted patterns to previous results and their biological significance are analyzed. Random feature sets are generated and the ability of these sets to predict efficacy are studied and their results discussed. Our algorithm does not require any specialized hardware and consistently performs better than other publicly available efficacy prediction algorithms.

PDF



TR-CS-2005-22

A Framework for Orthology Assignment from Gene Rearrangement Data
Krister M. Swenson and Nicholas D. Pattengale and Bernard M.E. Moret

Gene rearrangements have successfully been used in phylogenetic reconstruction and comparative genomics, but usually under the assumption that all genomes have the same gene content and that no gene is duplicated. While these assumptions allow one to work with organellar genomes, they are too restrictive when comparing nuclear genomes. The main challenge is how to deal with gene families, specifically, how to identify orthologs. While searching for orthologies is a common task in computational biology, it is usually done using sequence data. We approach that problem using gene rearrangement data, provide an optimization framework in which to phrase the problem, and present some preliminary theoretical results.

gzipped postscript | PDF



TR-CS-2005-08

Inferring Ancestral Chloroplast Genomes with Duplications
Liying Cui, Jijun Tang, Bernard M. E. Moret, and Claude dePamphilis

Motivation: Genome evolution is shaped not only by nucleotide substitutions but also by structural changes including gene and genome duplications, insertions and deletions, and gene-order rearrangements. Reconstruction of phylogeny based on gene-order changes has been limited to cases where equal gene content or few deletions can be assumed. Since conserved duplicated regions are present in many genomes, the inference of duplication is needed in ancestral genome reconstruction.

Results: We apply GRAPPA-IR, a modified GRAPPA algorithm,to reconstruct ancestral chloroplast genomes containing duplicated genes. A test of GRAPPA-IR using six divergent chloroplast genomes from land plants and green algae recovers the phylogeny congruent with prior studies, while analyses that do not consider the duplications fail to obtain the accepted topology. The ancestral genome structure suggests that genome rearrangement in chloroplasts is probably limited by inverted repeats with a conserved core region. In addition, the boundaries of inverted repeats are hot spots for gene duplications or deletions.

PDF



TR-CS-2005-02

Quartet methods for phylogeny reconstruction from gene orders
Tao Liu, Jijun Tang, and Bernard M.E. Moret

Phylogenetic reconstruction from gene-rearrangement data has attracted increasing attention from biologists and computer scientists. Methods used in reconstruction include distance-based methods, parsimony methods using sequence-based encodings, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate approach; however, its exhaustive approach means that it can be applied only to small datasets of fewer than 15 taxa. While we have successfully scaled it up to 1,000 genomes by integrating it with a disk-covering method (DCM-GRAPPA), the recursive decomposition may need many levels of recursion to handle datasets with 1,000 or more genomes. We thus investigated quartet-based approaches, which directly decompose the datasets into subsets of four taxa each; such approaches have been well studied for sequence data, but not for gene-rearrangement data. We give an optimization algorithm for the NP-hard problem of computing optimal trees for each quartet, present a variation of the dyadic method (using heuristics to choose suitable short quartets), and use both in simulation studies. We find that our quartet-based method can handle more genomes than the base version of GRAPPA, thus enabling us to reduce the number of levels of recursion in DCM-GRAPPA, but is more sensitive to the rate of evolution, with error rates rapidly increasing when saturation is approached.

PDF



TR-CS-2004-32

Efficient RNAi-Based Gene Family Knockdown via Set Cover Optimization
Wenzhong Zhao, M. Leigh Fanning and Terran Lane

RNA interference (RNAi) is a recently discovered genetic immune system with widespread therapeutic and genomic applications. In this paper, we address the problem of selecting an efficient set of initiator molecules (siRNAs) for RNAi-based gene family knockdown experiments. Our goal is to select a minimal set of siRNAs that (a) cover a targeted gene family or a specified subset of it, (b) do not cover any untargeted genes, and (c) are individually highly effective at inducing knockdown. We show that the problem of minimizing the number of siRNAs required to knock down a family of genes is NP-Hard via a reduction to the set cover problem. We also give a formal statement of a generalization of the basic problem that incorporates additional biological constraints and optimality criteria. We modify the classical branch-and-bound algorithm to include some of these biological criteria. We find that, in many typical cases, these constraints reduce the search space enough that we are able to compute exact minimal siRNA covers within reasonable time. For larger cases, we propose a probabilistic greedy algorithm for finding minimal siRNA covers efficiently. Our computational results on real biological data show that the probabilistic greedy algorithm produces siRNA covers as good as the branch-and-bound algorithm in most cases. Both algorithms return minimal siRNA covers with high predicted probability that the selected siRNAs will be effective at inducing knockdown. We also examine the role of "off-target" interactions - the constraint of avoiding covering untargeted genes can, in some cases, substantially increase the complexity of the resulting solution. Overall, however, we find that in many common cases, our approach significantly reduces the number of siRNAs required in gene family knockdown experiments, as compared to knocking down genes independently.

PDF



TR-CS-2004-30

Linear Programming for Phylogenetic Reconstruction Based on Gene Rearrangements
Jijun Tang and Bernard M.E. Moret

Phylogenetic reconstruction from gene rearrangements has attracted increasing attention from biologists and computer scientists over the last few years. Methods used in reconstruction include distance-based methods, parsimony methods using sequence-based encodings, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate approach, but has been limited to small genomes because the running time of its scoring algorithm grows exponentially with the number of genes in the genome. We report here on a new method to compute a tight lower bound on the score of a given tree, using a set of linear constraints generated through selective applications of the triangle inequality. Our method generates an integer linear program with a carefully limited number of constraints, rapidly solves its relaxed version, and uses the result to provide a tight lower bound. Since this bound is very close to the optimal tree score, it can be used directly as a selection criterion, thereby enabling us to bypass entirely the expensive scoring procedure. We have implemented this method within our GRAPPA software and run several series of experiments on both biological and simulated datasets to assess its accuracy. Our results show that using the bound as a selection criterion yields excellent trees, with error rates below 5% up to very large evolutionary distances, consistently beating the baseline Neighbor-Joining. Our new method enables us to extend the range of applicability of the direct optimization method to chromosomes of size comparable to those of bacteria, as well as to datasets with complex combinations of evolutionary events.

gzipped postscript



TR-CS-2004-26

Distance correction for large pairwise genomic distances
J. Tang and B.M.E. Moret

Phylogenetic reconstruction from gene-order data has attracted attention from both biologists and computer scientists over the last few years. Our software suite, GRAPPA and its extension DCM-GRAPPA, has been applied with excellent results to datasets (real and simulated) of up to a thousand organellar genomes and should scale up to even larger datasets. However, scaling it up from the small organellar genomes (with a hundred or so genes) to larger genomes such as bacterial genomes (with several thousand genes) or eukaryotic genomes (with tens of thousands of genes) remains a problem, in terms of both running time and accuracy. The cost of the basic optimization loop grows exponentially with the genomic distance -- and distances are much larger among the widely varying bacterial and eukaryotic genomes than among the highly stable organellar ones. Even with the smaller genomes, the presence of large pairwise evolutionary distances creates a two-sided problem: edit distances will seriously underestimate the evolutionary distance and distance corrections will introduce the risk of serious overestimates. In a computation based on the use of lower bounds to eliminate much of the search space, underestimates lead to loose bounds and poor pruning and thus to excessive running times, while overestimates may lead to the elimination of the best solutions and thus to excessive errors. This problem due to large pairwise distances has been studied in sequence-based phylogenetic reconstruction; the proposed solution has been some form of limit on the largest possible values produced in the process of correction. One particular approach, the so-called "fix-factor," has proved very successful. In this paper, we present an adaptation of the fix-factor technique to gene-order data. Experiments on both biological and simulated datasets suggest that, using our fix-factor adaptation, the pruning rate can be dramatically increased without sacrificing accuracy.

gzipped postscript | PDF



TR-CS-2004-25

Advances in Phylogeny Reconstruction from Gene Order and Content Data
Bernard M.E. Moret and Tandy Warnow

Genomes can be viewed in terms of their gene content and the order in which the genes appear along each chromosome. Evolutionary events that affect the gene order or content are ``rare genomic events" (rarer than events that affect the composition of the nucleotide sequences) and have been advocated by systematists for inferring deep evolutionary histories. This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution. Because such methods are currently quite restricted in the type of data they can analyze, we also present current research aimed at handling the full range of whole-genome data.

gzipped postscript | PDF



TR-CS-2004-15

Approximating the True Evolutionary Distance Between Two Genomes
Krister M. Swenson, Mark Marron, Joel V. Earnest-DeYoung, and Bernard M.E. Moret

As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multiset of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions and a limited form of insertions (which forbids duplications); in turn we extended it to compute a nearly optimal edit sequence between an arbitrary genome and the identity permutation. In this paper we extend and improve our previous work to handle duplications as well as insertions and present an algorithm for computing true evolutionary distances (as opposed to the less useful edit distances) in a framework involving insertions, duplications, deletions, and inversions. We present experimental results showing that our algorithm produces very good estimates of the true evolutionary distance up to a (high) threshold of saturation; indeed, the distances thus produced are good enough to enable a simple neighbor-joining procedure to reconstruct our test trees with high accuracy.

gzipped postscript



TR-CS-2004-14

Reversing Gene Erosion--Reconstructing Ancestral Bacterial Genomes from Gene-Content and Order Data
Joel V. Earnest-DeYoung, Emmanuelle Lerat, Bernard M.E. Moret

In the last few years, it has become routine to use gene-order data to reconstruct phylogenies, both in terms of edge distances (parsimonious sequences of operations that transform one end point of the edge into the other) and in terms of genomes at internal nodes, on small, duplication-free genomes. Current gene-order methods break down, though, when the genomes contain more than a few hundred genes, possess high copy numbers of duplicated genes, or create edge lengths in the tree of over one hundred operations. We have constructed a series of heuristics that allow us to overcome these obstacles and reconstruct edges distances and genomes at internal nodes for groups of larger, more complex genomes. We present results from the analysis of a group of thirteen modern gamma-proteobacteria, as well as from simulated datasets.

PDF



TR-CS-2004-10

Reconstructing Phylogenies from Gene-Content and Gene-Order Data
Bernard M.E. Moret, Jijun Tang, and Tandy Warnow

Gene-order data has been used successfully to reconstruct organellar phylogenies; it offers low error rates, the potential to reach farther back in time than through DNA sequences (because genome-level events are much rarer than DNA point mutations), and relative immunity from the so-called gene-tree vs. species-tree problem (caused by the fact that the evolutionary history of specific genes is not isomorphic to that of the organism as a whole). It has also provided deep mathematical and algorithmic results dealing with permutations and shortest sequences of operations on these permutations. Recent developments include generalizations to handle insertions, duplications, and deletions, scaling to large numbers of organisms, and, to a lesser extent, to larger genomes; and the first Bayesian approach to the reconstruction problem. We survey the state-of-the-art in using such data for phylogenetic reconstruction, focusing on recent work by our group that has enabled us to handle arbitrary insertions, duplications, and deletions of genes, as well as inversions of gene subsequences. We conclude with a list of research questions (mathematical, algorithmic, and biological) that will need to be addressed in order to realize the full potential of this type of data.

gzipped postscript



TR-CS-2004-05

Phylogenetic Reconstruction from Arbitrary Gene-Order Data
Jijun Tang, Bernard M.E. Moret, Liying Cui, and Claude W. dePamphilis

Phylogenetic reconstruction from gene-order data has attracted attention from both biologists and computer scientists over the last few years. So far, our software suite GRAPPA is the most accurate approach, but it requires that all genomes have identical gene content, with each gene appearing exactly once in each genome. Some progress has been made in handling genomes with unequal gene content, both in terms of computing pairwise genomic distances and in terms of reconstruction. In this paper, we present a new approach for computing the median of three arbitrary genomes and apply it to the reconstruction of phylogenies from arbitrary gene-order data. We implemented these methods within GRAPPA and tested them on simulated datasets under various conditions as well as on a real dataset of chloroplast genomes; we report the results of our simulations and our analysis of the real dataset and compare them to reconstructions made by using neighbor-joining and using the original GRAPPA on the same genomes with equal gene contents. Our new approach is remarkably accurate both in simulations and on the real dataset, in contrast to the distance-based approaches and to reconstructions using the original GRAPPA applied to equalized gene contents.

gzipped postscript



TR-CS-2003-35

Genomic Distances under Deletions and Insertions II
Mark Marron and Krister M. Swenson and Bernard M.E. Moret

As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multi-set of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions (or conversely, a limited form of insertions which forbids duplications). In this paper we extend El-Mabrouk's work to handle duplications as well as insertions and present an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions. We derive an error bound for our polynomial-time distance computation under various assumptions and present preliminary experimental results that suggest that performance in practice may be excellent, within a few percent of the actual distance.

gzipped postscript



TR-CS-2003-31

An Exact, Linear-Time Algorithm for Computing Genomic Distances Under Inversions and Deletions
Tao Liu, Bernard M.E. Moret, and David A. Bader

As more genomes are sequenced, it is becoming possible to study evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematic model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multiset of genes. Hannenhalli and Pevzner gave the first polynomial-time algorithm to compute the shortest sequence of inversions that can transform one signed permutation into another, under the assumption that the genomes have exactly one copy of each gene. Later, Bader et al. showed that the length of such a sequence can be computed in linear time. El-Mabrouk extended Hannenhalli and Pevzner's result to allow deletions (or, equivalently, non-duplicating inversions). In this paper we show that the length of a shortest sequence of inversions and deletions can be computed in linear time. In the process, we also refine the characterization of hurdles and safe mergings. Our algorithm is easy to implement and runs with very low overhead: we include the results of computational experiments over a large range of permutation pairs produced through simulated evolution.

gzipped postscript



TR-CS-2003-14

Adaptation and Ruggedness in an Evolvability Landscape
Terry Van Belle and David H. Ackley

Evolutionary algorithms depend on both a within-generation selection process that determines the fitness of individual genomes, and a cross-generation evolvability process that determines how offspring can improve on their ancestors. We introduce a minimal model in which not only selection but also evolvability are genetically controlled, and study its behavior in an environment where peak fitness changes location over evolutionary time. We demonstrate an `evolution of evolvability' model that can improve performance on this problem, compared to fixed evolvability mechanisms as commonly employed in genetic algorithms. We introduce and discuss evolvability landscapes that relate evolvability parameters to long-term population success, and find that the evolvability landscape of our model problem turns out to have some rugged features, in which badly performing points can be found very close to high scoring points. We suggest that a research focus on evolvability may help build better evolutionary algorithms as well as understand their behaviors.

gzipped postscript | PDF



TR-CS-2003-09

Phylogenetic Reconstruction from Gene Rearrangement Data with Unequal Gene Content
Jijun Tang and Bernard M.E. Moret

Phylogenetic reconstruction from gene rearrangement data has attracted increased attention over the last five years. Existing methods are limited computationally and by the assumption (highly unrealistic in practice) that all genomes have the same gene content. We have recently shown that we can scale our reconstruction tool, GRAPPA, to instances with up to a thousand genomes with no loss of accuracy and at minimal computational cost. Computing genomic distances between two genomes with unequal gene contents has seen much progress recently, but that progress has not yet been reflected in phylogenetic reconstruction methods. In this paper, we present extensions to our GRAPPA approach that can handle limited numbers of duplications (one of the main requirements for analyzing genomic data from organelles) and a few deletions. Although GRAPPA is based on exhaustive search, we show that, in practice, our bounding functions suffice to prune away almost all of the search space (our pruning rates never fall below 99.995%), resulting in high accuracy and fast running times. The range of values within which we have tested our approach encompasses mitochondria and chloroplast organellar genomes, whose phylogenetic analysis is providing new insights on evolution.

gzipped postscript



TR-CS-2003-08

Genomic Distances under Deletions and Insertions
Mark Marron, Krister M. Swenson, and Bernard M.E. Moret

As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multiset of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions and a limited form of insertions (which forbids duplications). In this paper we extend El-Mabrouk's work to handle duplications as well as insertions and present an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions. We derive an error bound for our polynomial-time distance computation under various assumptions and present preliminary experimental results that suggest that performance in practice may be excellent, within a few percent of the actual distance.

gzipped postscript



TR-CS-2003-04

Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data
J. Tang and B.M.E. Moret

Phylogenetic reconstruction from gene-order data has attracted increasing attention from both biologists and computer scientists over the last few years. Methods used in reconstruction include distance-based methods (such as neighbor-joining), parsimony methods using sequence-based encodings, Bayesian approaches, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate approach, but cannot handle more than about 15 genomes. We report here on our successful efforts to scale up direct optimization through a two-step approach: the first step decomposes the dataset into smaller pieces and runs the direct optimization (GRAPPA) on the smaller pieces, while the second step builds a tree from the results obtained on the smaller pieces. We used the sophisticated disk-covering method (DCM) pioneered by Warnow and her group, suitably modified to take into account the computational limitations of GRAPPA. We find that DCM-GRAPPA scales gracefully to at least $1,000$ genomes and retains surprisingly high accuracy throughout the range: in our experiments, the topological error rate rarely exceeded a few percent. Thus, reconstruction based on gene-order data can now be accomplished with high accuracy on datasets of significant size.

gzipped postscript



TR-CS-2002-31

Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data
J. Tang, T. Liu, and B.M.E. Moret

Phylogenetic reconstruction from gene-order data has attracted increasing attention from both biologists and computer scientists over the last few years. Methods used in reconstruction include distance-based methods (such as neighbor-joining), parsimony methods using sequence-based encodings, Bayesian approaches, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate approach, but cannot handle more than about 15 genomes. We report here on our successful efforts to scale up the direct approach through a two-step approach: the first step decomposes the dataset into smaller pieces and runs the direct optimization (GRAPPA) on the smaller pieces, while the second step builds a tree from the results obtained on the smaller pieces. We used both a quartet-based approach (the quartet-puzzling method) and the more sophisticated disk-covering method (DCM) pioneered by Warnow. We find that DCM-GRAPPA scales gracefully to 100 genomes and retains surprisingly high accuracy throughout the range: in our experiments, the topological error rate rarely exceeded a few percent. The quartet-based methods perform better than might be expected from their known accuracy on sequence data, but trail far behind DCM-GRAPPA in terms of both speed and accuracy. Overall, our results demonstrate that reconstruction based on gene-order data can now be accomplished with high accuracy on datasets of significant size.

gzipped postscript



TR-CS-2002-27

Constructing Regulatory Networks from Gene Expression Data Using Association Rule Mining
Jiaye Zhou

Gene expression technology, such as high density DNA Microarray, allows us to monitor gene expression patterns at the genomic level. While this technology promises progress toward the understanding of transcriptional response and regulation, it also introduces the challenge of extracting relevant information from the large datasets generated. As a result, data mining of gene expression data has become an important area of research for biologists. Two classical data mining methods: data classification and clustering have been widely used to analyze gene expression data. These methods are valuable exploratory tools in data mining, but they are limited to placing genes into groups with others that share certain characteristics. While it is important to determine which genes are related, we also need to understand the mechanism of how genes relate and how they regulate one another. These two methods do not provide such insight. This information, however, can be extracted from gene expression data with experiments that are well designed. In this thesis, I consider the use of association rule mining for discovering regulatory relationships among genes from gene expression data. Association rule mining, a relatively new technique in the area of data mining and knowledge discovery, has been the focus of much data mining research in computer science. Association rule mining is a process that identifies links between sets of correlated objects in large datasets. In this thesis, I describe the first application of association rule mining to gene expression data analysis. I develop a strategy for transforming gene expression data to make it suitable for association rule mining, and show how the FP-Growth algorithm for association rule mining can be adapted to apply to the transformed data. I implement, test and evaluate the association rule mining method using real gene expression data that is publicly available. I am able to validate our analysis method by showing that our results are consistent with published results generated by independent analysis methods. Furthermore, this method is able to generate previously unknown biological information, making it a valuable gene expression data analysis tool.

PDF



TR-CS-2002-04

Inversion Medians Generally Outperform Breakpoint Medians in Phylogeny Reconstruction from Gene-Order Data
Bernard M.E. Moret, Adam C. Siepel, Jijun Tang, and Tao Liu

Phylogeny reconstruction from gene-order data has attracted much attention over the last few years. The two software packages used for that purpose, BPAnalysis and GRAPPA, both use so-called breakpoint medians in their computations. Breakpoint distance, however, is a heuristic measure of evolution, not directly based on any model of genome rearrangement. We conjectured that phylogeny reconstructions could be improved by using inversion medians, which minimize evolutionary distance under an inversions-only model of genome rearrangement. Recent algorithmic developments have made it possible to compute inversion medians for problems of realistic size.

Our experimental studies unequivocally show that inversion medians are strongly preferable to breakpoint medians in the context of phylogenetic reconstruction from gene-order data. Improvements are most pronounced in the reconstruction of ancestral genomes, but are also evident in the topological accuracy of the reconstruction as well as, surprisingly, in the overall running time. Improvements are strongest for small average distances along tree edges and for evolutionary scenarios with a preponderance of inversion events, but occur in all cases, including evolutionary scenarios with high proportions of transpositions.

gzipped postscript



TR-CS-2001-29

An Algorithm to Find All Sorting Reversals
Adam C. Siepel

The problem of estimating evolutionary distance from differences in gene order has been distilled to the problem of finding the reversal distance between two signed permutations. During the last decade, much progress was made both in computing reversal distance and in finding a minimum sequence of sorting reversals. For most problem instances, however, many minimum sequences of sorting reversals exist; furthermore, obtaining the complete set can be useful in exploring the space of genome rearrangements, in pursuit of solutions to higher-level problems. The problem of finding all minimum sequences of sorting reversals reduces easily to the problem of finding all sorting reversals of one permutation with respect to another. We derive an efficient algorithm to solve this latter problem, and present experimental results indicating that our algorithm performs extremely well in comparison to the best known alternative.

gzipped postscript



TR-CS-2001-19

Inferring phylogenetic relationships using whole genome data: A case study of photosynthetic organelles and chloroplast genomes
Linda A. Raubeson, Bernard M.E. Moret, Jijun Tang, Stacia K. Wyman, and Tandy Warnow

We present a case study in the use of a challenging dataset of gene orders to determine deep relationships among photosynthetic organelles. We have conducted an extensive analysis using several techniques that our group has developed for phylogenetic reconstruction from such data. We find that our methods are capable of extracting significant information from an undersampled dataset with relatively few genes---evidence that gene-order data contains a strong phylogenetic signal and that our methods can extract this signal. We also present new and independent evidence to resolve some of the controversies remaining in organellar phylogeny.

gzipped postscript



TR-CS-2001-18

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study
Li-San Wang, Robert K. Jansen, Bernard M.E. Moret, Linda A. Raubeson, and Tandy Warnow

Evolution operates on whole genomes through mutations that change the order and strandedness of genes within the genomes. Thus analyses of gene order data present new opportunities for discoveries about deep evolutionary events, provided that sufficiently accurate methods can be developed to reconstruct evolutionary trees. In this paper we present a new method of character coding for parsimony-based analysis of genomic rearrangements called MPBE-2, and a new parsimony-based method which we call MPME (based on an encoding of Bryant), both variants of the MPBE method. We then conduct computer simulations to compare this class of methods to distance-based methods (NJ under various distance measures) and other parsimony approaches (MPBE and MPME) for phylogeny reconstruction based on genome rearrangement data. Our empirical results show that parsimony approaches generally have better accuracy than distance-based methods.

gzipped postscript



TR-CS-2001-17

The Accuracy of Fast Phylogenetic Methods for Large Datasets
Luay Nakhleh, Bernard M.E. Moret, Usman Roshan, Katherine St.~John, Jerry Sun, and Tandy Warnow

Whole-genome phylogenetic studies require various sources of phylogenetic signals to produce an accurate picture of the evolutionary history of a group of genomes. In particular, sequence-based reconstruction will play an important role, especially in resolving more recent events. But using sequences at the level of whole genomes means working with very large amounts of data---large numbers of sequences---as well as large phylogenetic distances, so that reconstruction methods must be both fast and robust as well as accurate. We study the accuracy, convergence rate, and speed of several fast reconstruction methods: neighbor-joining, Weighbor (a weighted version of neighbor-joining), greedy parsimony, and a new phylogenetic reconstruction method based on disk-covering and parsimony search (DCM-NJ + MP). Our study uses extensive simulations based on random birth-death trees, with controlled deviations from ultrametricity. We find that Weighbor, thanks to its sophisticated handling of probabilities, outperforms other methods for short sequences, while our new method is the best choice for sequence lengths above 100. For very large sequence lengths, all four methods have similar accuracy, so that the speed of neighbor-joining and greedy parsimony makes them the two methods of choice.

gzipped postscript



TR-CS-2001-10

Industrial Applications of High-Performance Computing for Phylogeny Reconstruction
David A. Bader, Bernard M.E. Moret, and Lisa Vawter

Phylogenies (that is, tree-of-life relationships) derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Real-world interest is strong in determining these relationships. For example, pharmaceutical companies may use phylogeny reconstruction in drug discovery for finding plants with similar gene production. Health organizations study the evolution and spread of viruses such as HIV to gain understanding of future outbreaks. And governments are interested in aiding the production of foodstuffs like rice, wheat, and corn, by understanding the genetic code. Yet very few techniques are available for such phylogenetic reconstructions. Appropriate tools for analyzing such data may help resolve some difficult phylogenetic reconstruction problems; indeed, this new source of data has been embraced by many biologists in their phylogenetic work.

With the rapid accumulation of whole genome sequences for a wide diversity of taxa, phylogenetic reconstruction based on changes in gene order and gene content is showing promise, particularly for resolving deep (i.e., old) branches. However, reconstruction from gene-order data is even more computationally intensive than reconstruction from sequence data, particularly in groups with large numbers of genes and highly rearranged genomes. We have developed a software suite, GRAPPA, that extends the breakpoint analysis (BPAnalysis) method of Sankoff and Blanchette while running much faster: in a recent analysis of a collection of chloroplast data for species of Campanulaceae on a 512-processor Linux supercluster with Myrinet, we achieved a one-million-fold speedup over BPAnalysis. GRAPPA currently can use either breakpoint or inversion distance (computed exactly) for its computation and runs on single-processor machines as well as parallel and high-performance computers.

gzipped postscript



TR-CS-2001-08

Finding an Optimal Inversion Median: Experimental Results
Adam Siepel and Bernard M.E. Moret

We derive a branch-and-bound algorithm to find an optimal inversion median of three signed permutations. We establish fairly tight upper and lower bounds at intermediate points using general geometric properties of the problem that can be exploited because highly efficient algorithms are now available for determining pairwise inversion distance. Our experiments on simulated datasets show that optimal inversion medians can be significantly better than breakpoint medians; they also indicate that our algorithm computes the median in reasonable time for genomes of medium size when the distances are not too large, a common case in phylogeny reconstruction. We also provide evidence that parallel implementations can provide linear speedup and thus enable us to solve most realistic instances.

gzipped postscript



TR-CS-2001-02

New Approaches for Reconstructing Phylogenies from Gene Order Data
Bernard M.E. Moret, Li-San Wang, Tandy Warnow, and Stacia K. Wyman

We report on new techniques we have developed for reconstructing phylogenies on whole genomes. Our mathematical techniques include new polynomial-time methods for bounding the inversion length of a candidate tree and new polynomial-time methods for estimating genomic distances which greatly improve the accuracy of neighbor-joining analyses. We demonstrate the power of these techniques through an extensive performance study based upon simulating genome evolution under a wide range of model conditions. Combining these new tools with standard approaches (fast reconstruction with neighbor-joining, exploration of all possible refinements of strict consensus trees, etc.) has allowed us to analyze datasets that were previously considered computationally impractical. In particular, we have conducted a complete phylogenetic analysis of a subset of the Campanulaceae family, confirming various conjectures about the relationships among members of the subset and about the principal mechanism of evolution for their chloroplast genome. We give representative results of the extensive experimentation we conducted on both real and simulated datasets in order to validate and characterize our approaches. We find that our techniques provide very accurate reconstructions of the true tree topology even when the data are generated by processes that include a significant fraction of transpositions and when the data are close to saturation.

gzipped postscript



TR-CS-2000-20

An Empirical Comparison Between BPAnalysis and MPBE on the Campanulaceae Chloroplast Genome Dataset
Mary E. Cosner, Robert K. Jansen, Bernard M.E. Moret, Linda A. Raubeson, Li-San Wang, Tandy Warnow, and Stacia Wyman

The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstructing evolutionary trees from gene order data. These same authors also developed and implemented BPAnalysis, a heuristic method (based on solving many instances of the travelling salesman problem) for estimating the breakpoint phylogeny. We present a new heuristic for estimating the breakpoint phylogeny which, although not polynomial-time, is much faster in practice than BPAnalysis. We use this heuristic to conduct a phylogenetic analysis of the flowering plant family Campanulaceae. We also present and discuss the results of experimentation of this real dataset with three methods: our new method, BPAnalysis, and the neighbor-joining method, using breakpoint distances, inversion distances, and inversion plus transposition distances.

gzipped postscript



TR-CS-2000-19

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data
Mary E. Cosner, Robert K. Jansen, Bernard M.E. Moret, Linda A. Raubeson, Li-San Wang, Tandy Warnow, and Stacia Wyman

The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstructing evolutionary trees from gene order data. These same authors also developed and implemented BPAnalysis, a heuristic method (based on solving many instances of the travelling salesman problem) for estimating the breakpoint phylogeny. We present a new heuristic for this purpose; although not polynomial-time, our heuristic is much faster in practice than BPAnalysis. We present and discuss the results of experimentation on synthetic datasets and on the flowering plant family Campanulaceae with three methods: our new method, BPAnalysis, and the neighbor-joining method using several distance estimation techniques. Our preliminary results indicate that, on datasets with slow evolutionary rates and large numbers of genes in comparison with the number of taxa (genomes), all methods recover quite accurate reconstructions of the true evolutionary history (although BPAnalysis is too slow to be practical), but that on datasets where the rate of evolution is high relative to the number of genes, the accuracy of all three methods is poor.

gzipped postscript