View All Technical Reports
Found 324 results.
Listing from newest to oldest
TR-CS-2008-04
A New Approach to Model-Based Diagnosis Using Probabilistic Logic
Nikita A. Sakhanenko, Roshan R. Rammohan, George F. Luger, Carl R. Stern
We describe a new approach to model construction using transfer function diagrams that are consequently mapped into generalized loopy logic, a first-order, Turing-complete stochastic language. Transfer function diagrams support representation of dynamic systems with interconnected components. We demonstrate how these diagrams provide interfaces to a context-sensitive probabilistic modeling system (COSMOS). As a result, interfaces as well as the notion of context underlying COSMOS are successfully used for model-based diagnosis. This paper describes transfer function diagrams and how they are incorporated into COSMOS. We illustrate our approach with a practical example taken from a "pump system" modeling problem.
TR-CS-2008-03
Proving Termination of Imperative Programs via Term Rewriting
Stephan Falke and Deepak Kapur
This paper adapts techniques from the term rewriting literature in order to show termination of imperative programs operating on numbers. For this, we present a two-stage approach. In the first stage, imperative programs are translated into constrained term rewrite systems operating on numbers, where constraints are quantifier-free formulas of Presburger arithmetic. This way, computations of imperative programs are mimicked by rewrite sequences. In the second stage, termination of constrained term rewrite systems is analyzed by specializing and simplifying the dependency pair framework for normalized equational rewriting with constraints. This approach is highly flexible and allows the use of various termination techniques adapted from the term rewriting literature, including graph-based decompositions and polynomial interpretations. The approach has been used to prove termination of a large collection of imperative programs.
TR-CS-2008-02
Customizable Resource Usage Analysis for Java Bytecode
Jorge Navas, Mario Mendez-Lojo, and Manuel Hermenegildo
Automatic cost analysis of programs has been traditionally studied in terms of a number of concrete, predefined resources such as execution steps, time, or memory. However, the increasing relevance of analysis applications such as static debugging and/or certification of user-level properties (including for mobile code) makes it interesting to develop analyses for resource notions that are actually application-dependent. This may include, for example, bytes sent or received by an application, number of files left open, number of SMSs sent or received, number of accesses to a database, money spent, energy consumption, etc. We present a fully automated analysis for inferring upper bounds on the usage that a Java bytecode program makes of a set of application programmer-definable resources. In our context, a resource is defined by programmer-provided annotations which state the basic consumption that certain program elements make of that resource. From these definitions our analysis derives functions which return an upper bound on the usage that the whole program (and individual blocks) make of that resource for any given set of input data sizes. The analysis proposed is independent of the particular resource. We also present some experimental results from a prototype implementation of the approach covering an ample set of interesting resources.
TR-CS-2008-01
A Filesystem Interface for Sensor Networks
James Horey, Jean-Charles Tournier, Patrick Widener, Arthur B. Maccabe, and Ann Kilzer
Sensor network users currently face enormous challenges, including programming difficulty and the use of unfamiliar, complex interfaces. To date, most usability research with respect to sensor networks have focused on simplifying programming by offering powerful programming abstractions. Unfortunately, such work does little to encourage ordinary users, such as application scientists, to adopt sensor networks. In order to address these issues we have developed a filesystem interface for sensor networks. By treating a sensor network as a standard Unix filesystem, users are able to use familiar tools to interface with sensor networks. This simplifies management and debugging. Similarly, programs originally designed for filesystems, such as file-sharing programs, can be used to extend sensor networks. Finally, users are also able to prototype applications using pre-existing programming environments that interface with the sensor network through the file I/O interface.
TR-CS-2007-22
Operational Termination of Conditional Rewriting with Built-in Numbers and Semantic Data Structures
Stephan Falke and Deepak Kapur
Rewrite systems on free data structures have limited expressive power since semantic data structures like sets or multisets cannot be modeled elegantly. In this work we define a class of rewrite systems that allows the use of semantic data structures. Additionally, built-in natural numbers, including (dis)equality, ordering, and divisibility constraints, are supported. The rewrite mechanism is a combination of normalized equational rewriting with evaluation of conditions and validity checking of instantiated constraints. The framework is highly expressive and allows modeling of algorithms in a natural way.
Termination is one of the most important properties of conditional normalized equational rewriting. For this it is not sufficient to only show well-foundedness of the rewrite relation, but it also has to be ensured that evaluation of the conditions does not loop. The notion of operational termination is a way to capture these properties. In this work we show that it is possible to transform a conditional constrained equational rewrite system into an unconditional one such that termination of the latter implies operational termination of the former. Methods for showing termination of unconditional constrained equational rewrite system are presented in a companion paper.
TR-CS-2007-21
Dependency Pairs for Rewriting with Built-in Numbers and Semantic Data Structures
Stephan Falke and Deepak Kapur
Rewrite systems on free data structures have limited expressive power since semantic data structures like sets or multisets cannot be modeled elegantly. In this work we define a class of rewrite systems that allows the use of semantic data structures. Additionally, built-in natural numbers, including (dis)equality, ordering, and divisibility constraints, are supported. The rewrite mechanism is a combination of normalized equational rewriting with validity checking of instantiated constraints. The framework is highly expressive and allows modeling of algorithms in a natural way.
Termination is one of the most important properties of algorithms. This is true for both functional programs and imperative programs operating on natural numbers, which can be translated into rewrite systems. In this work, the dependency pair framework for proving termination is extended to be applicable to the class of rewrite systems described above, thus obtaining a flexible and powerful method for showing termination that can be automated effectively. We develop various refinements which increase power and efficiency of the method.
TR-CS-2007-20
Machine Learning: Probabilistic
George F. Luger, Nikita A. Sakhanenko, Roshan R. Rammohan, Chayan Chakrabarti
In this chapter we introduce probabilistic concepts and present several important machine learning algorithms influenced by these stochastic methods. At the heart of the chapter lie Bayes' rule and Markov chains. Probabilistic models of machine learning use the Bayesian approach which allows for interpretation of new evidence based on prior knowledge and experience. Similarly, Markov chains describe the probability of a current event as a function of the probabilities of events at previous time steps. We present hidden Markov models and their variants and illustrate the Viterbi algorithm with a speech processing example. Dynamic Bayesian networks, a temporal extension to traditional Bayesian networks, are presented. We also discuss different learning techniques including Markov chain Monte-Carlo and Expectation Maximization algorithms. Finally, we show how probabilistic measures can be used to extend reinforcement learning and Markov decision processes. This chapter is published as chapter 13 in the book "Artificial Intelligence: Structures and Strategies for Complex Problem Solving" by G. F. Luger (Addison-Wesley 2009).
TR-CS-2007-19
Tables: A Table-Based Language Environment for Sensor Networks
James Horey, Eric Nelson, and Arthur B. Maccabe
Intuitive programming environments targetting application specialists and casual users are currently lacking. Current work has either required large investment from users to learn advanced programming techniques or has focused on simplistic and limited tools. This paper introduces Tables, a graphical programming environment that consists of a spreadsheet-inspired interface and a local runtime executing on sensor nodes. Tables emphasizes ease-of-use by reusing spreadsheet abstractions, such as pivot tables and functions, to interactively program the sensor network. By using these familiar tools, users are able to construct complex applications that include local data filtering and collective processing.
We discuss the design and prototype implementation of Tables and demonstrate how to use Tables to create simple environmental monitoring applications and an advanced object-tracking application. We evaluate these applications in terms of ease-of-use and the relative network overhead, measured in simulation, imposed by the Tables environment. Using this evaluation, we show that the Tables programming environment represents a feasible alternative to existing programming systems.
TR-CS-2007-18
A Relational Algebra for Negative Databases
Fernando Esponda, Eric D. Trias, Elena S. Ackley, Stephanie Forrest
A negative database is a representation of all elements not contained in a given database. A negative database can enhance the privacy of sensitive information without resorting to encryption. This can be useful in settings where encryption is too expensive, e.g., some sensor networks, or for applications where searches or other operations on stored data are desired. The original negative database framework supported only authentication queries and operations for modifying data, such as insert and delete. This paper extends that work by defining a set of relational operators for negative representations. For each relational operator, the corresponding negative operator is defined such that the result of the negative operator applied to a negative representation is equivalent to the positive version applied to the positive representation. Algorithms for each relational operator are described and compared to its positive counterpart. This work enhances the practicality of negative databases and expands their range of application.
TR-CS-2007-17
Structured Streams: Data Services for Petascale Science Environments
Patrick M. Widener, Matthew Wolf, Hasan Abbasi, Matthew Barrick, Jay Lofstead,
Jack Pullikottil, Greg Eisenhauer, Ada Gavrilovska, Scott Klasky, Ron Oldfield,
Patrick G. Bridges, Arthur B. Maccabe, Karsten Schwan.
The challenge of meeting the I/O needs of petascale applications is exacerbated by an emerging class of data-intensive HPC applications that requires annotation, reorganization, or even conversion of their data as it moves between HPC computations and data end users or producers. For instance, data visualization can present data at different levels of detail. Further, actions on data are often dynamic, as new end-user requirements introduce data manipulations unforeseen by original application developers. These factors are driving a rich set of requirements for future petascale I/O systems: (1) high levels of performance and therefore, flexibility in how data is extracted from petascale codes; (2) the need to support .on demand. data annotation . metadata creation and management . outside application codes; (3) support for concurrent use of data by multiple applications, like visualization and storage, including associated consistency management and scheduling; and (4) the ability to flexibly access and reorganize physical data storage.
We introduce an end-to-end approach to meeting these requirements: Structured Streams, streams of structured data with which methods for data management can be associated whenever and wherever needed. These methods can execute synchronously or asynchronously with data extraction and streaming, they can run on the petascale machine or on associated machines (such as storage or visualization engines), and they can implement arbitrary data annotations, reorganization, or conversions. The Structured Streaming Data System (SSDS) enables high-performance data movement or manipulation between the compute and service nodes of the petascale machine and between/on service nodes and ancillary machines; it enables the metadata creation and management associated with these movements through specification instead of application coding; and it ensures data consistency in the presence of anticipated or unanticipated data consumers. Two key abstractions implemented in SSDS, I/O graphs and Metabots, provide developers with high-level tools for structuring data movement as dynamically-composed topologies. A lightweight storage system avoids traditional sources of I/O overhead while enforcing protected access to data.
This paper describes the SSDS architecture, motivating its design decisions and intended application uses. The utility of the I/O graph and Metabot abstractions is illustrated with examples from existing HPC codes executing on Linux Infiniband clusters and Cray XT3 supercomputers. Performance claims are supported with experiments benchmarking the underlying software layers of SSDS, as well as application-specific usage scenarios.
TR-CS-2007-16
Towards a High-Level Implementation of Execution Primitives for Unrestricted, Independent And-parallelism
Amadeo Casas, Manuel Carro, and Manuel V. Hermenegildo
Most efficient implementations of parallel logic programming rely on complex low-level machinery which is arguably difficult to implement and modify. We explore an alternative approach aimed at taming that complexity by raising core parts of the implementation to the source language level for the particular case of and-parallelism. We handle a significant portion of the parallel implementation at the Prolog level with the help of a comparatively small number of concurrency-related primitives which take care of lower-level tasks such as locking, thread management, stack set management, etc. The approach does not eliminate altogether modifications to the abstract machine, but it does greatly simplify them and it also facilitates experimenting with different alternatives. We show how this approach allows implementing both restricted and unrestricted (i.e., non fork-join) parallelism. Preliminary experiments show that the performance sacrificed is reasonable, although granularity control is required in some cases. Also, we observe that the availability of unrestricted parallelism contributes to better observed speedups.
TR-CS-2007-14
Annotation Algorithms for Unrestricted Independent And-Parallelism in Logic Programs
Amadeo Casas, Manuel Carro, and Manuel V. Hermenegildo
We present two new algorithms which perform automatic parallelization via source-to-source transformations. The objective is to exploit goal-level, unrestricted independent and-parallelism. The proposed algorithms use as targets new parallel execution primitives which are simpler and more flexible than the well-known &/2 parallel operator. This makes it possible to generate better parallel expressions by exposing more potential parallelism among the literals of a clause than is possible with &/2. The difference between the two algorithms stems from whether the order of the solutions obtained is preserved or not. We also report on a preliminary evaluation of an implementation of our approach. We compare the performance obtained to that of previous annotation algorithms and show that relevant improvements can be obtained.
TR-CS-2007-13
Dominance: Modeling Heap Structures with Sharing
Mark Marron, Rupak Majumdar, Darko Stefanovic, Deepak Kapur
A number of papers have used predicate languages over sets of abstract locations to model the heap (decorating a heap graph with the predicates, or in conjunction with an access path abstraction). In this work we introduce a new predicate, dominance, which is a generalization of aliasing and is used to model how objects are shared in the heap (e.g. do two lists contain the same set of objects?) and how sharing influences the results of destructive updates (e.g. if all the objects in one list are modified does that imply that all the objects in another list are modified?). The dominance relation is introduced in the context of a graph-based heap model but the concept is general and can be incorporated in other frameworks as a high-level primitive.
The motivation for introducing a higher-order predicate to model sharing is based on the success of existing approaches which use connectivity, reachability, and sharing predicates to perform shape analysis using high-level graph-based models. Our approach provides an analysis technique that is efficient and scalable while retaining a high level of accuracy. This is in contrast to proposals in the literature based on using logical languages (usually built on top of first-order predicate logic) which either are highly expressive and very general (resulting in a computationally expensive analysis) or utilize only fragments of the full logic (which reduces the computational cost but results in a much less powerful analysis).
The approach presented in this paper has been implemented and run on a variation of the Jolden benchmarks. The information extracted using the analysis has been used to parallelize these programs with a near optimal speedup for 7 of the 9 benchmarks. The runtimes are small enough to make the analysis practical (a few seconds for each program).
TR-CS-2007-12
Protecting BGP from Invalid Paths
Josh Karlin, Stephanie Forrest and Jennifer Rexford
The Internet's interdomain routing protocol, BGP, is vulnerable to a number of potentially crippling attacks. Many promising cryptography-based solutions have been proposed, but none have been embraced by the necessary communities to garner significant adoption. This is largely due to the difficulty of developing and maintaining the necessary PKI infrastructure and changes to the BGP protocol that the proposed solutions require. Alternative solutions such as anomaly detectors have been unable to provide the same level of security as the cryptographic mechanisms. In this paper we create an anomaly detector and response mechanism capable of automatically stopping the propagation of invalid path attacks, a difficult class of attacks to detect. Our solution provides comparable security to the cryptographic methods and could be readily deployed with a simple software upgrade in participating networks.
TR-CS-2007-11
Worm Versus Alert: Who Wins in a Battle for Control of a
Large-Scale Network?
James Aspnes, Navin Rustagi, and Jared Saia
Consider the following game between a worm and an alert1 over a network of n nodes. Initially, no nodes are infected or alerted and each node in the network is a special detector node independently with small but constant probability. The game starts with a single node becoming infected. In every round thereafter, every infected node sends out a constant number of worms to other nodes in the population, and every alerted node sends out a constant number of alerts. Nodes in the network change state according to the following three rules: 1) If a worm is received by a node that is not a detector and is not alerted, that node becomes infected; 2) If a worm is received by a node that is a detector, that node becomes alerted; 3) If an alert is received by a node that is not infected, that node becomes alerted.
We make two assumptions about this game. First, that an infected node can send worm messages to any other node in the network but, in contrast, an alerted node can send alert messages only through a previously determined, constant degree overlay network. Second, we assume that the infected nodes are intelligent, coordinated and essentially omniscient. In other words, the infected nodes know everything except for which nodes are detectors and the alerted nodes' random coin flips i.e. they know the topology of the overlay network used by the alerts; which nodes are alerted and which are infected at any time; where alerts and worms are being sent; the overall strategy used by the alerted nodes; etc. The alerted nodes are assumed to know nothing about which other nodes are infected or alerted, where alerts or worms are being sent, or the strategy used by the infected nodes.
Specifically, this result holds if d ≥ α and
,
where α and
represent the
rate of the spread of the alert and worm respectively;
is the
probability that a node is a detector node; d is the degree of the
overlay network; and c is the node expansion of the overlay
network. Next, we give empirical results that suggest that our
algorithms for the alert may be useful in current large-scale
networks. Finally, we show that if the overlay network has poor
expansion, in particular if
, then the worm will likely
infect almost all of the non-detector nodes.
1Specifically, we consider self-certifying alerts[6], which contain short proofs that a security flaw exists and thereby eliminate false alerts.
TR-CS-2007-10
Selecting Bayesian Network Parameterizations for Generating Simulated Data
John Burge, Terran Lane
The correct solution for an unsupervised learning task is never known. Thus, experimental results for such tasks must be validated. This can be done by turning the task into a supervised learning problem by training on simulated data where the correct solution is known and controllable. While this is a common practice, little research is done on generating interesting classes of random data that can be applied to many domains. We present a method to select the parameters of a Bayesian network (BN) which can be used to generate simulated data. Our methods allow the magnitude of correlations among random variables in the data to be precisely controlled. We apply our methods to the validation of BN structure search techniques and demonstrate that by precisely controlling correlations in the simulated data, qualitative differences among structure learning algorithms can be observed.
TR-CS-2007-09
A Context-Partitioned Stochastic Modeling
System with Causally Informed Context
Management and Model Induction
Nikita A. Sakhanenko, Roshan R. Rammohan, George F. Luger, and
Carl R. Stern
We describe a flexible multi-layer architecture for context-sensitive stochastic modeling. The architecture incorporates a high performance stochastic modeling core based on a recursive form of probabilistic logic. On top of this modeling core, causal representations and reasoning direct a long-term incremental learning process that produces a context-partitioned library of stochastic models. The failure-driven learning procedure for expanding and refining the model library employs a combination of abductive inference together with EM model induction to construct new models when current models no longer perform acceptably. The system uses a causal finite state machine representation to control on-line model switching and model adaptation along with embedded learning. Our system is designed to support operational deployment in real-time monitoring, diagnostic, prognostic, and decision support applications. In this paper we describe the basic multi-layer architecture along with new learning algorithms inspired by developmental learning theory.
TR-CS-2007-08
Efficient Context-Sensitive Shape Analysis with Graph Based Heap Models
Mark Marron, Manuel Hermenegildo, Deepak Kapur, and Darko Stefanovic
The performance of heap analysis techniques has a significant impact on their utility in optimization and verification applications. A significant factor affecting analysis performance is how procedure calls are handled. To maximize the accuracy of the analysis it is desirable to perform inter-procedural data flow analysis in a context-sensitive manner. However, re-analyzing a procedure for each calling context can result in the analysis being computationally intractable. To minimize the computational cost of performing context-sensitive dataflow analysis it is critical to minimize the number of times a function body needs to be re-analyzed. The computational cost can be further reduced by minimizing the amount of information that is carried in from the caller scope and thus needs to be processed during the analysis of the callee procedure. This can be accomplished by memoizing the calls as they are analyzed and using project/extend operations to remove portions of the heap model that cannot be affected by the called procedure.
The focus of this paper is the design of project and extend operations for a graph-based abstract heap model. Our interprocedural analysis introduces a novel method for naming cutpoints which is able to accurately model calls with shared data structures, multiple cutpoints, and recursive calls. In conjunction with our heap model (which tracks all heap properties using local information) the project and extend operations are able to substantially reduce the amount of calling context information that is carried into the analysis of the callee body. Our experimental results indicate that the proposed project/extend operations have a substantial positive effect on the memoization of procedure analysis results and on the reduction of the size of the model that the local analysis must deal with. Additionally, the cutpoint naming technique is (in many cases) able to precisely maintain the relations between the callee local variables and the caller local variables in the immediately preceding call frame, which is critical to modeling programs that perform recursive destructive modifications on lists or trees.
TR-CS-2007-07
Dependency Pairs for Rewriting with Non-Free Constructors
Stephan Falke and Deepak Kapur
A method based on dependency pairs for showing termination of functional programs on data structures generated by constructors with relations is proposed. A functional program is specified as an equational rewrite system, where the rewrite system specifies the program and the equations express the relations on the constructors that generate the data structures. Unlike previous approaches, relations on constructors can be collapsing, including idempotency and identity relations. Relations among constructors may be partitioned into two parts: (i) equations that cannot be oriented into terminating rewrite rules, and (ii) equations that can be oriented as terminating rewrite rules, in which case an equivalent convergent system for them is generated. The dependency pair method is extended to normalized rewriting, where constructor-terms in the redex are normalized first. The method has been applied to several examples, including the Calculus of Communicating Systems and the Propositional Sequent Calculus. Various refinements, such as dependency graphs, narrowing, etc., which increase the power of the dependency pair method, are presented for normalized rewriting.
TR-CS-2007-06
A Simulation Framework for Modeling Combinatorial Control in Transcription Regulatory Networks
Sushmita Roy, Terran Lane, Margaret Werner-Washburne
With the increasing availability of genome scale data, a plethora of algorithms are being developed to infer genome scale regulatory networks. However, identifying the appropriate algorithm for a particular inference task is a critical and difficult modeling question. This is because for most real-world data, the true network is rarely known and different algorithms produce different networks, which may be due to assumptions inherrent to the algorithm or the objective the algorithm is designed to optimize. One approach to address this issue is to build artificial, but biologically realistic, gene networks for which the true topology is known. These networks can be sampled to generate data, which can then serve as a testbed to compare a suite of network inference algorithms.
Existing simulation systems either require highly detailed descriptions of network components making the generation of large-scale networks (> 100 nodes) infeasible, or, make simplifications that render the simulated networks very far from true models of gene regulation. We have developed a simulation system that builds a network of genes and proteins and models the rate of transcriptional activity as a non-linear function of a set of transciption factor complexes. Our simulator uses a system of differential equations and interfaces with Copasi, a differential equation solver, to produce steady-state and time-series expression datasets. Overall, our simulator has the following advantages: (i) it can produce expression datasets in an efficient and highthroughput manner, (ii) it is biologically more realistic since it models transcription rate as a function of transcription factor levels rather than gene expression levels, (iii) it incorporates the combinatorial control among different transcription factors, and (iii) it incorporates gene and protein expressions, thus allowing the comparison of algorithms that use different combinations of these data sources.
TR-CS-2007-05
Precise Set Sharing and Nullity Analysis for Java-style Programs
Mario Mendez Lojo, Manuel V. Hermenegildo
Finding useful sharing information between instances in object-oriented programs has been recently the focus of much research. The applications of such static analysis are multiple:by knowing which variables share in memory we can apply conventional compiler optimizations, find coarse-grained parallelism opportunities, or, more importantly,verify certain correctness aspects of programs even in the absence of annotations In this paper we introduce a framework for deriving precise sharing information based on abstract interpretation for a Java-like language. Our analysis achieves precision in various ways. The analysis is multivariant, which allows separating different contexts. We propose a combined Set Sharing + Nullity + Classes domain which captures which instances share and which ones do not or are definitively null, and which uses the classes to refine the static information en inheritance is present. Carrying the domains in a combined way facilitates the interaction among the domains in the presence of mutivariance in the analysis. We show that both the set sharing part of the domain as well as the combined domain provide more accurate information than previous work based on pair sharing domains, at reasonable cost.
TR-CS-2007-04
Managing Dynamic Contexts Using Failure-Driven Stochastic Models
Nikita A. Sakhanenko, George F. Luger, Carl R. Stern
We describe an architecture for representing and managing context shifts that supports dynamic data interpretation. This architecture utilizes two layers of learning and three layers of control for adapting and evolving new stochastic models to accurately represent changing and evolving situations. At the core of this architecture is a form of probabilistic logic used to encode sets of recursive relationships defining Dynamic Bayesian Models. This logic is extended with a syntax for defining contextual restrictions on stochastic Horn clauses. EM parameter learning is used to calibrate models as well as to assess the quality of fit between the model and the data. Model failure, detected as a poor fit between model and data, triggers a model repair mechanism based on causally informed context splitting and context merging. An implementation of this architecture for distributed weather monitoring is currently under development.
TR-CS-2007-03
An Efficient, Parametric Fixpoint Algorithm for Analysis of Java Bytecode
Mario Mendez, Jorge Navas, Manuel Hermenegildo
Abstract interpretation has been widely used for the analysis of object-oriented languages and, in particular, Java source and bytecode. However, while most existing work deals with the problem of finding expressive abstract domains that track accurately the characteristics of a particular concrete property, the underlying fixpoint algorithms have received comparatively less attention.
In fact, many existing (abstract interpretation based--) fixpoint algorithms rely on relatively inefficient techniques for solving inter-procedural call graphs or are specific and tied to particular analyses. We also argue that the design of an efficient fixpoint algorithm is pivotal to supporting the analysis of large programs. In this paper we introduce a novel algorithm for analysis of Java bytecode which includes a number of optimizations in order to reduce the number of iterations. The algorithm is parametric -- in the sense that it is independent of the abstract domain used and it can be applied to different domains as "plug-ins" --, multivariant, and flow-sensitive. Also, is based on a program transformation, prior to the analysis, that results in a highly uniform representation of all the features in the language and therefore simplifies analysis. Detailed descriptions of decompilation solutions are given and discussed with an example. We also provide some performance data from a preliminary implementation of the analysis.
TR-CS-2007-02
Heap Analysis in the Presence of Collection Libraries
Unified Memory Analysis: Analyzing Collections
Mark Marron, Darko Stefanovic, Manuel Hermenegildo, and Deepak Kapur
Memory analysis techniques have become sophisticated enough to model, with a high degree of accuracy, the manipulation of simple memory structures (finite structures, single/double linked lists and trees). However, modern programming languages provide extensive library support including a wide range of generic collection objects that make use of complex internal data structures. While these data structures ensure that the collections are efficient, often these representations cannot be effectively modeled by existing methods (either due to excessive runtime or due to the inability to represent the required information).
This paper presents a method to represent collections using an abstraction of their semantics. The construction of the abstract semantics for the collection objects is done in a manner that allows individual elements in the collections to be identified. Our construction also supports iterators over the collections and is able to model the position of the iterator with respect to the elements in the collection. By ordering the contents of the collection based on the iterator position, the model can represent a notion of progress when iteratively manipulating the contents of a collection. These features allow strong updates to the individual elements in the collection as well as strong updates over the collections themselves.
TR-CS-2007-01
Resource Bounds Analysis
Jorge Navas, Edison Mera, Pedro Lopez-Garcia, and Manuel
Hermenegildo
We present a generic analysis that infers both upper and lower bounds on the usage that a program makes of a set of user-definable resources. The inferred bounds will in general be functions of input data sizes. A resource in our approach is a quite general, user-defined notion which associates a basic cost function with elementary operations. The analysis then derives the related (upper- and lower-bound) cost functions for all procedures in the program. We also present an assertion language which is used to define both such resources and resource-related properties that the system can then check based on the results of the analysis. We have performed some experiments with some concrete resource-related properties such as execution steps, bits sent or received by an application, number of arithmetic operations performed, number of calls to a procedure, number of transactions, etc. presenting the resource usage functions inferred and the times taken to perform the analysis. Applications of our analysis include resource consumption verification and debugging (including for mobile code), resource control in parallel/distributed computing, and resource-oriented specialization.
TR-CS-2006-15
A Platform for Research into Object-Level Trace Generation
James Foucar
An object-level trace, which we will hereafter refer to as a garbage collection trace (GC-trace), enumerates all the events that happen during program execution that affect the heap. Therefore, GC-tracing is an excellent way to examine the behavior of an object-oriented system. Knowing behavior expedites many forms of virtual machine research including garbage collection research. GC-traces were originally generated by the painfully slow brute-force technique. Subsequently, this technique was replaced by the Merlin algorithm which can generate GC-traces at a fraction of the cost of the brute-force algorithm.
Our work introduces a new GC-tracing system. Our system uses the Merlin algorithm within the context of a general shadow heap that is independent of the virtual machine. The shadow heap supports many optimizations to the Merlin algorithm and also supports analysis and verification. We examine the advantages and disadvantages of our approach, the viability of using a C++ library for analysis in conjunction with a Java-in-Java virtual machine, the various costs of integrating our shadow heap with a virtual machine, and the effectiveness of our GC-tracing optimizations. Finally, we compare our GC-tracing system with the GC-tracing system that is currently bundled with JikesRVM.
TR-CS-2006-14
Markov Random Field Modeling of the Spatial Distribution of Proteins on Cell Membranes
Jun Zhang, Stanly L. Steinberg, Bridget S. Wilson, Janet M. Oliver, and Lance R. Williams
Biologists use gold particles visualized by transmission electron microscopy (TEM) to label proteins on the cell membrane. This is done to gain insight into how proteins in a signaling pathway interact. However, due to limitations of applicable gold particle size and shape, typically only two proteins can be labeled at a time. A formidable challenge is to integrate experimental data across multiple experiments where there are from 10 to 100 different proteins of interest and only the positions of two proteins can be observed simultaneously. We propose to use Markov random fields (MRF) to characterize the distribution of proteins on the cell membrane. Markov random fields are a powerful mathematical formalism for modeling correlations between states associated with neighboring sites in spatial lattices. The presence or absence of a protein of a specific type at a point on the cell membrane is a state. Since only two protein types can be observed, i.e., those bound to particles, and the rest cannot be observed, the problem is one of deducing the conditional distribution of a MRF with unobservable (hidden) states. We develop a multiscale MRF model and use mathematical programming techniques to infer the conditional distribution of a MRF for proteins of three types from observations showing the interactions between only two types. Application to artificial data shows that the interactions among three proteins can be reliably estimated. Results on real data are also presented.
TR-CS-2006-13
Design and Implementation of a Baseline port of the Jikes Research Virtual Machine to the IA-64 Architecture
Ravi kiran Gorrepati
Of all 64-bit architectures, IA-64 is the most recent and offers exciting new features for compiler developers. IA-64 gains speed from efficient compilation, which is costly for Java, because the compilation of modules happens dynamically on the virtual machine. Lack of an opensource Java virtual machine for the IA-64 which allows for research on code generation techniques lead us to pursue this work. This work presents a baseline code generator for the IA-64 architecture. This code generator is a part of the Jikes Research Virtual Machine for Java written in Java. To our knowledge, this work presents the first open-source, just-in-time virtual machine for Java on IA-64 architecture.
TR-CS-2006-12
On the Equational Definability of Brouwer-Zadeh Lattices
Matthew Spinks and Robert Veroff
We give an axiomatisation of the variety of Brouwer-Zadeh lattices, suitable for applications to quantum theory.
TR-CS-2006-11
Shock Physics Data Reconstruction Using Support Vector Regression
Nikita A. Sakhanenko, George F. Luger, Hanna E. Makaruk, Joysree B. Aubrey, and David B. Holtkamp
This paper considers a set of shock physics experiments that investigate how materials respond to the extremes of deformation, pressure, and temperature when exposed to shock waves. Due to the complexity and the cost of these tests, the available experimental data set is often very sparse. A support vector machine (SVM) technique for regression is used for data estimation of velocity measurements from the underlying experiments. Because of good generalization performance, the SVM method successfully interpolates the experimental data. The analysis of the resulting velocity surface provides more information on the physical phenomena of the experiment. Additionally, the estimated data can be used to identify outlier data sets, as well as to increase the understanding of the other data from the experiment.
TR-CS-2006-10
Pretty Good BGP: Improving BGP by Cautiously Adopting Routes
Josh Karlin, Stephanie Forrest, and Jennifer Rexford
The Internet's interdomain routing protocol, BGP, is vulnerable to a number of damaging attacks, which often arise from operator misconfiguration. Proposed solutions with strong guarantees require a public-key infrastructure, accurate routing registries, and changes to BGP. While experts debate whether such a large deployment is feasible, networks remain vulnerable to false information injected into BGP. However, BGP routers could avoid selecting and propagating these routes if they were cautious about adopting new reachability information. We describe a protocol-preserving enhancement to BGP, Pretty Good BGP (PGBGP), that slows the dissemination of bogus routes, providing network operators time to respond before problems escalate into a large-scale Internet attack. Simulation results show that realistic deployments of PGBGP could provide 99% of Autonomous Systems with 24 hours to investigate and repair bogus routes without affecting prefix reachability. We also show that without PGBGP, 40% of ASs cannot avoid selecting bogus routes; with PGBGP, this number drops to less than 1%. Finally, we show that PGBGP is incrementally deployable and offers significant security benefits to early adopters and their customers.
TR-CS-2006-09
The Concrete Mathematics of Microfluidic Mixing, Part I
Jennifer Sager, Maxwell Young, and Darko Stefanovic
We analyze mathematically a previously reported class of passive microfluidic mixing networks. The networks produce nonhomogeneous concentrations in the output channel, resulting in diverse concentration profiles. We formally prove that all profiles obtainable with this class of networks can be described as polynomials of degree no higher than the number of input channels less one. We derive explicit formulas for the calculation of resultant output concentration profiles and conversely for the calculation of input concentrations needed to obtain set output profiles.
TR-CS-2006-08
Use of Crossing-State Equivalence Classes for Rapid Relabeling of
Knot-Diagrams Representing 2 1/2 D Scenes
Keith Wiley and Lance R. Williams
In previous research, we demonstrated a sophisticated computer- assisted drawing program called Druid which permits easy construction of 2 1/2 D scenes. A 2 1/2 D scene is a representation of surfaces that is fundamentally two-dimensional, but which also represents the relative depths of those surfaces in the third dimension. This paper greatly improves Druid's efficiency by exploitating a topological constraint on 2 1/2 D scenes which we call a crossing-state equivalence class. This paper describes this constraint and how it is used by Druid.
TR-CS-2006-07
A Syntactic Approach to Combining Functional Notation, Lazy Evaluation, and Higher-Order in LP Systems
Amadeo Casas, Daniel Cabeza, and Manuel V. Hermenegildo
Nondeterminism and partially instantiated data structures give logic programming expressive power beyond that of functional programming. However, functional programming often provides convenient syntactic features, such as having a designated implicit output argument, which allow function call nesting and sometimes results in more compact code. Functional programming also sometimes allows a more direct encoding of lazy evaluation, with its ability to deal with infinite data structures. We present a syntactic functional extension, used in the Ciao system, which can be implemented in ISO-standard Prolog systems and covers function application, predefined evaluable functors, functional definitions, quoting, and lazy evaluation. The extension is also composable with higher-order features and can be combined with other extensions to ISO-Prolog such as constraints. We also highlight the features of the Ciao system which help implementation and present some data on the overhead of using lazy evaluation with respect to eager evaluation.
TR-CS-2006-06
Unified Memory Analysis
Mark Marron, Deepak Kapur, Darko Stefanovic, and Manuel Hermenegildo
The ability to accurately model the state of program memory and how it evolves during program execution is critical to many optimization and verification techniques. Current memory analysis techniques either provide very accurate information but run prohibitively slowly or run in an acceptable time but produce very conservative results. This paper presents an analysis method that is capable of accurately modeling many important program properties (aliasing, shape, logical data structures, sharing of data elements in collections) while maintaining an acceptable level of performance. Using several examples we show how our abstract model is able to provide detailed information on the properties of interest in the concrete domain. We demonstrate that the model is a safe approximation of the concrete heap and outline the required data flow operations. Finally, we show that the asymptotic runtime of this method is polynomial in the size of the abstract heap and that in practice it is efficient enough to be used in an optimizing compiler.
TR-CS-2006-05
NetGen Release Notes, Epsilon Version
(Phylogenetic Network Generation Application)
Monique M. Morin
NetGen is an event-driven simulator that creates phylogenetic histories which include hybrids. The birth-death model typically employed by biologists is extended with a diploid hybridization event. DNA sequences are evolved in conjunction with the typology, enabling hybridization decisions to be based upon hamming distance if desired. NetGen supports variable rate lineages, root sequence specification, outgroup generation and many other options. This document provides an overview of the software, installation and execution instructions, as well as a comprehensive list of input parameters.
TR-CS-2006-04
Implicit Induction Methods and Decision Procedures
Stephan Falke and Deepak Kapur
Decision procedures are widely used in automated reasoning tools in order to reason about data structures. Their scope is typically limited, though, and many conjectures occurring in practical applications fall outside the theory handled by a decision procedure. Typically, reasoning about functions that are defined on those data structures is needed. For this, inductive reasoning has to be employed.
In this work, families of function definitions and conjectures are identified for which inductive validity can be decided using implicit induction methods and decision procedures for an underlying decidable theory. The results significantly extend the results obtained in (Kapur & Subramaniam, 2000), which were obtained using explicit induction schemes. Firstly, we allow performing induction on nonlinear terms, which enables us to decide conjectures like gcd(x, x) = x. Secondly, we allow for mutually recursive function definitions. Thirdly, we allow general terms from the decidable theory on inductive positions, which is needed to decide conjectures such as gcd(2x, 2) = 2. These contributions are crucial for successfully integrating inductive reasoning into decision procedures.
TR-CS-2006-03
A Theoretical and Experimental Evaluation of Augmented Bayesian
Classifiers
Vikas Hamine and Paul Helman
Naive Bayes is a simple Bayesian network classifier with strong independence assumptions among features. This classifier despite its strong independence assumptions, often performs well in practice. It is believed that relaxing the independence assumptions of naive Bayes may improve the performance of the resulting structure. Augmented Bayesian Classifiers relax the independence assumptions of naive Bayes by adding augmenting arcs that obey certain structural restrictions, between features of a naive Bayes classifier. In this paper we present algorithms for learning Augmented Bayesian Classifiers with respect to the Minimum Description Length (MDL) and Bayesian-Dirichlet (BD) metrics. Experimental results on the performance of these algorithms on various datasets selected from the UCI Machine Learning Repository are presented. Finally, a comparison of the learning rates and accuracies of various Augmented Bayesian Classifiers is presented on artificial data. Global connectivity from local geometric constraints for sensor networks with various wireless footprints
TR-CS-2006-02
Global connectivity from local geometric constraints for
sensor networks with various wireless footprints
Raissa D’Souza, David Galvin, Cristopher Moore, and Dana Randall
Adaptive power topology control (APTC) is a local algorithm for constructing a one-parameter family of theta-graphs, where each node increases power until it has a neighbor in every theta sector around it. We show it is possible to use such a local geometric theta-constraint to determine whether full network connectivity is achievable, and consider tradeoffs between assumptions of the wireless footprint and constraints on the boundary nodes. In particular, we show that if the boundary nodes can communicate with neighboring boundary nodes and all interior nodes satisfy a theta_I < pi constraint, we can guarantee connectivity for any arbitrary wireless footprint. If we relax the boundary assumption and instead impose a theta_B < 3pi/2 constraint on the boundary nodes, together with the theta_I < pi constraint on interior nodes, we can guarantee full network connectivity using only a weak-monotonicity footprint assumption. The weakmonotonicity model, introduced herein, is much less restrictive than the disk model of coverage and captures aspects of the spatial correlations inherent in signal propagation and noise. Finally, assuming the idealized disk model of coverage, we show that when theta < pi, APTC constructs graphs that are sparse, and when theta < 2pi/3, the graphs support greedy geometric routing.
TR-CS-2006-01
Efficient top-down set-sharing analysis using cliques
Jorge Navas, Francisco Bueno, and Manuel Hermenegildo
We study the problem of efficient, scalable set-sharing analysis of logic programs. We use the idea of representing sharing information as a pair of abstract substitutions, one of which is a worst-case sharing representation called a clique set, which was previously proposed for the case of inferring pair-sharing. We use the clique-set representation for (1) inferring actual set-sharing information, and (2) analysis within a top-down framework. In particular, we define the new abstract functions required by standard top-down analyses, both for sharing alone and also for the case of including freeness in addition to sharing. We use cliques both as an alternative representation and as widening, defining several widening operators. Our experimental evaluation supports the conclusion that, for inferring set-sharing, as it was the case for inferring pair-sharing, precision losses are limited, while useful efficiency gains are obtained. We also derive useful conclusions regarding the interactions between thresholds, precision, efficiency and cost of widening. At the limit, the clique-set representation allowed analyzing some programs that exceeded memory capacity using classical sharing representations.
TR-CS-2005-44
The Reduction Approach to Decision Procedures
Deepak Kapur and Calogero G. Zarba
Reduction functions reduce the satisfiability of a formula over a complex data structure to the satisfiability of formula over a simpler data structure. It is shown how reduction functions can be used to reduce a formula over lists to a formula over constructors, a formula over arrays to a formula over equality, a formula over sets to a formula over equality, and a formula over multisets to a formula over integer linear arithmetic extended with uninterpreted function symbols. Furthermore, a method for combining reduction functions is presented. Consequently, the satisfiability of a formula over the combination of complex data structures can be reduced to the satisfiability of a formula over the combination of simpler data structures.
TR-CS-2005-43
A survey of configurable operating systems
Jean-Charles Tournier
The need to deal with configurable operating systems is not new and several research projects have tried, and are trying, to deal with this issue. In order to identify the huge number of different configurable operating systems, a survey is therefore needed. Moreover, such a survey should allow to identify the main classes of configurability and to compare operating systems from a configurability point of view. This paper presents an up-to-date survey that includes newest research projects and based on revelant criteria that has been risen when authors started to work on a configurable operating system for high-performance computing.
TR-CS-2005-42
Multiple Kernel Learning for Support Vector Regression
Shibin Qiu and Terran Lane
Kernel support vector (SV) regression has successfully been used for prediction of nonlinear and complicated data. However, like other kernel methods such as support vector machine (SVM) classification, the quality of SV regression depends on proper choice of kernel functions and their parameters. Kernel selection for model selection is conventionally performed through repeated cross validation over a range of kernels and their parameters. Multiple kernel learning arises when a range of kernel parameters need to be tuned within one training process, when different types of kernels are applied simultaneously, or when data are from heterogeneous sources and are characterized with different kernels. Multiple kernel learning can improve the efficiency of kernel selection by automatically evaluate the relative importance of the candidate kernels. Inspired by recent developments in kernel selection for SVM classification, we investigate multiple kernel learning for SV regression. Since more slack variables and constraints are involved in the optimization formulation of SV regression than SVM, we can only follow the general procedures used by SVM but cannot directly use the results derived from SVM. We transform the optimization problems of SV regressions using both epsilon-insensitive loss function and automatic error control into proper forms so that they can be formulated as semidefinite programming (SDP) problems. To avoid the high computational cost of SDP programming, we further formulate multiple kernel SV regression into quadratically constrained quadratic programming optimization problems. Experiments on public and simulated data sets demonstrate that multiple kernel SV regression improves prediction accuracy, reduces the number of support vectors, and helps characterize the propertis of the data.
TR-CS-2005-41
Games, Strategies, and Boolean Formula Manipulation
Ben Andrews
Biomolecular automata based on chemical logic gates can play two-player games of perfect information provided that a legal strategy for the game can be expressed in terms of Boolean formulas. To facilitate the construction of such automata, we developed a formal procedure for the analysis of game strategies, their translation into Boolean formulas, and the manipulation of the resultant formulas into canonical forms determined by the available logic primitives. As a case study in the application of this procedure, we generated strategies for the first player in the game of tic-tac-toe that, by construction, are favorable (do not admit a loss), and we found that many can be translated into Boolean formulas. Moreover, the resultant formulas can be expressed in logic with a fan-in of five, but, apparently, not in simpler logic. Additionally, we studied the combinatorics of strategies and, for the first player in the game of tic-tac-toe, we computed the number of all strategies, about 1.423e124, and the number of favorable strategies, about 2.648e103.
TR-CS-2005-40
Designing a Dynamic Middleware System for Sensor Networks
James Horey and Arthur Maccabe
Kensho is a middleware system for sensor networks that provides a communication and computation transparent view of the sensor network. Kensho is based on the specification of functional groups that support both data sharing and computation. Collective operations can migrate automatically throughout a group to take advantage of new resources without the user's input. This allows new nodes to be added to the network without having to reprogram the network or the new node. We believe that these abstractions provided by the Kensho middleware supports the creation of highly scalable, robust distributed programs.
TR-CS-2005-39
The Design and Testing of a First-Order Logic-Based Stochastic Modeling Language
Daniel J. Pless, Chayan Chakrabarti, Roshan Rammohan, and George F. Luger
We have created a logic-based, Turing-complete language for stochastic modeling. Since the inference scheme for this language is based on a variant of Pearl's loopy belief propagation algorithm, we call it Loopy Logic. Traditional Bayesian networks have limited expressive power, basically constrained to finite domains as in the propositional calculus. Our language contains variables that can capture general classes of situations, events and relationships. A first-order language is also able to reason about potentially infinite classes and situations using constructs such as hidden Markov models(HMMs). Our language uses an Expectation-Maximization (EM) type learning of parameters. This has a natural fit with the Loopy Belief Propagation used for inference since both can be viewed as iterative message passing algorithms. We present the syntax and theoretical foundations for our Loopy Logic language. We then demonstrate three examples of stochastic modeling and diagnosis that explore the representational power of the language. A mechanical fault detection example displays how Loopy Logic can model time-series processes using an HMM variant. A digital circuit example exhibits the probabilistic modeling capabilities, and finally, a parameter fitting example demonstrates the power for learning unknown stochastic values.
TR-CS-2005-38
Multiple Target Tracking with Lazy Background Subtraction and Connected Component Analysis
Robert G. Abbott and Lance R. Williams
Background subtraction, binary morphology, and connected components analysis are the first processing steps in many vision-based tracking applications. Although background subtraction has been the subject of much research, it is typically treated as stand-alone process, dissociated from the subsequent phases of object recognition and tracking. This paper presents a method for decreasing computational cost in visual tracking systems by using track state estimates to direct and constrain image segmentation via background subtraction and connected components analysis. We also present a multiple target tracking application which uses the technique to achieve a large reduction in computation costs.
TR-CS-2005-37
Pretty Good BGP: Protecting BGP by Cautiously Selecting Routes
Josh Karlin, Stephanie Forrest, and Jennifer Rexford
The Border Gateway Protocol (BGP), the Internet's interdomain routing protocol, is vulnerable to a number of damaging attacks. Proposed solutions either (i) rely on a public-key infrastructure and accurate routing registries or (ii) detect attacks only after they have spread throughout the network. However, BGP routers could avoid selecting and propagating malicious routes if they were cautious about adopting new reachability information. We describe an enhancement to BGP, Pretty Good BGP (PGBGP), that slows the dissemination of malicious routes, providing network operators time to respond before the problem escalates into a large-scale Internet attack. Results show that realistic deployments of PGBGP could provide 99% of Autonomous Systems with 24 hours to investigate and repair malicious routes without affecting prefix reachability. The result also show that without PGBGP, 40% of ASs cannot avoid using malicious routes; with PGBGP, this number drops to less than 1%. Finally, we show that PGBGP is incrementally deployable and offers significant security benefits to early adopters and their customers.
TR-CS-2005-36
Scale Invariance in Road Networks
Vamsi Kalapala, Vishal Sanwalani, Aaron Clauset, and Cristopher Moore
We study the topological and geographic structure of the national road networks of the United States, England and Denmark. By transforming these networks into their dual representation, where roads are vertices and an edge connects two vertices if the corresponding roads ever intersect, we show that they exhibit both topological and geographic scale invariance. That is, we empirically show that the degree distribution of the dual is well-characterized by a power law with exponent 2.0 < alpha < 2.5, and that journeys, regardless of their length, have a largely identical structure. To explain these properties, we introduce and analyze a simple fractal model of road placement that reproduces the observed structure, and suggests a connection between the scaling exponent alpha and the fractal dimensions governing the placement of roads and intersections.
TR-CS-2005-35
Infiniband Scalability in Open MPI
Galen M. Shipman, Tim S. Woodall, Rich L. Graham, and Arthur B. Maccabe
Infiniband is becoming an important interconnect technology in high performance computing. Recent efforts in large scale Infiniband deployments are raising scalability questions in the HPC community. Open MPI, a new production grade implementation of the MPI standard, provides several mechanisms to enhance Infiniband scalability. Initial comparisons with MVAPICH, the most widely used Infiniband MPI implementation, show similar performance but with much better scalability characterics. Specifically, small message latency is improved by up to 10% in medium/large jobs and memory usage per host is reduced by as much as 300%. In addition, Open MPI provides predicatable latency that is close to optimal without sacrificing bandwidth performance.
TR-CS-2005-34
A Sublinear-Time Randomized Approximation Scheme for the
Robinson-Foulds Metric
Nicholas D. Pattengale and Bernard M.E. Moret
The Robinson-Foulds (RF) metric is the measure most widely used in comparing phylogenetic trees; it can also be computed in linear time using Day's algorithm. When faced with the need to compare large numbers of large trees, however, even linear time becomes prohibitive. In this paper, we present a randomized approximation scheme that provides, with high probability, a (1+epsilon) approximation of the true RF metric for all pairs of trees in a given collection. Our approach is to use a sublinear-space embedding of the trees, combined with an application of the Johnson-Lindenstrauss lemma to approximate vector norms very rapidly. We discuss the consequences of various parameter choices (in the embedding and in the approximation requirements). We also implemented our algorithm as a Java class that can easily be combined with popular packages such as Mesquite; in consequence, we present experimental results illustrating the precision and running-time tradeoffs as well as demonstrating the speed of our approach.
TR-CS-2005-33
Representation of Interwoven Surfaces in 2 1/2D Drawing
Keith Wiley and Lance R. Williams
The state-of-the-art in computer drawing programs is based on a number of concepts that are over two decades old. One such concept is the use of layers for ordering the surfaces in a 2 1/2D drawing from top to bottom. A 2 1/2D drawing is a drawing that depicts surfaces in a fundamentally two-dimensional way, but also represents the relative depths of those surfaces in the third dimension. Unfortunately, the current approach based on layers unnecessarily imposes a partial ordering on the depths of the surfaces and prevents the user from creating a large class of potential drawings, e.g., of Celtic knots and interwoven surfaces. In this paper we describe a novel approach which only requires local depth ordering of segments of the boundaries of surfaces in a drawing rather than a global depth relation between entire surfaces. Our program provides an intuitive user interface with a fast learning curve that allows a novice to create complex drawings of interwoven surfaces that would be extremely difficult and time-consuming to create with standard drawing programs.
TR-CS-2005-32
Combining Data Structures with Nonstably Infinite Theories using Many-Sorted Logic
Silvio Ranise, Christophe Ringeissen, and Calogero G. Zarba
Most computer programs store elements of a given nature into container-based data structures such as lists, arrays, sets, and multisets. To verify the correctness of these programs, one needs to combine a theory S modeling the data structure with a theory T modeling the elements. This combination can be achieved using the classic Nelson-Oppen method only if both S and T are stably infinite. The goal of this paper is to relax the stable infiniteness requirement. To achieve this goal, we introduce the notion of polite theories, and we show that natural examples of polite theories include those modeling data structures such as lists, arrays, sets, and multisets. Furthemore, we provide a method that is able to combine a polite theory S with any theory T of the elements, regardless of whether T is stably infinite or not. The results of this paper generalize to many-sorted logic those recently obtained by Tinelli and Zarba concerning the combination of shiny theories with nonstably infinite theories in one-sorted logic.
TR-CS-2005-31
A Tableau-Based Decision Procedure for a Fragment of Graph Theory Involving Reachability and Acyclicity
Domenico Cantone and Calogero G. Zarba
We study the decision problem for the language DGRA (directed graphs with reachability and acyclicity), a quantifier-free fragment of graph theory involving the notions of reachability and acyclicity. We prove that the language DGRA is decidable, and that its decidability problem is NP-complete. We do so by showing that the language enjoys a small model property: If a formula is satisfiable, then it has a model whose cardinality is polynomial in the size of the formula. Moreover, we show how the small model property can be used in order to devise a tableau-based decision procedure for DGRA.
TR-CS-2005-30
Model Selection and Model Complexity: Identifying Truth Within A Space Saturated with Random Models
Paul Helman
A framework for the analysis of model selection issues is presented. The framework separates model selection into two dimensions: the model-complexity dimension and the model-space dimension. The model-complexity dimension pertains to how the complexity of a single model interacts with its scoring by standard evaluation measures. The model-space dimension pertains to the interpretation of the totality of evaluation scores obtained. Central to the analysis is the concept of evaluation coherence, a property which requires that a measure not produce misleading model evaluations. Of particular interest is whether model evaluation measures are misled by model complexity. Several common evaluation measures --- apparent error rate, the BD metric, and MDL scoring --- are analyzed, and each is found to lack complexity coherence. These results are used to consider arguments for and against the Occam razor paradigm as it pertains to overfit avoidance in model selection, and also to provide an abstract analysis of what the literature refers to as oversearch.
TR-CS-2005-29
An Extensible Message-Oriented Offload Model for High-Performance Applications
Patricia Gilfeather and Arthur B. Maccabe
In this paper, we present and verify a new model designed to capture the benefits of protocol offload in the context of high performance computing systems. Other models capture the benefits of offload or the performance of parallel applications. However, the extensible message-oriented offload model (EMO) is the first model to emphasize the performance of the network protocol itself and models it in a message-oriented rather than flow-oriented manner. EMO allows us to consider benefits associated with the reduction in message latency along with benefits associated with reduction in overhead and improvements to throughput.
In order to verify EMO, we use the tool to model a very common offload technique, interrupt coalescing. We discuss the assumptions of our model and show the modeled offload and latency performance of interrupt coalescing and no interrupt coalescing. We then present preliminary results to verify that our model is accurate.
TR-CS-2005-28
The Triton Branch Predictor
Josh Karlin, Darko Stefanovic, and Stephanie Forrest
We describe a new branch predictor that is designed to balance multiple constraints---predicting branch biases versus predicting specific branch instance behavior. Most branch instances only require branch bias information for accurate predictions while a select few require more sophisticated prediction structures. Our predictor uses a cache mechanism to classify branches and dynamically adjust the balance of the predictor. On average, our predictor mispredicts 24% less often than YAGS and 19% less often than a global perceptron predictor with the same bit budget.
TR-CS-2005-27
Distributed Recommender Systems and the Network Topologies that Love Them
Hamilton Link, Jared Saia, Terran Lane, and Randall Laviolette
One approach to distributed recommender systems is to have users sample products at random and randomly query one another for the best item they have found. We have been considering refinements to this approach that take advantage of a communication network; users may share information only with their immediate neighbors, who either by design or by nature may have common interests. In the ``mailing list model,'' users with common interests form cliques, while in the ``word-of-mouth model'' the users are distributed randomly in a scale-free network (SFN). In both models, users tell their neighbors about satisfactory products as they are found. We have found that our distributed recommender systems benefit from the communication network in terms of the number of sampled items, the number of messages sent, and the amount of ``spam'' sent. In the course of developing these results, we have revisited the question of subgraph structure in SFNs and now have a more precise understanding of the character and parameters of random SFN subgraphs than has previously been published. This new result in power-law graphs gives a clearer picture of the resilience of SFNs to random node failures; in fact, their resilience appears to have been overstated in the literature. In the case of the widely-cited ``Internet resilience'' result, high failure rates actually lead to the orphaning of half or more of the surviving nodes and the complete disintegration of the giant component after 90% of the network has failed independently of the size of the network.
TR-CS-2005-26
Machine learning approaches to siRNA efficacy prediction.
Sahar Abubucker
RNA interference (RNAi) is being widely used to study gene expression and gene regulation via selective knockdown of gene expression, which is important for functional genomic studies. RNAi refers to the biological process by which short interfering RNA (siRNA) after incorporation into the RNA induced silencing complex (RISC) degrades complementary messenger RNA (mRNA) sequences. This knockdown of mRNA prevents it from producing amino acid sequences that are responsible for gene expression. Recent studies indicate that all siRNAs do not produce the same knockdown effects. Due to the high cost of synthesizing siRNAs and the extensive effort required to test siRNAs, rational siRNA design---a priori prediction of functionality for specific siRNAs---is a critical task for practical RNAi studies. Therefore, a key component of RNAi applications is the selection of effective siRNA sequences---ones that degrade at least 80% of the targeted mRNA. The goal of siRNA efficacy prediction is to aid in designing siRNA sequences that are highly efficient in degrading target mRNA sequences. Most of the current algorithms use positional features, energy characteristics, thermodynamic properties and secondary structure predictions to predict the knockdown efficacy of siRNAs. In this work, frequently occurring patterns in siRNAs are used to make predictions about their efficacy. Time after transfection is shown to be an important factor in siRNA efficacy prediction---a feature that has been ignored in previous efficacy studies. The relevance of these extracted patterns to previous results and their biological significance are analyzed. Random feature sets are generated and the ability of these sets to predict efficacy are studied and their results discussed. Our algorithm does not require any specialized hardware and consistently performs better than other publicly available efficacy prediction algorithms.
TR-CS-2005-25
A Continuous-Discontinuous Second-Order Transition in the Satisfiability of Random Horn-SAT Formulas
Cristopher Moore, Gabriel Istrate, Demetrios Demopoulos, and Moshe Y. Vardi
We compute the probability of satisfiability of a class of random Horn-SAT formulae, motivated by a connection with the nonemptiness problem of finite tree automata. In particular, when the maximum clause length is three, this model displays a curve in its parameter space along which the probability of satisfiability is discontinuous, ending in a second-order phase transition where it becomes continuous. This is the first case in which a phase transition of this type has been rigorously established for a random constraint satisfaction problem.
TR-CS-2005-24
Dynamic Bayesian Networks: Class Discriminative Structure Search with
an Application to Functional Magnetic Resonance Imaging Data
John Burge
The neuroanatomical behavior and underlying root causes of many mental illnesses are only moderately understood. To further understand these illnesses, the relationships among neuroanatomical regions of interest (ROIs) can be modeled. Neuroscientists have several standardized methods which have yielded much success, however, many of these methods make restrictive assumptions. On the other hand, the machine learning community has developed techniques that do not require many of these assumptions. I propose a method based on dynamic Bayesian networks (DBNs) that is capable of modeling the temporal relationships among ROIs. DBNs are graphical models that explicitly represent statistical correlations among random variables (RVs). Within the fMRI domain, it is not known a priori which ROIs are correlated. Thus, the DBN's structure modeling the ROI interactions must be learned. My research focuses on novel techniques capable of learning the structure of DBNs from a given set of data. First, I observe that under certain circumstances, learning multiple Bayesian networks, one for each possible class of data, is a better approach than the frequently employed method of learning a single BN with a class node. Second, I propose a method for determining which structures are representative of class-discriminating relationships. Third, I propose a structure search heuristic that takes advantage of the relationships among hierarchically related RVs. Finally, I propose to incorporate shrinkage (a statistical technique previously used in machine learning) in the calculation of the DBN's parameters. This proposal also summarizes preliminary work on the classification of functional magnetic resonance imaging data with my proposed scoring function.
TR-CS-2005-23
A Practical Approach to Significance Assessment of siRNA Off-target
Effects in RNA Interference
Wenzhong Zhao and Terran Lane
Detection of potential cross-hybridization (or cross-reaction) between a short oligonucleotide sequence and a longer (unintended) sequence is crucial for many biological applications, such as selecting PCR primers, microarray nucleotide probes or short interfering RNAs (siRNAs). In this paper, we propose a flexible framework for estimating the significance of siRNA off-target effects on untargeted transcripts (messager RNA, or mRNA) in the RNA interference (RNAi) process. The framework can also be extended to other applications with minor changes.
We have developed and implemented a new homology sequence search framework -- siRNA Off-target Search (SOS). SOS uses a hybrid, q-gram based approach, combining two filtering techniques using overlapping and non-overlapping q-grams. Our approach considers three types of imperfect matches based on biological experiments, namely G:U wobbles, mismatches, and bulges. The three main improvements over existing methods are: the introduction of a more general cost model (an affine bulge cost model) for siRNA-mRNA off-target alignment; the use of separate searches for alignments with and without bulges, that enables efficient discovery of potential off-target candidates in the filtration phase; and the use of position-preserving and order-preserving hit-processing techniques, that further improves the filtration efficiency. Overall, SOS achieves better performance, in terms of speed and recall/precision, than BLAST in detecting potential siRNA off-targets.
TR-CS-2005-22
A Framework for Orthology Assignment from Gene Rearrangement Data
Krister M. Swenson and Nicholas D. Pattengale and Bernard M.E. Moret
Gene rearrangements have successfully been used in phylogenetic reconstruction and comparative genomics, but usually under the assumption that all genomes have the same gene content and that no gene is duplicated. While these assumptions allow one to work with organellar genomes, they are too restrictive when comparing nuclear genomes. The main challenge is how to deal with gene families, specifically, how to identify orthologs. While searching for orthologies is a common task in computational biology, it is usually done using sequence data. We approach that problem using gene rearrangement data, provide an optimization framework in which to phrase the problem, and present some preliminary theoretical results.
TR-CS-2005-21
Evaluating EST Clustering with Simulated Data
Ke Ye and Bernard M.E. Moret
Expressed sequence tags (ESTs) provide a rapid and inexpensive approach to gene sequencing, gene discovery, and the characterization of gene regulation and alternate splicings. The first step in using ESTs is to cluster them, that is, to group them according to the gene that produced them. In these applications, clustering performance and quality are the most critical issues. Most efforts in EST clustering have focused on biological data, but such data make assessment of clustering quality and robustness difficult.
In this paper, we use the EST simulator of Hazelhurst and Bergheim, ESTSim, to study experimentally the accuracy and robustness of four different EST clustering tools, ESTate, N2tool, BlastClust, and simple hierarchical agglomerative clustering based on Smith-Waterman similarities, using the Rand index as a measure of clustering quality.
Our results indicate that N2tool consistently dominates BlastClust and that, in turn, ESTate and clustering based on Smith-Waterman similarities greatly outperform N2tool, with the much slower Smith-Waterman-based clustering slightly outperforming ESTate. We also find that, of the various parameters affecting EST generation that can be set in the ESTSim simulator (EST length, indels and substitutions, polymerase decay, stutter, and ligation), polymerase decay has by far the largest effect on the quality of clustering -- a finding that could lead to improved EST and assembly quality through the control of bench procedures.
TR-CS-2005-20
Explicit Multiregister Measurements for Hidden Subgroup Problems; or, Fourier Sampling Strikes Back
Cristopher Moore and Alexander Russell
We present an explicit measurement in the Fourier basis that solves an important case of the Hidden Subgroup Problem, including the case to which Graph Isomorphism reduces. This entangled measurement uses k=log_2 |G| registers, and each of the 2^k subsets of the registers contributes some information.
TR-CS-2005-19
For Distinguishing Conjugate Hidden Subgroups, the Pretty Good Measurement is as Good as it Gets
Cristopher Moore and Alexander Russell
Recently Bacon, Childs and van Dam showed that the ``pretty good measurement'' (PGM) is optimal for the Hidden Subgroup Problem on the dihedral group D_n in the case where the hidden subgroup is chosen uniformly from the n involutions. We show that, for any group and any subgroup H, the PGM is the optimal one-register experiment in the case where the hidden subgroup is a uniformly random conjugate of H. We go on to show that when H forms a Gel'fand pair with its parent group, the PGM is the optimal measurement for any number of registers. This generalizes the case of the dihedral group, and includes a number of other examples of interest.
TR-CS-2005-18
The Symmetric Group Defies Strong Fourier Sampling: Part II
Cristopher Moore and Alexander Russell
Part I of this paper showed that the hidden subgroup problem over the symmetric group---including the special case relevant to Graph Isomorphism---cannot be efficiently solved by strong Fourier sampling, even if one may perform an arbitrary POVM on the coset state. In this paper, we extend these results to entangled measurements. Specifically, we show that the hidden subgroup problem on the symmetric group cannot be solved by any POVM applied to pairs of coset states. In particular, these hidden subgroups cannot be determined by any polynomial number of one- or two-register experiments on coset states.
TR-CS-2005-17
The Symmetric Group Defies Strong Fourier Sampling: Part I
Cristopher Moore, Alexander Russell, and Leonard J. Schulman
We resolve the question of whether Fourier sampling can efficiently solve the hidden subgroup problem. Specifically, we show that the hidden subgroup problem over the symmetric group cannot be efficiently solved by strong Fourier sampling, even if one may perform an arbitrary POVM on the coset state. These results apply to the special case relevant to the Graph Isomorphism problem.
TR-CS-2005-16
Scale Invariance in Global Terrorism
Aaron Clauset, Maxwell Young
Traditional analyses of international terrorism have not sought to explain the emergence of rare but extremely severe events. Using the tools of extremal statistics to analyze the set of terrorist attacks worldwide between 1968 and 2004, as compiled by the National Memorial Institute for the Prevention of Terrorism (MIPT), we find that the relationship between the frequency and severity of terrorist attacks exhibits the "scale-free" property with an exponent of close to two. This property is robust, even when we restrict our analysis to events from a single type of weapon or events within major industrialized nations. We also find that the distribution of event sizes has changed very little over the past 37 years, suggesting that scale invariance is an inherent feature of global terrorism.
TR-CS-2005-15
Cayley-Dixon construction of Resultants of Multi-Univariate Composed Polynomials
Arthur D. Chtcherba, Deepak Kapur, Manfred Minimair
The Cayley-Dixon formulation for multivariate resultants have been shown to be efficient (both experimentally and theoretically) for computing resultants by simultaneously eliminating many variables from a polynomial system. In this paper, the behavior of Cayley-Dixon resultant construction and the structure of Dixon matrices is analyzed for composed polynomial systems constructed from a multivariate system in which each variable is substituted by a univariate polynomial in a distinct variable. It is shown that Dixon projection operator (multiple of resultant) of the composed system can be expressed as a power of the resultant of the outer polynomial system multiplied by powers of the leading coefficients of the univariate polynomials substituted for variables in the outer system. A new resultant formula is derived for systems where it is known that the Cayley-Dixon construction does not contain extraneous factors. The derivation of the resultant formula for the composed system unifies all the known related results in the literature. Furthermore, it demonstrates that the resultant of a composed system can be effectively calculated by considering only the resultant of the outer system. Since the complexity of resultant computation is typically determined by the degree (and support) of the polynomial system, resultants of a composed system can be computed much faster by focusing only on the outer system.
TR-CS-2005-14
Making Chord Robust to Byzantine Attacks
Amos Fiat, Jared Saia, Maxwell Young
Chord is a distributed hash table (DHT) that requires only O(log n) links per node and performs searches with latency and message cost O(log n), where n is the number of peers in the network. Chord assumes all nodes behave according to protocol. We give a variant of Chord which, for any fixed \epsilon_{0}>0, is resilient to (1/4-\epsilon_{0})*z Byzantine nodes joining the network over a time period during which 1) there are always at least z total nodes in the network and 2) the number of correct peers joining and leaving is no more than z^k for some tunable parameter k. We assume there is an omniscient and computationally unbounded adversary controlling the Byzantine peers and that the IP-addresses of all the Byzantine peers and the locations where they join the network are carefully selected by this adversary.
Our notion of resilience is rather strong in that we not only guarantee that searches can be performed but also that we can enforce any set of ``proper behavior'' such as contributing new material, etc. In comparison to Chord, the resources required by this new variant are only a polylogarithmic factor greater in communication, messaging, and linking costs.
TR-CS-2005-13
A First-Order Stochastic Prognostic System for the Diagnosis of Helicoptor Rotor Systems for the US Navy.
Chayan Chakrabarti, Roshan Rammohan and George F. Luger
We have created a diagnostic system for the US Navy to use in the analysis of the running health of helicopter rotor systems. Although our system is not yet deployed for real-time in-flight diagnosis, we have successfully analyzed the data sets of actual helicopter rotor failures supplied by the US Navy. We discuss both critical techniques supporting the design of our stochastic diagnostic system as well as issues related to full deployment.
Our diagnostic system, called DBAYES, is composed of a logic-based, first-order, and Turing-complete set of software tools for stochastic modeling. We use this language for modeling time-series data supplied by sensors on the mechanical system. The inference scheme for these software tools is based on a variant of Pearl's loopy belief propagation algorithm. Our language contains variables that can capture general classes of situations, events, and relationships. A Turing-complete language is able to reason about potentially infinite classes and situations, similar to the analysis of Dynamic Bayesian Networks. Since the inference algorithm is based on a variant of loopy belief propagation, the language includes the Expectation Maximization type learning of parameters in the modeled domain. In this paper we briefly present the theoretical foundations for our first-order stochastic language and then demonstrate time-series modeling and learning in the context of fault diagnosis.
TR-CS-2005-12
A First-Order Stochastic Modeling Language for Diagnosis
Chayan Chakrabarti and George F. Luger
We have created a logic-based, first-order, and Turing-complete set of software tools for stochastic modeling. Because the inference scheme for this language is based on a variant of Pearl's loopy belief propagation algorithm, we call it Loopy Logic. Traditional Bayesian belief networks have limited expressive power, basically constrained to that of atomic elements as in the propositional calculus. Our language contains variables that can capture general classes of situations, events, and relationships. A Turing-complete language is able to reason about potentially infinite classes and situations, with a Dynamic Bayesian Network. Since the inference algorithm for Loopy Logic is based on a variant of loopy belief propagation, the language includes an Expectation Maximization-type learning of parameters in the modeling domain. In this paper we briefly present the theoretical foundations for our loopy-logic language and then demonstrate several examples of stochastic modeling and diagnosis.
TR-CS-2005-11
Comparison of Garbage Collectors Operating in a Large Address Space
Sergiy Kyrylkov and Darko Stefanovic
We analyze the performance of several copying garbage collection algorithms in a large address space offered by modern architectures. In particular, we describe the design and implementation of the RealOF garbage collector, an algorithm explicitly designed to exploit the features of 64-bit environments. This collector maintains a correspondence between object age and object placement in the address space of the heap. It allocates and copies objects within designated regions of memory called zones and performs garbage collection incrementally by collecting one or more ranges of memory called windows. The windows are managed so as to collect middle-aged objects, rather than almost always collecting young objects, as with a generational collector. The address-ordered heap allows us to use the same inexpensive write barrier that works for generational collectors. We show that for server applications this algorithm improves throughput and reduces heap size requirements over the best-throughput generational copying algorithms such as the Appel-style generational collector.
TR-CS-2005-10
Generating Hard Satisfiable Formulas by Hiding Solutions Deceptively
Haixia Jia, Cristopher Moore, and Doug Strain
To test incomplete search algorithms for constraint satisfaction problems such as 3-SAT, we need a source of hard, but satisfiable, benchmark instances. A simple way to do this is to choose a random truth assignment A, and then choose clauses randomly from among those satisfied by A. However, this method tends to produce easy problems, since the majority of literals point toward the ``hidden'' assignment A. Last year, Achlioptas, Jia and Moore proposed a problem generator that cancels this effect by hiding both A and its complement. While the resulting formulas appear to be just as hard for DPLL algorithms as random 3-SAT formulas with no hidden assignment, they can be solved by WalkSAT in only polynomial time.
Here we propose a new method to cancel the attraction to A, by choosing a clause with t > 0 literals satisfied by A with probability proportional to q^t for some q < 1. By varying q, we can generate formulas whose variables have no bias, i.e., which are equally likely to be true or false; we can even cause the formula to ``deceptively'' point away from A. We present theoretical and experimental results suggesting that these formulas are exponentially hard both for DPLL algorithms and for incomplete algorithms such as WalkSAT.
TR-CS-2005-09
FINDMODEL: A tool to select the best-fit model of nucleotide substitution
Tao, N., Bruno, W.J., Abfalterer, W., Moret, B.M.E., Leitner, T., and Kuiken, C.
Choosing a model of sequence evolution is a crucial step when using DNA sequence data to reconstruct phylogenies: using a mismatched model will reduce accuracy and may lead to erroneous conclusions. FINDMODEL is a web-based tool for selecting a model of DNA (nucleotide) evolution; it is designed to be easy to use by researchers who do some sequencing and may not have access to phylogenetic packages. FINDMODEL can analyze 28 models or a restricted subset of 12 models. It creates a guide tree using Weighbor, optimizes branch lengths, calculates the likelihood for every chosen model (using baseml from the PAML package), and computes the Akaike information criterion (AIC). The model with the smallest AIC score is considered to be the best-fit model. Because of server limitations, the FINDMODEL web server processes inputs above a certain size in non-interactive mode, sending an email to the user when it has analyzed that user's data and providing a down-loadable file with the results. To test the performance of FINDMODEL, we generated simulated DNA sequences with Seq-Gen under four different models of nucleotide substitution of different complexity and compared the inferred model with the true model. We used 17 different configurations, with 5 instances for each set of parameter values. FINDMODEL returned the correct model for 75% of our test instances and also returned the correct model, but with variable site-specific rates instead of homogeneous rates for another 8%. Moreover, on all tests where FINDMODEL did not return the correct model, the normalized AIC error between the correct and the predicted models was below 0.002 (and the actual AIC difference was below 7).
TR-CS-2005-08
Inferring Ancestral Chloroplast Genomes with Duplications
Liying Cui, Jijun Tang, Bernard M. E. Moret, and Claude dePamphilis
Motivation: Genome evolution is shaped not only by nucleotide substitutions but also by structural changes including gene and genome duplications, insertions and deletions, and gene-order rearrangements. Reconstruction of phylogeny based on gene-order changes has been limited to cases where equal gene content or few deletions can be assumed. Since conserved duplicated regions are present in many genomes, the inference of duplication is needed in ancestral genome reconstruction.
Results: We apply GRAPPA-IR, a modified GRAPPA algorithm,to reconstruct ancestral chloroplast genomes containing duplicated genes. A test of GRAPPA-IR using six divergent chloroplast genomes from land plants and green algae recovers the phylogeny congruent with prior studies, while analyses that do not consider the duplications fail to obtain the accepted topology. The ancestral genome structure suggests that genome rearrangement in chloroplasts is probably limited by inverted repeats with a conserved core region. In addition, the boundaries of inverted repeats are hot spots for gene duplications or deletions.
TR-CS-2005-07
Towards practical biomolecular computers using microfluidic deoxyribozyme logic gate networks
Joseph Farfel and Darko Stefanovic
We propose a new way of implementing a biomolecular computer in the laboratory, using deoxyribozyme logic gates inside a microfluidic reaction chamber. We build upon our previous work, which simulated the operation of a deoxyribozyme-based flip-flop and oscillator in a continuous stirred-tank reactor (CSTR); unfortunately, using these logic gates in a laboratory-size CSTR is too expensive because the reagent volume is too large. For a realistic microfluidic design, the properties of microfluidic flow and mixing have to be taken into account. We describe the differences between a macrofluidic system such as the CSTR and the microfluidic setting. Liquid in a microfluidic setting exhibits laminar flow, and is more difficult to mix than in the CSTR. We use a rotary mixer, and examine how it operates so that we may properly model it. We discuss the details of our mixer simulation, including our diffusion model. We discuss why having discrete phases of influx/efflux (``charging'') and mixing is necessary, and how it changes the kinetics of the system. We then show the result of simulating both a flip-flop and an oscillator inside our rotary mixing chamber, and discuss the differences in results from the CSTR simulation.
TR-CS-2005-06
Designing Nucleotide Sequences for Computation: A Survey of Constraints
Jennifer Sager and Darko Stefanovic
We survey the biochemical constraints useful for the design of DNA code words for DNA computation. We define the DNA/RNA Code Constraint problem and cover biochemistry topics relevant to DNA libraries. We examine which biochemical constraints are best suited for DNA word design.
TR-CS-2005-05
Consensus methods using phylogenetic databases
Mahesh M. Kulkarni and Bernard M.E. Moret
Motivation: Consensus methods summarize the information from many phylogenetic trees into one tree. With the increasing use of phylogenies throughout the life sciences and with the advent of the Tree of Life project, however, the output of tree reconstruction programs must be stored for future reference, in which case post-tree analyses such as consensus should be run from the database. We set out to determine whether such analyses can be run at a reasonable cost; we chose consensus because of its general applicability and because it creates a severe demand on the database by requiring examination of every edge of every tree.
Methodology: We preprocess the data (the collections of trees) to create a new table in the database in order to support consensus computations; to support the preprocessing step, we extend the PhyloDB schema proposed by Nakhleh et al. For each of the three main consensus methods (strict, majority, and greedy consensus), we compare the PhyloDB-based computation with the computation performed on trees stored in main memory using the Phylip consensus programs. We used a large selection of datasets of varying sizes (up to 1,000 trees of up to 1,500 taxa each) and of varying degrees of commonality. Preprocessing and storing the data in an organized manner in a database is expensive: we only tallied the cost of computing the consensus from the trees stored in the database, as a researcher might do long after the original trees were generated.
Results: The computations from the database are very practical: they often run faster, and never run more than five times slower, than the computations in main memory using Phylip. The additional storage costs are easily handled by any database system, while the preprocessing costs remain reasonable (several hours for our largest datasets). Thus suitable preprocessing of phylogenetic data allows post-tree analyses to be run directly from the database at much the same cost as current memory-resident analyses.
TR-CS-2005-04
On the Bias of Traceroute Sampling, or, Power-law Degree Distributions in Regular Graphs
Dimitris Achlioptas, Aaron Clauset, David Kempe, and Cristopher Moore
Understanding the structure of the Internet graph is a crucial step for building accurate network models and designing efficient algorithms for Internet applications. Yet, obtaining its graph structure is a surprisingly difficult task, as edges cannot be explicitly queried. Instead, empirical studies rely on traceroutes to build what are essentially single-source, all-destinations, shortest-path trees. These trees only sample a fraction of the network's edges, and a recent paper by Lakhina et al. found empirically that the resuting sample is intrinsically biased. For instance, the observed degree distribution under traceroute sampling exhibits a power law even when the underlying degree distribution is Poisson.
In this paper, we study the bias of traceroute sampling systematically, and, for a very general class of underlying degree distributions, calculate the likely observed distributions explicitly. To do this, we use a continuous-time realization of the process of exposing the BFS tree of a random graph with a given degree distribution, calculate the expected degree distribution of the tree, and show that it is sharply concentrated. As example applications of our machinery, we show how traceroute sampling finds power-law degree distributions in both delta-regular and Poisson-distributed random graphs. Thus, our work puts the observations of Lakhina et al. on a rigorous footing, and extends them to nearly arbitrary degree distributions.
TR-CS-2005-03
Finding local community structure in networks
Aaron Clauset
Although the inference of global community structure in networks has recently become a topic of great interest in the physics community, all such algorithms require that the graph be completely known. Here, we define both a measure of local community structure and an algorithm that infers the hierarchy of communities that enclose a given vertex by exploring the graph one vertex at a time. This algorithm runs in time O(d*k^2) for general graphs when d is the mean degree and k is the number of vertices to be explored. For graphs where exploring a new vertex is time-consuming, the running time is linear, O(k). We show that on computer-generated graphs this technique compares favorably to algorithms that require global knowledge. We also use this algorithm to extract meaningful local clustering information in the large recommender network of an online retailer and show the existence of mesoscopic structure.
TR-CS-2005-02
Quartet methods for phylogeny reconstruction from gene orders
Tao Liu, Jijun Tang, and Bernard M.E. Moret
Phylogenetic reconstruction from gene-rearrangement data has attracted increasing attention from biologists and computer scientists. Methods used in reconstruction include distance-based methods, parsimony methods using sequence-based encodings, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate approach; however, its exhaustive approach means that it can be applied only to small datasets of fewer than 15 taxa. While we have successfully scaled it up to 1,000 genomes by integrating it with a disk-covering method (DCM-GRAPPA), the recursive decomposition may need many levels of recursion to handle datasets with 1,000 or more genomes. We thus investigated quartet-based approaches, which directly decompose the datasets into subsets of four taxa each; such approaches have been well studied for sequence data, but not for gene-rearrangement data. We give an optimization algorithm for the NP-hard problem of computing optimal trees for each quartet, present a variation of the dyadic method (using heuristics to choose suitable short quartets), and use both in simulation studies. We find that our quartet-based method can handle more genomes than the base version of GRAPPA, thus enabling us to reduce the number of levels of recursion in DCM-GRAPPA, but is more sensitive to the rate of evolution, with error rates rapidly increasing when saturation is approached.
TR-CS-2005-01
A Study of Garbage Collection With a Large Address Space for Server Applications
Sergiy Kyrylkov and Darko Stefanovic
We analyze the performance of several copying garbage collection algorithms in a large address space offered by modern architectures. In particular, we describe the design and implementation of the RealOF garbage collector, an algorithm explicitly designed to exploit the features of 64-bit environments. This collector maintains a correspondence between object age and object placement in the address space of the heap. It allocates and copies objects within designated regions of memory called zones and performs garbage collection incrementally by collecting one or more ranges of memory called windows. The address-ordered heap allows us to use the same inexpensive write barrier that works for traditional generational collectors. We show that for server applications this algorithm improves throughput and reduces heap size requirements over the best-throughput generational copying algorithms such as the Appel-style generational collector.
TR-CS-2004-35
Finding community structure in very large networks
Aaron Clauset, M. E. J. Newman, and Cristopher Moore
The discovery and analysis of community structure in networks is a topic of considerable recent interest within the physics community, but most methods proposed so far are unsuitable for very large networks because of their computational cost. Here we present a hierarchical agglomeration algorithm for detecting community structure which is faster than many competing algorithms: its running time on a network with n vertices and m edges is O(m d log n) where d is the depth of the dendrogram describing the community structure. Many real-world networks are sparse and hierarchical, with m ~ n and d ~ log n, in which case our algorithm runs in essentially linear time, O(n log^2 n). As an example of the application of this algorithm we use it to analyze a network of items for sale on the web-site of a large online retailer, items in the network being linked if they are frequently purchased by the same buyer. The network has more than 400,000 vertices and 2 million edges. We show that our algorithm can extract meaningful communities from this network, revealing large-scale patterns present in the purchasing habits of customers.
TR-CS-2004-34
Computational Challenges from the Tree of Life
B.M.E. Moret
A phylogeny is a reconstruction of the evolutionary history of a group of organisms. Phylogenies are used throughout the life sciences, as they offer a structure around which to organize the knowledge and data accumulated by researchers. Computational phylogenetics has been a rich area for algorithm design over the last 15 years. The biology community has embarked on an enormously ambitious project, the assembly of the Tree of Life---the phylogeny of all organisms on this planet. This project presents a true computational grand challenge: current phylogenetic methods can barely handle a few hundred organisms, yet the Tree of Life has an estimated 10--100 million organisms. Thus, while data collection and cataloguing is one of the major tasks, algorithm design and engineering is a crucial enabling component of the project. In this paper, I briefly introduce the principles of phylogenetic analysis, then discuss the computational challenges posed by the Tree of Life project, from the design and validation of computationally useful models of evolution to the actual computation and assessment of the Tree of Life itself.
TR-CS-2004-33
Building the components for a biomolecular computer
Clint Morgan, Darko Stefanovic, Cristopher Moore and Milan N. Stojanovic
We propose a new method for amorphous bio-compatible computing using deoxyribozyme logic gates in which oligonucleotides act as enzymes on other oligonucleotides, yielding oligonucleotide products. Moreover, these reactions can be controlled by inputs that are also oligonucleotides. We interpret these reactions as logic gates, and the concentrations of chemical species as signals. Since these reactions are homogeneous, i.e., they use oligonucleotides as both inputs and outputs, we can compose them to construct complex logic circuits. Thus, our system for chemical computation offers functionality similar to conventional electronic circuits with the potential for deployment inside of living cells. Previously, this technology was demonstrated in closed-system batch reactions, which limited its computational ability to simple feed-forward circuits. In this work, we go beyond closed systems, and show how to use thermodynamically open reactors to build biomolecular circuits with feedback. The behavior of an open chemical system is determined both by its chemical reaction network and by the influx and efflux of chemical species. This motivates a change in design process from that used with closed systems. Rather than focusing solely on the stoichiometry of the chemical reactions, we must carefully examine their kinetics. Systems of differential equations and the theory of dynamical systems become the appropriate tools for designing and analyzing such systems. Using these tools, we present an inverter. Next, by introducing feedback into the reaction network, we construct devices with a sense of state. We show how a combination of analytical approximation techniques and numerical methods allows us to tune the dynamics of these systems. We demonstrate a flip-flop which exhibits behavior similar to the RS flip-flop of electronic computation. It has two states in which the concentration of one oligonucleotide is high and the other is low or vice versa. We describe how to control the state of the flip-flop by varying the concentration of the substrates. Moreover, there are large regions of parameter space in which this behavior is robust, and we show how to tune the influx rates as a function of the chemical reaction rates in a way that ensures bistability.
TR-CS-2004-32
Efficient RNAi-Based Gene Family Knockdown via Set Cover Optimization
Wenzhong Zhao, M. Leigh Fanning and Terran Lane
RNA interference (RNAi) is a recently discovered genetic immune system with widespread therapeutic and genomic applications. In this paper, we address the problem of selecting an efficient set of initiator molecules (siRNAs) for RNAi-based gene family knockdown experiments. Our goal is to select a minimal set of siRNAs that (a) cover a targeted gene family or a specified subset of it, (b) do not cover any untargeted genes, and (c) are individually highly effective at inducing knockdown. We show that the problem of minimizing the number of siRNAs required to knock down a family of genes is NP-Hard via a reduction to the set cover problem. We also give a formal statement of a generalization of the basic problem that incorporates additional biological constraints and optimality criteria. We modify the classical branch-and-bound algorithm to include some of these biological criteria. We find that, in many typical cases, these constraints reduce the search space enough that we are able to compute exact minimal siRNA covers within reasonable time. For larger cases, we propose a probabilistic greedy algorithm for finding minimal siRNA covers efficiently. Our computational results on real biological data show that the probabilistic greedy algorithm produces siRNA covers as good as the branch-and-bound algorithm in most cases. Both algorithms return minimal siRNA covers with high predicted probability that the selected siRNAs will be effective at inducing knockdown. We also examine the role of "off-target" interactions - the constraint of avoiding covering untargeted genes can, in some cases, substantially increase the complexity of the resulting solution. Overall, however, we find that in many common cases, our approach significantly reduces the number of siRNAs required in gene family knockdown experiments, as compared to knocking down genes independently.
TR-CS-2004-31
On Algorithms for Choosing a Random Peer
Valerie King, Scott Lewis and Jared Saia
In this paper, we study the problem of choosing a peer uniformly at random from the set of all peers in a distributed hash table(DHT). We present two new algorithms to solve this problem and show that these algorithms have good theoretical and empirical properties.
TR-CS-2004-30
Linear Programming for Phylogenetic Reconstruction Based on Gene Rearrangements
Jijun Tang and Bernard M.E. Moret
Phylogenetic reconstruction from gene rearrangements has attracted increasing attention from biologists and computer scientists over the last few years. Methods used in reconstruction include distance-based methods, parsimony methods using seque