From Teaching

BigData: PosterSessionBigData2013

Join us for a special poster session, where students from the courses of Big Data and Data Mining will present their work addressing interesting questions in data science

When: Dec 13th from 1:00 to 3:00 PM

Where: Center for Advanced Research Computing (CARC)

1601 Central Ave. NE, 87106

Google maps

Featured posters

Capturing synchronized stock price movements using a growing neural gas network

Taylor Berger (undergraduate student)

The dynamic nature of stock prices can make it difficult to provide a lasting analysis of stocks that behave similarly. A growing neural gas (GNG) network simulates diffusion through a Euclidean space to form connected components of nodes that mimic the underlying data set. Growing neural gas networks are dynamic in nature, meaning it can be used to approximate any set of data. The GNG network also tracks any changes in the data as those changes occur. Growing neural gas networks also have the useful property of forming multiple, disconnected components as they are detected in the data. The network merges the components when the separate clusters in the data no longer exist. The disconnected networks can be represented as distinct classifications or clusters. Therefore, the GNG network dynamically assigns classifications in a similar manner to other unsupervised clustering techniques. In this project, I use the GNG network to form disconnected sub-networks of stocks that behave similarly throughout time to gain a better understanding of which sets of stocks would represent a more diverse investment portfolio.

Geometric Algorithms and Data Structures for Parallelizing Simulations of Diffusion Limited Reactions

Shaun Bloom (PhD student)

I present a parallelized version of an algorithm that simulates radiation damage to living biological tissue. Radiation damage is represented in terms of the presence of concentrations of certain radiolytic species one microsecond after high energy radiation is deposited into the simulation environment consisting of an infinite source of 𝐻2𝑂 molecules. The algorithm is inspired by kinetic data structures and randomization. It uses a layered directed acyclic graph, where each layer is a hash table to store the particles. The use of hash tables allows the location of the closest pair in linear expected time. To avoid discretizing time and rebuilding the data structure, a priority queue is used to keep track of the events of interests (e.g., a reaction). With these techniques, a simulation time of O((n+k)logn) is achieved in the serial version of the code, where O(nlogn) is the time to initialize the data structure, k is the total number of events of interest, and O(logn) is the time for updating the priority queue in each iteration. Parallelized versions of the proximity grid and kinetic data structures are taken from the original work [1] and are described in this report. Preliminary versions of this algorithm making use of mutexes were disappointing, offering a speedup of only ~2 with 50,000 input particles and 8 hyper-threaded cores. As a result, a method of particle transport between spatially separated parallelized regions was developed, based on Fickís second law of diffusion that eliminates the need for mutexes. Parallelization is achieved with OpenMP technology and the run time of the algorithm is 𝑂 ((𝑛+𝑘) log 𝑛), where there are 𝑝 processors Ė a near perfect theoretical speedup.

Differential gene expression between two subsets of muscle in Drosophila melanogaster: A tour through computational genetics and next generation sequencing

Tonya Brunetti (PhD student)

Drosophila melanogaster, the common fruit fly, shares approximately 75% of genes associated with human disease alleles, making it a suitable candidate for studying genetic regulation due to the fast reproduction period and the simplicity of the system. By isolating RNA from two different muscle samples, the quantification of specific transcripts and certain isoforms can distinguish the genetic contrast between the two muscle types. This has potential in understanding human muscle myopathies such as muscular dystrophies and cardiomyopathies and the complex yet intricate pathways in which the genes are regulated.

Although this has been advantageous to the scientific community by producing fast and immense quantities of data, there have been problems with handling and programming such large data sets. Through UNIX, I have utilized an open-source platform, the Tuxedo Suite, made for the genetic analysis of next generation sequencing datasets. The algorithms implemented in the suite assess each transcript and determines the probability that the transcript is properly aligned to a provided reference genome. Since each transcript can be identified by a special key, the total transcript abundance can be calculated using a program similar to the function of mapReduce. Additionally, due to the well-annotated nature of the Drosophila genome, the ability to write clustering algorithms to identify genes that share conserved motifs and have similar functionality has been made possible. This is therefore a method to look at how the transcriptome is changing between different samples.

Discovering correlation of economic health and educational opportunities around the world

Miguel Gallegos (MS student)

During economic downturns many people return to school in hopes broadening their horizons and boosting their employability. Even without downturns there seems to be an implicit promise of a better life with additional education. It seems as though it would be obvious that education would produce a better quality of life not only financially for you but increase the quality of life of a nation as a whole. In this analysis I show two types of trends based on raw data from the Bureau of Labor and Statistics and The World Data Bank.

To achieve this I use Rís analytical tools coupled with the Hadoop framework for processing large data-sets. Furthermore, the results show that countries with higher value in education tend to do better economically and people with degrees in health and engineering sciences tend have stable employment even in economic hardships.

Hierarchical Data Model Designer

Jacob Hobbs (MS student)

Survey data is among the most ubiquitous, rich sources of information available in the world. It is collected by virtually all major world powers, and great efforts have been expended to enable such data to be useful. Statistical data and metadata exchange formats and the semantic web have been pursued in Europe to enable linking of comparable datasets. In the US, FIPS and more recently the GNIS standards have likewise allowed for better linkage of a variety of datasets.

Still, the data made available by the US government has had its limitations. They have provided thousands of tables, and tools, such as FactFinder, to enable users to search for specific information they are interested in across multiple data sources. This kind of information is useful for certain groups, like state government agencies, which have specific data sets in mind and are generally only interested in the some of the most recently available data. Websites like the Social Explorer and the Integrated Health Interview Series have sprung up to allow easier parsing and more uniformity among chronologically disparate data. USA Today has tools that allow subsets of census data to be mined and graphed for use by reporters. The census bureau itself has also released an API and encouraged developers to make applications that leverage available data and present it in more useful ways than are currently available.

I have attempted to do just that. Utilizing the Census API, the TIGER database and the OpenMap Java toolkit, I have attempted to integrate an interactive map with census data to provide a tool that makes it easier to generate subsets of census data at various granularities across time, to be used both as inputs and as targets for simulations. The overarching goal for facilitating multiple granularities is to enable me to write models that can be run, tested and tweaked at multiple granularities, and thus be more useful than a single model that only works for a single granularity. I expect better models to be more capable of capturing higher granularity trends and deviations than simpler ones, and still sum to the simpler lower granularity averages and totals.

Real-Time Analysis with Twitter Streaming API

Ronald Shaw (undergraduate student) and Khabbab Saleem (undergraduate student)

In this experiment, we looked at real-time analysis of data coming directly from Twitter, specifically from people tweeting about soccer games. This was an example of the possibilities opened up by using the streaming API from Twitter combined with Hadoop data analysis. More specifically, we were able to directly move this data to the hadoop distributed file system while streaming from Twitter, bypassing the problem of storing the information and transferring it later.This type of analysis can open up powerful new avenues for immediate data analysis of specific topics or in specific areas. We easily could take the data retrieved and find key marketing areas for teams based on fans tweeting out of that area. We also can do temporal analysis for local maxima of number of tweets occurred to observe popular timeframes of the game. These are only trivial examples of what can be accomplished using real-time analysis of such a large data set, and it will inevitably lead to more important information being processed with time sensitive matters.

Biomedical Text Mining: Application for Investigating Brain Disorders

Mohammad R. Arbabshirani (PhD student)

The rapid growth of the literature on neuroscience has led to major advances in our understanding of human brain function and several brain disorders but has also made it increasingly difficult to aggregate and synthesize neuroimaging findings. The huge amount of literature makes it impossible for a researcher to summarize the main findings. Recent advances in computational power, parallel processing, artificial intelligence, machine learning, statistical pattern recognition, natural language processing and big data analysis have made it possible to conduct meta-analysis on biomedical papers. The goal of this study is analyzing biomedical data at PubMed and extract useful/interesting information about one of the most devastating brain disorders, schizophrenia

Analysis of Public Data on Facebook

Amir Arbabshirani

Facebook says they have about 1.1 billion active users accessing the website each month, and this number is growing daily. Most of the usersí activities involve personal information sharing like education, work, likes, photos, videos, and locations. On the other hand, Facebook has provided some privacy settings to allow users manage the visibility of their information. However most of people find these settings complex and sometimes confusing. Also some of the users are not aware of all privacy features they have. These all together usually lead to undesirable sharing of information which can be exploited by websites, apps, and people and cause serious privacy issues.

In this experiment I analyze available public information on Facebook. Due to Facebook limitations, it is not possible to access all the attributes of a profile through graph API even if they are public. Therefore, I did the data analysis with a concentration on visible data which includes sex, friend count, wall count, last profile update time, and affiliations. I have followed two approaches in my studies. Firstly I analyze the data from a privacy point of view to figure out how many percentages of people in my data set have their education, work, friends, and wall visible. Also, Iím interested in comparing last profile update times of more public people with more private ones. Secondly I look into the data to find some interesting correlations between the following attribute pairs: (number of friends, wall count), (sex, number of friends), (sex, wall counts), (college, number of friends), and (sex, last profile update time).

Retrieved from
Page last modified on December 10, 2013, at 12:43 PM EST