Main

Research

My research interests include:

  • Exploring machine learning techniques applied to data-intensive scientific problems.
  • Using emerging distributed systems, mobile devices and crowd sourcing for pervasive health care.
  • Introducing automatic decision making process to global distributed computing environments.
  • Applying intelligent approaches to scheduling in distributed systems.

For my PhD I worked in the DAPLDS project lead by Dr. Michela Taufer. The goal of this project consists in exploring the multi-scale nature of algorithmic adaptations in protein-ligand docking by using distributed volunteered computing systems. In particular I was in charge of Docking@Home, a volunteer computing system for docking of proteins. My research in this area includes detection of native-like docking structures under uncertainty, as well as finding opportunities to improve performance of distributed systems by introducing adaptivity on them.

Selected projects:

2012 Providing application-aware self management to VC systems. Although most distributed systems already provide some degree of self-management, they do not consider application-specific optimization of parameters. Thus, we propose an autonomic, modular framework for multiscale applications running on distributed systems. In particular, we developed:

  • A novel multi dimensional tree-like stream-mining algorithm called Knowledge Organization Tree (KOTree) that organizes statistical information of VC applications for efficient application parameter exploration and prediction while being built and updated at runtime. This tree algorithm builds a powerful predictive model that provides VC with effective strategies to drive workload reconfiguration.
  • A modular framework integrating KOTree for providing application-aware self-management in VC that can be easily adapted to different distributed systems and applications with diverse scientific goals.

2011 Linear octree clustering and data mapping:A Scalable and Accurate Method for Classifying Protein-Ligand Binding Geometries using MapReduce. This is a scalable and accurate, 3-step, algorithm for classifying protein-ligand binding geometries in molecular docking. We analyze results for docking, cross-docking and ensemble docking for a series of HIV protease inhibitors. This algorithm demonstrates significant improvement over energy-based scoring for the accurate identification of near-native ligand structures. The advantages of our approach make it attractive for applications in real-world drug design

2009 EmBOINC. The BOINC Emulator is a trace driven emulator that statistically models thousands of clients in a client/server volunteer computing paradigm. I designed and implemented EmBOINC as an open-source research tool to investigate the impact of scheduling, generation, and validation policies in the performance of BOINC projects. EmBOINC is implemented in C and C++ and can be conditionally compiled in the current BOINC distribution. Read more at http://gcl.cis.udel.edu/projects/emboinc/

2008 – 2009 Docking@Home. Docking@Home (D@H) is a volunteer computing project comprising more than 25,000 volunteered computers. D@H performs high-throughput virtual screening of protein-ligand docking. My achievements in this project consist on identifying statistical flaws in the scoring function of the docking algorithm, as well as developing a method to accurately identify native-like docked structures under uncertainty. The middleware is BOINC (C, C++ and MySql). The analysis is in Matlab and Perl. http://docking.cis.udel.edu/

2007 Automatic generation of scheduling policies for volunteer computing (VC) projects. I designed and implemented a distributed genetic algorithm to automatically generate scheduling policies in a VC environment. Contrary to human-designed policies, this system was able to produce a set of scheduling policies capable of keeping high throughput across different VC projects. In addition, those policies, were robust to various levels of volatility and heterogeneity of the environment. This project was implemented in C, MPI, and Perl.

2004 Identification of stellar populations in galactic spectra. Using the widths and shape of certain lines in a galactic spectra I was able to detect the age of different stellar populations in a galaxy. To do so, I designed and implemented a ‘hierarchical ensemble’ of classifiers. Results exhibited an improved accuracy compared to traditional template-based searching methods for both: synthetic and real galactic spectra. The preprocessing of spectra was written in Matlab, and the hierarchical ensemble in C.

Find out more about my research