June 29, 2007
Announcement: DIMACS/DyDAn Workshop on Computational Methods for Dynamic Interaction Networks
While chatting recently with Martin Rosvall, I realized that it might actually be useful (gasp!) if I were to post information about workshops and conferences on complex networks that I hear about. So, in the interest of having this blog serve at least one additional purpose other than being my own personal bully pulpit, I'll try to post announcements as I receive them. Also, to those of you who are plugged into these things, you could help out by sending me your own workshop and conference announcements.
Without further ado, here's the first of the bunch coming up in the Fall. The paper submission deadline is already upon us (Sunday, July 1st) but DIMACS has a good track record of running good workshops, so maybe some folks will find it worthwhile to attend. Update 29 June: Deadline has been extended to July 8th, and I'm told there will be some support available for junior folks to attend.
September 24 - 25, 2007 at the DIMACS Center, CoRE Building, Rutgers University
Organizers: Tanya Berger-Wolf (UIUC), Mark Goldberg (RPI), Malik Magdon-Ismail (RPI), Fred Roberts (DIMACS) and William "Al" Wallace (RPI).
Description: A substantial body of research in various sciences aims at understanding the dynamics and patterns of interactions within populations, in particular how social groups arise and evolve. As a result of the advances in communications and computing technology, extreme amounts of data are being accumulated representing the evolution of large scale communication networks, such as the WWW, chatrooms, Blogs, and networks of bluetooth enabled handheld devices. Moreover, as small sensors become largely available and affordable, new research areas are exploiting the social networks resulting from those sensor networks data. Finding patterns of social interaction within a population has been addressed in a wide range applications including: disease modeling cultural and information transmission, intelligence and surveillance, business management, conservation biology and behavioral ecology.
The workshop will focus on two complementary themes. On one hand it will address the emerging importance of electronic communication networks, their social implications and how those facilitate the organization and coordination of activities of social groups. The second theme of the workshop is adapting and extending the computational methods developed in the context of communication and computer networks to the social interaction networks.
- Modeling and simulation of dynamic social networks
- Measurement and comparison of dynamic social networks
- Community and social structure identification
- Identification of individual roles and behavioral patterns
- Visualization of large dynamic networks
Update 13 August: Here is the program. I'll be presenting a paper at this workshop entitled "Persistence and periodicity in a dynamic proximity network", which is joint work with Nathan Eagle (currently of MIT, but soon to be joining SFI), and considers the real-time dynamics of a human proximity network.
June 28, 2007
Two science news articles (here and here) about J. Craig Venter's efforts to hack the genome (or perhaps, more broadly, hacking microbiology) reminded me of a few other articles about his goals. The two articles I ran across today concern a rather cool experiment where scientists took the genome of one bacterium species (M. mycoides) and transplanted it into a closely related one (M. capricolum). The actual procedure by which they made the genome transfer seems rather inelegant, but the end result is that the donor genome replaced the recepient genome and was operating well enough that the recepient looked like the donor. (Science article is here.) As a proof of concept, this experiment is a nice demonstration that the cellular machinery of the the recepient species is similar enough to that of the donor that it can run the other's genomic program. But, whether or not this technique can be applied to other species is an open question. For instance, judging by the difficulties that research on cloning has encountered with simply transferring a nucleus of a cell into an unfertilized egg of the same species, it seems reasonable to expect that such whole-genome transfers won't be reprogramming arbitrary cells any time in the foreseeable future.
The other things I've been meaning to blog about are stories I ran across earlier this month, also relating to Dr. Venter's efforts to pioneer research on (and patents for) genomic manipulation. For instance, earlier this month Venter's group filed a patent on an "artificial organism" (patent application is here; coverage is here and here). Although the bacterium (derived from another cousin of the two mentioned above, called M. genitalium) is called an artificial organism (dubbed M. laboratorium), I think that gives Venter's group too much credit. Their artificial organism is really just a hobbled version of its parent species, where they removed many of the original genes that were not, apparently, always necessary for a bacterium's survival. From the way the science journalism reads though, you get the impression that Venter et al. have created a bacterium from scratch. I don't think we have either the technology or the scientific understanding of how life works to be able to do that yet, nor do I expect to see it for a long time. But, the idea of engineering bacteria to exhibit different traits (maybe useful traits, such as being able to metabolize some of the crap modern civilizations produce) is already a reality and I'm sure we'll see more work along these lines.
Finally, Venter gave a TED talk in 2005 about his trip to sample the DNA of the ocean at several spots around the world. This talk is actually more about the science (or more pointedly, about how little we know about the diversity of life, as expressed through genes) and less about his commercial interests. It appears that some of the research results from this trip have already appeared on PLoS Biology.
I think many people love to hate Venter, but you do have to give him credit for having enormous ambition, and for, in part, spurring the genomics revolution currently gripping microbiology. Perhaps like many scientists, I'm suspicious of his commercial interests and find the idea of patenting anything about a living organism to be a little absurd, but I also think we're fast approaching the day when we putting bacteria to work doing things that we currently do via complex (and often dirty) industrial processes will be an everyday thing.
June 17, 2007
Mathematics of Sudoku
Long-time readers and friends of mine who have had the unfortunate luck to ask me about Sudoku will know that I'm not very fond of the game itself. In fact, I've never completed one by hand (even an "easy" one) because I get bored from running a constraint satisfaction algorithm algorithm by hand. Instead, for a project in one of my courses in graduate school, I instead wrote a computer program to solve them for me. (I wrote two versions, one that implements a brute-force backtracking algorithm (in prolog) and one that implements a constraint satisfaction algorithm (in ciao), and did a little write up to explain them. The code will actually solve an arbitrary sized puzzle, and not just the regular 9 x 9 ones you find in newspapers.)
The last time I blogged about Sudoku was to talk about the interesting mathematical or theoretical questions about the game. Things like, given a partially-completed puzzle, how many unique solutions exist? For a given solution, how many unique partially-completed puzzles exist with a given number of entries provided for you? These probably aren't the kinds of questions most people ask when they sit down to solve a Sudoku puzzle, since they're interesting at a higher level than coming up with a solution to a particular instance.
Fortunately, mathematicians have been thinking about these kinds of questions for a couple of years now, and an article in the June/July 2007 issue of the Notices of the American Mathematical Society (AMS) by Agnes M. Herzberg and M. Ram Murty delves into these with great enthusiasm. In particular, they show that Sudoku puzzles can be reduced to a graph coloring problem. Each of the 81 boxes in the puzzle are nodes in a graph (network), and two nodes are connected if they appear in the same row, column or sub-grid. Then, each of the numbers 1..9 is assigned a color, and the task is to come up with a coloring of the graph such that no edge has the same color on both ends. For regular instances, some of the colors are given, and your task is to complete the coloring.
The nice thing about this reformulation is that it makes the questions I mentioned above amenable to some of the tools from mathematics and computer science for dealing with the general problem of graph coloring (e.g., chromatic polynomials and other things I wish I understood better). For instance, the authors are able to prove that although there are roughly 6.671 x 10^21 valid Sudoku squares (a trivially small fraction of the number of Latin squares, which is about 5.525 x 10 ^27!), only a mere 5,472,730,538 are unique (when you account for symmetries and relabelings). That number alone should keep puzzle fans and newspapers busy for quite a while. I wonder if we'll ever see the day when newspapers print puzzles that ask players to find both of two unique solutions, or to count the number of unique solutions for a particular puzzle. In spite of my ongoing dislike of the Sudoku craze gripping the country, I should at least take comfort in knowing that people everwhere are exercising the rational side of the brain, and a few may even be pondering the deeper mathematical mysteries of the game.
June 16, 2007
My long-time friend Nick Yee recently sat on a CNN panel to discuss the impact and evolution of virtual worlds (online video games). This video is an 8 minute clip from the longer program on the subject. Nick is probably the world's expert on the culture and psychology of online gaming, and particularly for the massively multiplayer online roleplaying games (MMORPGs) like World of Warcraft, City of Heroes, or Second Life. He started his research (much of which appears on his Daedelus Project page) while we were both at Haverford and when online gaming hadn't hit the mainstream the way it has today. If you want to understand what virtual worlds and online gaming are about, his work is the place to go.
Also briefly featured is Philip Rosedale (CEO of Linden Labs, which makes Second Life), whom I recently met at a Santa Fe Institute business network meeting. Nick tells me that the clip is airing worldwide on CNN this week, and that he's working on a book on the subject as well.
Update 16 June: Nick's posted a brief writeup about his appearance.
June 08, 2007
Power laws and all that jazz
With apologies to Tolkien:
Three Power Laws for the Physicists, mathematics in thrall,
Four for the biologists, species and all,
Eighteen behavioral, our will carved in stone,
One for the Dark Lord on his dark throne.
In the Land of Science where Power Laws lie,
One Paper to rule them all, One Paper to find them,
One Paper to bring them all and in their moments bind them,
In the Land of Science, where Power Laws lie.
From an interest that grew directly out of my work chracterizing the frequency of severe terrorist attacks, I'm happy to say that the review article I've been working on with Cosma Shalizi and Mark Newman -- on accurately characterizing power-law distributions in empirical data -- is finally finished. The paper covers all aspects of the process, from fitting the distribution to testing the hypothesis that the data is distributed according to a power law, and to make it easy for folks in the community to use the methods we recommend, we've also made our code available.
So, rejoice, rejoice all ye people of Science! Go forth, fit and validate your power laws!
For those still reading, I have a few thoughts about this paper now that it's been released into the wild. First, I naturally hope that people read the paper and find it interesting and useful. I also hope that we as a community start asking ourselves what exactly we mean when we say that such-and-such a quantity is "power-law distributed," and whether our meaning would be better served at times by using less precise terms such as "heavy-tailed" or simply "heterogeneous." For instance, we might simply mean that visually it looks roughly straight on a log-log plot. To which I might reply (a) power-law distributions are not the only thing that can do this, (b) we haven't said what we mean by roughly straight, and (c) we haven't been clear about why we might prefer a priori such a form over alternatives.
The paper goes into the first two points in some detail, so I'll put those aside. The latter point, though, seems like one that's gone un-addressed in the literature for some time now. In some cases, there are probably legitimate reasons to prefer an explanation that assumes large events (and especially those larger than we've observed so far) are distributed according to a power law -- for example, cases where we have some convincing theoretical explanations that match the microscopic details of the system, are reasonably well motivated, and whose predictions have held up under some additional tests. But I don't think most places where power-law distributions have been "observed" have this degree of support for the power-law hypothesis. (In fact, most simply fit a power-law model and assume that it's correct!) We also rarely ask why a system necessarily needs to exhibit a power-law distribution in the first place. That is, would the system behave fundamentally differently, perhaps from a functional perspective, if it instead exhibited a log-normal distribution in the upper tail?
Update 15 June: Cosma also blogs about the paper, making many excellent points about the methods we describe for dealing with data, as well as making several very constructive points about the general affair of power-law research. Well worth the time to read.
June 05, 2007
Thoughts on NetSci 2007
Clearly, attending four conferences in four different states over five weeks has been too much for your humble blogger to handle in a timely fashion. This entry covers some of the notes I took while I was in New York City for the International Conference on Network Science (NetSci 2007), held at the New York Hall of Science.
Stefan Bornholdt (Bremen) gave an reasonably interesting talk on computation in molecular networks on Day 1. The basic idea is that while we pretty much understand how the storage (DNA), transmission (reproduction) and alteration (mutation, crossover) of heritable traits works, we have very little idea about how the machinery that interacts with them actually computes (regulates, manages, etc.) the things they code for. Using a very simple model of this computation, and a set of inferred interactions for the genes governing cell division in yeast, he constructed a simple dynamical system and analyzed its behavior. His main result was a claim that the simple dynamical model, combined with the topology of which genes talk to each other, reproduces the actual 13-state sequence for cell division. The one problem with this claim was that the figure he showed of this sequence was a chain ending in a fixed point, rather than a limit cycle. Still, interesting work, and it suggests that combining the right dynamical model with a network topology can lead to interesting dynamical behaviors.
Stephen North (AT&T; also one of the contact people for GraphViz) gave a brief talk about network visualization procedures. I think the consequences of visualization are under-appreciated by the networks community. That is, "good" visualizations should show us things about the data that we didn't know were there (or maybe we did), but it should try hard not to create artifacts in the visualization. I wasn't surprised to recently learn that the ubiquitous spring-force layout algorithms exhibit strong pathologies in terms of their getting stuck in local energy-minima that can give the appearance of structure that may not exist. For visualization, the desirable features he suggests are purely local -- we want few edge crossings and we want edges to be shot and straight -- but for very large networks, I think it would be more informative to get the cluster structure right than to make the inner details of the clusters pretty. I'm not aware of any layout algorithms that do this, but this is at least partially because getting the cluster structure right can require very sophisticated pre-processing.
Jennifer Dunne (Santa Fe Institute) opened Day 2 with discussion of her recent work on understanding paleolithic foodwebs and their similarity (or dissimilarity) to modern-day foodwebs. Amazingly, experts on paleolithic species can reconstruct a large fraction of the corresponding foodwebs from fossils, based on lucky fossilizations (animals caught in the act of feeding, or fossilized stomach contents), morphological considerations (jaw structure, sizes, etc.), damage patterns, etc. Each edge is assigned a confidence value by the expert. Dunne's analysis of these inferred topologies (e.g., from the Burgess Shale) seems pretty careful, showing that paleo webs and modern webs actually don't seem to differ in the network-features that ecologists typically study, although they do differ in their details. Her work is unusual in the networks community for explicitly doing an analysis of the uncertainty of the conclusions in light of the known uncertainties in the topology -- this was a topic that came up during the panel discussion at the IPAM workshop earlier in the month, and I think it deserves a lot more attention in the networks world. That is, there's a strong need for better methods to understand how to quote reasonable uncertainties in our conclusions.
One of my favorite talks of the week was by James Collins (Boston University) on his work on reverse engineering gene networks. He used a multiple linear regression model to infer an (weighted?) adjacency matrix from gene expression data for the E. coli SOS network (a particular set of responses E. coli initiates under high stress), and then matched its predictions for what genes to target to knockout the overall response of the network with in vivo experiments. I asked if a more complicated model might do even better -- he suggested that the fact that the linear model does so well suggests that the system is highly damped, and that more complicated models (e.g., those with say, non-linear effects built-in) do seem to do slightly better.
Another of my favorite talks of the week was by Jon Kleinberg (Cornell) on the difficulties of anonymizing large social network data sets. He pointed out that many companies have and will release anonymized version of their social networks (e.g., facebook or LiveJournal), where the anonymization is done simply by assigning each node a random unique label. Jon points out that, and then shows us exactly how, a malicious attacker could, by introducing a few additional edges or nodes to the network (e.g., by opening up some new email accounts and sending some emails), specifically de-anonymize a set of edges and nodes in the network. The results he showed relied on some very beautiful and very old results from combinatorics (e.g., some work by Paul Erdos on Ramsey theory). He then pitched an open question, are there good privacy-preserving mechanisms for anonymizing social network data? He suggested that zero-knowledge proofs or interactive proof schemes might be one way to guarantee anonymity, but at the expense of severely limiting the kinds of analysis that researchers could do with the network data.
Lise Getoor (U. Maryland) gave a short talk on her work on entity resolution for citation and coauthor networks (code available here; x86 chips only). Although she didn't speak directly about the effect of this kind of aliasing on the topological patterns that people are often interested in, the implication was clear that the topology (and thus both small-scale and large-scale patterns) can change quite dramatically when you appropriately clean the data. I'm curious to see what happens to the basic network statistics like the degree distribution, clustering coefficient and geodesic-length distribution, along with larger things like the community structure change when these things are fixed. (This would also give us some insight into just how much uncertainty we should include in our conclusions when we don't do this kind of alias resolution.)