April 29, 2010
What have I been doing these past 8 years?
The other day while contemplating this whole business of being a university professor, recruiting students, etc., it occurred to me that my current website doesn't have the usual blah-blah-blah boilerplate descriptions about the topics I work on and the questions I'm interested in. I'll probably write something eventually, but for now, I decided to take a data-driven approach to describing what I do: I took the text of almost all the papers I've written since 2003, threw them into a text file, munged things a little , and made a of the results.
Voila. Here's what I work on.
 The munging is not strictly necessary, but wordle.net's implementation of the word cloud algorithm doesn't do "stemming", i.e., it doesn't see that words like "distribution" and "distributions" are really the same. So, some munging is necessary to combine words that are really the same.
April 20, 2010
People v. The scale-free networks hypothesis
The other night while skimming the nightly arxiv mailing, I was momentarily confused as to why a paper on cell phone networks  was in the q-bio mailing. Then I realized that these cellular networks were about actual cells: the kind that squirm and wiggle and master our every innovation in biochemical warfare. Turns out, I should have paid more attention.
This paper, by de Lomana, Beg, de Fabritiis and Villà-Freixa, is about the way Nature organizes biological networks inside cells, such as protein-protein interaction networks, transcriptional regularity networks, metabolic networks, etc. The authors apply new statistical methods  to more rigorously test an old and oft-repeated systems biology hypothesis: that biological networks universally exhibit a "scale free" structure, as shown by their degree distributions following a power-law form. The implication is that there's a class of "universal" evolutionary mechanisms  that build and maintain the structure of these networks, which is why they exhibit this common organizational pattern. Many scientists are not fans of this idea, and one of my favorite critiques is a 2005 paper by Tanaka titled "Scale-rich Metabolic Networks" .
Long-time readers will guess what's coming next. Power-law distributions, and the question of whether some empirical data can reasonably be claimed to follow one, and thus also whether we are licensed to infer certain kinds of causal mechanisms versus others, is a topic very close to my heart. A few years ago, I even made a mild attempt to lay direct siege to the scale free networks fortress by reanalyzing some recently published protein-interaction network data . I'm happy to say that de Lomana et al. have done a much better and more comprehensive job than I did. In their paper, they used applied sound methods to a wide variety of biochemical data sets and surprisingly, or perhaps unsurprisingly, they found that none of these systems exhibit plausible power laws. In their words:
Our results demonstrate that the large-scale topology of the molecular interaction networks and the global mRNA and protein expression distributions examined here do not strictly follow power-law distributions. Moreover, none of the three heavy-tailed models tested had a universal agreement with the empirical data even when using the highest quality data sets available. Distributions are evidently heavy-tailed and for this type of data [maximum likelihood] analyses prove superior to graphical methods for assessing different tested distributions.
de Lomana et al. go on to discuss how these conclusions might be wrong, due to various uncertainties relating to the empirical data. I particularly liked this latter piece because it shows that they've thought carefully about the hypotheses, the uncertainties in their data, and how these interact. (Sadly for science, this kind of self-criticism is increasingly unfashionable in much of this literature.)
In their conclusions, de Lomana et al. keep the interpretation fairly narrow. I think more can be said. Here's me, two years ago, writing about one particular study that argued in support of the scale-free hypothesis:
...there must be a lot of non-scale-free structure in the network. This structure may have evolutionary or functional significance, since it's behavior is qualitatively different from the large-degree proteins. Unfortunately, the authors missed the opportunity to identify or discuss this because they were sloppy in their analysis. The moral here is that doing the statistics right can shed new light on the interactome's structure, and can actually generate new questions for future work... if we're ever to build scientific theories here, then we sure had better get the details right.
Tip to Brian Karrer and Cosma Shalizi.
Update 2 Sept. 2010: Jordi tells me that the paper has finally appeared as A.L.G. De Lomana, Q.K. Beg, G. De Fabritiis and J. Villà-Freixa. "Statistical Analysis of Global Connectivity and Activity Distributions in Cellular Networks." Journal of Computational Biology, 17(7): 869-878 (2010).
 Statistical Analysis of Global Connectivity and Activity Distributions in Cellular Networks by Adrián López García de Lomana, Qasim K. Beg, G. de Fabritiis, Jordi Villà-Freixa. Journal of Computational Biology (Forthcoming).
Various molecular interaction networks have been claimed to follow power-law decay for their global connectivity distribution. It has been proposed that there may be underlying generative models that explain this heavy-tailed behavior by self-reinforcement processes such as classical or hierarchical scale-free network models. Here we analyze a comprehensive data set of protein-protein and transcriptional regulatory interaction networks in yeast, an E. coli metabolic network, and gene activity profiles for different metabolic states in both organisms. We show that in all cases the networks have a heavy-tailed distribution, but most of them present significant differences from a power-law model according to a stringent statistical test. Those few data sets that have a statistically significant fit with a power-law model follow other distributions equally well. Thus, while our analysis supports that both global connectivity interaction networks and activity distributions are heavy-tailed, they are not generally described by any specific distribution model, leaving space for further inferences on generative models.
 Disclaimer: I helped develop these methods.
 Specifically preferential-attachment-style mechanisms, such as duplication-mutation. Problematically, the language used in this area has led to enormous confusion. The name "scale free" has been attached both to a particular mechanism (preferential attachment) that generates a particular pattern (a power-law distribution) and to the pattern itself. So, when someone says "X is scale free", they could mean either that X was generated by preferential attachment or that X follows a power-law distribution. The problem, of course, is that there are many mechanisms that generate power-law distributions, so just because a preferential attachment mechanism implies a power-law distribution does not mean that if we observe a power-law distribution that we can correctly infer preferential attachment. If we do, we're almost surely committing a Type I error (also called an error of excessive credulity). Personally, when it comes to scientific theories, I would prefer to make errors of excessive skepticism, but I'm not sure many of my colleagues in network science share that preference.
 R. Tanaka, "Scale-Rich Metabolic Networks." Physical Review Letters 94, 168101 (2005).
 You may not know that I took that blog entry and turned it into a comment that I posted on the arxiv, or that I subsequently submitted the comment to Science. The comment was not published, but not because my criticism wasn't correct. Here's a short explanation of why it wasn't published.