March 07, 2016
The trouble with modern interdisciplinary publishing
You may have heard about the recent debacle at PLOS ONE, in which the journal published  and then retracted  a paper with the following, seemingly creationist line in the abstract (boldface added):
The explicit functional link indicates that the biomechanical characteristic of tendinous connective architecture between muscles and articulations is the proper design by the Creator to perform a multitude of daily tasks in a comfortable way.
The reference to the Creator also appears in the introduction and in the discussion. Jon Wilkins has written good summaries of some of the details, both of the original publication of the paper, and of its subsequent retraction by the PLOS ONE editorial staff. An interesting point he makes is that the use of the English phrase "the Creator" can plausibly be attributed to poor translation from Chinese; an alternate translation could thus be "nature", although that doesn't ameliorate the creationist subtext that is also carried by the phrase "proper design." 
Anyway, I'm not interested in second guessing the editorial and authorial sequences of actions that led to the initial publication or its retraction. And, it having happened probably won't impact my own inclination for or against publishing something in PLOS ONE, since I think that the editors and papers at PLOS ONE are generally okay.
Instead, I think this event serves to illustrate a deeper and quite important issue with modern academic publishing, especially interdisciplinary journals like PLOS ONE. The core question is this: how do we scientists, as a community, ensure quality control in a system where we are drowning in the sheer volume of everything? 
For instance, vanity journals like Nature and Science don't let little obvious things like this through in part because they employ copy editors whose job it is to proofread and copyedit articles before they are published. But, these journals do let plenty of other, shinier crap through, for which they are properly mocked. But how does the crap get through? I think the answer is partly due to the enormous volume of submissions, which necessarily means that individual submissions often are not looked at very closely or, if they are, they are often not looked at by the right experts. PLOS ONE has, if anything, an even larger volume problem than the vanity journals. Not only did they publish something like 30,000 papers in 2015, but they also have more than 6000 academic editors. The complexity of ensuring a good matching process between papers and editors with appropriate expertise is dizzying. The fact that this enormous machine works as well as it does is pretty remarkable: PLOS ONE is probably easily the world's largest journal, both in volume of submissions, volume of published papers, number of editors, and range of topics covered. It's also a very young journal, having begun publishing only in 2006.
Here's the crux of the matter. Science publishing as we know it was invented when science was run by a pretty small community of fairly like-minded individuals. As it's grown over the past 350 years , science has become increasingly specialized, and the specialization trend really kicked in during the 20th century, when governments realized that funding science was politically advantageous (especially for war). Specialization is often held up as a bugaboo in modern discourse, but it's a completely natural process in science, and I think it advanced so much during the 20th century in part because it helped keep the peer review and publication systems small and manageable. That is, specialization is a way to solve the matching process of manuscripts with editors, reviewers and journals.
But, interdisciplinary journals like PLOS ONE, and its shinier brethren like Science Advances  and Nature Communications, are racing headlong into territory where the publication system has never functioned before. From a simple logistics point of view, it's not clear that we know how to scale up what is essentially an early 20th century evaluation process , which only worked because of specialization, to handle 21st century volumes for interdisciplinary science, and still feel good about the product (effectively peer reviewed papers, for any reasonable definition of peer review).
So, I think it's good that we have places like PLOS ONE who are experimenting in this area. And, I think they rightly deserve our ridicule when they screw up royally, as in the case of this "Creator" paper. But let's not shame them into stopping, e.g., by boycotting them, when they are quite a bit more brave than we are. Instead, I think we as a community need to think harder about the real problem, which is about how to run an effective scientific publishing operation in a world where we cannot rely on specialization to manage the way we collectively allocate the attention of experts.
 Ming-Jin Liu, Cai-Hua Xiong, Le Xiong, and Xiao-Lin Huang. "Biomechanical Characteristics of Hand Coordination in Grasping Activities of Daily Living." PLOS ONE 11(1): e0146193 (2016).
 The PLOS ONE Staff. "Retraction: Biomechanical Characteristics of Hand Coordination in Grasping Activities of Daily Living." PLoS ONE 11(3): e0151685 (2016).
 It seems possible that the authors could have rewritten the offending parts of the paper in order to make it more clear that they mean these features evolved naturally. However, PLOS did not, apparently, give them that option.
 The problems that high volumes induce are pretty fundamental ubiquitous in academia. The most precious resource I have as a scientist is my attention, and it is very limited. And yet, there are an ever increasing number of things that need attending to. The number of students in undergraduate CS programs is increasing; the number of applicants to PhD programs is increasing; the number of applicants to faculty openings is increasing; the number of papers needing to be reviewed is increasing; the number of papers I would like to read in order to stay up on the literature is increasing; the number of meetings I should probably be attending or organizing is increasing; etc. etc. This is a place where technology should be helping us focus on things that are important, but instead, it's just made everything worse, by lowering barriers. Email is the perfect example of this.
 Arguably, science as a field was first formalized in the founding of the Royal Society in 1660.
 Full disclosure: I am currently an Associate Editor at Science Advances. We have our own version of the volume+diversity problem, and we have our own method for managing it. The method requires a lot of human attention in sorting things into buckets, and I have great respect for our sorters (who are also academics, like all the associate editors). But how scalable is our system? What mistakes does it tend to make? How could we lower the error rate? I don't think anyone knows the answers to these questions, and I am not sure anyone is even thinking that these are questions that need to be answered.
 Peer review was mainly carried out directly by the senior-most editors until the late 19th century, and it wasn't until the early 20th century that the process we recognize today, with external reviewers, took shape.
December 02, 2014
Doing interdisciplinary work (Advice to young scholars, part 3 of 4)
These notes are an adapted summary of the the 3rd of 4 professional development panels for young scholars, as part of the American Mathematical Society (AMS) Mathematics Research Community (MRC) on Network Science, held in June 2014. Their focus was on mathematics, computer science, and networks, but many of the comments generalize to other fields. 
Panel 3. Interdisciplinary Research
Opening remarks: Sometimes, the most interesting problems come from interdisciplinary fields, and interdisciplinary researchers are becoming more and more common. As network scientists, we tend to fit in with many disciplines. That said, the most important thing you have is time; therefore, choose your collaborations wisely. Interdisciplinary work can be divided into collaboration and publication, and each of these has its own set of difficulties. A common experience with interdisciplinary work is this:
Any paper that aims for the union of two fields will appeal mainly to the intersection. -- Jon Kleinberg
- What's the deal with interdisciplinary collaborations? How do they impact your academic reputation?
There are three main points to consider when choosing interdisciplinary collaborations, and how they impact perceptions of your academic reputation.
First, academia is very tribal, and the opinions of these tribes with regards to your work can have a huge impact on your career. Some departments won't value work outside their scope. (Some even have a short list of sanctioned publication venues, with work outside these venues counting literally as zero for your assessments.) Other departments are more open minded. In general, it's important to signal to your hopefully-future-colleagues that you are "one of them." This can mean publishing in certain places, or working on certain classes of problems, or using certain language in your work, etc. If you value interdisciplinary work, then you want to end up in a department that also values it.
Second, it's strategically advantageous to be "the person who is the expert on X," where X might be algorithms or statistics or models for networks, or whatever. Your research specialty won't necessarily align completely with any particular department, but it should align well with a particular external research community. In the long run, it is much more important to fit into your community than to fit into your department, research-wise. This community will be the group of people who review your papers, who write your external letters when you go up for tenure, who review your grant proposals, who hire your students as postdocs, etc. The worst possible situation is to be community-less. You don't have to choose your community now, but it helps to choose far enough ahead of your tenure case that you have time to build a strong reputation with them.
Third, make sure the research is interesting to you. If your contribution in some interdisciplinary collaboration is to point out that an off-the-shelf algorithm solves the problem at hand, it's probably not interesting to you, even if it's very interesting to the collaborator. Even if it gives you an easy publication, it won't have much value to your reputation in your community. Your work will be compared to the work of people who do only one type of research in both fields, and might not look particularly good to any field.
Be very careful about potentially complicated collaborations in the early stages of your career. Be noncommittal until you're sure that your personalities and tastes in problems match. (Getting "divorced" from a collaborator, once a project has started, can be exhausting and complicated.) Being able to recognize cultural differences is an important first step to good collaborations, and moving forward effectively. Don't burn bridges, but don't fall into the trap of saying yes to too many things. Be open to writing for an audience that is not your primary research community, and be open to learning what makes an interesting question and a satisfying answer in another field.
- What's the deal with publishing interdisciplinary work? Where should it go?
As a mathematical or computer or data scientist doing work in a domain, be sure to engage with that domain's community. This helps ensure that you're doing relevant good work, and not reinventing wheels. Attend talks at other departments at your university, attend workshops/conferences in the domain, and discuss your results with people in the domain audience.
When writing, vocabulary is important. Knowing how to speak another discipline's language will help you write in a way that satisfies reviewers from that community. Less cynically, it also helps the audience of that journal understand your results, which is the real goal. If publishing in the arena of a collaborator, trust your collaborator on the language/writing style.
In general, know what part of the paper is the most interesting, e.g., the mathematics, or the method or algorithm, or the application and relationship to scientific hypotheses, etc., and send the paper to a venue that primarily values that thing. This can sometimes be difficult, since academic tribes are, by their nature, fairly conservative, and attempting to publish a new or interdisciplinary idea can meet with knee-jerk resistance. Interdisciplinary journals like PLOS ONE, which try not to consider domain, can be an okay solution for early work that has trouble finding a home. But, don't overuse these venues, since they tend also to not have a community of readers built in the way regular venues do.
Note: When you interview for a faculty position, among the many questions that you should be asking the interviewing department: "In practice, how is your department interdisciplinary? How do you consider interdisciplinary work when evaluating young faculty (e.g., at tenure time)?"
 Panelists were Mason Porter (Oxford), David Kempe (Southern California), and me (the MRC organizers), along with an ad hoc assortment of individuals from the MRC itself, as per their expertise. The notes were compiled by MRC participants, and I then edited and expanded upon them for clarity and completeness, and to remove identifying information. Notes made public with permission.
 Here is a complete copy of the notes for all four panels (PDF).
January 29, 2014
PLOS mandates data availability. Is this a good thing?
The Public Library of Science, aka PLOS, recently announced a new policy on the availability of the data used in all papers published in all PLOS journals. The mandate is simple: all data must be either already publicly available, or be made publicly available before publication, except under certain circumstances .
On the face of it, this is fantastic news. It is wholly in line with PLOS’s larger mission of making the outcomes of science open to all, and supports the morally correct goal of making all scientific knowledge accessible to every human. It should also help preserve data for posterity, as apparently a paper’s underlying data becomes increasingly hard to find as the paper ages . But, I think the the truth is more complicated.
PLOS claims that it has always encouraged authors to make their data publicly available, and I imagine that in the vast majority of cases, those data are in fact available. But the policy does change two things: (i) data availability is now a requirement for publication, and (ii) the data are supposed to be deposited in a third-party repository that makes them available without restriction or attached to the paper as supplementary files. The first part ensures that authors who would previously decline or ignore the request for open data must now fall into line. The second part means that a mere promise by the authors to share the data with others is now insufficient. It is the second part where things get complicated, and the first part is meaningless without practical solutions to the second part.
First, the argument for wanting all data associated with scientific papers to be publicly available is a good one, and I think it is also the right one. If scientific papers are in the public domain , but the data underlying their results are not, then have we really advanced human knowledge? In fact, it depends on what kind of knowledge the paper is claiming to have produced. If the knowledge is purely conceptual or mathematical, then the important parts are already contained in the paper itself. This situation covers only a smallish fraction of papers. The vast majority report figures, tables or values derived from empirical data, taken from an experiment or an observational study. If those underlying data are not available to others, then the claims in the paper cannot be exactly replicated.
Some people argue that if the data are unavailable, then the claims of a paper cannot be evaluated at all, but that is naive. Sometimes it is crucial to use exactly the same data, for instance, if you are trying to understand whether the authors made a mistake, whether the data are corrupted in some way, or understand a particular method. For these efforts, data availability is clearly helpful.
But, science aspires for general knowledge and understanding, and thus getting results using different data of the same type but which are still consistent with the original claims is actually a larger step forward than simply following exactly the same steps of the original paper. Making all data available may thus have an unintended consequence of reducing the amount of time scientists spend trying to generalize, because it will be easier and faster to simply reuse the existing data rather than work out how to collect a new, slightly different data set or understand the details that went into collecting the original data in the first place. As a result, data availability is likely to increase the rate at which erroneous claims are published. In fields like network science, this kind of data reuse is the norm, and thus gives us some guidance about what kinds of issues other fields might encounter as data sharing becomes more common .
Of course, reusing existing data really does have genuine benefits, and in most cases these almost surely outweigh the more nebulous costs I just described. For instance, data availability means that errors can be more quickly identified because we can look at the original data to find them. Science is usually self-correcting anyway, but having the original data available is likely to increase the rate at which erroneous claims are identified and corrected . And, perhaps more importantly, other scientists can use the original data in ways that the original authors did not imagine.
Second, and more critically for PLOS’s new policy, there are practical problems associated with passing research data to a third party for storage. The first problem is deciding who counts as an acceptable third party. If there is any lesson from the Internet age, it is that third parties have a tendency to disappear, in the long run, taking all of their data with them . This is true both for private and public entities, as continued existence depends on continued funding, and continued funding, when that funding comes from users or the government, is a big assumption. For instance, the National Science Foundation is responsible for funding the first few years of many centers and institutes, but NSF makes it a policy to make few or no long-term commitments on the time scales PLOS’s policy assumes. Who then should qualify as a third party? In my mind, there is only one possibility: university libraries, who already have a mandate to preserve knowledge, should be tapped to also store the data associated with the papers they already store. I can think of no other type of entity with as long a time horizon, as stable a funding horizon, and as strong a mandate for doing exactly this thing. PLOS’s policy does not suggest that libraries are an acceptable repository (perhaps because libraries themselves fulfill this role only rarely right now), and only provides the vague guidance that authors should follow the standards of their field and choose a reasonable third party. This kind of statement seems fine for fields with well-developed standards, but it will likely generate enormous confusion in all other fields.
This brings us to another major problem with the storage of research data. Most data sets are small enough to be included as supplementary files associated with the paper, and this seems right and reasonable. But, some data sets are really big, and these pose special problems. For instance, last year I published an open access paper in Scientific Reports that used a 20TB data set of scoring dynamics in a massive online game. Data sets of that scale might be uncommon today, but they still pose a real logistical problem for passing it to a third party for storage and access. If someone requests a copy of the entire data set, who pays for the stack of hard drives required to send it to them? What happens when the third party has hundreds or thousands of such data sets, and receives dozens or more requests per day? These are questions that the scientific community is still trying to answer. Again, PLOS’s policy only pays lip service to this issue, saying that authors should contact PLOS for guidance on “datasets too large for sharing via repositories.”
The final major problem is that not all data should be shared. For instance, data from human-subjects research often includes sensitive information about the participants, e.g., names, addresses, private behavior, etc., and it is unethical to share such data . PLOS’s policy explicitly covers this concern, saying that data on humans must adhere to the existing rules about preserving privacy, etc.
But what about large data sets on human behavior, such as mobile phone call records? These data sets promise to shed new light on human behavior of many kinds and help us understand how entire populations behave, but should these data sets be made publicly available? I am not sure. Research has shown, for instance, that it is not difficult to uniquely distinguish individuals within these large data sets  because each of us has distinguishing features to our particular patterns of behavior. Several other papers have demonstrated that portions of these large data sets can be deanonymized, by matching these unique signatures across data sets. For such data sets, the only way to preserve privacy might be to not make the data available. Additionally, many of these really big data sets are collected by private companies, as the byproduct of their business, at a scale that scientists cannot achieve independently. These companies generally only provide access to the data if the data is not shared publicly, because they consider the data to be theirs . If PLOS’s policy were universal, such data sets would seem to become inaccessible to science, and human knowledge would be unable to advance along any lines that require such data . That does not seem like a good outcome.
PLOS does seem to acknowledge this issue, but in a very weak way, saying that “established guidelines” should be followed and privacy should be protected. For proprietary data sets, PLOS only makes this vague statement: “If license agreements apply, authors should note the process necessary for other researchers to obtain a license.” At face value, it would seem to imply that proprietary data sets are allowed, so long as other researchers are free to try to license them themselves, but the devil will be in the details of whether PLOS accepts such instructions or demands additional action as a requirement for publication. I’m not sure what to expect there.
On balance, I like and am excited about PLOS’s new data availability policy. It will certainly add some overhead to finalizing a paper for submission, but it will also make it easier to get data from previously published papers. And, I do think that PLOS put some thought into many of the issues identified above. I also sincerely hope they understand that some flexibility will go a long way in dealing with the practical issues of trying to achieve the ideal of open science, at least until we the community figure out the best way to handle these practical issues.
 PLOS's Data Access for the Open Access Literature policy goes into effect 1 March 2014.
 See “The availability of Research Data Declines Rapidly with Article Age” by Vines et al. Cell 24(1), 94-97 (2013).
 Which, if they are published at a regular “restricted” access journal, they are not.
 For instance, there is a popular version of the Zachary Karate Club network that has an error, a single edge is missing, relative to the original paper. Fortunately, no one makes strong claims using this data set, so the error is not terrible, but I wonder how many people in network science know which version of the data set they use.
 There are some conditions for self-correction: there must be enough people thinking about a claim that someone might question its accuracy, one of these people must care enough to try to identify the error, and that person must also care enough to correct it, publicly. These circumstances are most common in big and highly competitive fields. Less so in niche fields or areas where only a few experts work.
 If you had a profile on Friendster or Myspace, do you know where your data is now?
 Federal law already prohibits sharing such sensitive information about human participants in research, and that law surely trumps any policy PLOS might want to put in place. I also expect that PLOS does not mean their policy to encourage the sharing of that sensitive information. That being said, their policy is not clear on what they would want shared in such cases.
 And, thus perhaps easier, although not easy, to identify specific individuals.
 And the courts seem to agree, with recent rulings deciding that a “database” can be copyrighted.
 It is a fair question as to whether alternative approaches to the same questions could be achieved without the proprietary data.
August 18, 2012
Wanting to hire
Postdoctoral Fellowship in Study of Networks
Along with Cris Moore, I am looking to hire a postdoc in the area of complex networks and statistical inference. There are two such positions available, one located at the Santa Fe Institute (working with Cris) and one at the University of Colorado Boulder (working with me). Both are funded by a grant from DARPA to investigate the use of generative models to perform statistical inference in complex networks. The larger team includes Mark Newman at the University of Michigan, and there will be ample opportunity to travel and collaborate among the three institutions.
The grant has a particular emphasis on community detection methods, including methods for detecting changes in community structure in dynamic graphs; functional groups that are not merely "clumps" of densely connected nodes; predicting missing links and identifying spurious ones; building on incomplete or noisy information about the network, generalizing from known attributes of nodes and edges to unknown ones; and identifying surprising or anomalous structures or events.
If you are interested in applying, or know someone who has a strong quantitative background in physics, statistics or computer science, see the application information.
The application deadline is 13 January 2013 with an anticipated start date of May or August 2013.
January 13, 2012
A crisis in higher education?
Attention conservation notice: 3200 cranky words on the PhD over-supply "crisis."
Higher education is always having crises, it seems. Some of this is probably a kind of marketing strategy, because genuinely serious problems are so systemic and slow-moving that it's easy to ignore them, or because you can't get people to pay attention in today's saturated media environment without a large dose of hyperbole. But, one "crisis" in particular did catch my attention over the past few years: the harsh market faced by relatively new PhDs seeking faculty jobs . Nature did a full spread on the future of the PhD, The Economist weighed in with their own coverage, and I'm sure the Chronicle of Higher Education has done a number of stories on the topic. Now, Online PhD has done its own report, in the form of a slick infographic packed with grim factoids and numbers.
What most of these perspectives miss, and what makes some of their analysis a little shrill, is the historical context of higher education and its growth trajectory over the past 70 years . The overall upward trend in PhD production over this time period can be blamed on huge increases in federal funding for research, on huge growth in the number of students getting undergrad degrees, on a vast broadening of higher education as a whole and on intensified competition between research-oriented universities.
The role of competition, I think, is under appreciated: many more universities now produce PhDs and many more programs are genuinely good than was the case before federal funding for higher education began surging after World War II. The result is a larger and more competitive market for those PhDs, especially the ones produced by the best programs. (The same is true for funding sources: the pie has grown, but the number of people competing for a slice has grown much faster.) In many ways, this is a good thing for higher education overall, since students can receive a good or even a great education at a wider variety of places. That is, higher education is probably genuinely less elitist and genuinely more accessible and inclusive. There's also a cost, however, in brutal treatment that even successful candidates experience on the market and in the colossal waste of human potential from the many talented individuals who fail to find good jobs. (More on "good" jobs in a moment.)
That being said, increased production of PhDs doesn't necessarily mean increased competition. If the number of tenure-track faculty positions increases at the same rate as the production of PhDs, then in principle competition could remain flat. This point gets a lot of attention in the popular discussion and the argument is often that if only we could increase the number of tenure-track lines, everything would be great. But this obscures the complexity of the problem. First, faculty lines are largely paid for by undergraduate (or professional degree) tuition , so increasing the size of the faculty requires increasing the size of the undergraduate population, which has its own problems.  Second, part of the modern National Science Foundation's mandate is actually to overproduce graduate students , and this is largely at the behest of Congress . Alternatively, we could solve the over-supply of PhDs by reducing the overall production, but this would negatively impact the amount of research being produced (since it's largely done by PhD students), the teaching of many departments (again, often done by PhD students) and would reduce the supply of highly educated individuals to non-academic professions.
Third, not all tenure-track positions are equally desirable, not all PhDs are equally competitive, and growth in the most desirable slots and most competitive people has not been uniform. This is a tricky problem to explain but let's put it this way: I would not be surprised to learn that the 10 best programs in the country (in any particular field) collectively produce enough PhDs each year to fill every advertised faculty line at every other university, even the not-so-great ones . This means that the lack-of-tenure-track-jobs / overproduction-of-PhDs "crisis" is not one that everyone feels equally, which complicates the conclusion that it is universally a problem. In fact, a tight job market for faculty positions has some benefits, at least collectively. One is that lower-quality places can pick up relatively better qualified people than they would if the top-ranked departments had enough extra lines to hire all the top applicants. Over time, an over-supply of good PhDs may be necessary to raise the quality of the worst-performing institutions, although this effect may only be observable in the long run. 
Fourth, the history of higher education as an industry is a series of large expansions and contractions, and the effects of these are often felt and distributed unevenly . Life and job prospects for faculty in expanding fields are good, but are hard during contractions. (These effects are surely amplified for young scholars and so one possibility would be better knowing and advertising the true employment prospects for graduates; but maybe not .) It's not entirely clear to me that academia is actually experiencing a contraction, despite the federal budget travails. A more truthful statement may be that higher education is restructuring, which brings us to the issue of "good" jobs versus "bad" jobs.
It's true that universities (at least in the US) are increasingly made up of two types of faculty, those either with or who are eligible for tenure ("tenure track"; a population that is, at best, growing fairly slowly) and those without or who can never receive tenure (teaching faculty, adjuncts, etc.). The latter group is much larger now than it used to be, but it's not very well integrated into the decision-making systems of universities, and this, I think, leads to some level of systemic abuse. In the long run, it seems likely that these groups will become better integrated into the decision-make system, which will reduce the abuse . But a more interesting question, I think, is why has this population grown so much so recently?
The role that universities play in society is changing, and I think the growth of these lower-quality jobs reflects this shift. The US economy overall has shifted significantly toward service-sector jobs and the growth in adjunct and teaching positions at universities should probably be seen as the higher education equivalent. This may be driven in part by the commoditization of a bachelors degree (which is primarily what non-tenure-track faculty help produce), which society has demanded and the universities have embraced (especially the big public universities and the non-endowed private colleges, where increased enrollment means increased tuition revenue). For their part, colleges and universities are figuring out that they can produce an apparently equivalent "product" at significantly lower cost by relying more heavily on non-tenure track faculty [11,12]. It seems telling that losses of tenure-track lines are often at colleges and universities well below the "top tier", where the struggle for product differentiation and the negative consequences of price competition are likely stronger. So, it seems reasonable to expect growth in these "bad" jobs in places where the service rendered (education provided) is less specialized, e.g., entry- or lower-level undergraduate classes where the material is highly standardized and probably does not require the best of the best minds to teach.
Another aspect is that tenure is not just about protecting faculty from being fired for political reasons. Tenure also allows universities to fulfill their mission toward preserving knowledge because tenured faculty will be around for a long time, communicating their vast and detailed knowledge to the next generation. Eliminating tenure lines may basically mean that an institution is giving up some or all of its commitment to the knowledge preservation mission. This is surely a loss for society as a whole, but it does raise the interesting question about which institutions are best positioned to fulfill that mission -- it may be that the institutions who are giving up on it were not doing a very good job at it in the first place. The fact that tenure lines are mainly (but not always) being lost from the lower-ranked institutions suggests that the top places are largely still committed to this mission, even if they are retrenching to some degree (perhaps because of the shifting demands on bachelor degree production described above).
So, let's take stock. Is there a crisis? Not in the usual definition of the word, no. But, there are serious issues that we should consider, and these tap deep into both the mission and purpose of higher education and its relationship to society as a whole.
The good things for society about the current system are that the over-supply of PhDs produces a steady stream of highly educated people for other industries and government to use. The over-supply means that low-quality departments will tend to improve over time because they can hire better people than their peers tend to produce. The over-supply also means that the best or most desirable departments will also tend to improve over time because they can select their new hires from the very best of the very best. For scholarship in general, this is a good thing. The over-supply means that academia has a large supply of low-cost skilled labor (graduate students) for producing research, educating younger students, etc. And, the over-supply means that academia has an adequate supply of potential faculty to facilitate restructuring needs, i.e., responding to the changing demands from society and the changing roles of universities.
The bad things are that the over-supply is a colossal waste of human potential for people who aspire to be faculty but who ultimately fail to find employment. For instance, many very talented individuals will spend substantial time in low-paying, low-benefits temporary employment (graduate students, postdocs, adjuncts, research faculty positions, etc.) only to discover years or decades later that these years are now counted against them on the job market (and not just in the academic market). The over-supply makes the individual experience of finding a job fairly brutal and with a high error rate (many people who should get faculty jobs do not ). Success also comes with a cost in the form of moving a very large distance (the faculty job market is one of the few truly national labor markets). The over-supply has made it easy for susceptible colleges and universities to slowly replace their tenure track faculty with non-tenure faculty with less autonomy, less security, lower pay and lower benefits, which ultimately means these institutions basically abandon one of their missions: preserving human knowledge. It also makes the institution less democratic, which likely has a negative impact on the campus culture and the educational environment.
Just as this situation did not appear suddenly, I don't think it will change significantly in the near future. Although Congress is a powerful voice in higher education, and has had a direct role in creating the over-supply, the large and complex ecology of higher education institutions, society itself and the economy as a whole are also key players. What happens will depend on their interactions, and lobbying Congress alone may lead to unexpected and undesirable results . In the near term, I think the over-supply will persist (and if anything the job market will become even more competitive, but again this is not a completely bad thing), the number of non-tenured positions will continue to increase (mainly at lower-ranked institutions or for teaching intro classes at the higher-ranked places), undergraduate degrees will become even more comoditized, and the mission of knowledge preservation will be increasingly concentrated among the better or more financially stable institutions.
One long-term consequence is a partitioning of the faculty at research universities into "research" faculty (tenure-track faculty who do research and teach mainly graduate and upper-level undergraduate courses, of which I am one) and "teaching" faculty (non-tenure track faculty who teach heavy course loads of lower-level undergraduate classes), but that does seem like the way things are going . I wish that research universities (and tenure-track faculty) would treat the non-tenure groups with more respect and include them more directly into the decision-making processes. And, I hope that we can find better ways of encouraging the very best young scholars to stick with it, even though the system will likely become only more brutal in the future .
To end on a more positive note, one genuinely beneficial thing we as academics could do would be to encourage our PhD students to consider non-academic trajectories. That is, I don't think we should view the PhD as being exclusively an academic degree, and we could strive to teach our PhD students a combination of both academic and practical skills. This would increase their options on the job market, which may reduce the overall brutality that individuals currently experience.
 Partly because I was in that market myself. And, now being in a tenure-track position at a good university, I'm lucky enough to be on the other side of that harrowing process. Had I written this a couple of years ago, I'm sure I would have said slightly different things.
 These are well covered by Roger Geiger's excellent and authoritative books on the evolution of the American university system, in the post-war period and since 1990. These books are highly recommended. Geiger takes what should be a dry and boring subject and makes it a fascinating and insightful story.
 This is true at both public and private universities. The only place it's less accurate is in medical research schools, where faculty lines are often funded out of "soft" money from research grants. (Some are still funded from medical school tuition revenue, so the coupling with student populations is not eliminated.) The result is that these faculty lines are mainly sensitive to changes in federal funding levels.
 Another complicating factor is that tenure lines are traditionally tied to departments, and their number depends on student demand for those courses offered by that department. That is, teaching is still a labor-constrained activity. The division of that labor into departments means that growth in faculty lines is driven by changes in the popularity of different disciplines. The benefits for the faculty job market created by overall growth in student enrollments will thus be distributed unevenly.
There are at least two factors that decouple the number of faculty lines and the popularity of the field: tenure, which means departments tend to shrink very slowly in response to decreasing popularity while the reverse is not true, and the commitment that all universities have to the preservation and production of knowledge, which means even an unpopular department may be maintained as a kind of cultural memory device.
 This is done partly through direct support to students (about 15% of the budget) and partly through grants (50% of the budget); typically, large portions of grants are in fact used to support graduate students by paying them as research assistants.
 Apparently, NSF has always struggled to justify its budget to Congress, who generally has an uncomfortable relationship with the idea of supporting basic research for the sake of humanity. For NSF, historically "supporting education," and more recently "supporting economic growth" (a.k.a. technology transfer), have been a successful arguments, and these are reflected in funding priorities.
 This is almost surely true in Computer Science, where some of the best programs are also some of the largest. For example, MIT and CMU collectively have about 250 faculty; if they each produce a single graduate each year, that would be enough to place one person at each of the other 200-odd PhD-granting Computer Science departments in North America. The per-faculty production rate is probably not so high, the overall volume may be so if we account for other top places like Stanford, Washington, Princeton, etc. If we include the fact that not every department hires every year, it seems entirely reasonable that the top 10 places could fill the entire annual market demand themselves.
 This effect probably happens faster for newer fields, e.g., complex systems. The best universities are all fairly sensitive to their perceived prestige and quality, and for them, it doesn't make strategic sense to risk their current standing with risky investments in untested fields. This means that lower-ranked universities who place smart bets can move up (at least during the time it takes for a field to become established enough that the top places start poaching the best people).
 Physics experienced a long expansion, but that had largely run its course in the United States by the time Congress trashed the Superconducting Super Collider in 1993. In contrast, biomedical research has been expanding fairly steadily for 70 years, which is probably why it dominates federal science budgets. The "golden age" of science was really the post-war and Sputnik eras, when federal spending was expanding faster than universities could satisfy the demand for research. The 1970s were apparently a fairly broad contraction, because Congress moved to limit the growth in science budgets (for instance, NASA's budget peaked in the early 1970s) and because student enrollment growth tempered. Since then, the expansions and contractions have been less even.
 On the other hand, had anyone convincingly explained to my younger self just what life would be like in my first few years as a professor, I may have decided to try a different career path. More generally, ignorance may be a crucial part of what makes the whole system work: it allows us to unknowingly take foolish risks that sometimes yield truly remarkable, or at least highly improbable, results. At the collective level, youthful foolishness may be essential to keeping the quality of the faculty high despite the brutality of the career path.
 Of course, in the meantime it's terrible that some institutions and individuals are taking advantage of the current powerlessness of these groups. They can and should be integrated into the academy and given a voice.
 Some graduate and professional degrees also show evidence of becoming commodities, for instance, MBAs. It's not clear that PhDs are facing similar pressures, although in my darker moments I believe it.
 From this perspective, things like Stanford's free online courses may be a truly disruptive innovation. They offer the possibility of dramatically lowered cost, dramatically increased "production" and they seem to require a currently specialized set of skills. Of course, their success could destroy what remains of the tenure track at smaller or less unique institutions.
 As I've learned, it's a highly stochastic and error-prone process. Departments tend to decide ahead of time to hire in a particular area, and this means top-notch candidates from outside that area are potentially passed over for less-top-notch candidates within the target area. The decision of which area to hire in is often driven by internal politics (which "group" is stronger, which has a louder voice, "who's turn" it is) or existing curricular needs rather than meritocratic or strategic concerns. And, even within a given area, it can be difficult to accurately access true merit and relative quality, particularly for junior positions where each candidate's track record is, by definition, relatively short.
Did I mention that I'm reading PhD applications this week? Ugh.
 It certainly has in the past.
 Ironically, dividing the teaching and research between different groups of faculty is mostly how the Germans used to do things. Now, the Germans are Americanizing their system to some degree, while we Americans seem to be Germanizing ours.
 From my perspective, "early career" funding, fellowship and other young faculty support mechanisms seem to be wholly inadequate (in size and scope) and the easiest way to get them is to propose highly incremental, highly risk-averse research. This does not seem to be serving the right goals or to be teaching young faculty the right lessons about scholarship.
February 02, 2011
Whither computational paleobiology?
This week I'm in Washington DC, hanging out at the Smithsonian (aka the National Museum of Natural History) with paleontologists, paleobiologists, paleobotanists, palaeoentomologist, geochronographers, geochemists, macrostratigraphers and other types of rock scientists. The meeting is an NSF-sponsored workshop on the Deep Time Earth-Life Observatory Network (DETELON) project, which is a community effort to persuade NSF to fund large-ish interdisciplinary groups of scientists exploring questions about earth-life system dynamics and processes using data drawn from deep time (i.e., the fossil record).
One of the motivations here is the possibility to draw many different skill sets and data sources together in a synergistic way to shed new light on fundamental questions about how the biosphere interacts with (i.e., drives and is driven by) geological processes and how it works at the large scale, potentially in ways that might be relevant to understanding the changing biosphere today. My role in all this is to represent the potential of mathematical and computational modeling, especially of biotic processes.
I like this idea. Paleobiology is a wonderfully fascinating field, not just because it involves studying fossils (and dinosaurs; who doesn't love dinosaurs?), but also because it's a field rich with interesting puzzles. Surprisingly, the fossil record, or rather, the geological record (which includes things that are not strictly fossils), is incredibly rich, and the paleo folks have become very sophisticated in extracting information from it. Like many other sciences, they're now experiencing a data glut, brought on by the combination of several hundred years of hard work, a large community of data collectors and museums, along with computers and other modern technologies that make extracting, measuring and storing the data easier to do at scale. And, they're building large, central data repositories (for instance, this one and this one), which span the entire globe and all of time. What's lacking in many ways is the set of tools that can allow the field to automatically extract knowledge and build models around these big data bases in novel ways.
Enter "computational paleobiology", which draws on the tools and methods of computer science (and statistics and machine learning and physics) and the questions of paleobiology, ecology, macroevolution, etc. At this point, there aren't many people who would call themselves a computational paleobiologist (or computational paleo-anything), which is unfortunate. But, if you think evolution and fossils are cool, if you like data with interesting stories, if you like developing clever algorithms for hard inference problems or if you like developing mathematical or computational models for complex systems, if you like having an impact on real scientific questions, and if you like a wide-open field, then I think this might be the field for you.
October 27, 2010
Story-telling, statistics, and other grave insults
The New York Times (and the NYT Magazine) has been running a series of pieces about math, science and society written by John Allen Paulos, a mathematics professor at Temple University and author of several popular books. His latest piece caught my eye because it's a topic close to my heart: stories vs. statistics. That is, when we seek to explain something , do we use statistics and quantitative arguments using mainly numbers or do we use stories and narratives featuring actors, motivations and conscious decisions?  Here are a few good excerpts from Paulos's latest piece:
...there is a tension between stories and statistics, and one under-appreciated contrast between them is simply the mindset with which we approach them. In listening to stories we tend to suspend disbelief in order to be entertained, whereas in evaluating statistics we generally have an opposite inclination to suspend belief in order not to be beguiled. A drily named distinction from formal statistics is relevant: we’re said to commit a Type I error when we observe something that is not really there and a Type II error when we fail to observe something that is there. There is no way to always avoid both types, and we have different error thresholds in different endeavors, but the type of error people feel more comfortable may be telling.
I’ll close with perhaps the most fundamental tension between stories and statistics. The focus of stories is on individual people rather than averages, on motives rather than movements, on point of view rather than the view from nowhere, context rather than raw data. Moreover, stories are open-ended and metaphorical rather than determinate and literal.
It seems to me that for science, the correct emphasis should be on the statistics. That is, we should be more worried about observing something that is not really there. But as humans, statistics is often too dry and too abstract for us to understand intuitively, to generate that comfortable internal feeling of understanding. Thus, our peers often demand that we give not only the statistical explanation but also a narrative one. Sometimes, this can be tricky because the structure of the two modes of explanation are in fundamental opposition, for instance, if the narrative must include notions of randomness or stochasticity. In such a case, there is no reason for any particular outcome, only reasons for ensembles or patterns of outcomes. The idea that things can happen for no reason is highly counter intuitive , and yet in the statistical sciences (which is today essentially all sciences), this is often a critical part of the correct explanation . For the social sciences, I think this is an especially difficult balance to strike because our intuition about how the world works is built up from our own individual-level experiences, while many of the phenomena we care about are patterns above that level, at the group or population levels .
This is not a new observation and it is not a tension exclusive to the social sciences. For instance, here is Stephen J. Gould (1941-2002), the eminent American paleontologist, speaking about the differences between microevolution and macroevolution (excerpted from Ken McNamara's "Evolutionary Trends"):
In Flatland, E.A. Abbot's (1884) classic science-fiction fable about realms of perception, a sphere from the world of three dimensions enters the plane of two-dimensional Flatland (where it is perceived as an expanding circle). In a notable scene, he lifts a Flatlander out of his own world and into the third dimension. Imagine the conceptual reorientation demanded by such an utterly new and higher-order view. I do not suggest that the move from organism to species could be nearly so radical, or so enlightening, but I do fear that we have missed much by over reliance on familiar surroundings.
An instructive analogy might be made, in conclusion, to our successful descent into the world of genes, with resulting insight about the importance of neutralism in evolutionary change. We are organisms and tend to see the world of selection and adaptation as expressed in the good design of wings, legs, and brains. But randomness may predominate in the world of genes--and we might interpret the universe very differently if our primary vantage point resided at this lower level. We might then see a world of largely independent items, drifting in and out by the luck of the draw--but with little islands dotted about here and there, where selection reins in tempo and embryology ties things together. What, then, is the different order of a world still larger than ourselves? If we missed the world of genic neutrality because we are too big, then what are we not seeing because we are too small? We are like genes in some larger world of change among species in the vastness of geological time. What are we missing in trying to read this world by the inappropriate scale of our small bodies and minuscule lifetimes?
To quote Howard T. Odum (1924-2002), the eminent American ecologist, on a similar theme: "To see these patterns which are bigger than ourselves, let us take a special view through the macroscope." Statistical explanations, and the weird and diffuse notions of causality that come with them, seem especially well suited to express in a comprehensible form what we see through this "macroscope" (and often what we see through microscopes). And increasingly, our understanding of many important phenomena, be they social network dynamics, terrorism and war, sustainability, macroeconomics, ecosystems, the world of microbes and viruses or cures for complex diseases like cancer, depend on us seeing clearly through some kind of macroscope to understand the statistical behavior of a population of potentially interacting elements.
Seeing clearly, however, depends on finding new and better ways to build our intuition about the general principles that take inherent randomness or contingency at the individual level and produce complex patterns and regularities at the macroscopic or population level. That is, to help us understand the many counter-intuitive statistical mechanisms that shape our complex world, we need better ways of connecting statistics with stories.
27 October 2010: This piece is also being featured on Nature's Soapbox Science blog.
 Actually, even defining what we mean by "explain" is a devilishly tricky problem. Invariably, different fields of scientific research have (slightly) different definitions of what "explain" means. In some cases, a statistical explanation is sufficient, in others it must be deterministic, while in still others, even if it is derived using statistical tools, it must be rephrased in a narrative format in order to provide "intuition". I'm particularly intrigued by the difference between the way people in machine learning define a good model and the way people in the natural sciences define it. The difference appears, to my eye, to be different emphases on the importance of intuitiveness or "interpretability"; it's currently deemphasized in machine learning while the opposite is true in the natural sciences. Fortunately, a growing number of machine learners are interested in building interpretable models, and I expect great things for science to come out of this trend.
In some areas of quantitative science, "story telling" is a grave insult, leveled whenever a scientist veers too far from statistical modes of explanation ("science") toward narrative modes ("just so stories"). While sometimes a justified complaint, I think completely deemphasizing narratives can undermine scientific progress. Human intuition is currently our only way to generate truly novel ideas, hypotheses, models and principles. Until we can teach machines to generate truly novel scientific hypotheses from leaps of intuition, narratives, supported by appropriate quantitative evidence, will remain a crucial part of science.
 Another fascinating aspect of the interaction between these two modes of explanation is that one seems to be increasingly invading the other: narratives, at least in the media and other kinds of popular discourse, increasing ape the strong explanatory language of science. For instance, I wonder when Time Magazine started using formulaic titles for its issues like "How X happens and why it matters" and "How X affects Y", which dominate its covers today. There are a few individual writers who are amazingly good at this form of narrative, with Malcolm Gladwell being the one that leaps most readily to my mind. His writing is fundamentally in a narrative style, stories about individuals or groups or specific examples, but the language he uses is largely scientific, speaking in terms of general principles and notions of causality. I can also think of scientists who import narrative discourse into their scientific writing to great effect. Doing so well can make scientific writing less boring and less opaque, but if it becomes more important than the science itself, it can lead to "pathological science".
 Which is perhaps why the common belief that "everything happens for a reason" persists so strongly in popular culture.
 It cannot, of course, be the entire explanation. For instance, the notion among Creationists that natural selection is equivalent to "randomness" is completely false; randomness is a crucial component of way natural selection constructs complex structures (without the randomness, natural selection could not work) but the selection itself (what lives versus what dies) is highly non-random and that is what makes it such a powerful process.
What makes statistical explanations interesting is that many of the details are irrelevant, i.e., generated by randomness, but the general structure, the broad brush-strokes of the phenomena are crucially highly non-random. The chief difficulty of this mode of investigation is in correctly separating these two parts of some phenomena, and many arguments in the scientific literature can be understood as a disagreement about the particular separation being proposed. Some arguments, however, are more fundamental, being about the very notion that some phenomena are partly random rather than completely deterministic.
 Another source of tension on this question comes from our ambiguous understanding of the relationship between our perception and experience of free will and the observation of strong statistical regularities among groups or populations of individuals. This too is a very old question. It tormented Rev. Thomas Malthus (1766-1834), the great English demographer, in his efforts to understand how demographic statistics like birth rates could be so regular despite the highly contingent nature of any particular individual's life. Malthus's struggles later inspired Ludwig Boltzmann (1844-1906), the famous Austrian physicist, to use a statistical approach to model the behavior of gas particles in a box. (Boltzmann had previously been using a deterministic approach to model every particle individually, but found it too complicated.) This contributed to the birth of statistical physics, one of the three major branches of modern physics and arguably the branch most relevant to understanding the statistical behavior of populations of humans or genes.
May 13, 2010
Nature's Journal Club
A few months ago, I was invited to write a column for Nature's Journal Club. This series appears every week in their print edition and the text is also posted on the Journal Club's blog. The instructions were pretty simple: pick a paper with broad appeal, which has been published sometime in the last year or so and which has not appeared in Nature , and write about 260 words summarizing the results and describing why I like it. I was encouraged to be provocative, too, but I'll leave it to you to decide if I was that bold.
After mulling it over, I agreed to write a column on a paper by James O'Dwyer and Jessica Green, both at the University of Oregon in Eugene [2,3]. You can read the blog version of the column here or the print version here. Here's the setup:
Many species are concentrated in biodiversity hot spots such as tropical rainforests and coral reefs. But our estimates of how many species these and other ecosystems contain are very rough. Conservation efforts and ecological theories would be better served by a more accurate picture.
Our best guesses come from empirical species–area relationships, which count the number of species observed as a function of geographical area. These relationships show sharp increases at local and continental scales, but slow growth at intermediate scales. Despite decades of study, ecologists have no clear explanation of this pattern's origins or what causes deviations from it.
These species-area relationships (SARs) are ubiquitous in ecology largely because ecological survey practices have long focused on counting species within a specific study region. Most such data is collected from small survey areas and these data are then combined within a meta-study to get up to the regional or continental scales. Perhaps because of the ease of constructing SARs, much ink has been spilled over their structure. They're also our only reliable tool for estimating how many species live in places like the Amazon or the Great Barrier Reef, which are too large to survey completely.
What's clear from all this work is that there are some general patterns in SARs, and that if we want to use them in unconventional places, such as in estimating the number of microbial species in the world (or in smaller regions, like your gut), then we need a good theoretical explanation of where those patterns come from and what processes cause deviations from them at different length scales. That is, we need a good null model.
Creating one is largely what O'Dwyer and Green have done. There have, of course, been previous explanations of parts of the SAR pattern, with various amounts of biological realism. On the more unrealistic side, simple iid sampling from a sufficiently heavy-tailed distribution can generate SARs with power-law slopes in the right neighborhood. But, this kind of explanation ignores biological processes like speciation, extinction, dispersal in space, competition, etc., not to mention abiotic factors like geography and climate.
Building on previous work on neutral processes to explain biodiversity patterns, O'Dwyer and Green built a null model containing only the neutral processes of birth, death and dispersal. What makes this model different from, and better than, previous efforts is that it explicitly incorporates a notion of spatial structure by embedding species in space and allowing them to move around. This is helpful because it gets directly at the SAR. The problem, however, is that spatially explicit stochastic processes can be difficult to solve mathematically.
Fortunately, O'Dwyer and Green could use tools from quantum field theory, which is well suited to solving this kind of spatial stochastic model. Aside from the coolness of using quantum field theory in ecology and the fact that it predicts an SAR that agrees with decades of data, what I like about this result is that it illustrates two things close to my heart. First, it's a beautiful example of a null model. Because it includes only boring, neutral processes in generating its prediction for the SAR, when empirical data deviates from the model's prediction, those deviations can be interpreted as model mis-specification errors. In this case, that means interesting, non-neutral ecologically significant processes like competition, predation, habitat, climate, etc. In this way, it can generate new, specific hypotheses about what to test next.
The second is that this approach to model building moves the emphasis of the science away from small-scale (in time or space), context-dependent processes and towards more large-scale (in time and space) neutral dynamics and principles. This kind of perspective is currently more common in the physical sciences than in the biological ones , but I hope to see more of it in biology in the future , and it's one of the things I think physics has to offer ecology .
This aspect of O'Dwyer and Green's work fits nicely with my own on explaining why we see such huge morphological diversity in living and extinct species, and how whales got so much bigger than mice. In a way, the model I've been using is pretty similar to O'Dwyer and Green's: it omits all ecological processes, climate and geography, but includes neutral processes representing species birth, death (extinction), and dispersal (changes in body mass). The fact that both our neutral models do pretty well at correctly predicting the observed empirical data suggests that perhaps randomness, structured by a few relatively banal processes, might be a pretty good general explanation of how the biological world works at these very large scales. I suspect also that similar models, expanded to include some role for social institutions, will also work well to explain how societies work at large spatial and temporal scales. Finding out if this is true is something I hope to be around for.
 This criterion was a pleasant surprise. As much as I dislike Nature's outsized status and influence in science, I've been pleasantly surprised on several occasions by some of their policies. Someone there genuinely seems to care about the integrity of the scientific process.
 Full disclosure: Jessica recently joined the external faculty here at SFI and James will be starting as a postdoc at SFI in the Fall. That being said, I haven't really interacted much with either of them.
 O'Dwyer and Green, "Field theory for biogeography: a spatially explicit model for predicting patterns of biodiversity." Ecology Letters 13, 87-95 (2010).
 It's uncommon in the biological sciences, but not unknown. Mathematical evolutionary theory and population genetics are good examples of communities that frequently use null models in this way . I think the reason such an approach is more common in the physical sciences today is that we actually understand a great deal about the fundamental processes there, and what things can and should vary in different contexts, while we're still sorting those things out in biology. For sure, we're making progress, but it's slow going.
 It would be good for other fields, too, such as sociology and political science. The issue is, I think, that scientific progress toward general principles is always limited by the availability of data that reveal those principles. When scientists of any kind are restricted to having either rich data on a small number of examples (think of alchemy), or poor data on a large number of examples (think of polling data), it's hard to make real progress. In both cases, there are typically an embarrassment of reasonable explanations for the observed patterns and it's difficult to distinguish them with the crappy data we can get. This is partly why I'm excited about the increasing availability of "big data" on social behavior, largely coming out of digital systems like email, Facebook, Twitter, etc. These data are not a panacea for social science, since they have they have their own weird biases and pathologies, but they're rich data on huge samples of individuals, which is qualitatively different than what was available to social scientists in the past. Perhaps we can answer old questions using these new data, and perhaps we can even ask some new questions such as, Are the behavioral patterns at the population scale simply scaled up versions of the behavioral patterns at the individual scale?
 To summarize: what I think physics has to offer ecology, among other fields, is (i) a very impressive and useful set of mathematical tools and models, and (ii) a valuable shift in perspective, away from small-scale processes and toward large-scale processes and general principles. I'm not advocating that we replace ecologists with physicists, but rather that we encourage physicists to train and work with ecologists, and vice versa. Biology will always need scientists focused on understanding specific contexts, but it also needs scientists focused on synthesizing those context-specific results into more general theories, as I think O'Dwyer and Green have done. Generally, physicists often have a good intuition about which details will be important at the large-scale and they often have good mathematical tools for working out whether its true.
 The statistical models that underly most statistical hypothesis tests, which are ubiquitous in the biological and social sciences, are technically null models, too. But, in many cases, these are wholly inappropriate since their iid assumptions are grossly violated by the mechanistic processes actually at play. That being said, it can be hard to come up with a good null model because often we don't know which processes are the important ones to include. A topic for another day, I think.
February 02, 2008
The Importance of Mathematics
W. Timothy Gowers (homepage) won the Fields Medal in 1998 for work in functional analysis and combinatorics. Pleasantly, he also writes regularly at Gowers's Blog, although much of what he writes is over my head (I am probably more to blame for that fact than Gowers, however). But, this very pleasant talk he gave at the Clay Mathematics Institute for their Millennium Meeting back in 2000. His topic is generally the "Importance of Mathematics" (which of course is a familiar notion to long-time readers here), and he gives a highly entertaining intellectual meditation on the subject, and touches briefly on problems like graph coloring, computational complexity (as a way of distinguishing those things practical in theory and those things practical in practice), knot theory, and the Erdos-Kac theorem that the number of factors for a randomly chosen integer is normally distributed (something I hadn't heard before). Along the way, he gives a good explanation about why it's dangerous to cut funding to "useless" parts of mathematics (or, science) in favor of funding only the "useful" bits, and tries to convey the idea that the most important mathematics is also often the most beautiful.
June 08, 2007
Power laws and all that jazz
With apologies to Tolkien:
Three Power Laws for the Physicists, mathematics in thrall,
Four for the biologists, species and all,
Eighteen behavioral, our will carved in stone,
One for the Dark Lord on his dark throne.
In the Land of Science where Power Laws lie,
One Paper to rule them all, One Paper to find them,
One Paper to bring them all and in their moments bind them,
In the Land of Science, where Power Laws lie.
From an interest that grew directly out of my work chracterizing the frequency of severe terrorist attacks, I'm happy to say that the review article I've been working on with Cosma Shalizi and Mark Newman -- on accurately characterizing power-law distributions in empirical data -- is finally finished. The paper covers all aspects of the process, from fitting the distribution to testing the hypothesis that the data is distributed according to a power law, and to make it easy for folks in the community to use the methods we recommend, we've also made our code available.
So, rejoice, rejoice all ye people of Science! Go forth, fit and validate your power laws!
For those still reading, I have a few thoughts about this paper now that it's been released into the wild. First, I naturally hope that people read the paper and find it interesting and useful. I also hope that we as a community start asking ourselves what exactly we mean when we say that such-and-such a quantity is "power-law distributed," and whether our meaning would be better served at times by using less precise terms such as "heavy-tailed" or simply "heterogeneous." For instance, we might simply mean that visually it looks roughly straight on a log-log plot. To which I might reply (a) power-law distributions are not the only thing that can do this, (b) we haven't said what we mean by roughly straight, and (c) we haven't been clear about why we might prefer a priori such a form over alternatives.
The paper goes into the first two points in some detail, so I'll put those aside. The latter point, though, seems like one that's gone un-addressed in the literature for some time now. In some cases, there are probably legitimate reasons to prefer an explanation that assumes large events (and especially those larger than we've observed so far) are distributed according to a power law -- for example, cases where we have some convincing theoretical explanations that match the microscopic details of the system, are reasonably well motivated, and whose predictions have held up under some additional tests. But I don't think most places where power-law distributions have been "observed" have this degree of support for the power-law hypothesis. (In fact, most simply fit a power-law model and assume that it's correct!) We also rarely ask why a system necessarily needs to exhibit a power-law distribution in the first place. That is, would the system behave fundamentally differently, perhaps from a functional perspective, if it instead exhibited a log-normal distribution in the upper tail?
Update 15 June: Cosma also blogs about the paper, making many excellent points about the methods we describe for dealing with data, as well as making several very constructive points about the general affair of power-law research. Well worth the time to read.
May 27, 2007
This week, I'm in Snowbird, UT for SIAM's conference on Applications of Dynamical Systems (DS07). I'm here for a mini-symposium on complex networks organized by Mason Porter and Peter Mucha. I'll be blogging about these (and maybe other) network sessions as time allows (I realize that I still haven't blogged about NetSci last week - that will be coming soon...).
May 21, 2007
This week, I'm in New York City for the International Conference on Network Science, being held at the New York Hall of Science Museum in Queens. I may not be able to blog each day about the events, but I'll be posting my thoughts and comments as things progress. Stay tuned. In the meantime, here's the conference talk schedule.
IPAM - Random and Dynamic Graphs and Networks (Days 4 & 5)
Rather than my usual format of summarizing the things that got me thinking during the last few days, I'm going to go with a more free-form approach.
Thursday began with Jennifer Chayes (MSR) discussing some analytical work on adapting convergence-in-distribution proof techniques to ensembles of graphs. She introduced the cut-norm graph distance metric (useful on dense graphs; says that they have some results for sparse graphs, but that it's more difficult for those). The idea of graph distance seems to pop up in many different areas (including several I've been thinking of) and is closely related to the GRAPH ISOMORPHISM problem (which is not known to be NP-complete, but nor is it known to be in P). For many reasons, it would be really useful to be able to calculate in polynomial time the minimum edge-edit distance between two graphs; this would open up a lot of useful techniques based on transforming one graph into another.
Friday began with a talk by Jeannette Janssen (Dalhousie University) on a geometric preferential attachment model, which is basically a geometric random graph but where nodes have a sphere of attraction (for new edges) that has volume proportional to the node's in-degree. She showed some very nice mathematical results on this model. I wonder if this idea could be generalized to arbitrary manifolds (with a distance metric on them) and attachment kernels. That is, imagine that our complex network has actually imbedded on some complicated manifold and the attachment is based on some function of the distance on that manifold between the two nodes. The trick would be then to infer both the structure of the manifold and the attachment function from real data. Of course, without some constraints on both features, it would be easy to construct an arbitrary pair (manifold and kernel) that would give you exactly the network you observed. Is it sufficient to get meaningful results that both should be relatively smooth (continuous, differentiable, etc.)?
Jeannette's talk was followed by Filippo Menczer's talk on mining traffic data from the Internet2/Abilene network. The data set was based on daily dumps of end-to-end communications (packet headers with client and server IDs anonymized) and looked at a variety of behaviors of this traffic. He used this data to construct interaction graphs betwen clients and servers, clients and applications (e.g., "web"), and a few other things. The analysis seems relatively preliminary in the sense that there are a lot of data issues that are lurking in the background (things like aggregated traffic from proxies, aliasing and masking effects, etc.) that make it difficult to translate conclusions about the traffic into conclusions about real individual users. But, fascinating stuff, and I'm looking forward to seeing what else comes out of that data.
The last full talk I saw was by Raissa D'Souza on competition-induced preferential attachment, and a little bit at the end on dynamic geometric graphs and packet routing on them. I've seen the technical content of the preferential attachment talk before, but it was good to have the reminder that power-law distributions are not necessarily the only game in town for heavy-tailed distributions, and that even though the traditional preferential attachment mechanism may not be a good model of the way real growing networks change, it may be that another mechanism that better models the real world can look like preferential attachment. This ties back to Sidney Redner's comment a couple of days before about the citation network: why does the network look like one grown by preferential attachment, when we know that's not how individual authors choose citations?
May 09, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 3)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on Random and Dynamic Graphs and Networks; this is the third of five entries based on my thoughts from each day. As usual, these topics are a highly subjective slice of the workshop's subject matter...
The impact of mobility networks on the worldwide spread of epidemics
I had the pleasure of introducing Alessandro Vespignani (Indiana University) for the first talk of the day on epidemics in networks, and his work in modeling the effect that particles (people) moving around on the airport network have on models of the spread of disease. I've seen most of this stuff before from previous versions of Alex's talk, but there were several nice additions. The one that struck the audience the most was a visualization of all of the individual flights over the space of a couple of days in the eastern United States; the animation was made by Aaron Koblin for a different project, but was still quite effective in conveying the richness of the air traffic data that Alex has been using to do epidemic modeling and forecasting.
On the structure of growing networks
Sidney Redner gave the pre-lunch talk about his work on the preferential attachment growing-network model. Using the master equation approach, Sid explored an extremely wide variety of properties of the PA model, such as the different regimes of degree distribution behavior for sub-, exact, and different kinds of super- linear attachment rates, the first-mover advantage in the network, the importance of initial degree in determining final degree, along with several variations on the initial model. The power of the master equation approach was clearly evident, I should really learn more about.
He also discussed his work analyzing 100 years of citation data from the Physical Review journal (about 350,000 papers and 3.5 million citations; in 1890, the average number of references in a paper was 1, while in 1990, the average number had increased to 10), particularly with respect to his trying to understand the evidence for linear preferential attachment as a model of citation patterns. Quite surprisingly, he showed that for the first 100 or so citations, papers in PR have nearly linear attachment rates. One point Sid made several times in his talk is that almost all of the results for PA models are highly sensitive to variations in the precise details of the attachment mechanism, and that it's easy to get something quite different (so, no power laws) without trying very hard.
Finally, a question he ended with is why does linear PA seem to be a pretty good model for how citations acrue to papers, even though real citation patterns are clearly not dictated by the PA model?
The last talk-slot of the day was replaced by a panel discussion, put together by Walter Willinger and chaired by Mark Newman. Instead of the usual situation where the senior people of a field sit on the panel, this panel was composed of junior people (with the expectation that the senior people in the audience would talk anyway). I was asked to sit on the panel, along with Ben Olding (Harvard), Lea Popovic (Cornell), Leah Shaw (Naval Research Lab), and Lilit Yeghiazarian (UCLA). We each made a brief statement about what we liked about the workshop so far, and what kinds of open questions we would be most interested in seeing the community study.
For my on part, I mentioned many of the questions and themes that I've blogged about the past two days. In addition, I pointed out that function is more than just structure, being typically structure plus dynamics, and that our models currently do little to address the dynamics part of this equation. (For instance, can dynamical generative models of particular kinds of structure tell us more about why networks exhibit those structures specifically, and not some other variety?) Lea and Leah also emphasized dynamics as being a huge open area in terms of both modeling and mechanisms, with Lea pointing out that it's not yet clear what are the right kinds of dynamical processes that we should be studying with networks. (I made a quick list of processes that seem important, but only came up with two main caterogies, branching-contact-epidemic-percolation processes and search-navigation-routing processes. Sid later suggested that consensus-voting style processes, akin to the Ising model, might be another, although there are probably others that we haven't thought up yet.) Ben emphasized the issues of sampling, for instance, sampling subgraphs of our model, e.g., the observable WWW or even just the portion we can crawl in an afternoon, and dealing with sampling effects (i.e., uncertainty) in our models.
The audience had a lot to say on these and other topics, and particularly so on the topics of what statisticians can contribute to the field (and also why there are so few statisticians working in this area; some suggestions that many statisticians are only interested in proving asymptotic results for methods, and those that do deal with data are working on bio-informatics-style applications), and on the cultural difference between the mathematicians who want to prove nice things about toy models (folks like Christian Borgs, Microsoft Research) as a way of understanding the general propeties of networks and of their origin, and the empiricists (like Walter Willinger) who want accurate models of real-world systems that they can use to understand their system better. Mark pointed out that there's a third way in modeling, which relies on using an appropriately defined null model as a probe to explore the structure of your network, i.e., a null model that reproduces some of the structure you see in your data, but is otherwise maximally random, can be used to detect the kind of structure the model doesn't explain (so-called "modeling errors", in contrast to "measurement errors"), and thus be used in the standard framework of error modeling that science has used successfully in the past to understand complex systems.
All-in-all, I think the panel discussion was a big success, and the conversation certainly could have gone on well past the one-hour limit that Mark imposed.
May 08, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 2)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on Random and Dynamic Graphs and Networks; this is the second of five entries based on my thoughts from each day. As usual, these topics are a highly subjective slice of the workshop's subject matter...
Biomimetic searching strategies
Massimo Vergassola (Institut Pasteur) started the day with an interesting talk that had nothing to do with networks. Massimo discussed the basic problem of locating a source of smelly molecules in the macroscopic world where air currents cause pockets of the smell to be sparsely scattered across a landscape, thus spoiling the chemotaxis (gradient ascent) strategy used by bacteria, and a clever solution for it (called "infotaxis") based on trading off exploration and exploitation via an adaptive entropy minimization strategy.
Diversity of graphs with highly variable connectivity
Following lunch, David Alderson (Naval Postgraduate School) described his work with Lun Li (Caltech) on understanding just how different networks with a given degree distribution can be from each other. The take-home message of Dave's talk is, essentially, that the degree distribution is a pretty weak constraint on other patterns of connectivity, and is not a sufficient statistical characterization of the global structure of the network with respect to many (most?) of the other (say, topological and functional) aspects we might care about. Although he focused primarily on degree assortativity, the same kind of analysis could in principle be done for other network measures (clustering coefficient, distribution, diameter, vertex-vertex distance distribution, etc.), none of which are wholly independent of the degree distribution, or of each other! (I've rarely seen the interdependence of these measures discussed (mentioned?) in the literature, even though they are often treated as such.)
In addition to describing his numerical experiments, Dave sounded a few cautionary notes about the assumptions that are often made in the complex networks literature (particularly by theoreticians using random-graph models) on the significance of the degree distribution. For instance, the configration model with a power-law degree sequence (and similarly, graphs constructed via preferential attachment) yields networks that look almost nothing like any real-world graph that we know, except for making vaguely similar degree distributions, and yet they are often invoked as reasonable models of real-world systems. In my mind, it's not enough to simply fix-up our existing random-graph models to instead define an ensemble with a specific degree distribution, and a specific clustering coefficient, and a diameter, or whatever our favorite measures are. In some sense all of these statistical measures just give a stylized picture of the network, and will always be misleading with respect to other important structural features of real-world networks. For the purposes of proving mathematical theorems, I think these simplistic toy models are actually very necessary -- since their strong assumptions make analytic work significantly easier -- so long as we also willfully acknowledge that they are a horrible model of the real world. For the purposes of saying something concrete about real networks, we need more articulate models, and, probably, models that are domain specific. That is, I'd like a model of the Internet that respects the idiosyncracies of this distributed engineered and evolving system; a model of metabolic networks that respects the strangeness of biochemistry; and a model of social networks that understands the structure of individual human interactions. More accurately, we probably need models that understand the function that these networks fulfill, and respect the dynamics of the network in time.
Greedy search in social networks
David Liben-Nowell (Carleton College) then closed the day with a great talk on local search in social networks. The content of this talk largely mirrored that of Ravi Kumar's talk at GA Tech back in January, which covered an empirical study of the distribution of the (geographic) distance covered by friendship links in the LiveJournal network (from 2003, when it had about 500,000 users located in the lower 48 states). This work combined some nice data analysis with attempts to validate some of the theoretical ideas due to Kleinberg for locally navigable networks, and a nice generalization of those ideas to networks with non-uniform population distributions.
An interesting point that David made early in his talk was that homophily is not sufficient to explain the presense of either the short paths that Milgrams' original 6-degrees-of-seperation study demonstrated, or even the existence of a connected social graph! That is, without a smoothly varying notion of "likeness", then homophily would lead us to expect disconnected components in the social network. If both likeness and the population density in the likeness space varied smoothly, then a homophilic social web would cover the space, but the average path length would be long, O(n). In order to get the "small world" that we actually observe, we need some amount of non-homophilic connections, or perhaps multiple kinds of "likeness", or maybe some diversity in the preference functions that individuals use to link to each other. Also, it's still not clear what mechanism would lead to the kind of link-length distribution predicted by Kleinberg's model of optimally navigable networks - an answer to this question would, presumably, tell us something about why modern societies organize themselves the way they do.
May 07, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 1)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on random and dynamic graphs and networks. This workshop is the third of four in their Random Shapes long program. The workshop has the usual format, with research talks throughout the day, punctuated by short breaks for interacting with your neighbors and colleagues. I'll be trying to do the same for this event as I did for the DIMACS workshop I attended back in January, which is to blog each day about interesting ideas and topics. As usual, this is a highly subjective slice of the workshop's subject matter.
Detecting and understanding the large-scale structure of networks
Mark Newman (U. Michigan) kicked off the morning by discussing his work on clustering algorithms for networks. As he pointed out, in the olden days of network analysis (c. 30 years ago), you could write down all the nodes and edges in a graph and understand its structure visually. These days, our graphs are too big for this, and we're stuck using statistical probes to understand how these things are shaped. And yet, many papers include figures of networks as incoherent balls of nodes and edges (Mark mentioned that Marc Vidal calls these figures "ridiculograms").
I've seen the technical content of Mark's talk before, but he always does an excellent job of making it seem fresh. In this talk, there was a brief exchange with the audience regarding the NP-completeness of the MAXIMUM MODULARITY problem, which made me wonder what exactly are the kind of structures that would make an instance of the MM problem so hard. Clearly, polynomial time algorithms that approximate the maximum modularity Q exist because we have many heuristics that work well on (most) real-world graphs. But, if I was an adversary and wanted to design a network with particularly difficult structure to partition, what kind would I want to include? (Other than reducing another NPC problem using gadgets!)
Walter Willinger raised a point here (and again in a few later talks) about the sensitivity of most network analysis methods to topological uncertainty. That is, just about all the techniques we have available to us assume that the edges as given are completely correct (no missing or spurious edges). Given the classic result due to Watts and Strogatz (1998) of the impact that a few random links added to a lattice have on the diameter of the graph, it's clear that in some cases, topological errors can have a huge impact on our conclusions about the network. So, developing good ways to handle uncertainty and errors while analyzing the structure of a network is a rather large, gaping hole in the field. Presumably, progress in this area will require having good error models of our uncertainty, which, necessary, depend on the measurement techniques used to produce the data. In the case of traceroutes on the Internet, this kind of inverse problem seems quite tricky, but perhaps not impossible.
Probability and Spatial Networks
David Aldous (Berkeley) gave the second talk and discussed some of his work on spatial random graphs, and, in particular, on the optimal design and flow through random graphs. As an example, David gave us a simple puzzle to consider:
Given a square of area N with N nodes distributed uniformly at random throughout. Now, subdivided this area into L^2 subsquares, and choose one node in each square to be a "hub." Then, connect each of the remaining nodes in a square to the hub, and connect the hubs together in a complete graph. The question is, what is the size L that minimizes the total (Euclidean) length of the edges in this network?
He then talked a little about other efficient ways to connect up uniformly scattered points in an area. In particular, Steiner trees are the standard way to do this, and have a cost O(N). The downside for this efficiency is that the tree-distance between physically proximate points on the plane is something polynomial in N (David suggested that he didn't have a rigorous proof for this, but it seems quite reasonable). As it turns out, you can dramatically lower this cost by adding just a few random lines across the plane -- the effect is analagous to the one in the Watts-Strogatz model. Naturally, I was thinking about the structure of real road networks here, and it would seem that the effect of highways in the real world is much the same as David's random lines. That is, it only takes a few of these things to dramatically reduce the topological distance between arbitrary points. Of course, road networks have other things to worry about, such as congestion, that David's highways don't!
January 25, 2007
DIMACS - Complex networks and their applications (Day 3)
The third day of the workshop focused on applications to biochemical networks (no food webs), with a lot of that focus being on the difficulties of taking fuzzy biological data (e.g., gene expression data) and converting it into an accurate and meaningful form for further analysis or for hypothesis testing. Only a few of the talks were theoretical, but this perhaps reflects the current distribution of focus in biology today. After the workshop was done, I wondered just how much information crossed between the various disciplines represented at the workshop - certainly, I came away from it with a few new ideas, and a few new insights from the good talks I attended. And I think that's the sign of a successful workshop.
Complex Networks in Biology
Chris Wiggins (Columbia) delivered a great survey of interesting connections between machine learning and biochemical networks. It's probably fair to say that biologists are interested in constructing an understanding of cellular-level systems that compares favorably to an electrical engineer's understanding of circuits (Pointer: Can a Biologist Fix a Radio?). But, this is hard because living stuff is messy, inconsistent in funny ways, and has a tendency to change while you're studying it. So, it's harder to get a clean view of what's going on under the hood than it was with particle physics. This, of course, is where machine learning is going to save us - ML offers powerful and principled ways to sift through (torture) all this data.
The most interesting part of his talk, I think, was his presentation of NetBoost, a mechanism discriminator that can tell you which (among a specific suite of existing candidates) is the most likely to have generated your observed network data . For instance, was it preferential attachment (PA) or duplication-mutation-complementation (DMC) that produced a given protein-interaction network (conclusion: the latter is better supported). The method basically works by constructing a decision tree that looks at the subgraph decomposition of a network and scores it's belief that each of the various mechanisms produced it . With the ongoing proliferation of network mechanisms (theorists really don't have enough to do these days), this kind of approach serves as an excellent way to test a new mechanism against the data it's supposed to be emulating.
One point Chris made that resonated strongly with me - and which Cris and Mark made yesterday - is the problem with what you might call "soft validation" . Typically, a study will cluster or do some other kind of analysis with the data, and then tell a biological story about why these results make sense. On the other hand, forcing the clustering to make testable predictions would be a stronger kind of validation.
Network Inference and Analysis for Systems Biology
Just before lunch, Joel Bader (Johns Hopkins) gave a brief talk about his work on building a good view of the protein-protein interaction network (PPIN). The main problems with this widely studied data are the high error rate, both for false positives (interactions that we think exist, but don't) and false negatives (interactions that we think don't exist, but do). To drive home just how bad the data is, he pointed out that two independent studies of the human PPIN showed just 1% overlap in the sets of "observed" interactions.
He's done a tremendous amount of work on trying to improve the accuracy of our understanding of PPINs, but here he described a recent approach that fits degree-based generative models  to the data using our old friend expectation-maximization (EM) . His results suggest that we're seeing about 30-40% of the real edges, but that our false positive rate is about 10-15%. This is a depressing signal-to-noise ratio (roughly 1%), because the number of real interactions is O(n), while our false positive rate is O(n^2). Clearly, the biological methods used to infer the interactions need to be improved before we have a clear idea of what this network looks like, but it also suggests that a lot of the previous results on this network are almost surely wrong. Another question is whether it's possible to incorporate these kinds of uncertainties into our analyses of the network structure.
Activating Interaction Networks and the Dynamics of Biological Networks
Meredith Betterton (UC-Boulder) presented some interesting work on signaling and regulatory networks. One of the more surprising tidbits she used in her motivation is the following. In yeast, the mRNA transcription undergoes a consistent 40-minute genome-wide oscillation, but when exposed to an antidepressant (in this case, phenelzine), the period doubles . (The fact that gene expression oscillates like this poses another serious problem for the results of gene expression analysis that doesn't account for such oscillations.)
The point Meredith wanted to drive home, though, was we shouldn't just think of biochemical networks as static objects - they also represent the form that the cellular dynamics must follow. Using a simple dynamical model of activation and inhibition, she showed that the structure (who points to who, and whether an edge inhibits or activates its target) of a real-world circadian rhythm network and a real-world membrane-based signal cascade basically behave exactly as you would expect - one oscillates and the other doesn't. But, then she showed that it only takes a relatively small number of flips (activation to inhibition, or vice versa) to dramatically change the steady-state behavior of these cellular circuits. In a sense, this suggests that these circuits are highly adaptable, given a little pressure.
There are several interesting questions that came to mind while she was presenting. For instance, if we believe there are modules within the signaling pathways that accomplish a specific function, how can we identify them? Do sparsely connected dense subgraphs (assortative community structure) map onto these functional modules? What are the good models for understanding these dynamics, systems of differential equations, discrete time and matrix multiplication, or something more akin to a cellular version of Ohm's Law? 
 M. Middendorf, E. Ziv and C. Wiggins, "Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network." PNAS USA 102 (9), 3192 (2005).
 Technically, it's using these subgraphs as generic features and then crunching the feature vectors from examples of each mechanism through a generalized decision tree in order to learn how to discriminate among them. Boosting is used within this process in order to reduce the error rates. The advantage of this approach to model selection and validation, as Chris pointed out, is that it doesn't assume a priori which features (e.g., degree distribution, clustering coefficient, distance distribution, whatever) are interesting, but rather chooses the ones that can actually discriminate between things we believe are different.
 Chris called it "biological validation," but the same thing happens in sociology and Internet modeling, too.
 I admit that I'm a little skeptical of degree-based models of these networks, since they seem to assume that we're getting the degree distribution roughly right. That assumption is only reasonable if our sampling of the interactions attached to a particular vertex is unbiased, which I'm not sure about.
 After some digging, I couldn't find the reference for this work. I did find this one, however, which illustrates a different technique for a related problem. I. Iossifov et al., "Probabilistic inference of molecular networks from noisy data sources." 20 (8), 1205 (2004).
 C. M. Li and R. R. Klevecz, "A rapid genome-scale response of the transcriptional oscillator to perturbation reveals a period-doubling path to phenotypic change." PNAS USA 103 (44), 16254 (2006).
 Maribeth Oscamou pointed out to me during the talk that any attempt to construct such rules have to account for processes like the biochemical degradation of the signals. That is, unlike electric circuits, there's no strict conservation of the "charge" carrier.
January 24, 2007
DIMACS - Complex networks and their applications (Day 2)
There were several interesting talks today, or rather, I should say that there were several talks today that made me think about things beyond just what the presenters said. Here's a brief recap of the ones that made me think the most, and some commentary about what I thought about. There were other good talks today, too. For instance, I particularly enjoyed Frank McSherry's talk on doing PageRank on his laptop. There was also one talk on power laws and scale-free graphs that stimulated a lot of audience, ah, interaction - it seems that there's a lot of confusion both over what a scale-free graph is (admittedly the term has no consistent definition in the literature, although there have been some recent attempts to clarify it in a principled manner), and how to best show that some data exhibit power-law behavior. Tomorrow's talks will be more about networks in various biological contexts.
Complex Structures in Complex Networks
Mark Newman's (U. Michigan) plenary talk mainly focused on the importance of having good techniques to extract information from networks, and being able to do so without making a lot of assumptions about what the technique is supposed to look for. That is, rather than assume that some particular kind of structure exists and then look for it in our data, why not let the data tell you what kind of interesting structure it has to offer?  The tricky thing about this approach to network analysis, though, is working out a method that is flexible enough to find many different kinds of structure, and to present only that which is unusually strong. (Point to ponder: what should we mean by "unusually strong"?) This point was a common theme in a couple of the talks today. The first example that Mark gave of a technique that has this nice property was a beautiful application of spectral graph theory to the task of find a partition of the vertices that give an extremal value of modularity. If we ask for the maximum modularity, this heuristic method , using the positive eigenvalues of the resulting solution, gives us a partition with very high modularity. But, using the negative eigenvalues gives a partition that minimizes the modularity. I think we normally think of modules meaning assortative structures, i.e., sparsely connected dense subgraphs. But, some networks exhibit modules that are approximately bipartite, i.e., they are disassortative, being densely connected sparse subgraphs. Mark's method naturally allows you to look for either. The second method he presented was a powerful probabilistic model of node clustering that can be appropriately parameterized (fitted to data) via expectation-maximization (EM). This method can be used to accomplish much the same results as the previous spectral method, except that it can look for both assortative and disassortative structure simultaneously in the same network.
Hierarchical Structure and the Prediction of Missing Links
In an afternoon talk, Cris Moore (U. New Mexico) presented a new and powerful model of network structure, the hierarchical random graph (HRG) . (Disclaimer: this is joint work with myself and Mark Newman.) A lot of people in the complex networks literature have talked about hierarchy, and, presumably, when they do so, they mean something roughly along the lines of the HRG that Cris presented. That is, they mean that nodes with a common ancestor low in the hierarchical structure are more likely to be connected to each other, and that different cuts across it should produce partitions that look like communities. The HRG model Cris presented makes these notions explicit, but also naturally captures the kind of assortative hierarchical structure and the disassortative structure that Mark's methods find. (Test to do: use HRG to generate mixture of assortative and disassortative structure, then use Mark's second method to find it.) There are several other attractive qualities of the HRG, too. For instance, using a Monte Carlo Markov chain, you can find the hierarchical decomposition of a single real-world network, and then use the HRG to generate a whole ensemble of networks that are statistically similar to the original graph . And, because the MCMC samples the entire posterior distribution of models-given-the-data, you can look not only at models that give the best fit to the data, but you can look at the large number of models that give an almost-best fit. Averaging properties over this ensemble can give you more robust estimates of unusual topological patterns, and Cris showed how it can also be used to predict missing edges. That is, suppose I hide some edges and then ask the model to predict which ones I hid. If it can do well at this task, then we've shown that the model is capturing real correlations in the topology of the real graph - it has the kind of explanatory power that comes from making correct predictions. These kinds of predictions could be extremely useful for laboratory or field scientists who manually collect network data (e.g., protein interaction networks or food webs) . Okay, enough about my own work!
The Optimization Origins of Preferential Attachment
Although I've seen Raissa D'Souza (UC Davis) talk about competition-induced preferential attachment  before, it's such an elegant generalization of PA that I enjoyed it a second time today. Raissa began by pointing out that most power laws in the real-world can't extend to infinity - in most systems, there are finite limits to the size that things can be (the energy released in an earthquake or the number of edges a vertex can have), and these finite effects will typically manifest themselves as exponential cutoffs in the far upper tail of the distribution, which takes the probability of these super-large events to zero. She used this discussion as a springboard to introduce a relatively simple model of resource constraints and competition among vertices in a growing network that produces a power-law degree distribution with such an exponential cutoff. The thing I like most about this model is that it provides a way for (tempered) PA to emerge from microscopic and inherently local interactions (normally, to get pure PA to work, you need global information about the system). The next step, of course, is to find some way to measure evidence for this mechanism in real-world networks . I also wonder how brittle the power-law result is, i.e., if you tweak the dynamics a little, does the power-law behavior disappear?
Web Search and Online Communities
Andrew Tomkins (of Yahoo! Reserch) is a data guy, and his plenary talk drove home the point that Web 2.0 applications (i.e., things that revolve around user-generated content) are creating a huge amount of data, and offering unparalleled challenges for combining, analyzing, and visualizing this data in meaningful ways. He used Flickr (a recent Y! acquisition) as a compelling example by showing an interactive (with fast-rewind and fast-forward features) visual stream of the trends in user-generated tags for user-posted images, annotated with notable examples of those images. He talked a little about the trickiness of the algorithms necessary to make such an application, but what struck me most was his plea for help and ideas in how to combine information drawn from social networks with user behavior with blog content, etc. to make more meaningful and more useful applications - there's all this data, and they only have a few ideas about how to combine it. The more I learn about Y! Research, the more impressed I am with both the quality of their scientists (they recently hired Duncan Watts), and the quality of their data. Web 2.0 stuff like this gives me the late-1990s shivers all over again. (Tomkins mentioned that in Korea, unlike in the US, PageRank-based search has been overtaken by an engine called Naver, which is driven by users building good sets of responses to common search queries.)
 To be more concrete, and perhaps in lieu of having a better way of approaching the problem, much of the past work on network analysis has taken the following approach. First, think of some structure that you think might be interesting (e.g., the density of triangles or the division into sparsely connected dense subgraphs), design a measure that captures that structure, and then measure it in your data (it turns out to be non-trivial to do this in an algorithm independent way). Of course, the big problem with this approach is that you'll never know whether there is other structure that's just as important as, or maybe more important than, the kind you looked for, and that you just weren't clever enough to think to look for it.
 Heuristic because Mark's method is a polynomial time algorithm, while the problem of modularity maximization was recently (finally...) shown to be NP-complete. The proof is simple, and, in retrospect, obvious - just as most such proofs inevitably end up being. See U. Brandes et al. "Maximizing Modularity is hard." Preprint (2006).
 M. E. J. Newman, "Finding community structure in networks using the eigenvectors of matrices." PRE 74, 036104 (2006).
 M. E. J. Newman and E. A. Leicht, "Mixture models and exploratory data analysis in networks." Submitted to PNAS USA (2006).
 A. Clauset, C. Moore and M. E. J. Newman, "Structural Inference of Hierarchies in Networks." In Proc. of the 23rd ICML, Workshop on "Statistical Network Analysis", Springer LNCS (Pittsburgh, June 2006).
 This capability seems genuinely novel. Given that there are an astronomical number of ways to rearrange the edges on a graph, it's kind of amazing that the hierarchical decomposition gives you a way to do such a rearrangement, but one which preserves the statistical regularities in the original graph. We've demonstrated this for the degree distribution, the clustering coefficient, and the distribution of pair-wise distances. Because of the details of the model, it sometimes gets the clustering coefficient a little wrong, but I wonder just how powerful / how general this capability is.
 More generally though, I think the idea of testing a network model by asking how well it can predict things about real-world problems is an important step forward for the field; previously, "validation" consisted of showing only a qualitative (or worse, a subjective) agreement between some statistical measure of the model's behavior (e.g., degree distribution is right-skewed) and the same statistical measure on a real-world network. By being more quantitative - by being more stringent - we can say stronger things about the correctness of our mechanisms and models.
 R. M. D'Souza, C. Borgs, J. T. Chayes, N. Berger, and R. Kleinberg, "Emergence of Tempered Preferential Attachment From Optimization", To appear in PNAS USA, (2007).
 I think the best candidate here would be the BGP graph, since there is clearly competition there, although I suspect that the BGP graph structure is a lot more rich than the simple power-law-centric analysis has suggested. This is primarily due to the fact that almost all previous analyses have ignored the fact that the BGP graph exists as an expression of the interaction of business interests with the affordances of the Border Gateway Protocol itself. So, its topological structure is meaningless without accounting for the way it's used, and this means accounting for complexities of the customer-provider and peer-to-peer relationships on the edges (to say nothing of the sampling issues involved in getting an accurate BGP map).
January 23, 2007
DIMACS - Complex networks and their applications (Day 1)
Today and tomorrow, I'm at the DIMACS workshop on complex networks and their applications, held at Georgia Tech's College of Computing. Over the course of the workshop, I'll be blogging about the talks I see and whatever ideas they stimulate (sadly, I missed most of the first day because of travel).
The most interesting talk I saw Monday afternoon was by Ravi Kumar (Yahoo! Research), who took location data of users on LiveJournal, and asked Do we see the same kind of routable structure - i.e., an inverses-square law relationship in the distance between people and the likelihood that they have a LJ connection - that Kleinberg showed was optimal for distributed / local search? Surprisingly, they were able to show that in the US, once you correct for the fact that there can be many people at a single "location" in geographic space (approximated to the city level), you do indeed observe exactly the kind of power-law that Kleinberg predicted . Truly, this was a kind of stunning confirmation of Kleinberg's theory. So now, the logical question would be, What mechanism might produce this kind of structure in geographic space? Although you could probably get away with assuming a priori the population distribution, what linking dynamics would construct the observed topological pattern? My first project in graduate school asked exactly this question for the pure Kleinberg model, and I wonder if it could be adapted to the geographic version that Kumar et al. consider.
 D. Liben-Nowell, et al. "Geographic Routing in Social Networks." PNAS USA 102, 33 11623-1162 (2005).
October 04, 2006
Fighting the dominant paradigm
Sean over at Cosmic Variance has a nice review of Lee Smolin's The Trouble With Physics, which is itself a critique of theoretical physic's focus on string theory as the way to unify gravity with the other forces. Most of the review focuses on Smolin's criticism of string theory's dominance, but Sean points out that Smolin is actually making two arguments, one about string theory and one about supporting interesting alternative ideas.
Smolin talks a great deal about the need for physics, and academia more generally, to support plucky upstart ideas and scholars with the courage and vision to think big and go against the grain. This is a larger point than the specific argument about how to best quantize gravity, and ultimately far more persuasive; it is likely, unfortunately, to be lost amidst the conflict between string theory and its discontents. Faculty positions and grant money are scarce commodities, and universities and funding agencies are naturally risk-averse. Under the current system, a typical researcher might spend five years in graduate school, three to six as a postdoc, and another six or seven as an assistant professor before getting tenure – with an expectation that they will write several competent papers in every one of those years. Nobody should be surprised that, apart from a few singular geniuses, the people who survive this gauntlet are more likely to be those who show technical competence within a dominant paradigm, rather than those who will take risks and pursue their idiosyncratic visions. The dogged pursuit of string theory through the 1970’s by Green and Schwarz is a perfect example of the ultimate triumph of the latter approach, and Smolin is quite correct to lament the lack of support for this kind of research today.
Although he's talking about theoretical physicists, the same applies just as much to other disciplines (perhaps with shorter postdoc periods) and their relationship to upstart ideas. Of course, finding the right balance between "normal science" and "paradigm-shifting science" is not easy, and there is a big difference between supporting interesting new ideas and supporting crackpots. Sometimes, that distinction can be hard to see at first, but all good new ideas ultimately lead to really excellent science. Fortunately, there are places that actively encourage both both excellent work and thinking about crazy ideas.
Update Oct. 4: Suresh blogs about Sean's review as well, and also zeros in on the same passage. He makes some valuable points about how important it is to build your own model of how to do good research. Separately, Dave Bacon blogs about Peter Shor's review of Smolin's book, in which Shor likens Nature to an insurance salesman.
July 06, 2006
An ontological question about complex systems
Although I've been reading Nature News for several years now (as part of my daily trawl for treasure in the murky waters of science), I first came to recognize one of their regular writers Philip Ball when he wrote about my work on terrorism with Maxwell Young. His essay, now hidden behind Nature's silly subscription-only barrier, sounded an appropriately cautionary note about using statistical patterns of human behavior to predict the future, and was even titled "Don't panic, it might never happen."
The idea that there might be statistical laws that govern human behavior can be traced, as Ball does in his essay, back to the English philosopher Thomas Hobbes (1588-1679) in The Leviathan and to the French positivist philosopher Auguste Comte (1798-1857; known as the father of sociology, and who also apparently coined the term "altruism"), who were inspired by the work of physicists in mechanizing the behavior of nature to try to do the same with human societies.
It seems, however, that somewhere between then and now, much of sociology has lost interest in such laws. A good friend of mine in graduate school for sociology (who shall remain nameless to protect her from the politics of academia) says that her field is obsessed with the idea that context, or nurture, drives all significant human behavior, and that it rejects the idea that overarching patterns or laws of society might exist. These, apparently, are the domain of biology, and thus Not Sociology. I'm kind of stunned that any field that takes itself seriously would so thoroughly cling to the nearly medieval notion of the tabula rasa (1) in the face of unrelenting scientific evidence to the contrary. But, if this territory has been abandoned by sociologists (2), it has recently, and enthusiastically, been claimed by physicists (who may or may not recognize the similarity of their work to a certain idea in science fiction).
Ball's background is originally in chemistry and statistical physics, and having spent many years as an editor at Nature, he apparently now has a broad perspective on modern science. But, what makes his writing so enjoyable is the way he places scientific advances in their proper historical context, showing both where the inspiration may have come from, and how other scientists were developing similar or alternative ideas concurrently. These strengths are certainly evident in his article about the statistical regularity of terrorism, but he puts them to greater use in several books and, in particular, one on physicists' efforts to create something he calls sociophysics. As it turns out, however, this connection between physics and sociology is not a new one, and the original inspiration for statistical physics (one of the three revolutionary ideas in modern physics; the other two are quantum mechanics and relativity) is owed to social scientists.
In the mid 1800s, James Clerk Maxwell, one of the fathers of statistical physics, read Henry Thomas Buckle's lengthy History of Civilization. Buckle was a historian by trade, and a champion of the idea that society's machinations are bound by fundamental laws. Maxwell, struggling with the question of how to describe the various motions of particles in a gas, was struck by Buckle's descriptions of the statistical nature of studies of society. Such studies sought not to describe each individual and their choices exactly, but instead represent the patterns of behavior statistically, and often pointed to surprising regularities, e.g., the near-stable birth or suicide rates in a particular region. As a result, Maxwell abandoned the popular approach of describing gas particles only using Newtonian mechanics, i.e., an attempt to describe every particle's position and motion exactly, in favor for a statistical approach that focused on the distribution of velocities.
It was the profound success of these statistical descriptions that helped cement this approach as one of the most valuable tools available to physicists, and brought about some pretty profound shifts in our understanding of gasses, materials and even astrophysics. So, it seems fitting that statistical physicists are now returning to their roots by considering statistical laws of human behavior. Alas, I doubt that most such physicists appreciate this fact.
These efforts, which Ball surveys in "Critical Mass" (Farrar, Straus and Giroux, 2004) via a series of well-written case studies, have dramatically altered our understanding of phenomena as varied as traffic patterns (which have liquid, gaseous, solid and meta-stable states along with the corresponding phase transitions), voting patterns in parliamentary elections (which display nice heavy-tailed statistics), the evolution of pedestrian traffic trails across a university quad, economics and the statistics of businesses and markets, and a very shallow discussion of social networks. Although his exposition is certainly aimed at the layman, he does not shy away from technical language when appropriate. Pleasantly, he even reproduces figures from the original papers when it serves his explanations. Given that these phenomena were drawn from a burgeoning field of interdisciplinary research, it's easy to forgive him for omitting some of my favorite topics, treating others only shallowly, and mercifully leaving out the hobby horses of cellular automata, genetic algorithms and artificial life.
Now, after seeing that list of topics, you might think that "Critical Mass" was a book about complex systems, and you might be right. But, you might be wrong, too, which is the problem when there's no strict definition of a term. So, let's assume he has, and see what this offers in terms of clarifying the corresponding ontological question. For one thing, Ball's choices suggest that perhaps we do not need other ill-defined properties like emergence, self-organization or robustness (3) to define a complex system. Instead, perhaps when we say we are studying a "complex system," we simply mean that it has a highly heterogeneous composition that we seek to explain using statistical mechanisms. To me, the former means that I, because of my limited mental capacity to grasp complicated equations, relationships or a tremendously large configuration space, pretty much have to use a statistical characterization that omits most of the detailed structure of the system; also, I say heterogeneous because homogeneous systems are much easier to explain using traditional statistical mechanics. The latter means that I'm not merely interested in describing the system, which can certainly be done using traditional statistics, but rather in explaining the rules and laws that govern the formation, persistence and evolution of that structure. For me, this definition is attractive both for its operational and utilitarian aspects, but also because it doesn't require me to wave my hands, use obfuscating jargon or otherwise change the subject.
In general, it's the desire to establish laws that reflects complex systems' roots in physics, and it is this that distinguishes it from traditional statistics and machine learning. In those areas, the focus seems to me to be more on predictive power ("Huzzah! My error rate is lower than yours.") and less on mechanisms. My machine learning friends tell me that people are getting more interested in the "interpretability" of their models, but I'm not sure this is the same thing as building models that reflect the true mechanical nature of the underlying system... of course, one fundamental difference between much of statistical learning and what I've described above is that for many systems, there's no underlying mechanism! We shouldn't expect problems like keeping the spam out of my inbox to exhibit nice mechanistic behavior, and there are a tremendous number of such problems out there today. Fortunately, I'm happy to leave those to people who care more about error rates than mechanisms, and I hope they're happy to leave studying the (complex) natural world, mechanisms and all, to me.
Updates, July 7
(1) The notion of the tabula rasa is not antithetical to the idea that there are patterns in social behavior, but patterns per se are not the same as the kind of societal laws that the founders of sociology were apparently interested in, i.e., sociology apparently believes these patterns to be wholly the results of culture and not driven by things that every human shares like our evolutionary history as a species. I suppose there's a middle ground here, in which society has created the appearance of laws, which the sociophysicists then discover and mistake for absolutes. Actually, I'm sure that much of what physicists have done recently can be placed into this category.
(2) It may be the case that it is merely the portion of sociology that my friend is most familiar with that expresses this odd conviction, and that there are subfields that retain the idea that true mechanistic laws do operate in social systems. For all I know, social network analysis people may be of this sort; it would be nice to have an insider's perspective on this.
(3) Like the notions of criticality and universality, these terms actually do have precise, technical definitions in their proper contexts, but they've recently been co-opted in imprecisely ways and are now, unfortunately and in my opinion, basically meaningless in most of the complex systems literature.
March 01, 2006
The scenic view
In my formal training in physics and computer science, I never did get much exposure to statistics and probability theory, yet I have found myself consistently using them in my research (partially on account of the fact that I deal with real data quite often). What little formal exposure I did receive was always in some specific context and never focused on probability as a topic itself (e.g., statistical mechanics, which could hardly be called a good introduction to probability theory). Generally, my training played-out in the crisp and clean neighborhoods of logical reasoning, algebra and calculus, with the occasional day-trip to the ghetto of probability. David Mumford, a Professor of Mathematics at Brown University, opines about ongoing spread of that ghetto throughout the rest science and mathematics, i.e., how probability theory deserves a respect at least equal to that of abstract algebra, in a piece from 1999 on The Dawning of the Age of Stochasticity. From the abstract,
For over two millennia, Aristotle's logic has rules over the thinking of western intellectuals. All precise theories, all scientific models, even models of the process of thinking itself, have in principle conformed to the straight-jacket of logic. But from its shady beginnings devising gambling strategies and counting corpses in medieval London, probability theory and statistical inference now emerge as better foundations for scientific models ... [and] even the foundations of mathematics itself.
It may sound it, but I doubt that Mumford is actually overstating his case here, especially given the deep connection between probability theory, quantum mechanics (c.f. the recent counter-intuitive result on quantum interrogation) and complexity theory.
A neighborhood I'm more familiar with is that of special functions; things like the Gamma distribution, the Riemann Zeta function (a personal favorite), and the Airy functions. Sadly, these familiar friends show up very rarely in the neighborhood of traditional computer science, but instead hang out in the district of mathematical modeling. Robert Batterman, a Professor of Philosophy at Ohio State University, writes about why exactly these functions are so interesting in On the Specialness of Special Functions (The Nonrandom Effusions of the Divine Mathematician).
From the point of view presented here, the shared mathematical features that serve to unify the special functions - the universal form of their asymptotic expansions - depends upon certain features of the world.
(Emphasis his.) That is, the physical world itself, by presenting a patterned appearance, must be governed by a self-consistent set of rules that create that pattern. In mathematical modeling, these rules are best represented by asymptotic analysis and, you guessed it, special functions, that reveal the universal structure of reality in their asymptotic behavior. Certainly this approach to modeling has been hugely successful, and remains so in current research (including my own).
My current digs, however, are located in the small nexus that butts up against these neighborhoods and those in computer science. Scott Aaronson, who occupies an equivalent juncture between computer science and physics, has written several highly readable and extremely interesting pieces on the commonalities he sees in his respective locale. I've found them to be a particularly valuable way to see beyond the unfortunately shallow exploration of computational complexity that is given in most graduate-level introductory classes.
In NP-complete Problems and Physical Reality Aaronson looks out of his East-facing window toward physics for hints about ways to solve NP-complete problems by using physical processes (e.g., simulated annealing). That is, can physical reality efficiently solve instances of "hard" problems? Although he concludes that the evidence is not promising, he points to a fundamental connection between physics and computer science.
Then turning to look out his West-facing window towards computer science, he asks Is P Versus NP Formally Indepenent?, where he considers formal logic systems and the implications of Godel's Incompleteness Theorem for the likelihood of resolving the P versus NP question. It's stealing his thunder a little, but the most quotable line comes from his conclusion:
So I'll state, as one of the few definite conclusions of this survey, that P \not= NP is either true or false. It's one or the other. But we may not be able to prove which way it goes, and we may not be able to prove that we can't prove it.
There's a little nagging question that some researchers are only just beginning to explore, which is, are certain laws of physics formally independent? I'm not even entirely sure what that means, but it's an interesting kind of question to ponder on a lazy Sunday afternoon.
There's something else embedded in these topics, though. Almost all of the current work on complexity theory is logic-oriented, essentially because it was born of the logic and formal mathematics of the first half of the 20th century. But, if we believe Mumford's claim that statistical inference (and in particular Bayesian inference) will invade all of science, I wonder what insights it can give us about solving hard problems, and perhaps why they're hard to begin with.
I'm aware of only anecdotal evidence of such benefits, in the form of the Survey Propagation Algorithm and its success at solving hard k-SAT formulas. The insights from the physicists' non-rigorous results has even helped improve our rigorous understanding of why problems like random k-SAT undergo a phase transition from mostly easy to mostly hard. (The intuition is, in short, that as the density of constraints increases, the space of valid solutions fragments into many disconnected regions.) Perhaps there's more being done here than I know of, but it seems that a theory of inferential algorithms as they apply to complexity theory (I'm not even sure what that means, precisely; perhaps it doesn't differ significantly from PPT algorithms) might teach us something fundamental about computation.
December 19, 2005
On modeling the human response time function; Part 3.
Much to my surprise, this morning I awoke to find several emails in my inbox apparently related to my commentary on the Barabasi paper in Nature. This morning, Anders Johansen pointed out to myself and Luis Amaral (I can only assume that he has already communicated this to Barabasi) that in 2004 he published an article entitled Probing human response times in Physica A about the very same topic using the very same data as that of Barabasi's paper. In it, he displays the now familiar heavy-tailed distribution of response times and fits a power law of the form P(t) ~ 1/(t+c) where c is a constant estimated from the data. Asymptotically, this is the same as Barabasi's P(t) ~ 1/t; it differs in the lower tail, i.e., for t < c where it scales more uniformly. As an originating mechanism, he suggests something related to a spin-glass model of human dynamics.
Although Johansen's paper raises other issues, which I'll discuss briefly in a moment, let's step back and think about this controversy from a scientific perspective. There are two slightly different approaches to modeling that are being employed to understand the response-time function of human behavior. The first is a purely "fit-the-data" approach, which is largely what Johansen has done, and certainly what Amaral's group has done. The other, employed by Barabasi, uses enough data analysis to extract some interesting features, posits a mechanism for the origin of those and then sets about connecting the two. The advantage of developing such a mechanistic explanation is that (if done properly) it provides falsifiable hypotheses and can move the discussion past simple data-analysis techniques. The trouble begins, as I've mentioned before, when either a possible mechanistic model is declared to be "correct" before being properly vetted, or when an insufficient amount of data analysis is done before positing a mechanism. This latter kind of trouble allows for a debate over how much support the data really provides to the proposed mechanism, and is exactly the source of the exchange between Barabasi et al. and Stouffer et al.
I tend to agree with the idea implicitly put forward by Stouffer et al.'s comment that Barabasi should have done more thorough data analysis before publishing, or alternatively, been a little more cautious in his claims of the universality of his mechanism. In light of Johansen's paper and Johansen's statement that he and Barabasi spoke at the talk in 2003 where Johansen presented his results, there is now the specter that either previous work was not cited that should have been, or something more egregious happened. While not to say that this aspect of the story isn't an important issue in itself, it is a separate one from the issues regarding the modeling, and it is those with which I am primarily concerned. But, given the high profile of articles published in journals like Nature, this kind of gross error in attribution does little to reassure me that such journals are not aggravating certain systemic problems in the scientific publication system. This will probably be a topic of a later post, if I ever get around to it. But let's get back to the modeling questions.
Seeking to be more physics and less statistics, the ultimate goal of such a study of human behavior should be to understand the mechanism at play, and at least Barabasi did put forward and analyze a plausible suggestion there, even if a) he may not have done enough data analysis to properly support it or his claims of universality, and b) his model assumes some reasonably unrealistic behavior on the part of humans. Indeed, the former is my chief complaint about his paper, and why I am grateful for the Stouffer et al. comment and the ensuing discussion. With regard to the latter, my preference would have been for Barabasi to have discussed the fragility of his model with respect to the particular assumptions he describes. That is, although he assumes it, humans probably don't assign priorities to their tasks with anything like a uniformly random distribution and nor do humans always execute their highest priority task next. For instance, can you decide, right now without thinking, what the most important email in your inbox is at this moment? Instead, he commits the crime of hubris and neglects these details in favor of the suggestiveness of his model given the data. On the other hand, regardless of their implausibility, both of these assumptions about human behavior can be tested through experiments with real people and through numerical simulation. That is, these assumptions become predictions about the world that, if they fail to agree with experiment, would falsify the model. This seems to me an advantage of Barabasi's mechanism over that proposed by Johansen, which, by relying on a spin glass model of human behavior, seems quite trickier to falsify.
But let's get back to the topic of the data analysis and the argument between Stouffer et al. and Barabasi et al. (now also Johansen) over whether the data better supports a log-normal or a power-law distribution. The importance of this point is that if the log-normal is the better fit, then the mathematical model Barabasi proposes cannot be the originating mechanism. From my experience with distributions with heavy tails, it can be difficult to statistically (let alone visually) distinguish between a log-normal and various kinds of power laws. In human systems, there is almost never enough data (read: orders of magnitude) to distinguish these without using standard (but sophisticated) statistical tools. This is because for any finite sample of data from an asymptotic distribution, there will be deviations that will blur the functional form just enough to look rather like the other. For instance, if you look closely at the data of Barabasi or Johansen, there are deviations from the power-law distribution in the far upper tail. Stouffer et al. cite these as examples of the poor fit of the power law and as evidence supporting the log-normal. Unfortunately, they could simply be due to deviations due to finite-sample effects (not to be confused with finite-size effects), and the only way to determine if they could have been is to try resampling the hypothesized distribution and measuring the sample deviation against the observed one.
The approach that I tend to favor for resolving this kind of question combines a goodness-of-fit test with a statistical power test to distinguish between alternative models. It's a bit more labor-intensive than the Bayesian model selection employed by Stouffer et al., but this approach offers, in addition to others that I'll describe momentarily, the advantage of being able to say that, given the data, neither model is good or that both models are good.
Using Monte Carlo simulation and something like the Kolmogorov-Smirnov goodness-of-fit test, you can quantitatively gauge how likely a random sample drawn from your hypothesized function F (which can be derived using maximum likelihood parameter estimation or by something like a least-squares fit; it doesn't matter) will have a deviation from F at least as big as the one observed in the data. By then comparing the deviations with an alternative function G (e.g., a power law versus a log-normal), you get a measure of the power of F over G as an originating model of the data. For heavy-tailed distributions, particularly those with a sample-mean that converges slowly or never at all (as is the case for something like P(t) ~ 1/t), sampling deviations can cause pretty significant problems with model selection, and I suspect that the Bayesian model selection approach is sensitive to these. On the other hand, by incorporating sampling variation into the model selection process itself, one can get an idea of whether it is even possible to select one model over another. If someone were to use this approach to analyze the data of human response times, I suspect that the pure power law would be a poor fit (the data looks too curved for that), but that the power law suggested in Johansen's paper would be largely statistically indistinguishable from a log-normal. With this knowledge in hand, one is then free to posit mechanisms that generate either distribution and then proceed to validate the theory by testing its predictions (e.g., its assumptions).
So, in the end, we may not have gained much in arguing about which heavy-tailed distribution the data likely came from, and instead should consider whether or not an equally plausible mechanism for generating the response-time data could be derived from the standard mechanisms for producing log-normal distributions. If we had such an alternative mechanism, then we could devise some experiments to distinguish between them and perhaps actually settle this question like scientists.
As a closing thought, my interest in this debate is not particularly in its politics. Rather, I think this story suggests some excellent questions about the practice of modeling, the questions a good modeler should ponder on the road to truth, and some of the pot holes strewn about the field of complex systems. It also, unfortunately, provides some anecdotal evidence of some systemic problems with attribution, the scientific publishing industry and the current state of peer-review at high-profile, fast turn-around-time journals.
References for those interested in reading the source material.
A. Johansen, "Probing human response times." Physica A 338 (2004) 286-291.
A.-L. Barabasi, "The origin of bursts and heavy tails in human dynamics." Nature 435 (2005) 207-211.
D. B. Stouffer, R. D. Malmgren and L. A. N. Amaral "Comment on 'The origin of bursts and heavy tails in human dynamics'." e-print (2005).
J.-P. Eckmann, E. Moses and D. Sergi, "Entropy of dialogues creates coherent structures in e-mail traffic." PNAS USA 101 (2004) 14333-14337.
A.-L. Barabasi, K.-I. Goh, A. Vazquez, "Reply to Comment on 'The origin of bursts and heavy tails in human dynamics'." e-print (2005).
November 27, 2005
Irrational exuberance plus indelible sniping yields delectable entertainment
In a past entry (which sadly has not yet scrolled off the bottom of the front page - sad because it indicates how infrequently I am posting these days), I briefly discussed the amusing public debate by Barabasi et al. and Souffer et al. over Barabasi's model of correspondence. At that point, I found the exchange amusing and was inclined to agree with the response article. However, let me rehash this topic and expose a little more light on the subject.
From the original abstract of the article posted on arxiv.org by Barabasi:
Current models of human dynamics, used from risk assessment to communications, assume that human actions are randomly distributed in time and thus well approximated by Poisson processes. In contrast, ... the timing of many human activities, ranging from communication to entertainment and work patterns, [are] ... characterized by bursts of rapidly occurring events separated by long periods of inactivity. Here we show that the bursty nature of human behavior is a consequence of a decision based queuing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, most tasks being rapidly executed, while a few experience very long waiting times.
(Emphasis is mine.) Barabasi is not one to shy away from grand claims of universality. As such, he epitomizes the thing that many of those outside of the discipline hate about physicists, i.e., their apparent arrogance. My opinion is that most physicists accused of intellectual arrogant are misunderstood, but that's a topic for another time.
Stouffer et al. responded a few months after Barabasi's original idea, as published in Nature, with the following (abstract):
In a recent letter, Barabasi claims that the dynamics of a number of human activities are scale-free. He specifically reports that the probability distribution of time intervals tau between consecutive e-mails sent by a single user and time delays for e-mail replies follow a power-law with an exponent -1, and proposes a priority-queuing process as an explanation of the bursty nature of human activity. Here, we quantitatively demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and that the proposed model is not representative of e-mail communication patterns.
(Emphasis is mine.) In this comment, Stouffer et al. strongly criticize the data analysis that Barabasi uses to argue for the plausibility and, indeed, the correctness of his priority-based queueing model. I admit that when I first read Barabasi's queueing model, I thought that surely the smart folks who have been dealing with queueing theory (a topic nearly a century old!) knew something like this already. Even if that were the case, the idea certainly qualifies as interesting, and I'm happy to see a) the idea published, although Nature was likely not the appropriate place and b) the press attention that Barabasi has brought to the discipline of complex systems and modeling. Anyway, the heart of the data-analysis based critique of Barabasi's work lies in distinguishing two different kinds of heavy-tailed distributions: the log-normal and the power law. Because of a heavy tail is an asymptotic property, these two distributions can be extremely difficult to differentiate when the data only spans a few orders of magnitude (as is the case here). Fortunately, statisticians (and occasionally, myself) enjoy this sort of thing. Stouffer et al. employ such statistical tools in the form of Bayesian model selection to choose between the two hypotheses and find the evidence of the power law lacking. It was quite dissatisfying, however, that Stouffer et al. neglected to discuss their model selection procedure in detail, and instead chose to discuss the politicking over Barabasi's publication in Nature.
And so, it should come as no surprise that a rejoinder from Barabasi was soon issued. With each iteration of this process, the veneer of professionalism cracks away a little more:
[Stouffer et al.] revisit the datasets [we] studied..., making four technical observations. Some of [their] observations ... are based on the authors' unfamiliarity with the details of the data collection process and have little relevance to [our] findings ... and others are resolved in quantitative fashion by other authors.
In the response, Barabasi discusses the details of the dataset that Stouffer et al. fixated on: that the extreme short-time behavior of the data is actually an artifact of the way messages to multiple recipients were logged. They rightly emphasize that it is the existence of a heavy tail that is primarily interesting, rather than its exact form (of course, Barabasi made some noise about the exact form in the original paper). However, it is not sufficient to simply observe a heavy tail, posit an apparently plausible model that produces some kind of such tail and then declare victory, universality and issue a press release. (I'll return to this thought in a moment.) As a result, Barabasi's response, while clarifying a few details, does not address the fundamental problems with the original work. Problems that Stouffer et al. seem to intuit, but don't directly point out.
While the rebuttal suggests the data is a better fit for the lognormal distribution, I am not a big believer in the fit-the-data approach to distinguish these distributions. The Barabasi paper actually suggested a model, which is nice, although the problem of how to verify such a model is challenge... This seems to be the real problem. Trust me, anyone can come up with a power law model. The challenge is figuring out how to show your model is actually right.
That is, first and foremost, the bursty nature of human activity is odd and, in that alluring voice only those fascinated by complex systems can hear, begs for an explanation. Second, a priority-based queueing process is merely one possible explanation (out of perhaps many) for the heaviness and burstiness. The emphasis is to point out that there is a real difficulty in nailing down causal mechanisms in human systems. often the best we can do is concoct a theory and see if the data supports it. That is, it is exceedingly difficult to go beyond mere plausibility without an overwhelming weight of empirical evidence and, preferably, the vetting of falsifiable hypotheses. The theory of natural selection is an excellent example that has been validated by just such a method (and continues to be). Unfortunately, simply looking at the response time statistics for email or letters by Darwin or Einstein, while interesting from the socio-historical perspective, does not prove the model. On the contrary: it merely suggests it.
That is, Barabasi's work demonstrates the empirical evidence (heavy-tails in the response times of correspondence) and offers a mathematical model that generates statistics of a similar form. It does not show causality, nor does it provide falsifiable hypotheses by which it could be invalidated. Barabasi's work in this case is suggestive but not explanatory, and should be judged accordingly. To me, it seems that the contention over the result derives partly from the overstatement of its generality, i.e., the authors claims their model to be explanatory. Thus, the argument over the empirical data is really just an argument about how much plausibility it imparts to the model. Had Barabasi gone beyond suggestion, I seriously doubt the controversy would exist.
Considering the issues raised here, personally, I think it's okay to publish a results that is merely suggestive so long as it is honestly made, diligently investigated and embodies a compelling and plausible story. That is to say that, ideally, authors should discuss the weakness of their model, empirical results and/or mathematical analysis, avoid overstating the generality of the result (sadly, a frequent problem in many of the papers I referee), carefully investigate possible biases and sources of error, and ideally, discuss alternative explanations. Admittedly, this last one may be asking a bit much. In a sense, these are the things I think about when I read any paper, but particularly when I referee something. This thread of thought seems to be fashionable right now, as I just noticed that Cosma's latest post discusses criteria for accepting or rejecting papers in the peer review process.
November 06, 2005
Finding your audience
Some time ago, a discussion erupted on Crooked Timber about the ettiquete of interdisciplinary research. This conversation was originally sparked by Eszter Hargittai, a sociologist with a distinct interest in social network analysis, who complained about some physicists working on social networks and failing to appropriately cite previous work in the area. I won't rehash the details, since you can read them for yourself. However, the point of the discussion that is salient for this post is the question of where and how one should publish and promote interdisciplinary work.
Over the better half of this past year, I have had my own journey with doing interdisciplinary research in political science. Long-time readers will know that I'm referring to my work with here, here and here). In our paper (old version via arxiv), we use tools from extremal statistics and physics to think carefully about the nature and evolution of terrorism, and, I think, uncover some interesting properties and trends at the global level. Throughout the process of getting our results published in an appropriate technical venue, I have espoused the belief that it should either go to an interdisciplinary journal or one that political scientists will read. That is, I felt that it should go to a journal with an audience that would both appreciate the results and understand their implications.
This idea of appropriateness and audience, I think, is a central problem for interdisciplinary researchers. In an ideal world, every piece of novel research would be communicated to exactly that group of people who would get the most out of learning about the new result and who would be able to utilize the advance to further deepen our knowledge of the natural world. Academic journals and conferences are a poor approximation of this ideal, but currently they're the best institutional mechanism we have. To correct for the non-idealness of these institutions, academics have always distributed preprints of their work to their colleagues (who often pass them to their own friends, etc.). Blogs, e-print archives and the world wide web in general constitute interesting new developments in this practice and show how the fundamental need to communicate ideas will co-opt whatever technology is available. Returning to the point, however, what is interesting about interdisciplinary research is that by definition it has multiple target audiences to which it could, or should, be communicated. Choosing that audience can become a question of choosing what aspects of the work you think are most important to science in general, i.e., what audience has the most potential to further develop your ideas? For physicists working on networks, some of their work can and should be sent to sociology journals, as its main contribution is in the form of understanding social structure and implication, and sociologists are best able to use these discoveries to explain other complex social phenomena and to incorporate them into their existing theoretical frameworks.
In our work on the statistics of terrorism, Maxwell and I have chosen a compromise strategy to address this question: while we selected general science or interdisciplinary journals to send our first manuscript on the topic, we have simultaneously been making contacts and promoting our ideas in political science so as to try to understand how to further develop these ideas within their framework (and perhaps how to encourage the establishment to engage in these ideas directly). This process has been educational in a number of ways, and recently has begun to bear fruit. For instance, at the end of October, Maxwell and I attended the International Security Annual Conference (in Denver this year) where we presented our work in the second of two panels on terrorism. Although it may have been because we announced ourselves as computer scientists, stood up to speak, used slides and showed lots of colorful figures, the audience (mostly political scientists, with apparently some government folk present as well) was extremely receptive to our presentation (despite the expected questions about statistics, the use of randomness and various other technical points that were unfamiliar to them). This led to several interesting contacts and conversations after the session, and an invitation to the both of us to attend a workshop in Washington DC on predictive analysis for terrorism that will be attended by people from the entire alphabet soup of spook agencies. Also, thanks to the mention of our work in The Economist over the summer, we have similarly been contacted be a handful of political scientists who are doing rigorous quantitative work in a similar vein as ours. We're cautiously optimistic that this may all lead to some fruitful collaborations, and ultimately to communicating our ideas to the people to whom they will matter the most.
Despite the current popularity of the idea of interdisciplinary research (not to be confused with excitement about the topic itself, which would take the form of funding), if you are interested in pursuing a career in it, like many aspects of an academic career, there is little education about its pitfalls. The question of etiquette in academic research deserves much more attention in graduate school than it currently receives, as does its subtopic of interdisciplinary etiquette. Essentially, it is this last idea that lays at the heart of Eszter Hargittai's original complaint about physicists working on social networks: because science is a fundamentally social exercise, there are social consequences for not observing the accepted etiquette, and those consequences can be a little unpredictable when the etiquette is still being hammered out as in the case of interdisciplinary research. For our work on terrorism, our compromise strategy has worked so far, but I fully expect that, as we continue to work in the area, we will need to more fully adopt the mode and convention of our target audience in order to communicate effectively with them.
October 27, 2005
Links, links, links.
The title is perhaps a modern variation on Hamlet's famous "words, words, words" quip to Lord Polonius. Some things I've read recently, with mild amounts of editorializing:
Tim Burke (History professor at Swarthmore College) recently discussed (again) his thoughts on the future of academia. That is, why would it take for college costs to actually decrease. I assume this arises at least partially as a result of the recent New York Times article on the ever increasing tuition rates for colleges in this country. He argues that modern college costs rise at least partially as a result of pressure from lawsuits and parents to provide in loco parentis to the kids attending. Given the degree of hand-holding I experienced at Haverford, perhaps the closest thing to Swarthmore without actually being Swat, this makes a lot of sense. I suspect, however, that tuition prices will continue to increase apace for the time being, if only because enrollment rates continue to remain high.
Speaking of high enrollment rates, Burke makes the interesting point
... the more highly selective a college or university is in its admission policies, the more useful it is for an employer as a device for identifying potentially valuable employees, even if the employer doesn’t know or care what happened to the potential employee while he or she was a student.
This assertion belies an assumption about whose pervasiveness I wonder. Basically, Burke is claiming that selectivity is an objective measure of something. Indeed, it is. It's an objective measure of the popularity of the school, filtered through the finite size of a freshman class that the school can reasonably admit, and nothing else. A huge institution could catapult itself higher in the selectivity rankings simply by cutting the number of students it admits.
Barabasi's recent promotion of his ideas about the relationship between "bursty behavior" among humans and our managing a queue of tasks to accomplish continues to generate press. New Scientist and Physics Web both picked the piece of work on Darwin's, Einstein's and modern email-usage communication patterns. To briefly summarize from Barabasi's own paper:
Here we show that the bursty nature of human behavior is a consequence of a decision based queueing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, most tasks being rapidly executed, while a few experience very long waiting times.
A.-L. Barabasi (2005) "The origin of bursts and heavy tails in human dynamics." Nature 435, 207.
That is, the response times are described by a power law with exponent between 1.0 and 1.5. Once again, power laws are everywhere. (NB: In the interest of full disclosure, power laws are one focus of my research, although I've gone on record saying that there's something of an irrational exuberance for them these days.) To those of you experiencing power-law fatigue, it may not come as any surprise that last night in the daily arXiv mailing of new work, a very critical (I am even tempted to say scathing) comment on Barabasi's work appeared. Again, to briefly summarize from the comment:
... we quantitatively demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and that the proposed model is not representative of e-mail communication patterns.
D. B. Stouffer, R. D. Malmgren and L. A. N. Amaral (2005) "Comment on The origin of bursts and heavy tails in human dynamics." e-print.
There are several interesting threads imbedded in this discussion, the main one being on the twin supports of good empirical research: 1) rigorous quantitative tools for data analysis, and 2) a firm basis in empirical and statistical methods to support whatever conclusions you draw with aforementioned tools. In this case, Stouffer, Malmgren and Amaral utilize Bayesian model selection to eliminate the power law as a model, and instead show that the distributions are better described by a log-normal distribution. This idea of the importance of good tools and good statistics is something I've written on before. Cosma Shalizi is a continual booster of these issues, particularly among physicists working in extremal statistics and social science.
And finally, Carl Zimmer, always excellent, on the evolution of language.
[Update: After Cosma linked to my post, I realized it needed a little bit of cleaning up.]
September 29, 2005
Networks in our nation's capital
This past week, I attended the Statistics on Networks workshop at the National Academies of Science in Washington DC, where I saw many familiar faces and many new ones. In particular, I was very happy to finally meet Jon Kleinberg, John Doyle, Steve Borgatti and my collaborator Dimitris Achlioptas. And it was nice to see Walter Willinger and Chris Wiggins again, both of whom I met at the MSRI workshop on networks earlier this year. And naturally, it was nice to see my collaborator Mark Newman again, even though we correspond pretty regularly. Now that I've distributed the appropriate linkage for the search engines, let me get on with my thoughts.
This workshop was interesting for a couple of reasons. First, the audience contained statisticians, social scientists, computer science/physics people, and engineers/biologists. Certainly the latter two groups presented very different perspectives on networks, with the former being interested in universality properties and random models of networks, while the latter was much more interested in building or decomposing a particular kind or instance of a network. The social scientists present (and there were many of them) seemed to have a nicely balanced perspective on the usefulness of random models, with perhaps a slight leaning toward the computer science/physics side. Naturally, this all made for interesting dinner and wrap-up discussion. For myself, my bias is naturally in the direction of appreciating models that incorporate randomness. However, it's true that when translated through a particular network model, randomness can itself generate structure (e.g., random graphs with power law degree distributions tend to have a densely connected core of high degree vertices, a structure that is a poor model for the core of the internet, where mixing is disassortative). In the case of real world networks, I think random models yield the most benefit when used to explore the space of viable solutions to a particular constraint or control problem. Eve Marder's work (also at the workshop) on small networks of self-regulating neurons (in this case, those of the lobster gut) is a particularly good example of this approach.
Second, although there were very few graduate students in attendance (I counted three, myself included), the environment was friendly, supportive and generally interesting. The workshop coordinators did a good job of inviting people doing interesting work, and I enjoyed just about all of the talks. Finally, it was interesting to see inside the National Academies a little. This institution is the one that fulfills the scientific inquiries of Congress, although I can't imagine this Congress listens to its scientists very much.
May 21, 2005
The inter-disciplinary politics of interdisciplinary research or, "Hey, that was my idea first."
A few days ago, Eszter Hargittai posted a rant on the joint-blog Crooked Timber about the entre of physicists into the subfield in sociology of social networks and her perception of their contributing mostly nothing of value. Her entry was prompted by this paper about the EuroVision Contest. I learned about the entry first when she reproduced it on the social networking listserv SOCNET; a list on which I lurk mostly because I'm too cheap to pay the membership fee and also because I mainly use it as a way to collect journal references for sociology literature. References which I imagine to myself that I'll read or use one day, although given the poor job I'm currently doing at keeping up with the recent papers in my own field, I may realistically never get around to. (This point is salient, and I'll return to it momentarily.) In the ensuing and relatively lively debate in the post's comments section, someone called for and then received attention from friend Cosma Shalizi, who blogs his own thoughts on the subject in his usual lengthy, heavily cross-referenced and edifying way.
Several meta-commentary thoughts come immediately to mind:
1. Cosma's points are extremely thoughtful and are likely right on the money in terms of seeing the merits of both physicists contributions to social sciences and the argument of their reinvention of wheels. Most relevant to the rant about physicists not contributing anything of value to the field of social networks, he gives four excellent and broad examples of how physicists have added to our knowledge.
2. One of these points, which bears rehashing here, is that physicists are not just interested in social networks (it unfortunately illustrates the irony of the sociologists claims of academic injustice that this observation is abscent from their complaints). Physics training, and particularly that of statistical mechanics, the subfield that most physicists interested in social networks hail from, emphasizes that items of inquiry can, to as great an extent as possible, be treated as interchangeable. Thus, complex networks is the idea that social networks are just one kind of network. The progress physicists have made in carving out the field of complex networks has been somewhat spotty, perhaps because of their not knowing entirely how much of statistical mechanics to import and how much of a reliance on numerical simulation is reasonable (this touches on a related point, that there is not a firm consensus on how computational modeling and simulation should be incorporated into science to the same degree that theory and empiricism have been). If they have been arrogant toward other fields in their attempts to do this, then they should be chastised through letters to the editor of the journals that publish the offending articles. With regard to the EuroVision Contest article, Eszter Hargittai and Kieran Healy's best recourse is to write such a letter to Physica A illustrating that the work is not novel.
3. A point which Cosma omits in his list is connection to social network analysis, via complex network analysis, a large body of mathematical techniques from physics such as percolation theory (he does point out the contribution via network epidemiology), group renormalization, random graph theory, ideas of entropy and techniques for modeling dynamic systems. I may be wrong on these contributions, since I will easily admit that I don't read enough sociology literature. (Update: Cosma notes that sociologists and affiliated statisticians were familiar with Erdos-Renyi random graph theory before the physicists came along.)
4. There's a deeper issue at play here, which Cosma has also discussed (his prolificness is truly impressive, even more so given its high quality). Namely, that there are more physicists than there is funding (or interest?) for physics problems. While I was at Haverford, one of my physics professors told me, without a hint of a smile, that in order to get a job in traditional physics, you basically had to work at one of the national laboratories, work at a particle accelerator laboratory, or work in condensed matter physics. None of these seemed particularly appealing, yet the ideas and approaches of physics were. So, it is perhaps entirely expected that similar folks in my position eventually branch out into other fields. This is, after all, the nature of interdisciplinary research, and physicists (along with mathematicians and, to a lesser degree, chemists) seem particularly well-equipped for this kind of adventure. With the rising emphasis among both funding agencies and universities for interdisciplinary research (which may or may not be simply lip-service), the future likelihood of inter-disciplinary ego-bruising seems high.
5. Obviously, in any scientific endeavor, interdisciplinary or otherwise, scientists should understand the literature that came before (I dislike the term "master", because it implies an amount of time-commitment that I think few people can honestly claim to have spent with literature). In my recent referee work for Physical Review E, I have routinely chastised authors for not writing better introductions that leave a reader with a firm understanding of the context (past and present) in which the fundamental questions they seek to address sit. When it comes to interdisciplinary work, these problems are particularly acute; not only do you have multiple bodies of literature to quickly and succinctly review, but you must also do so in a way accessible to the members of the each field. Some (but, by no means, all) physicists are certainly guilty of this when it comes to writing about social networks, as they are prone to reinventing the wheel. The most egregious example of which is the preferential attachment model of Barabasi and Albert, but it can (and should) be argued that this reinvention was extremely valuable, as it helped spark a wide degree of interest in the previous work and has prompted some excellent work on developing that idea since. So, the fundamental question that I think all of we who claim to be interdisciplinary must face and ultimately answer (in a way that can be communicated to future generations of interdisciplinary researchers, many of whom are in college right now) is, What is the most principled and reasonable way, given the constraints on attention, energy, time, knowledge, intelligence, etc., to allocate proper recognition (typically via citations and coauthorships) to previous and on-going work that is relevant to some interdisciplinary effort?
Or, more succinctly, what's the most practical way to mitigate the inter-disciplinary politics of interdisciplinary research while encouraging it to the fullest extent possible? Closely related are questions about adequately evaluating the merit of research that does not fall squarely within the domain of a large enough body of experts for peer-review. As is the question of how academic departments should value interdisciplinary researchers and what role they should fill in the highly compartmentalized and territorial realm of academic disciplines.
Manual TrackBack: Three-toed Sloth
January 31, 2005
On being interdisciplinary
I've been on the East Coast for two weeks for reasons of both work and play. I started at the MIT Media Lab working with Nathan Eagle on some really amazing network analysis stuff. It was cold, there was some peripheral unpleasantness not connected with him (he was actually good about it as it was happening), but it was great to be in a totally new environment thinking about totally new things. Plus, I got to hang out with friends from the Santa Fe Institute summer school I went to a couple of years ago. Then I went to Holyoke where I got to visit an old old friend that I haven't seen in ages. Jessica drove me down to Yale, where I visisted my friend Robin Herlands which was wonderful and stimulating and fun, despite having 18 inches of snow dumped on us that weekend. I met her friends, played "Cups" for the first time (I'm a natural - who knew?), and bonded with her hyperactive kitten Charlie. Then, I went to New York City to see one of my closest guy friends Trevor Barran. Despite his being completely overworked, that week was fabulous at least for my being able to cavort in a city that I have fallen completely in love with over the the culture, the closeness, the speed and the density. Nights of drinking and philosophizing and meeting people, followed by days of work and wandering and wondering.
One of the things I love most about my line of work is that it's largely quite accessible to smart people who aren't in my field. It's even accessible to people who aren't in academia at all. Sharing these things and getting people excited about the work is what makes me believe that what I'm doing is perhaps meaningful or worthwhile. And given the unrelenting pressure to produce new results at a constant (or accelerating) rate, this kind of support is like oxygen. In meeting so many new people during the past few weeks, I often got asked to describe my research. This is hard: I don't have a niche; I don't have a well-defined field. There is no obscure property of a complicated system that will bear my name, and there is no unifying framework that my work nicely fits inside of.
So, I've used the term "research cowboy" to describe what I do, since my work bears rough similarities to riding into a town, solving some problems that no one else has solved yet, and then riding out of town before I get too comfortable. I should make the French surrealist painter Francis Picabia my patron, for having once uttered the words "One must be a nomad, traveling through ideas as one travels through countries and cities." This is exactly what I find so stimulating about academia, but if it can't support my habit for the long term, then ultimately, I'm going to want out. The only distinctions I see between fields like physics, chemistry, biology and geopolitics are those arising from our inability to sufficiently understand their similarities and their structure. The universe makes no distinction between these things, so why do we?
My advisor once said to me that being interdisciplinary is both easy and hard. On the one hand, there is often a lot more "low hanging fruit" (oh, how academics overuse that phrase) in interdisciplinary fields, but on the other, endemic ideological gravitation requires that is one is twice as smart, twice as rigorous and twice as good at explaining the relevance of one's results in order to be taken seriously by the fields one is jostling. There is no Nature or Physical Review Letters for interdisciplinary work, and there are no Departments of Interdisciplinary Research.
And so, I am worried about several things. I am worried that I don't have the endurance to keep up with the grueling work schedule that academic research entails. I am worried that I won't continue to have interesting things to say about the world in my ill-defined fields. And I am worried that because I am neither a physicist nor a computer scientist that I won't be able to find a job in either, or that whatever job I do end up finding won't provide me the flexibility that I need. If I can't be interdisciplinary, I can't keep working. Does anyone know of any openings for a Research Cowboy?