PLOS mandates data availability. Is this a good thing?

The Public Library of Science, aka PLOS, recently announced a new policy on the availability of the data used in all papers published in all PLOS journals. The mandate is simple: all data must be either already publicly available, or be made publicly available before publication, except under certain circumstances [1].

On the face of it, this is fantastic news. It is wholly in line with PLOS’s larger mission of making the outcomes of science open to all, and supports the morally correct goal of making all scientific knowledge accessible to every human. It should also help preserve data for posterity, as apparently a paper’s underlying data becomes increasingly hard to find as the paper ages [2]. But, I think the the truth is more complicated.

PLOS claims that it has always encouraged authors to make their data publicly available, and I imagine that in the vast majority of cases, those data are in fact available. But the policy does change two things: (i) data availability is now a requirement for publication, and (ii) the data are supposed to be deposited in a third-party repository that makes them available without restriction or attached to the paper as supplementary files. The first part ensures that authors who would previously decline or ignore the request for open data must now fall into line. The second part means that a mere promise by the authors to share the data with others is now insufficient. It is the second part where things get complicated, and the first part is meaningless without practical solutions to the second part.

First, the argument for wanting all data associated with scientific papers to be publicly available is a good one, and I think it is also the right one. If scientific papers are in the public domain [3], but the data underlying their results are not, then have we really advanced human knowledge? In fact, it depends on what kind of knowledge the paper is claiming to have produced. If the knowledge is purely conceptual or mathematical, then the important parts are already contained in the paper itself. This situation covers only a smallish fraction of papers. The vast majority report figures, tables or values derived from empirical data, taken from an experiment or an observational study. If those underlying data are not available to others, then the claims in the paper cannot be exactly replicated.

Some people argue that if the data are unavailable, then the claims of a paper cannot be evaluated at all, but that is naive. Sometimes it is crucial to use exactly the same data, for instance, if you are trying to understand whether the authors made a mistake, whether the data are corrupted in some way, or understand a particular method. For these efforts, data availability is clearly helpful.

But, science aspires for general knowledge and understanding, and thus getting results using different data of the same type but which are still consistent with the original claims is actually a larger step forward than simply following exactly the same steps of the original paper. Making all data available may thus have an unintended consequence of reducing the amount of time scientists spend trying to generalize, because it will be easier and faster to simply reuse the existing data rather than work out how to collect a new, slightly different data set or understand the details that went into collecting the original data in the first place. As a result, data availability is likely to increase the rate at which erroneous claims are published. In fields like network science, this kind of data reuse is the norm, and thus gives us some guidance about what kinds of issues other fields might encounter as data sharing becomes more common [4].

Of course, reusing existing data really does have genuine benefits, and in most cases these almost surely outweigh the more nebulous costs I just described. For instance, data availability means that errors can be more quickly identified because we can look at the original data to find them. Science is usually self-correcting anyway, but having the original data available is likely to increase the rate at which erroneous claims are identified and corrected [5]. And, perhaps more importantly, other scientists can use the original data in ways that the original authors did not imagine.

Second, and more critically for PLOS’s new policy, there are practical problems associated with passing research data to a third party for storage. The first problem is deciding who counts as an acceptable third party. If there is any lesson from the Internet age, it is that third parties have a tendency to disappear, in the long run, taking all of their data with them [6]. This is true both for private and public entities, as continued existence depends on continued funding, and continued funding, when that funding comes from users or the government, is a big assumption. For instance, the National Science Foundation is responsible for funding the first few years of many centers and institutes, but NSF makes it a policy to make few or no long-term commitments on the time scales PLOS’s policy assumes. Who then should qualify as a third party? In my mind, there is only one possibility: university libraries, who already have a mandate to preserve knowledge, should be tapped to also store the data associated with the papers they already store. I can think of no other type of entity with as long a time horizon, as stable a funding horizon, and as strong a mandate for doing exactly this thing. PLOS’s policy does not suggest that libraries are an acceptable repository (perhaps because libraries themselves fulfill this role only rarely right now), and only provides the vague guidance that authors should follow the standards of their field and choose a reasonable third party. This kind of statement seems fine for fields with well-developed standards, but it will likely generate enormous confusion in all other fields.

This brings us to another major problem with the storage of research data. Most data sets are small enough to be included as supplementary files associated with the paper, and this seems right and reasonable. But, some data sets are really big, and these pose special problems. For instance, last year I published an open access paper in Scientific Reports that used a 20TB data set of scoring dynamics in a massive online game. Data sets of that scale might be uncommon today, but they still pose a real logistical problem for passing it to a third party for storage and access. If someone requests a copy of the entire data set, who pays for the stack of hard drives required to send it to them? What happens when the third party has hundreds or thousands of such data sets, and receives dozens or more requests per day? These are questions that the scientific community is still trying to answer. Again, PLOS’s policy only pays lip service to this issue, saying that authors should contact PLOS for guidance on “datasets too large for sharing via repositories.”

The final major problem is that not all data should be shared. For instance, data from human-subjects research often includes sensitive information about the participants, e.g., names, addresses, private behavior, etc., and it is unethical to share such data [7]. PLOS’s policy explicitly covers this concern, saying that data on humans must adhere to the existing rules about preserving privacy, etc.

But what about large data sets on human behavior, such as mobile phone call records? These data sets promise to shed new light on human behavior of many kinds and help us understand how entire populations behave, but should these data sets be made publicly available? I am not sure. Research has shown, for instance, that it is not difficult to uniquely distinguish individuals within these large data sets [8] because each of us has distinguishing features to our particular patterns of behavior. Several other papers have demonstrated that portions of these large data sets can be deanonymized, by matching these unique signatures across data sets. For such data sets, the only way to preserve privacy might be to not make the data available. Additionally, many of these really big data sets are collected by private companies, as the byproduct of their business, at a scale that scientists cannot achieve independently. These companies generally only provide access to the data if the data is not shared publicly, because they consider the data to be theirs [9]. If PLOS’s policy were universal, such data sets would seem to become inaccessible to science, and human knowledge would be unable to advance along any lines that require such data [10]. That does not seem like a good outcome.

PLOS does seem to acknowledge this issue, but in a very weak way, saying that “established guidelines” should be followed and privacy should be protected. For proprietary data sets, PLOS only makes this vague statement: “If license agreements apply, authors should note the process necessary for other researchers to obtain a license.” At face value, it would seem to imply that proprietary data sets are allowed, so long as other researchers are free to try to license them themselves, but the devil will be in the details of whether PLOS accepts such instructions or demands additional action as a requirement for publication. I’m not sure what to expect there.

On balance, I like and am excited about PLOS’s new data availability policy. It will certainly add some overhead to finalizing a paper for submission, but it will also make it easier to get data from previously published papers. And, I do think that PLOS put some thought into many of the issues identified above. I also sincerely hope they understand that some flexibility will go a long way in dealing with the practical issues of trying to achieve the ideal of open science, at least until we the community figure out the best way to handle these practical issues.


[1] PLOS's Data Access for the Open Access Literature policy goes into effect 1 March 2014.

[2] See “The availability of Research Data Declines Rapidly with Article Age” by Vines et al. Cell 24(1), 94-97 (2013).

[3] Which, if they are published at a regular “restricted” access journal, they are not.

[4] For instance, there is a popular version of the Zachary Karate Club network that has an error, a single edge is missing, relative to the original paper. Fortunately, no one makes strong claims using this data set, so the error is not terrible, but I wonder how many people in network science know which version of the data set they use.

[5] There are some conditions for self-correction: there must be enough people thinking about a claim that someone might question its accuracy, one of these people must care enough to try to identify the error, and that person must also care enough to correct it, publicly. These circumstances are most common in big and highly competitive fields. Less so in niche fields or areas where only a few experts work.

[6] If you had a profile on Friendster or Myspace, do you know where your data is now?

[7] Federal law already prohibits sharing such sensitive information about human participants in research, and that law surely trumps any policy PLOS might want to put in place. I also expect that PLOS does not mean their policy to encourage the sharing of that sensitive information. That being said, their policy is not clear on what they would want shared in such cases.

[8] And, thus perhaps easier, although not easy, to identify specific individuals.

[9] And the courts seem to agree, with recent rulings deciding that a “database” can be copyrighted.

[10] It is a fair question as to whether alternative approaches to the same questions could be achieved without the proprietary data.

Posted on January 29, 2014 in Scientifically Speaking | permalink | Comments (3)

2013: a year in review

This is it for the year, so here's a look back at 2013, by the numbers.

Papers published or accepted: 10 (journals or equivalent)
Number coauthored with students: 4
Number of papers that used data from a video game: 3 (this, that, and the other)
Pre-prints posted on the arxiv: 6
Other publications: 2 workshop papers, and 1 invited comment
Number coauthored with students: 2
Papers currently under review: 2
Manuscripts near completion: 8
Rejections: 4
New citations to past papers: 1722 (+15% over 2012)
Projects in-the-works: too many to count
Half-baked projects unlikely to be completed: already forgotten
Papers read: >200 (about 4 per week)

Research talks given: 15
Invited talks: 13
Visitors hosted: 2
Presentations to school teachers about science and data: 1 (at the fabulous Denver Museum of Nature and Science)
Conferences, workshops organized: 2
Conferences, workshops, summer schools attended: 7
Number of those at which I delivered a research talk: 5
Number of times other people have written about my research: >17
Number of interviews given about my research: 10

Students advised: 9 (6 PhD, 1 MS, 1 BS; 1 rotation student)
Students graduated: 1 PhD (my first: Dr. Sears Merritt), 1 MS
Thesis/dissertation committees: 10
Number of recommendation letters written: 5
Summer school faculty positions: 2
University courses taught: 2
Students enrolled in said courses: 69 grad
Number of problems assigned: 120
Number of pages of lecture notes written: >150 (a book, of sorts)
Pages of student work graded: 7225 (roughly 105 per student, with 0.04 graders per student)
Number of class-related emails received: >1624 (+38% over 2012)
Number of conversations with the university honor council: 0
Guest lectures for colleagues: 1

Proposals refereed for grant-making agencies: 1
Manuscripts refereed for various journals, conferences: 23 (+44% over 2012)
Fields covered: Network Science, Computer Science, Machine Learning, Physics, Ecology, Political Science, and some tabloids
Manuscripts edited for various journals: 2
Conference program committees: 2
Words written per report: 921 (-40% over 2012)
Referee requests declined: 68 (+36% over 2012)
Journal I declined the most: PLoS ONE (12 declines, 3 accepts)

Grant proposals submitted: 7 (totaling $6,013,669)
Number on which I was PI: 3
Proposals rejected: 2
New grants awarded: 3 (totaling $1,438,985)
Number on which I was PI: 1
Proposals pending: 2
New proposals in the works: 3

Emails sent: >8269 (+3% over 2012, and about 23 per day)
Emails received (non-spam): >16453 (+6% over 2012, and about 45 per day)
Fraction about work-related topics: 0.87 (-0.02 over 2012)
Emails received about power-law distributions: 157 (3 per week, same as 2012)

Unique visitors to my professional homepage: 31,000 (same as 2012)
Hits overall: 87,000 (+10% over 2012)
Fraction of visitors looking for power-law distributions: 0.52 (-11% over 2012)
Fraction of visitors looking for my course materials: 0.16
Unique visitors to my blog: 11,300 (-2% over 2012)
Hits overall: 17,300 (-4% over 2012)
Most popular blog post among those visitors: Our ignorance of intelligence (from 2005)
Blog posts written: 6 (-57% over 2012)
Most popular 2013 blog post: Small science for the win? Maybe not.

Number of twitter accounts: 1
Tweets: 235 (+82% over 2012; mostly in lieu of blogging)
Retweets: >930 (+281% over 2012)
Most popular tweet: a tweet about professors having little time to think
New followers on Twitter: >700 (+202% over 2012)

Number of computers purchased: 2
Netflix: 72 dvds, >100 instant (mostly TV episodes during lunch breaks and nap times)
Books purchased: 3 (-73% over 2012)
Songs added to iTunes: 140 (-5% over 2012)
Photos added to iPhoto: 2357 (+270% over 2012)
Jigsaw puzzle pieces assembled: >2,000
Major life / career changes: 0
Photos taken of my daughter: >1821 (about 5 per day)

Fun trips with friends / family: 10
Half-marathons completed: 0.76 (Coal Creek Crossing 10 mile race)
Trips to Las Vegas, NV: 0
Trips to New York, NY: 1
Trips to Santa Fe, NM: 9
States in the US visited: 8
States in the US visited, ever: 49
Foreign countries visited: 6 (Switzerland, Denmark, Sweden, Norway, United Kingdom, Canada)
Foreign countries visited, ever: 30
Number of those I drove to: 1 (Canada, 10 hours from Washington DC after United canceled my flight to Montreal for the JSM; I arrived with a few hours to spare before my invited talk)
Other continents visited: 1
Other continents visited, ever: 5
Airplane flights: 39

Here's to a great year, and hoping that 2014 is even better.

Update 23 December 2013: Mason reminded me that I forgot a foreign country this year.

Posted on December 22, 2013 in Self Referential | permalink | Comments (2)

Network Analysis and Modeling (CSCI 5352)

This semester I developed and taught a new graduate-level course on network science, titled Network Analysis and Modeling (listed as CSCI 5352 here at Colorado). As usual, I lectured from my own set of notes, which total more than 150 pages of material. The class was aimed at graduate students, so the pace was relatively fast and I assumed they could write code, understood basic college-level math, and knew basic graph algorithms. I did not assume they had any familiarity with networks.

To some degree, the course followed Mark Newman's excellent textbook Networks: An Introduction, but I took a more data-centric perspective and covered a number of additional topics. A complete listing of the lecture notes are below, with links to the PDFs.

I also developed six problem sets, which, happily, the students tell me where challenging. Many of the problems were also drawn from Newman's textbook, although often with tweaks to make the more suitable for this particular class. It was a fun class to teach, and overall I'm happy with how it went. The students were enthusiastic throughout the semester and engaged well with the material. I'm looking forward to teaching it again next year, and using that time to work out some of the kinks and add several important topics I didn't cover this time.


Lecture 1,2 : An introduction and overview, including representing network data, terminology, types of networks (pdf)

Lecture 3 : Structurally "important" vertices, degree-based centralities, including several flavors of eigenvector centrality (pdf)

Lecture 4 : Geometric centralities, including closeness, harmonic and betweenness centrality (pdf)

Lecture 5 : Assortative mixing of vertex attributes, transitivity and clustering coefficients, and reciprocity (pdf)

Lecture 6 : Degree distributions, and power-law distributions in particular (pdf)

Lecture 7 : Degree distributions, and fitting models, like the power law, to data via maximum likelihood (pdf)

Lecture 8 : How social networks differ from biological and technological networks, small worlds and short paths (pdf)

Lecture 9 : Navigability in networks (discoverable short paths) and link-length distributions (pdf)

Lecture 10 : Probabilistic models of networks and the Erdos-Renyi random graph model in particular (pdf)

Lecture 11 : Random graphs with arbitrary degree distributions (the configuration model), their properties and construction (pdf)

Lecture 12 : Configuration model properties and its use as a null model in network analysis (pdf)

Lecture 13 : The preferential attachment mechanism in growing networks and Price's model of citation networks (pdf)

Lecture 14 : Vertex copying models (e.g., for biological networks) and their relation to Price's model (pdf)

Lecture 15 : Large-scale patterns in networks, community structure, and modularity maximization to find them (pdf)

Lecture 16 : Generative models of networks and the stochastic block model in particular (pdf)

Lecture 17 : Network inference and fitting the stochastic block model to data (pdf)

Lecture 18 : Using Markov chain Monte Carlo (MCMC) in network inference, and sampling models versus optimizing them (pdf)

Lecture 19 : Hierarchical block models, generating structurally similar networks, and predicting missing links (pdf)

Lecture 20 : Adding time to network analysis, illustrated via a dynamic proximity network (pdf)


Problem set 1 : Checking the basics and a few centrality calculations (pdf)

Problem set 2 : Assortativity, degree distributions, more centralities (pdf)

Problem set 3 : Properties and variations of random graphs, and a problem with thresholds (pdf)

Problem set 4 : Using the configuration model, and edge-swapping for randomization (pdf)

Problem set 5 : Growing networks through preferential or uniform attachment, and maximizing modularity (pdf)

Problem set 6 : Free and constrained inference with the stochastic block model (pdf)

Posted on November 22, 2013 in Teaching | permalink | Comments (1)

Upcoming networks meetings

There are a number of great looking meetings about networks coming up in the next 12 months [1,2]. Here's what I know about so far, listed in their order of occurrence. If you know of more, let me know and I'll add them to the list.

  1. Santa Fe Institute's Short Course on Complexity: Exploring Complex Networks

    Location: Austin TX
    Date: 4-6 September, 2013

    Organized by the Santa Fe Institute.

    Confirmed speakers: Luis Bettencourt (SFI), Aaron Clauset (Colorado, Boulder), Simon DeDeo (SFI), Jennifer Dunne (SFI), Paul Hines (Vermont), Bernardo Huberman (HP Labs), Lauren Ancel Meyers (UT Austin), and Cris Moore (SFI).

    Registration: until full (registration link here)

  2. Workshop on Information in Networks (WIN)

    Location: New York University, Stern Business School
    Date: 4-5 October 2013

    Organized by Sinan Aral (MIT), Foster Provost (NYU), and Arun Sundararajan (NYU).

    Invited speakers: Lada Adamic (Facebook), Christos Faloutsos (CMU), James Fowler (UCSD), David Lazer (Northeastern), Ilan Lobel (NYU), John Padgett (Chicago), Sandy Pentland (MIT), Patrick Perry (NYU), Alessandro Vespignani (Northeastern) and Duncan Watts (MSR).

    Application deadline (for oral presentations): already passed
    Early registration deadline: 13 September 2013

  3. SAMSI Workshop on Social Network Data: Collection and Analysis

    Location: SAMSI, Research Triangle Park, NC
    Date: 21-23 October 2013

    Organized by Tom Carsey (UNC), Stephen Fienberg (CMU), Krista Gile (UMass), and Peter Mucha (UNC)

    Invited speakers: 12-15 invited talks (not clear who!)

    Registration: 13 September 2013

  4. Network Frontier Workshop

    Location: Northwestern University (Evanston, IL)
    Date: 4-6 December 2013

    Organized by Adilson E. Motter (Northwestern), Daniel M. Abrams (Northwestern), and Takashi Nishikawa (Northwestern).

    Invited speakers: lots (14), including many notables in the field.

    Application deadline (for oral presentations): 18 October 2013

  5. Frontiers of Network Analysis: Methods, Models, and Applications

    Location: NIPS 2013 Workshop, Lake Tahoe NV
    Date: December 9 or 10 (TBD), 2013
    Call for papers: here

    Organized by Edo Airoldi (Harvard), David Choi (Carnegie Mellon), Aaron Clauset (Colorado, Boulder), Khalid El-Arini (Facebook) and Jure Leskovec (Stanford).

    Speakers: Lise Getoor (Maryland), Carlos Guestrin (Washington), Jennifer Neville (Purdue) and Mark Newman (Michigan).

    Submission deadline: 23 October 2013 (extended deadline; via EasyChair, see workshop webpage for link)

  6. Mathematics of Social Learning

    Location: Institute for Pure & Applied Mathematics (IPAM) at UCLA
    Date: 6-10 January 2014

    Organized by Santo Fortunao (Aalto), James Fowler (UCSD), Kristina Lerman (USC), Michael Macy (Cornell) and Cosma Shalizi (Carnegie Mellon).

    Confirmed speakers: lots (24), including the organizers and many notables in the field.

    Application deadline (for travel support): 11 November 2013 (application here)

  7. Workshop on Complex Networks (CompleNet)

    Location: University of Bologna (Bologna, Italy)
    Date: 12-14 March 2014

    Organized by Andrea Omicini (Bologna), Julia Poncela-Casasnovas (Northwestern) and Pierluigi Contucci (Bologna)

    Invited speakers: Juyong Park (KAIST), Mirko Degli Esposti, Stephen Uzzo (NY Hall of Science & NYU), Giorgio Fagiolo, Adriano Barra

    Paper submission: 6 October 2013
    Early registration: 10 December 2013

  8. Les Houches Thematic School on Structure and Dynamics of Complex Networks

    Location: Les Houches, France
    Date: 7-18 April 2014

    Organized by Alain Barrat (CNRS & Aix-Marseille University), Ciro Cattuto, (ISI) and Bruno Gonçalves (CNRS & Aix-Marseille University)

    Confirmed speakers: lots (14), including the organizers and many notables in the field.

    Registration deadline: 15 December 2013 (opens in October)

  9. Mathematics Research Communities on Network Science

    Location: Snowbird UT
    Date: 34-30 June 2014

    Organized by Mason Porter (Oxford), Aaron Clauset (Colorado, Boulder), and David Kempe (USC), with help from Dan Larremore (Harvard).

    Format: this is a workshop aimed at graduate students and early postdocs, focused around small group research projects on network science.

    Applications open: 1 November 2013 (see American Mathematical Society page for the link)

Updated 4 September 2013: Added the Les Houches meeting.
Updated 6 September 2013: Added the Northwestern meeting.
Updated 9 September 2013: Added SAMSI, WIN, and CompleNet meetings.
Updated 9 October 2013: Updated the NIPS 2013 workshop deadline.


[1] Other than NetSci, the big annual networks meeting, of course, which will be 2-6 June 2014, in Berkeley CA, organized by Raissa D'Souza and Neo Martinez.

[2] It has not escaped my attention that I am either an organizer or speaker at many of these meetings.

Posted on August 23, 2013 in Networks | permalink | Comments (2)

If birds are this smart, how smart were dinosaurs?

I continue to be fascinated and astounded by how intelligent birds are. A new paper in PLOS ONE [1,2,3] by Auersperg, Kacelnik and von Bayern (at Vienna, Oxford and the Max-Planck-Institute for Ornithology, respectively) uses Goffin’s Cockatoos to demonstrate that these birds are capable of learning and performing a complex sequence of tasks in order to get a treat. Here's the authors describing the setup:

Here we exposed Goffins to a novel five-step means-means-end task based on a sequence of multiple locking devices blocking one another. After acquisition, we exposed the cockatoos to modifications of the task such as reordering of the lock order, removal of one or more locks, or alterations of the functionality of one or more locks. Our aim was to investigate innovative problem solving under controlled conditions, to explore the mechanism of learning, and to advance towards identifying what it is that animals learn when they master a complex new sequential task.

The sequence was sufficiently long that the authors argue the goal the cockatoos were seeking must have been an abstract goal. The implication is that the cockatoos are capable of a kind of abstract, goal-oriented planning and problem solving that is similar to what humans routinely exhibit. Here's what the problem solving looks like:

The authors' conclusions put things nicely into context, and demonstrate the appropriate restraint with claims about what is going on inside the cockatoos' heads (something we humans could do better at in interpreting each others' behaviors):

The main challenge of cognitive research is to map the processes by which animals gather and use information to come up with innovative solutions to novel problems, and this is not achieved by invoking mentalistic concepts as explanations for complex behaviour. Dissecting the subjects’ performance to expose their path towards the solution and their response to task modifications can be productive; even extraordinary demonstrations of innovative capacity are not proof of the involvement of high-level mental faculties, and conversely, high levels of cognition could be involved in seemingly simple tasks. The findings from the transfer tests allow us to evaluate some of the cognition behind the Goffins’ behaviour. Although the exact processes still remain only partially understood, our results largely support the supposition that subjects learn by combining intense exploratory behavior, learning from consequences, and some sense of goal directedness.

So it seems that the more we learn about birds, the more remarkably intelligent some species appear to be. Which begs a question: if some bird species are so smart, how smart were dinosaurs? Unfortunately, we don't have much of an idea because we don't know how much bird brains have diverged from their ancestral non-avian dinosaur form. But, I'm going to go out on a limb here, and suggest that they may have been just as clever as modern birds.


[1] Auersperg, Kacelnik and von Bayern, Explorative Learning and Functional Inferences on a Five-Step Means-Means-End Problem in Goffin’s Cockatoos (Cacatua goffini) PLOS ONE 8(7): e68979 (2013).

[2] Some time ago, PLOS ONE announced that they were changing their name from "PLoS ONE" to "PLOS ONE". But confusingly, on their own website, they give the citation to new papers as "PLoS ONE". I still see people use both, and I have a slight aesthetic preference for "PLoS" over "PLOS" (a preference perhaps rooted in the universal understanding that ALL CAPS is like yelling on the Interwebs, and every school child knows that yelling for no reason is rude).

[3] As it turns out these same authors have some other interesting results with Goffin's Cockatoos, including tool making and use. io9 has a nice summary, plus a remarkable video of the tool-making in action.

Posted on July 04, 2013 in Obsession with birds | permalink | Comments (2)

Small science for the win? Maybe not.

(A small note: with the multiplication of duties and obligations, both at home and at work, writing full-length blog entries has shifted to a lower priority relative to writing full-length scientific articles. Blogging will continue! But selectively so. Paper writing is moving ahead nicely.)

In June of this year, a paper appeared in PLOS ONE [1] that made a quantitative argument for public scientific funding agencies to pursue a "many small" approach in dividing up their budgets. The authors, Jean-Michel Fortin and David Currie (both at the University of Ottawa) argued from data that scientific impact is a decreasing function of total research funding. Thus, if a funding agency wants to maximize impact, it should fund lots of smaller or moderate-sized research grants rather than fund a small number of big grants to large centers or consortia. This paper made a big splash among scientists, especially the large number of us who rely on modest-sized grants to do our kind of science.

Many of us, I think, have felt in our guts that the long-running trend among funding agencies (especially in the US) toward making fewer but bigger grants was poisonous to science [2], but we lacked the data to support or quantify that feeling. Fortin and Currie seemed to provide exactly what we wanted: data showing that big grants are not worth it.

How could such a conclusion be wrong? Despite my sympathy, I'm not convinced they have it right. The problem is not the underlying data, but rather overly simplistic assumptions about how science works and fatal confounding effects.

Fortin and Currie ask whether certain measures of productivity, specifically number of publications or citations, vary as a function of total research funding. They fit simple allometric or power functions [3] to these data and then looked at whether the fitted function was super-linear (increasing return on investment; Big Science = Good), linear (proportional return; Big Science = sum of Small Science) or sub-linear (decreasing return; Big Science = Not Worth It). In every case, they identify a sub-linear relationship, i.e., increasing funding by X% yields less a progressively smaller proportional increase in productivity. In short, a few Big Sciences are less impactful and less productive than many Small Sciences.

But this conclusion only follows from the data if you believe that all scientific questions are equivalent in size and complexity, and that any paper could in principle be written by a research group of any funding level. These beliefs are certainly false, and this implies that the conclusions don't follow from the data.

Here's an illustration of the first problem. In the early days of particle physics, small research groups made significant progress because the equipment was sufficiently simple that a small team could build and operate it, and because the questions of scientific interest were accessible using such equipment. Small Science worked well. But today the situation is different. The questions of scientific interest are largely only accessible with enormously complex machines, run by large research teams supported by large budgets. Big Science is the only way to make progress because the tasks are so complex. Unfortunately, this fact is lost in Fortin and Currie's analysis because their productivity variables do not expose the difficulty of the scientific question being investigated, and are likely inversely correlated with difficulty.

This illustrates the second problem. The cost of answering a question seems largely independent of the number of papers required to describe the answer. There will only be a handful of papers published describing the experiments that discovered the Higgs boson, even though its discovery was possibly one of the costliest physics experiments so far. Furthermore, it is not clear that the few papers produced by a big team should necessarily be highly cited, as citations are produced by the publication of new papers, and if there are only a few research teams working in a particularly complex area, the number of new citations is not independent of the size of the teams.

In fact, by defining "impact" as counts related to paper production [4,5], Fortin and Currie may have inadvertently engineered the conclusion that Small Science will maximize impact. If project complexity correlates positively with grant size and negatively with paper production, then a decreasing return in paper/citation production as a function of grant size is almost sure to appear in any data. So while the empirical results seem correct, they are of ambiguous value. Furthermore, a blind emphasis on funding researchers to solve simple tasks (small grants), to pick all that low-hanging fruit people talk about, seems like an obvious way to maximize paper and citation production because simpler tasks can be solved faster and at lower cost than harder, more complex tasks. You might call this the "Washington approach" to funding, since it kicks the hard problems down the road, to be funded, maybe, in the future.

From my own perspective, I worry about the trend at funding agencies away from small grants. As Fortin and Currie point out, research has shown that grant reviewers are notoriously bad at predicting success [6] (and so are peer reviewers). This fact alone is a sound argument for a major emphasis on funding small projects: making a larger number of bets on who will be successful brings us closer to the expected or background success rate and minimizes the impact of luck (and reduces bias due to collusion effects). A mixed funding strategy is clearly the only viable solution, but what kind of mix is best? How much money should be devoted to big teams, to medium teams, and to small teams? I don't know, and I don't think there is any way to answer that question purely from looking at data. I would hope that the answer is chosen not only by considering things like project complexity, team efficiency, and true scientific impact, but also issues like funding young investigators, maintaining a large and diverse population of researchers, promoting independence among research teams, etc. We'll see what happens.

Update 1 July 2013: On Twitter, Kieron Flanagan asks "can we really imagine no particle physics w/out big sci... or just the kind of particle physics we have today?" To me, the answer seems clear: the more complex the project, the legitimately bigger the funding needs are. That funding manages the complexity, keeps the many moving parts moving together, and ultimately ensures a publication is produced. This kind of coordination is not free, and those dollars don't produce additional publications. Instead, their work allows for any publication at all. Without them, it would be like dividing up 13.25 billion dollars among about 3,000 small teams of scientists and expecting them to find the Higgs, without pooling their resources. So yes, no big science = no complex projects. Modern particle physics requires magnificently complex machines, without which, we'd have some nice mathematical models, but no ability to test them.

To be honest, physics is perhaps too good an example of the utility of big science. The various publicly-supported research centers and institutes, or large grants to individual labs, are a much softer target: if we divided up their large grants into a bunch of smaller ones, could we solve all the same problems and maybe more? Or, could we solve a different set of problems that would more greatly advance our understanding of the world (which is distinct from producing more papers or more citations)? This last question is hard because it means choosing between questions, rather than being more efficient at answering the same questions.


[1] J.-M. Fortin and D.J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6): e65263 (2013).

[2] By imposing large coordination costs on the actual conduct of science, by restricting the range of scientific questions being investigated, by increasing the cost of failure to win a big grant (because the proposals are enormous and require a huge effort to produce), by channeling graduate student training into big projects where only a few individuals actually gain experience being independent researchers, etc. Admittedly, these claims are based on what you might call "domain expertise," rather than hard data.

[3] Functions of the form log y = A*log x + B, which looks like a straight line on a log-log plot. These bivariate functions can be fitted using standard regression techniques. I was happy to see the authors include non-parametric fits to the same data, which seem to mostly agree with the parametric fits.

[4] Citation counts are just a proxy for papers, since only new papers can increase an old paper's count. Also, it's well known that the larger and more active fields (like biomedical research or machine learning) produce more citations (and thus larger citation counts) than smaller, less active fields (like number theory or experimental particle physics). From this perspective, if a funding agency took Fortin and Currie's advice literally, they would cut funding completely for all the "less productive" fields like pure mathematics, economics, computer systems, etc., and devote that money to "more productive" fields like biomedical research and whoever publishes in the International Conference on World Wide Web.

[5] Fortin and Currie do seem to be aware of this problem, but proceed anyway. Their sole caveat is simply thus: "Campbell et al. [12] discuss caveats of the use of bibliometric measures of scientific productivity such as the ones we used."

[6] For instance, see Berg, Nature 489, 203 (2012).

Posted on July 01, 2013 in Scientifically Speaking | permalink | Comments (0)

Design and Analysis of Algorithms (CSCI 5454)

Returning from paternity leave last semester, it is back into the fire this semester with my large graduate-level algorithms class, Design and Analysis of Algorithms (CSCI 5454). This year, I am again shocked to find that it is the largest grad class in my department, with 60 enrolled students and 30 students on the waiting list. In the past, enrollment churn at the beginning of the semester [1] has been high enough that everyone on the wait list has had an opportunity to take the class. I am not sure that will be true this year. This year will also be the last time I teach it, for a while.

My version of this course covers much of the standard material in algorithms, but with a few twists. Having tinkered considerably with the material a year ago, I'm going to tinker a little less this year. That being said, improving is a never-ending process and there are a number of little things on the problem sets and lectures that I want to try differently this time. The overall goals remain the same: to show students how different strategies for algorithm design work (and sometimes don't work), to get them thinking about boundary cases and rigorous thinking, to get them to think carefully about whether an algorithm is correct or not, and to introduce them to several advanced topics and algorithms they might encounter out in industry.

For those of you interested in following along, I will again be posting my lecture notes on the class website.


[1] Caused in part by some students enrolling thinking the class would be easy, and then dropping it when they realize that it requires real effort. To be honest, I am okay with this, although it does make for scary enrollment numbers on the first day.

Posted on January 10, 2013 in Teaching | permalink | Comments (2)