« 2013: a year in review | Main | The faculty market (Advice to young scholars, part 1 of 4) »

January 29, 2014

PLOS mandates data availability. Is this a good thing?

The Public Library of Science, aka PLOS, recently announced a new policy on the availability of the data used in all papers published in all PLOS journals. The mandate is simple: all data must be either already publicly available, or be made publicly available before publication, except under certain circumstances [1].

On the face of it, this is fantastic news. It is wholly in line with PLOS’s larger mission of making the outcomes of science open to all, and supports the morally correct goal of making all scientific knowledge accessible to every human. It should also help preserve data for posterity, as apparently a paper’s underlying data becomes increasingly hard to find as the paper ages [2]. But, I think the the truth is more complicated.

PLOS claims that it has always encouraged authors to make their data publicly available, and I imagine that in the vast majority of cases, those data are in fact available. But the policy does change two things: (i) data availability is now a requirement for publication, and (ii) the data are supposed to be deposited in a third-party repository that makes them available without restriction or attached to the paper as supplementary files. The first part ensures that authors who would previously decline or ignore the request for open data must now fall into line. The second part means that a mere promise by the authors to share the data with others is now insufficient. It is the second part where things get complicated, and the first part is meaningless without practical solutions to the second part.

First, the argument for wanting all data associated with scientific papers to be publicly available is a good one, and I think it is also the right one. If scientific papers are in the public domain [3], but the data underlying their results are not, then have we really advanced human knowledge? In fact, it depends on what kind of knowledge the paper is claiming to have produced. If the knowledge is purely conceptual or mathematical, then the important parts are already contained in the paper itself. This situation covers only a smallish fraction of papers. The vast majority report figures, tables or values derived from empirical data, taken from an experiment or an observational study. If those underlying data are not available to others, then the claims in the paper cannot be exactly replicated.

Some people argue that if the data are unavailable, then the claims of a paper cannot be evaluated at all, but that is naive. Sometimes it is crucial to use exactly the same data, for instance, if you are trying to understand whether the authors made a mistake, whether the data are corrupted in some way, or understand a particular method. For these efforts, data availability is clearly helpful.

But, science aspires for general knowledge and understanding, and thus getting results using different data of the same type but which are still consistent with the original claims is actually a larger step forward than simply following exactly the same steps of the original paper. Making all data available may thus have an unintended consequence of reducing the amount of time scientists spend trying to generalize, because it will be easier and faster to simply reuse the existing data rather than work out how to collect a new, slightly different data set or understand the details that went into collecting the original data in the first place. As a result, data availability is likely to increase the rate at which erroneous claims are published. In fields like network science, this kind of data reuse is the norm, and thus gives us some guidance about what kinds of issues other fields might encounter as data sharing becomes more common [4].

Of course, reusing existing data really does have genuine benefits, and in most cases these almost surely outweigh the more nebulous costs I just described. For instance, data availability means that errors can be more quickly identified because we can look at the original data to find them. Science is usually self-correcting anyway, but having the original data available is likely to increase the rate at which erroneous claims are identified and corrected [5]. And, perhaps more importantly, other scientists can use the original data in ways that the original authors did not imagine.

Second, and more critically for PLOS’s new policy, there are practical problems associated with passing research data to a third party for storage. The first problem is deciding who counts as an acceptable third party. If there is any lesson from the Internet age, it is that third parties have a tendency to disappear, in the long run, taking all of their data with them [6]. This is true both for private and public entities, as continued existence depends on continued funding, and continued funding, when that funding comes from users or the government, is a big assumption. For instance, the National Science Foundation is responsible for funding the first few years of many centers and institutes, but NSF makes it a policy to make few or no long-term commitments on the time scales PLOS’s policy assumes. Who then should qualify as a third party? In my mind, there is only one possibility: university libraries, who already have a mandate to preserve knowledge, should be tapped to also store the data associated with the papers they already store. I can think of no other type of entity with as long a time horizon, as stable a funding horizon, and as strong a mandate for doing exactly this thing. PLOS’s policy does not suggest that libraries are an acceptable repository (perhaps because libraries themselves fulfill this role only rarely right now), and only provides the vague guidance that authors should follow the standards of their field and choose a reasonable third party. This kind of statement seems fine for fields with well-developed standards, but it will likely generate enormous confusion in all other fields.

This brings us to another major problem with the storage of research data. Most data sets are small enough to be included as supplementary files associated with the paper, and this seems right and reasonable. But, some data sets are really big, and these pose special problems. For instance, last year I published an open access paper in Scientific Reports that used a 20TB data set of scoring dynamics in a massive online game. Data sets of that scale might be uncommon today, but they still pose a real logistical problem for passing it to a third party for storage and access. If someone requests a copy of the entire data set, who pays for the stack of hard drives required to send it to them? What happens when the third party has hundreds or thousands of such data sets, and receives dozens or more requests per day? These are questions that the scientific community is still trying to answer. Again, PLOS’s policy only pays lip service to this issue, saying that authors should contact PLOS for guidance on “datasets too large for sharing via repositories.”

The final major problem is that not all data should be shared. For instance, data from human-subjects research often includes sensitive information about the participants, e.g., names, addresses, private behavior, etc., and it is unethical to share such data [7]. PLOS’s policy explicitly covers this concern, saying that data on humans must adhere to the existing rules about preserving privacy, etc.

But what about large data sets on human behavior, such as mobile phone call records? These data sets promise to shed new light on human behavior of many kinds and help us understand how entire populations behave, but should these data sets be made publicly available? I am not sure. Research has shown, for instance, that it is not difficult to uniquely distinguish individuals within these large data sets [8] because each of us has distinguishing features to our particular patterns of behavior. Several other papers have demonstrated that portions of these large data sets can be deanonymized, by matching these unique signatures across data sets. For such data sets, the only way to preserve privacy might be to not make the data available. Additionally, many of these really big data sets are collected by private companies, as the byproduct of their business, at a scale that scientists cannot achieve independently. These companies generally only provide access to the data if the data is not shared publicly, because they consider the data to be theirs [9]. If PLOS’s policy were universal, such data sets would seem to become inaccessible to science, and human knowledge would be unable to advance along any lines that require such data [10]. That does not seem like a good outcome.

PLOS does seem to acknowledge this issue, but in a very weak way, saying that “established guidelines” should be followed and privacy should be protected. For proprietary data sets, PLOS only makes this vague statement: “If license agreements apply, authors should note the process necessary for other researchers to obtain a license.” At face value, it would seem to imply that proprietary data sets are allowed, so long as other researchers are free to try to license them themselves, but the devil will be in the details of whether PLOS accepts such instructions or demands additional action as a requirement for publication. I’m not sure what to expect there.

On balance, I like and am excited about PLOS’s new data availability policy. It will certainly add some overhead to finalizing a paper for submission, but it will also make it easier to get data from previously published papers. And, I do think that PLOS put some thought into many of the issues identified above. I also sincerely hope they understand that some flexibility will go a long way in dealing with the practical issues of trying to achieve the ideal of open science, at least until we the community figure out the best way to handle these practical issues.

-----

[1] PLOS's Data Access for the Open Access Literature policy goes into effect 1 March 2014.

[2] See “The availability of Research Data Declines Rapidly with Article Age” by Vines et al. Cell 24(1), 94-97 (2013).

[3] Which, if they are published at a regular “restricted” access journal, they are not.

[4] For instance, there is a popular version of the Zachary Karate Club network that has an error, a single edge is missing, relative to the original paper. Fortunately, no one makes strong claims using this data set, so the error is not terrible, but I wonder how many people in network science know which version of the data set they use.

[5] There are some conditions for self-correction: there must be enough people thinking about a claim that someone might question its accuracy, one of these people must care enough to try to identify the error, and that person must also care enough to correct it, publicly. These circumstances are most common in big and highly competitive fields. Less so in niche fields or areas where only a few experts work.

[6] If you had a profile on Friendster or Myspace, do you know where your data is now?

[7] Federal law already prohibits sharing such sensitive information about human participants in research, and that law surely trumps any policy PLOS might want to put in place. I also expect that PLOS does not mean their policy to encourage the sharing of that sensitive information. That being said, their policy is not clear on what they would want shared in such cases.

[8] And, thus perhaps easier, although not easy, to identify specific individuals.

[9] And the courts seem to agree, with recent rulings deciding that a “database” can be copyrighted.

[10] It is a fair question as to whether alternative approaches to the same questions could be achieved without the proprietary data.

posted January 29, 2014 04:21 AM in Scientifically Speaking | permalink

Comments

I didn't know about the missing edge in versions of the Karate Club data. I believe that I have always used the version from Mark Newman's website.

In terms of private data, I do have deep concerns about what PLoS will do in practice. I know of one example where somebody used part of the FB100 data after FB asked me to take it off of any websites. The journal, Biometrika, to which they submitted their paper required all data to be published if it was accepted --- and based on correspondence with the journal (which I was part of, as the author discussed this with me during that process), the people on the journal side, in my view and based on what they wrote, seemed to basically not care at all that the company would not permit it to be placed on a website. [It's available on torrents, obviously, and indeed I have seen one paper reference a specific torrent location for this data.] For that paper, the author in question basically just redid the calculations on a different data set. (This was a methods paper.)

I prefer when possible to make data completely public, but even with my desires, I work with a lot of data (including stuff much more sensitive/private than the FB100 data, which is something that I think should be shared and used by scholars) where this is just impossible. When I work with such data sets, because of my concern over whether the journals will be reasonable about this, my solution thus far is to not submit the relevant papers to those journals because I simply don't trust them to be reasonable about this issue at the publication stage. I don't want to be burned and to waste my time and energy.

Posted by: Mason at January 29, 2014 06:31 PM

Mason, how about the issue of reproducibility?

There is obvious scientific value in these sensitive, restricted
datasets, but what is the overall scientific usefulness of the research
produced with them, if they are always decoupled from the data itself?

I don't see an easy answer to this, since ignoring the data is obviously
unsatisfactory, but I find the PloS stance (at the very least) a
principled one.

There is a not-so-tenuous connection here to the issue of using and
sharing data obtained from unethical experiments, specially on humans
(based on torture, etc). Sometimes, there is scientific value to them,
but since they cannot be reproduced, relying on them becomes
questionable. (I'm ignoring the issue of corroboration with such
experiments.)

There is something I don't like about Aaron's point that data sharing
facilitates the reuse of wrong data. I don't think is untrue, but I
believe it has less to do with the sharing itself, and more so with the
general sloppiness of the field. Sharing enables the reuse of bad and
good data in equal measure. If the field is sloppy, bad data is produced
and reproduced, if not, it is not. Restricting access to data does not
really alter the sloppiness problem in any meaningful way. [If the wrong
Karate club network was restricted, probably others would use their own
(or some other) ad-hoc network instead of the consensus ad-hoc network,
possibly also containing problems, and the field would be none the
better.]

Posted by: Tiago at January 30, 2014 02:09 AM

Reproducibility is an important issue, but so is privacy. I believe in making all data public when it can be done, but I also believe in the importance of research on data sets where one can get insights without reproducing using the exact data set. There is still a very important notion of reproducibility: for a result to be robust, it can't just be valid in one specific data set or situation anyway! And you have to do that anyway to make sure the results are not artifactual. Thus, there is indeed great value in producing scholarship using data sets whose privacy or whatever prevent it from being made available publicly. And if it gives evidence for or against some results, there is also indeed very great scientific usefulness! Otherwise, there are some things that could not be studied at the same depth, so the alternative becomes not probing certain things (including exceedingly important ones). Not that scientific value does not require one to "rely" on it. We don't rely on it because we need to reproduce qualitatively similar results in other (perhaps very similar, perhaps with interesting differences) situations.

The PLoS stance is ok on the surface in terms of how it is written, as they do write about qualifications for sensitive sensitives. However, I don't trust that they'll be reasonable in practice, and therefore I will submit elsewhere when the situation calls for it. I don't need to deal with that kind of hassle. I already make data available whenever I can. I see no reason to do things that are scientifically the same except for the website something is on only to add an extra layer of hassle if the journal happens to not to be convinced by that argument. I despise bureaucracy, and I simply won't deal with it if I don't see added value if I have other options --- and I do have plenty of other options.

Posted by: Mason at January 30, 2014 11:23 PM