Recently some colleagues and I published a paper in PLOS in which we analyzed about 47,000 Data Availability Statements as a way of exploring the state of data sharing in a journal with a pretty strong data availability policy. The paper has gotten a good response from what I’ve seen on Twitter, and I’m really happy with how it turned out, thanks in part to some great feedback from the reviewers. But I also wanted to tell a few more things about how this paper came about – the things that don’t make it into the final scholarly article. A behind the scenes look, if you will.
The idea for this paper arose out of a somewhat eye-opening experience. I needed to get a hold of a good dataset – I forget why exactly, but I think it was when I was first starting to teach R and wanted to some real data that I could use in the classes for the hands-on exercises. Remembering that PLOS had this data availability policy, I thought to myself, ah, no problem, I will find an article that looks relevant to the researchers I’m teaching, download the data, and use it in my demo (with proper attribution and credit, of course). So I found an article that looked good and scrolled down to the Data Availability Statement. Data available upon request. Huh. I thought you weren’t allowed to say that, but okay, I guess this one slipped through the policy. Found another one – data is within the paper, it said, except the only data in the paper were summary tables, which were of no use to me (nor would they be of use to anyone hoping to verify the study or reanalyze the data, for example).
What a weird fluke, I thought, that the first two papers I happened to look at didn’t really follow the policy. So I checked a third, and a fourth. Pretty soon I’d spent a half hour combing through recent PLOS articles and I had yet to find one with a publicly available dataset that I could easily download from a repository. I ended up looking elsewhere for data (did you know that baseball fans keep surprisingly in-depth data on a gazillion data points?) but I was left wondering what the real impact of this policy was, which was why I decided to do this study.
I’ll let you read the paper to find out what exactly it is that we found, but there’s one other behind-the-scenes anecdote that I’ll share about this paper that I hope will be encouraging. Obviously if you’re going to write critically about data availability, you’re going to look a little hypocritical if you don’t share your own data. I fully intended to share our data and planned to do so using Figshare, which is how I’d shared a dataset associated with another publication I’d previously published in PLOS. When I shared the data from the first article, I set it to be public immediately, though I didn’t expect anyone to want to see it before the paper was out. Unexpectedly, and unbeknownst to me, someone at Figshare apparently thought this was an interesting dataset and decided to tweet it out the same day I submitted the paper to PLOS, obviously well before it was ever published, much less accepted.
While the interest in the dataset was encouraging, I was also concerned about the fact that it was out before the paper was accepted. I figured I was flattering myself to think that someone would want to scoop me, but then, I got an email from someone I didn’t know, who told me that she had found my dataset and that she would like to write an article describing my results, and would I mind sharing my literature review/citations with her to save her the trouble? In other words, “hi, I would like to write basically the paper that you’re trying to get accepted using all of the work you did.” I want to be clear that I am all for data sharing, but this situation bothered me. Was I about to get scooped?
Obviously our paper came out, no one beat us to it, and as far as I know, no one has ever written another paper using that dataset, but I was thinking about it when I was uploading the data for this most recent paper. This dataset was way more interesting and broadly applicable than the first one, so what if someone did get a hold of it before our paper came out? So what I decided to do was to upload it to Figshare, have it generate a DOI, but keep the dataset listed as private rather than publicly release it. Our data availability statement included the DOI and was therefore on the surface in compliance, but I had a feeling that, if you went to the DOI, it would tell you that the dataset was private or wasn’t found. Obviously I could have checked this before I submitted, but to be totally honest, I just left it as it was because I was genuinely curious whether any of the reviewers would try to check it themselves and say something.
To their credit, all three of the reviewers (who by the way, were incredibly helpful and gave the most useful feedback I’ve ever gotten on peer review, which I think significantly improved the paper) did indeed point out that the DOI didn’t work. In our revisions, our Data Availability Statement included a working link to not only the data, but also the code, on OSF. I invite anyone who is interested to reuse it and hope someone will find it useful. (Please don’t judge me on the quality of my code, though – I wrote it a long time ago when I was first learning R and I would do it way better now.)