Practicing What I Preach: The Open PhD Experiment

(Note: this is an adapted version of a final paper I wrote for one of my classes. That’s why it’s so long!)

A few weeks ago, a researcher called my office to see if we could meet to discuss our shared interest in open data. I agreed, and a week later we were sitting in my office having a lively discussion about the many problems that currently hinder more widespread data sharing and reuse in biomedical research. When I mentioned that these topics would be the focus of my doctoral dissertation work, he expressed an interested in seeing some of my research. I replied that it was only my first semester, so I didn’t have much yet, but that I’d published a few papers on my previous research. “I don’t mean papers,” he said. “I mean your data, your code. If you’re doing a PhD on data sharing, don’t you think you should share your data, too?  In fact, why don’t you do an open PhD?”

Perhaps I should have immediately replied, “you’re absolutely right. I will do an open PhD.”  After all, on the face of it, this suggestion seems perfectly reasonable. My research, and in fact my entire career, revolves around the premise that researchers should share their data. It should be a no-brainer that I would also share my data. In principle, I have no problem with agreeing to do so, but in the real world of research, lofty ideals like service to the community and furthering science are sometimes abandoned in favor of more practical concerns, like getting one’s paper accepted or finishing one’s dissertation before other people have a chance to capitalize on the data.

So what I ended up telling this researcher was that I found his suggestion intriguing and I’d give it some serious thought. I have done just that in the intervening weeks, and here I will reflect on the reasons for my hesitation and explore the levels of openness I am prepared to take on in my doctoral program and my academic career.

My first (mis)adventure with data sharing

The first – and as yet only – time I shared my data was when I submitted an article to PLOS in 2014. PLOS was one of the first publishers to adopt an open data policy that required researchers to share the data underlying their manuscripts. I dutifully submitted my data to figshare, a popular, discipline-agnostic data repository, with the title “Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff.” To my surprise, someone at figshare took notice of my upload and tweeted out a link to my dataset. I could have sworn that I’d checked the box to keep the data private until I opted to officially release them, but when I’d gone back to fix a minor mistake in the title of the submission, the box must have gotten unchecked, and the status was changed to public.

After the tweet went out, I could see from the “views” counter that people were already looking at the data. Someone retweeted the link to the data, then another person, and another. The paper hadn’t even been reviewed by anyone yet, much less accepted for publication, but my data were out there for anyone to see, with the link spreading across Twitter. The situation made me nervous. I was excited that people were interested in my data, but what were they doing with it?  The views counter ticked up steadily, and people were not just viewing, but actually downloading the dataset as well.

I finally received word from PLOS that they’d accepted the paper, but they asked for major revisions; Reviewer 2 (it’s always Reviewer 2) was niggling over my statistical methods, and I was going to have to redo much of my work to respond to all the revision requests. During the revision process, I received an email from someone I’d never heard of, from an Eastern European country I can’t now recall. She had seen my data on figshare and she, too, wanted to write a paper on this topic. She asked me to send her a copy of my still-in-process paper, as well as a list of all relevant references I had found. The audacity of her request shocked me. Here was someone I’d never even met, telling me she wanted to use my data, write essentially the same paper as me, and she wanted me to give her my background research as well?  I wrote an email back, politely but firmly rebuffing her request, and I never heard from her again.

In the end, everything went fine: the paper was published and it has gone on to be cited seven times and featured in PLOS’s new Open Data collection (PLOS Collections 2016). I do still believe that researchers, particularly those whose work is supported by taxpayers’ money, have a responsibility to share their data when doing so will not violate their human subjects’ privacy. However, my own experience demonstrated to me that sharing research data cannot be viewed as a black and white proposition, that you share and are “good,” or you don’t and you are “bad.” Rather, many researchers have real, valid concerns about how they share their data, when, and with whom. Though my reasons probably differ from those of many other researchers, I have my own concerns that give me pause when it comes to the idea of an “open PhD.”

  1. I don’t think my data would be useful or interesting to anyone else.

Some datasets have near infinite value, with uses that extend far beyond the expertise or disciplinary affiliation of their original collector. New computational methodologies and analytic techniques make it possible to uncover previously undetected meaning in datasets or “mash up” disparate datasets to detect novel connections between seemingly unrelated phenomena. The ability to quickly, easily, and cheaply share massive amounts of data means that researchers around the world are able to make life-saving discoveries. For example, the National Cancer Institute’s Cancer Genomics Cloud Pilot program allows researchers to connect to cancer genome data and perform complex analyses on cloud computing platforms more powerful than any computers they could buy for their lab (National Cancer Institute Center for Biomedical Informatics & Information Technology 2016). Projects like this are exciting – they could bring about cures for cancer and vastly improve our lives. Few people would argue that sharing these kinds of datasets is important.

By comparison, my data just look silly. Personally, I find my research fascinating. I could spend hours talking about biomedical scientists’ research data sharing and reuse practices. However, I don’t flatter myself that others are clamoring to see all the thrilling survey data and titillating interview transcriptions I have collected. Beyond validating the results in my article, I see little value for these data. Of course, I have made the argument that data can have unexpected uses that their original collectors could never have imagined, so I am prepared to admit that my data may have usefulness beyond what I would expect. Perhaps I should take the 252 views and 37 downloads of my figshare dataset as evidence that my data are of interest to more people than I might expect.

  1. I’m often embarrassed by my amateurish ways.

I’m a fan of GitHub, a site where you can share your code and allow others to collaboratively contribute to your work, but I’m also terrified of it. I spend a very significant amount of time at my job working with R, my programming language of choice; I teach it, I consult on it, and I use it for my own research. I like to think I know what I’m doing, but in all honesty, I’m pretty much entirely self-taught in R and, though I’m a quick study, I haven’t been using it for that long. I am far from an expert, and I often write code that makes this fact obvious.

Recently I wrote some R code related to a research project I hope to submit for publication soon. The work involved downloading the full text of over 60,000 articles, but since the server’s interface only allowed downloading a thousand articles at a time, I needed to write code that would download the allowed amount, then repeat itself 60 times, updating the article numbers after each iteration. I spent hours trying to figure out the best way to do it, but everything I tried failed. I could download a hundred at a time, then manually update the numbers in the code and re-run it, but doing this 60 times would have been time-consuming. In a throwing-up-your-hands moment of frustration, I wrote a command that would essentially just write those 60 lines of code for me, then ran all 60 lines.

Frankly, this approach was idiotic. Anyone who knows the first thing about programming would scoff at my code, and rightly so. However, at the time, this slipshod approach was the best I could come up with. It’s not just code that may reveal that I don’t always know what I’m doing; the more open the research process, the more opportunity for others to see the unpolished, imperfect steps that lie beneath the shiny surface of the perfected, word-smithed article.

  1. It takes time to prepare data for broader consumption.

When I teach data management classes for researchers, I emphasize how good data management practices will make submitting their data at the end of the process easy, practically effortless. Of course, having your data perfectly ready to share without any extra effort at the end of your project is about as likely as a jumping out of bed and looking good enough to head off to work without taking any time to freshen up. For example, part of my to-do list for preparing the article for the project I described above for publication is figuring out how to actually write that code the right way, so I can share it without fear of being humiliated. Getting my data, code, writing, or any other scholarly output I produce into the kind of shape it would need to be for me to be willing to put my name on it takes time. When I’m already trying to manage a demanding full-time job with a doctoral program and somehow still find the time to enjoy some sort of leisure every now and then, polishing up something to get it ready for sharing doesn’t often take enough priority to make it onto my daily schedule.

A compromise: the open-ish PhD

Though I’ve just spent five pages expounding on the reasons I cannot do a fully open PhD, I am prepared to compromise. The ideal the researcher urged me toward in our original conversation – don’t wait for your dissertation, share your data now, get your code up on GitHub today! – may not be right for me, but I do believe it is feasible to find some way to share at least some of scholarly output, if not in real-time, than at least in a timely fashion. Therefore, I propose the following tenets of my open-ish PhD:

  • I will do my best to write code that I am reasonably proud of (or at least not actively ashamed of) and share it on GitHub. While I do not feel comfortable immediately sharing code that corresponds to projects I am actively pursuing and seeking to publish, I will at least share it upon publication. I will also share teaching-related code immediately on GitHub, especially since doing so provides a good model for the researchers I am teaching.
  • I will make a more concerted effort to share my scholarly writing not just in its final, polished form as journal articles, but also in more casual settings, such as on my blog. I am also interested in exploring pre-print servers like arXiv and bioRxiv as a means of more rapid dissemination of research findings in advance of formal journal article publication.
  • I will attempt to collect data in a more mindful and intentional way, recognizing that I am not simply collecting my data, but that the point of my efforts are to inform others in my scholarly and research communities. As a federal employee, the work that I conduct in my official capacity cannot be copyrighted because it belongs not to me, but to all the American people who pay my salary. As I go forward with my research, I will do my best to remember that I am doing it not merely to satisfy my curiosity or add to my CV, but to advance science, even in my own small way.

In the end, it probably doesn’t matter so much whether the final data I share are perfect, whether my code impresses other people with its efficiency and elegance, or whether something I write appears in Nature or on my little blog. What matters is making the effort to share, committing to the highest level of openness possible, and doing so publicly and visibly – essentially, leading by example. I can give lectures on the importance of data sharing and teach classes on open source tools until I’m blue in the face, but perhaps the most important thing I can do to convince researchers of the importance of sharing and reusing data is doing exactly that myself.

One comment

Leave a Reply