I’ve been invited to be speaker at this evening’s Health 2.0 STAT meetup at Bethesda’s Barking Dog, alongside some pretty awesome scientists with whom I’ve been collaborating on some interesting research projects. This invitation is a good step toward my ridiculously nerdy goal of one day being invited to give a TED talk. My talk, entitled “Radical Reuse: Repurposing Yesterday’s Data for Tomorrow’s Discoveries” will briefly outline my view of data sharing and reuse, including what I view as five key factors in enabling data reuse. Since I have only five minutes for this talk, obviously I’ll be hitting only some highlights, so I decided to write this blog post to elaborate on the ideas in that talk.
First, let’s talk about the term “radical reuse.” I borrow this term from the realm of design, where it refers to taking discarded objects and giving them new life in some context far removed from their original use. For some nice examples (and some cool craft ideas), check out this Pinterest board devoted to the topic. For example, shipping pallets are built to fulfill the specific purpose of providing a base for goods in transport. The person assembling that shipping pallet, the person loading it on to a truck, the person unpacking it, and so on, use it for this specific purpose, but a very creative person might see that shipping pallet and realize that they can make a pretty cool wine rack out of it.
The very same principle is true of scientific research data. Most often, a researcher collects data to test some specific hypothesis, often under the auspices of funding that was earmarked to address a particular area of science. Maybe that researcher will go on to write an article that discusses the significance of this data in the context of that research question. Or maybe that data will never be published anywhere because they represent negative or inconclusive findings (for a nice discussion of this publication bias, see Ben Goldacre’s 2012 TED talk). Whatever the outcome, the usefulness of the dataset need not end when the researcher who gathered the data is done with it. In fact, that data may help answer a question that the original researcher never even conceived, perhaps in an entirely different realm of science. What’s more, the return on investment in that data increases when it can be reused to answer novel questions, science moves more quickly because the process of data gathering need not be repeated, and therapies potentially make their way into practice more quickly.
Unfortunately, science as it is practiced today does not particularly lend itself to this kind of radical reuse. Datasets are difficult to find, hard to get from researchers who “own” them, and often incomprehensible to those who would seek to reuse them. Changing how researchers gather, use, and share data is no trivial task, but to move toward an environment that is more conducive to data sharing, I suggest that we need to think about five factors:
- Description: if you manage to find a dataset that will answer your question, it’s unlikely that the researcher who originally gathered that data is going to stand over your shoulder and explain the ins and outs of how the data were gathered, what the variables or abbreviations mean, or how the machine was calibrated when the data were gathered. I recently helped some researchers locate data about influenza, and one of the variables was patient temperature. Straight forward enough. Except the researchers asked me to find out how temperature had been obtained – oral, rectal, tympanic membrane – since this affects the reading. I emailed the contact person, and he didn’t know. He gave me someone else to talk to, who also didn’t know. I was never able to hunt down the answer to this fairly simple question, which is pretty problematic. To the extent possible, data should be thoroughly described, particularly using standardized taxonomies, controlled vocabularies, and formal metadata schemas that will convey the maximum amount of information possible to potential data re-users or other people who have questions about the dataset.
- Discoverability: when you go into a library, you don’t see a big pile of books just lying around and dig through the pile hoping you’ll find something you can use. Obviously this would be ridiculous; chances are you’d throw up your hands in dismay and leave before you ever found what you were looking for. Librarians catalog books, shelve them in a logical order, and put the information into a catalog that you can search and browse in a variety of ways so that you can find just the book you need with a minimal amount of effort. And why shouldn’t the same be true of data? One of the services I provide as a research data informationist is assisting researchers in locating datasets that can answer their questions. I find it to be a very interesting part of my job, but frankly, I don’t think you should have to ask a specialist in order to find a dataset, anymore than I think you should have to ask a librarian to go find a book on the shelf for you. Instead, we need to create “catalogs” that empower users to search existing datasets for themselves. Databib, which I describe as a repository of repositories, is a good first step in this direction – you can use it to at least hopefully find a data repository that might have the kind of data you’re looking for, but we need to go even further and do a better job of cataloging well-described datasets so researchers can easily find them.
- Dissemination: sometimes when I ask researchers about data sharing, the look of horror they give me is such that you’d think I’d asked them whether they’d consider giving up their firstborn child. And to be fair, I can understand why researchers feel a sense of ownership about their data, which they have probably worked very hard to gather. To be clear, when I talk about dissemination and sharing, I’m not suggesting that everyone upload their data to the internet for all the world to access. Some datasets have confidential patient information, some have commercial value, some even have biosecurity implications, like H5N1 flu data that a federal advisory committee advised be withheld out of fear of potential bioterrorism. Making all data available to anyone, anywhere is neither feasible nor advisable. However, the scientific and academic communities should consider how to increase the incentives and remove the barriers to data sharing where appropriate, such as by creating the kind of data catalogs I described above, raising awareness about appropriate methods for data citation, and rewarding data sharing in the promotion and tenure process.
- Digital Infrastructure: okay, this is normally called cyberinfrastructure, but I had this whole “words starting with the letter D” thing going and I didn’t want to ruin it. 🙂 If we want to do data sharing properly, we need to build the tools to manage, curate, and search it. This might seem trivial – I mean, if Google can return 168 million web pages about dogs for me in 0.36 seconds, what’s the big deal with searching for data? I’m not an IT person, so I’m really not the right person to explain the details of this, but as a case in point, consider the famed Library of Congress Twitter collection. The Library of Congress announced that they would start collecting everything ever tweeted since Twitter started in 2006. Cool, huh? Only problem is, at least as of January 2013, LC couldn’t provide access to the tweets because they lacked the technology to allow such a huge dataset to be searched. I can confirm that this was true when I contacted them in March or April of 2013 to ask about getting tweets with a specific hashtag that I wanted to use to conduct some research on the sociology of scientific data sharing, and they turned me down for this reason. Imagine the logistical problems that would arise with even bigger, more complex datasets, like those associated with genome wide association studies.
- Data Literacy: Back in my library school days, my first ever library job was at the reference desk at UCLA’s Louise M. Darling Biomedical Library. My boss, Rikke Ogawa, who trained me to be an awesome medical librarian, emphasized that when people came and asked questions at the reference desk, this was a teachable moment. Yes, you could just quickly print out the article the person needed because you knew PubMed inside and out, but the better thing to do was turn that swiveling monitor around and show the person how to find the information. You know, the whole “give a man a fish and he’ll eat for a day, teach a man to fish and he’ll eat for a lifetime” thing. The same is true of finding, using, and sharing data. I’m in the process of conducting a survey about data practices at NIH, and almost 80% of the respondents have never had any training in data management. Think about that for a second. In one of the world’s most prestigious biomedical research institutions 80% of people have never been taught how to manage data. Eighty per cent. If you’re not as appalled by that as I am, well, you should be. Data cannot be used to its fullest if the next generation of scientists continues with the kind of makeshift, slapdash data practices I often encounter in labs today. I see the potential for more librarians to take positions like mine, focusing on making data better, but that doesn’t mean that scientists shouldn’t be trained in at least the basics of data management.
So that’s my data sharing manifesto. What I propose is not the kind of thing that can be accomplished with a few quick changes. It’s a significant paradigm shift in the way that data are collected and science is practiced. Change is never easy and rarely embraced right away, but in the end, we’re often better for having challenged ourselves to do better than we’ve been doing. Personally, I’m thrilled to be an informationist and librarian at this point in history, and I look forward to fondly reminiscing about these days in our data-driven future. 🙂