Behind the scenes of “Data sharing in PLOS ONE: An analysis of Data Availability Statements”

Recently some colleagues and I published a paper in PLOS in which we analyzed about 47,000 Data Availability Statements as a way of exploring the state of data sharing in a journal with a pretty strong data availability policy. The paper has gotten a good response from what I’ve seen on Twitter, and I’m really happy with how it turned out, thanks in part to some great feedback from the reviewers. But I also wanted to tell a few more things about how this paper came about – the things that don’t make it into the final scholarly article. A behind the scenes look, if you will.

The idea for this paper arose out of a somewhat eye-opening experience. I needed to get a hold of a good dataset – I forget why exactly, but I think it was when I was first starting to teach R and wanted to some real data that I could use in the classes for the hands-on exercises. Remembering that PLOS had this data availability policy, I thought to myself, ah, no problem, I will find an article that looks relevant to the researchers I’m teaching, download the data, and use it in my demo (with proper attribution and credit, of course). So I found an article that looked good and scrolled down to the Data Availability Statement.  Data available upon request.  Huh. I thought you weren’t allowed to say that, but okay, I guess this one slipped through the policy.  Found another one – data is within the paper, it said, except the only data in the paper were summary tables, which were of no use to me (nor would they be of use to anyone hoping to verify the study or reanalyze the data, for example).

What a weird fluke, I thought, that the first two papers I happened to look at didn’t really follow the policy. So I checked a third, and a fourth. Pretty soon I’d spent a half hour combing through recent PLOS articles and I had yet to find one with a publicly available dataset that I could easily download from a repository. I ended up looking elsewhere for data (did you know that baseball fans keep surprisingly in-depth data on a gazillion data points?) but I was left wondering what the real impact of this policy was, which was why I decided to do this study.

I’ll let you read the paper to find out what exactly it is that we found, but there’s one other behind-the-scenes anecdote that I’ll share about this paper that I hope will be encouraging. Obviously if you’re going to write critically about data availability, you’re going to look a little hypocritical if you don’t share your own data. I fully intended to share our data and planned to do so using Figshare, which is how I’d shared a dataset associated with another publication I’d previously published in PLOS. When I shared the data from the first article, I set it to be public immediately, though I didn’t expect anyone to want to see it before the paper was out. Unexpectedly, and unbeknownst to me, someone at Figshare apparently thought this was an interesting dataset and decided to tweet it out the same day I submitted the paper to PLOS, obviously well before it was ever published, much less accepted.

While the interest in the dataset was encouraging, I was also concerned about the fact that it was out before the paper was accepted. I figured I was flattering myself to think that someone would want to scoop me, but then, I got an email from someone I didn’t know, who told me that she had found my dataset and that she would like to write an article describing my results, and would I mind sharing my literature review/citations with her to save her the trouble? In other words, “hi, I would like to write basically the paper that you’re trying to get accepted using all of the work you did.” I want to be clear that I am all for data sharing, but this situation bothered me. Was I about to get scooped?

Obviously our paper came out, no one beat us to it, and as far as I know, no one has ever written another paper using that dataset, but I was thinking about it when I was uploading the data for this most recent paper.  This dataset was way more interesting and broadly applicable than the first one, so what if someone did get a hold of it before our paper came out? So what I decided to do was to upload it to Figshare, have it generate a DOI, but keep the dataset listed as private rather than publicly release it. Our data availability statement included the DOI and was therefore on the surface in compliance, but I had a feeling that, if you went to the DOI, it would tell you that the dataset was private or wasn’t found. Obviously I could have checked this before I submitted, but to be totally honest, I just left it as it was because I was genuinely curious whether any of the reviewers would try to check it themselves and say something.

To their credit, all three of the reviewers (who by the way, were incredibly helpful and gave the most useful feedback I’ve ever gotten on peer review, which I think significantly improved the paper) did indeed point out that the DOI didn’t work. In our revisions, our Data Availability Statement included a working link to not only the data, but also the code, on OSF. I invite anyone who is interested to reuse it and hope someone will find it useful. (Please don’t judge me on the quality of my code, though – I wrote it a long time ago when I was first learning R and I would do it way better now.)

 

Scientific “artifacts” – #overlyhonestmethods and indirect observation

This week I’ve been reading the first half of Bruno Latour and Steve Woolgar’s book Laboratory Life: The Construction of Scientific Facts.  Like many of the other pieces I’ve been reading lately, this book argues for a social contructivist theory of scientific knowledge, which is a perspective I’m really starting to identify with.  What I’m finding most interesting about this book is the ethnographic approach that was taken to observe the creation of scientific knowledge.  Basically, Bruno Latour spent two years observing in a biology lab at the Salk Institute.  Chapter 1 begins with a snippet of a transcript covering about 5 minutes of activity in a lab – all the little seemingly insignificant bits of conversation and activity that, taken together, would allow an outside observer to understand how scientific knowledge is socially constructed.

The authors emphasize that real sociological understanding of science can only come from an outside observer, someone who is not themselves too caught up in the science – someone who can’t see the forest for the trees, as it were.  They even suggest that it’s important to “make the activities of the laboratory seem as strange as possible in order not to take too much for granted” (30).  Why should we need someone to spend two years in a lab watching research happen when the researchers are going to be writing up their methods and results in an article anyway, you may ask?  The authors argue that “printed scientific communications systematically misrepresent the activity that gives rise to published reports” and even “systematically conceal the nature of the activity” (28).  In my experience, I would agree that this is true – a great example of it is #overlyhonestmethods, my absolute favorite Twitter hashtag of all time, in which scientists reveal the dirty secrets that don’t make it into the Nature article.

I’ve been thinking that an ethnographic approach might be an effective way to approach my research, and I’m thinking it makes even more sense after what I’ve read of this book so far.  However, this research was done in the 1970s, when research was a lot different.  Of course there are still clinical and bench researchers who are doing actual physical things that a person can observe, but a lot of research, especially the research I’m interested in, is more about digital data that’s already collected.  If I wanted to observe someone doing the kind of research I’m interested in, it would likely involve me sitting there and staring at them just doing stuff on a computer for 8 hours a day.  So I’m not sure if a traditional ethnographic approach is really workable for what I want to do.  Plus, I don’t think I’d get anyone to agree to let me observe them.  I know I certainly wouldn’t let someone just sit there and watch me work on my computer for a whole day, let alone two years (mostly because I’d be embarrassed for anyone else to know how much time I spend looking at pictures of dogs wearing top hats and videos of baby sloths).  Even if I could get someone to agree to that, I do wonder about the problem of observer effect – that the act of someone observing the phenomenon will substantively change that phenomenon (like how I probably wouldn’t take a break from writing this post to watch this video of a porcupine adorably nomming pumpkins if someone was observing me).

This thought takes me back to something I’ve been thinking about a lot lately, which is figuring out methods of indirect observation of researchers’ data reuse practices.  I’m very interested in exploring these sorts of methods because I feel like I’ll get better and more accurate results that way.  I don’t particularly like survey research for a lot of reasons: it’s hard to get people to fill out your survey, sometimes they answer in ways that don’t really give you the information you need, and you’re sort of limited in what kind of information you can get from them.  I like interviews and focus groups even less, for many of the same reasons.  Participant observation and ethnographic approaches have the problems I’ve discussed above.  So what I think I’m really interested in doing is exploring the “artifacts” of scientific research – the data, the articles, the repositories, the funny Twitter hashtags.  This idea sort of builds upon the concept I discussed in my blog last week – how systems can be studied and tells us something about their intended users.  I think this approach could yield some really interesting insights, and I’m curious to see what kind of “artifacts” I’ll be able to locate and use.

Can you hack it? On librarian-ing at hackathons

I had the great pleasure of spending the last few days working on a team at the latest NCBI hackathon.  I think this is the sixth hackathon I’ve been involved in, but this is the first time I’ve actually been a participant, i.e. a “hacker.”  Prior to working on these events, I’d heard a little bit about hackathons, mostly in the context of competitive hackathons – a bunch of teams compete against each other to find the “best” solution to some common problem, usually with the winning team receiving some sort of cash prize.  This approach can lead to successful and innovative solutions to problems in a short time frame.  However, the so-called NCBI-style hackathons that I’ve been involved in over the last couple years involve multiple teams each working on their own individual challenge over a period of three days. There are no winners, but in my experience, everyone walks away having accomplished something, and some very promising software products have come out of these hackathons.  For more specifics about the how and why of this kind of hackathon, check out the article I co-authored with several participants and the mastermind behind the hackathons, Ben Busby of NCBI.

As I said, this time was the first hackathon that I’ve actually been involved as a participant on a team, but I’ve had a lot of fun doing some librarian-y type “consulting” for five other hackathons before this, and it’s an experience I can highly recommend for any information professional who is interested in seeing science happen real-time.  There’s something very exciting about watching groups of people from different backgrounds, with different expertise, most of whom have never met each other before, get together on a Monday morning with nothing but an often very vague idea, and end up on Wednesday afternoon with working software that solves a real and significant biomedical research problem.  Not only that, but most of the groups manage to get pretty far along on writing a draft of a paper by that time, and several have gone on to publish those papers, with more on their way out (see the F1000Research Hackathons channel for some good examples).

As motivated and talented as all these hackathon participants are, as you can imagine, it takes a lot of organizational effort and background work to make something like this successful.  A lot of that work needs to be done by someone with a lot of scientific and computing expertise.  However, if you are a librarian who is reading this, I’m here to tell you that there are some really exciting opportunities to be involved with a hackathon, even if you are completely clueless when it comes to writing code.  In the past five hackathons, I’ve sort of functioned as an embedded informationist/librarian, doing things like:

  • basic lit searching for paper introductions and generally locating background information.  These aren’t formal papers that require an extensive or systematic lit review, but it’s useful for a paper to provide some context for why the problem is significant.  The hackers have a ton of work to fit in to three days, so it’s silly to have them spend their limited time on lit searching when a pro librarian can jump in and likely use their expertise to find things more easily anyway
  • manuscript editing and scholarly communication advice.  Anyone who has worked  with co-authors knows that it takes some work to make the paper sound cohesive, and not like five or six people’s papers smushed together.  Having someone like a librarian with editing experience to help make that happen can be really helpful.  Plus, many librarians  have relevant expertise in scholarly publishing, especially useful since hackathon participants are often students and earlier career researchers who haven’t had as much experience with submitting manuscripts.  They can benefit from advice on things like citation management and handling the submission process.  Also, I am a strong believer in having a knowledgeable non-expert read any paper, not just hackathon papers.  Often writers (and I absolutely include myself here) are so deeply immersed in their own work that they make generous assumptions about what readers will know about the topic.  It can be helpful to have someone who hasn’t been involved with the project from the start take a look at the manuscript and point out where additional background or explanation might be beneficial to aiding general understandability.
  • consulting on information seeking behavior and giving user feedback.  Most of the hackathons I’ve worked on have had teams made up of all different types of people – biologists, programmers, sys admins, other types of scientists.  They are all highly experienced and brilliant people, but most have a particular perspective related to their specific subject area, whereas librarians often have a broader perspective based on our interactions with lots of people from various different subject areas.  I often find myself thinking of how other researchers I’ve met might use a tool in other ways, potentially ones the hackathon creators didn’t necessarily intend.  Also, at least at the hackathons I’ve been at, some of the tools have definite use cases for librarians – for example, tools that involve novel ways of searching or visualizing MeSH terms or PubMed results.  Having a librarian on hand to give feedback about how the tool will work can be useful for teams with that kind of a scope.

I think librarians can bring a lot to hackathons, and I’d encourage all hackathon organizers to think about engaging librarians in the process early on.  But it’s not a one-way street – there’s a lot for librarians to gain from getting involved in a hackathon, even tangentially.  For one thing, seeing a project go from idea to reality in three days is interesting and informative.  When I first started working with hackathons, I didn’t have that much coding experience, and I certainly had no idea how software was actually developed.  Even just hanging around hackathons gave me so much of a better understanding, and as an informationist who supports data science, that understanding is very relevant.  Even if you’re not involved in data science per se, if you’re a biomedical librarian who wants to gain a better understanding of the science your users are engaged in, being involved in a hackathon will be a highly educational experience.  I hadn’t really realized how much I had learned by working with hackathons until a librarian friend asked me for some advice on genomic databases. I responded by mentioning how cool it was that ClinVar would tell you about pathogenic variants, including their location and type (insertion, deletion, etc), and my friend was like, what are you even talking about, and that was when it occurred to me that I’ve really learned a lot from hackathons!  And hey, if nothing else, there tends to be pizza at these events, and you can never go wrong with pizza.

I’ll end this post by reiterating that these hackathons aren’t about competing against each other, but there are awards given for certain “exemplary” achievements.  Never one to shy away from a little friendly competition, I hoped I might be honored for some contribution this time around, and I’m pleased to say I was indeed recognized . 🙂

It's true, I'm the absolute worst at darts.

There is a story behind this, but trust me when I say it’s true, I’m the absolute worst at darts.

A Silly Experiment in Quantifying Death (and Doing Better Code)

Doesn’t it seem like a lot of people died in 2016?  Think of all the famous people the world lost this year.  It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even.  Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it.  Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing.  For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names.  More importantly, this little project was actually a good way to practice a few things I wanted to teach myself.  Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people.  Here they are on a histogram, by year.

year_plotAs you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still).  The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average.  So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case.  I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data.  Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

month_plotObviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

age_plotThis chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, you may notice those outliers in 2006 and 2014 where someone died at a very, very old age. I didn’t realize it at first, butWikipedia does include some notable non-humans in their list. One is a famous tree that died in an ice storm at age 125 and the other a tortoise who had allegedly been owned by Charles Darwin, but significantly outlived him, dying at age 176.  Obviously this makes the data and therefore this analysis even more suspect as a true scientific pursuit.  But we had fun, right? 🙂

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository.  And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing.  I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around.  I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was).  Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github.  Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it).  This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish.  In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway?  Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way.  I learned how to write good functions, for real.  I learned how to use the %>%, (which is a pipe operator, and it’s very awesome).  I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point.  In the past, I would write code and be elated if it actually worked.  With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better?  Can I do that in one line of code instead of three?  Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work.  Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?

To keep or not to keep: that is the question

I recently read an article in The Atlantic about people who are compulsive declutterers – the opposite of hoarders – who feel compelled to get rid of all their possessions. I’m more on the side of hoarding, because I always find myself thinking of eventualities in which I might need the item in question.  Indeed, it has often been the case that I will think of something I got rid of weeks or even years later and wish I still had it: a book I would have liked to reference, a piece of clothing I would have liked to wear, a receipt I could have used to take something back.  Of course, I don’t have unlimited storage space, so I can’t keep all this stuff.  The question of what to keep and for how long is one that librarians think about when it comes to weeding: deciding which parts of the collection to deaccession, or basically, get rid of.  There are evidence-based, tried-and-true ways of thinking about weeding a library collection, but that’s not so much true when it comes to data.  How is a scientist to decide what to keep and what not to keep?

I know this is a question that researchers are thinking about quite a bit, because I get more emails about this than almost any other issue.  In fact, I get emails not only from users of my own library, but researchers from all over the country who have somehow found my name.  What exactly do I need to keep?  If I have electronic records, do I need to keep a print copy as well?  How many years do I need to keep this stuff?  These are all very reasonable questions that it would be nice to say, yes, there is an answer and it is….! but it’s almost never so easy to point to a single answer.

A case in point: a couple years ago, I decided to teach a class about data preservation and retention.  In my naivete, I thought it would be nice to take a look through all the relevant policy and find the specific number of years that research data is required to be retained.  I read handbooks and guides.  I read policy documents from various agencies.   I even read the U.S. Code (I do not recommend it).  At the end of it, I found that not only is there not a single, definitive, policy answer to how long funded research data should be retained, but there are in fact all sorts of contradictory suggestions.  I found documents giving times from 3 years to 7 years to the super-helpful “as long as necessary.”

This may be difficult to answer from a policy perspective, but I think answering this from a best practices perspective is even trickier.  Let’s agree that we just can’t keep everything – storing data isn’t free, and it takes considerable time and effort to ensure that data remain accessible and usable.  Assuming that some stuff has to get thrown away, how do we distinguish trash from treasure, especially given the old adage about how the former might be the latter to others?  It’s hard to know whether something that appears useless now might actually be useful and interesting to someone in the future.  To take this to the extreme, here’s an actual example from a researcher I’ve worked with: he asked how he could have his program automatically discard everything in the thousandth place from his measurements.  In other words, he wanted 4.254 to be saved as 4.25.  I told him I could show him how, but I asked why he wanted to do this.  He told me that his machine was capable of measuring to the thousandth, but the measurement was only scientifically relevant to the hundredth place.  To scientists right now, 4.254 and 4.252 were essentially indistinguishable, so why bother with the extra noise of the thousandth place?  Fair point, but what about 5 years from now, or 10 years from now?  If science evolves to the point that this extra level of precision is meaningful, tomorrow’s researchers will probably be a little annoyed that today’s researchers had that measurement and just threw it away.  But then again, how can we know now when, or even if, that level of precision will be wanted?  For that matter, we can’t even say for sure whether this dataset will be useful at all.  Maybe a new and better method for making this measurement will be developed tomorrow, and all this stuff we gathered today will be irrelevant.  But how can we know?

These are all questions that I think are not easy to answer right now, but that people within research communities should be thinking about.  For one thing, I don’t think we can give one simple answer to how long data should be retained.  For one type of research, a few years may be enough.  For other fields, where it’s harder to replicate data, maybe we need to keep it in perpetuity.  When it comes to deciding what should be retained and what should be discarded, I think that answers cannot be dictated by one-size-fits-all policies and that subject matter experts and information professionals should work together to figure out good answers for specific communities and specific data.  Eventually, I suppose we’ll probably have some of those well-defined best practices for data retention in the same way that we have those best practices from collection management in libraries.  Until then, keep your crystal balls handy. 🙂

So you think you can code

I’ve been thinking about many ideas lately dealing with data and data science (this is, I’m sure, not news to anyone).  I’ve also had several people encourage me to pick my blog back up, and I’ve recently made my den into a cute and comfy little office, so, why not put all this together and resume blogging with a little post about my thoughts on data!  In particular, in this post I’m going to talk about coding.

Early on in my library career when I first got interested in data, I was talking to one of my first bosses and told her I thought I should learn R, which is essentially a scripting language, very useful for data processing, analysis, statistics, and visualization.  She gave me a sort of dubious look, and even as I said it, I was thinking in my head, yeah, I’m probably not going to do that.  I’m no computer scientist.  Fast forward a few years later, and not only have I actually learned R, it’s probably the single most important skill in my professional toolbox.

Here’s the thing – you don’t have to be a computer scientist to code, especially in R.  It’s actually remarkably straightforward, once you get over the initial strangeness of it and get a feel for the syntax.  I started offering R classes around the beginning of this year and I call my introductory classes “Introduction to R for Non-programmers.”  I had two reasons for selecting this name: one, I had only been using R for less than a year myself and didn’t (and still don’t) consider myself an expert.  When I started thinking about getting up in front of a room of people and teaching them to code, I had horrifying visions of experienced computer scientists calling me out on my relative lack of expertise, mocking my class exercises, or correcting me in front of everyone.  So, I figured, let’s set the bar low. 🙂  More importantly, I wanted to emphasize that R is approachable!  It’s not scary!  I can learn it, you can learn it.  Hell, young children can (and do) learn it.  Not only that, but you can learn it from one of a plethora of free resources without ever cracking a book or spending a dime.  All it takes is a little time, patience, and practice.

The payoff?  For one thing, you can impress your friends with your nerdy awesome skills!  (Or at least that’s what I keep telling myself.)  If you work with data of any kind, you can simplify your work, because using R (or other scientific programming languages) is faaaaar more efficient than using other point and click tools like Excel.  You can create super awesome visualizations, do crazy data analysis in a snap, and work with big huge data sets that would break Excel.  And you can do all of this for free!  If you’re a research and/or medical librarian, you will also make yourself an invaluable resource to your user community.  I believe that I could teach an R class every day at my library and there would still be people showing up.  We regularly have waitlists of 20 or more people.  Scientists are starting to catch on to all the reasons I’ve mentioned above, but not all of them have the time or inclination to use one of the free online resources.  Plus, since I’m a real human person who knows my users and their research and their data, I know what they probably want to do, so my classes are more tailored to them.

I was being introduced to Hadley Wickham yesterday, who is a pretty big deal in the R world, as he created some very important R packages (kind of like apps), and my friend and colleague who introduced me said, “this is Lisa; she is our prototypical data scientist librarian.”  I know there are other librarian coders out there because I’m on mailing lists with some of them, but I’m not currently aware of any other data librarians or medical librarians who know R.  I’m sure there are others and I would be very interested in knowing them.  And if it is fair to consider me a “prototype,” I wonder how many other librarians will be interested in becoming data scientist librarians.  I’m really interested in hearing from the librarians reading this – do you want to code?  Do you think you can learn to code?  And if not, why not?

Radical Reuse: Repurposing Yesterday’s Data for Tomorrow’s Discoveries

I’ve been invited to be speaker at this evening’s Health 2.0 STAT meetup at Bethesda’s Barking Dog, alongside some pretty awesome scientists with whom I’ve been collaborating on some interesting research projects.  This invitation is a good step toward my ridiculously nerdy goal of one day being invited to give a TED talk.  My talk, entitled “Radical Reuse: Repurposing Yesterday’s Data for Tomorrow’s Discoveries” will briefly outline my view of data sharing and reuse, including what I view as five key factors in enabling data reuse.  Since I have only five minutes for this talk, obviously I’ll be hitting only some highlights, so I decided to write this blog post to elaborate on the ideas in that talk.

First, let’s talk about the term “radical reuse.”  I borrow this term from the realm of design, where it refers to taking discarded objects and giving them new life in some context far removed from their original use.  For some nice examples (and some cool craft ideas), check out this Pinterest board devoted to the topic.  For example, shipping pallets are built to fulfill the specific purpose of providing a base for goods in transport.  The person assembling that shipping pallet, the person loading it on to a truck, the person unpacking it, and so on, use it for this specific purpose, but a very creative person might see that shipping pallet and realize that they can make a pretty cool wine rack out of it.

The very same principle is true of scientific research data.  Most often, a researcher collects data to test some specific hypothesis, often under the auspices of funding that was earmarked to address a particular area of science.  Maybe that researcher will go on to write an article that discusses the significance of this data in the context of that research question.  Or maybe that data will never be published anywhere because they represent negative or inconclusive findings (for a nice discussion of this publication bias, see Ben Goldacre’s 2012 TED talk).  Whatever the outcome, the usefulness of the dataset need not end when the researcher who gathered the data is done with it.  In fact, that data may help answer a question that the original researcher never even conceived, perhaps in an entirely different realm of science.  What’s more, the return on investment in that data increases when it can be reused to answer novel questions, science moves more quickly because the process of data gathering need not be repeated, and therapies potentially make their way into practice more quickly.

Unfortunately, science as it is practiced today does not particularly lend itself to this kind of radical reuse.  Datasets are difficult to find, hard to get from researchers who “own” them, and often incomprehensible to those who would seek to reuse them.  Changing how researchers gather, use, and share data is no trivial task, but to move toward an environment that is more conducive to data sharing, I suggest that we need to think about five factors:

  • Description: if you manage to find a dataset that will answer your question, it’s unlikely that the researcher who originally gathered that data is going to stand over your shoulder and explain the ins and outs of how the data were gathered, what the variables or abbreviations mean, or how the machine was calibrated when the data were gathered.  I recently helped some researchers locate data about influenza, and one of the variables was patient temperature.  Straight forward enough.  Except the researchers asked me to find out how temperature had been obtained – oral, rectal, tympanic membrane – since this affects the reading.  I emailed the contact person, and he didn’t know.  He gave me someone else to talk to, who also didn’t know.  I was never able to hunt down the answer to this fairly simple question, which is pretty problematic.  To the extent possible, data should be thoroughly described, particularly using standardized taxonomies, controlled vocabularies, and formal metadata schemas that will convey the maximum amount of information possible to potential data re-users or other people who have questions about the dataset.
  • Discoverability: when you go into a library, you don’t see a big pile of books just lying around and dig through the pile hoping you’ll find something you can use.  Obviously this would be ridiculous; chances are you’d throw up your hands in dismay and leave before you ever found what you were looking for.  Librarians catalog books, shelve them in a logical order, and put the information into a catalog that you can search and browse in a variety of ways so that you can find just the book you need with a minimal amount of effort.  And why shouldn’t the same be true of data?  One of the services I provide as a research data informationist is assisting researchers in locating datasets that can answer their questions.  I find it to be a very interesting part of my job, but frankly, I don’t think you should have to ask a specialist in order to find a dataset, anymore than I think you should have to ask a librarian to go find a book on the shelf for you.  Instead, we need to create “catalogs” that empower users to search existing datasets for themselves.  Databib, which I describe as a repository of repositories, is a good first step in this direction – you can use it to at least hopefully find a data repository that might have the kind of data you’re looking for, but we need to go even further and do a better job of cataloging well-described datasets so researchers can easily find them.
  • Dissemination: sometimes when I ask researchers about data sharing, the look of horror they give me is such that you’d think I’d asked them whether they’d consider giving up their firstborn child.  And to be fair, I can understand why researchers feel a sense of ownership about their data, which they have probably worked very hard to gather.  To be clear, when I talk about dissemination and sharing, I’m not suggesting that everyone upload their data to the internet for all the world to access.  Some datasets have confidential patient information, some have commercial value, some even have biosecurity implications, like H5N1 flu data that a federal advisory committee advised be withheld out of fear of potential bioterrorism.  Making all data available to anyone, anywhere is neither feasible nor advisable.  However, the scientific and academic communities should consider how to increase the incentives and remove the barriers to data sharing where appropriate, such as by creating the kind of data catalogs I described above, raising awareness about appropriate methods for data citation, and rewarding data sharing in the promotion and tenure process.
  • Digital Infrastructure: okay, this is normally called cyberinfrastructure, but I had this whole “words starting with the letter D” thing going and I didn’t want to ruin it. 🙂  If we want to do data sharing properly, we need to build the tools to manage, curate, and search it.  This might seem trivial – I mean, if Google can return 168 million web pages about dogs for me in 0.36 seconds, what’s the big deal with searching for data?  I’m not an IT person, so I’m really not the right person to explain the details of this, but as a case in point, consider the famed Library of Congress Twitter collection.  The Library of Congress announced that they would start collecting everything ever tweeted since Twitter started in 2006.  Cool, huh?  Only problem is, at least as of January 2013, LC couldn’t provide access to the tweets because they lacked the technology to allow such a huge dataset to be searched.  I can confirm that this was true when I contacted them in March or April of 2013 to ask about getting tweets with a specific hashtag that I wanted to use to conduct some research on the sociology of scientific data sharing, and they turned me down for this reason.  Imagine the logistical problems that would arise with even bigger, more complex datasets, like those associated with genome wide association studies.
  • Data Literacy: Back in my library school days, my first ever library job was at the reference desk at UCLA’s Louise M. Darling Biomedical Library.  My boss, Rikke Ogawa, who trained me to be an awesome medical librarian, emphasized that when people came and asked questions at the reference desk, this was a teachable moment.  Yes, you could just quickly print out the article the person needed because you knew PubMed inside and out, but the better thing to do was turn that swiveling monitor around and show the person how to find the information.  You know, the whole “give a man a fish and he’ll eat for a day, teach a man to fish and he’ll eat for a lifetime” thing.  The same is true of finding, using, and sharing data.  I’m in the process of conducting a survey about data practices at NIH, and almost 80% of the respondents have never had any training in data management.  Think about that for a second.  In one of the world’s most prestigious biomedical research institutions 80% of people have never been taught how to manage data.  Eighty per cent.  If you’re not as appalled by that as I am, well, you should be.  Data cannot be used to its fullest if the next generation of scientists continues with the kind of makeshift, slapdash data practices I often encounter in labs today.  I see the potential for more librarians to take positions like mine, focusing on making data better, but that doesn’t mean that scientists shouldn’t be trained in at least the basics of data management.

So that’s my data sharing manifesto.  What I propose is not the kind of thing that can be accomplished with a few quick changes.  It’s a significant paradigm shift in the way that data are collected and science is practiced.  Change is never easy and rarely embraced right away, but in the end, we’re often better for having challenged ourselves to do better than we’ve been doing.  Personally, I’m thrilled to be an informationist and librarian at this point in history, and I look forward to fondly reminiscing about these days in our data-driven future. 🙂

Why Data Management is Cool (Sort Of)

“She told me the topic was really boring, but that you made it kind of interesting,” the woman said when I asked her to be honest about what our mutual acquaintance had said after attending a class I’d taught on writing a data management plan.  This is not the first time I’d heard something like this.  The fact is, I’m pretty damn passionate and excited about a topic that most people find slightly less boring than watching paint dry: data.  Now, I’m not going to try to convince you that data is not nerdy.  It is.  Very nerdy.   I have never claimed to be cool, and this is probably one of my least cool interests.  However, I think I have some very good reasons for finding data rather interesting.

I remember pretty much the exact moment when I realized the very interesting potential that lives in data.  I was in library school and taking a class in the biomedical engineering department about medical knowledge representation, and we were spending the whole quarter on talking about the very complicated issue of representing the clinical data around a very specific disease (glioblastoma multiforme or GBM, a type of brain cancer).  It’s very difficult with this disease, as with many others, to arrange and organize the data just about a single patient in such a way that a clinician can make sense of it.  There’s genetic data, vital signs data, drug dosing data, imaging data, lab report data, genetic data, doctor’s subjective notes, patient’s subjective reports of their symptoms, and tons of other stuff, and it all shifts and changes over time as the disease progresses or recedes.  Is there any way to build a system that could present this data in any sort of a manageable way to allow a clinician to view meaningful trends that might provide insight into the course of disease that could help improve treatment?  Disappointingly, at least for now, the answer seems to be no, not really.

But the moment that I really knew that I wanted to work with this stuff was when we were talking about personalized medicine and genetic data.  In the case of GBM, as with many other diseases, certain medicines work very well on some patients, but fail almost completely in others.  Many factors could play into this, but there’s likely a large genetic component for why this should be.  Given enough data about the patients in whom these drugs worked and in whom they didn’t, then, could we potentially figure out in advance which drug could help someone?  Extrapolating from that, if we have enough health data about enough different patients, aren’t there endless puzzles we could solve just by examining the patterns that would emerge by getting enough information into a system that could make it comprehensible?

Perhaps that’s oversimplifying it, but I do think it’s fair to conceive of data as pure, unrefined knowledge.  When I look at a dataset, I don’t see a bunch of numbers or some random collection of information.  I imagine what potential lives within that data just waiting to be uncovered by the careful observation of some astute individual or a program that can pick out the patterns that no human could ever catch.  To me, raw data represents the final frontier of wild, untamed knowledge just waiting to be understood and explained, and to someone like me who is really in love with knowledge above all, that’s a pretty damn cool thing.

Yes, I know that writing a data management plan or figuring out what kind of metadata to use for a dataset is pretty boring.  I’m not denying that.  But sometimes you have to do some boring stuff to make cool things happen.  You have to get your oil changed if you want your Bugatti Veyron to do 0 to 60 in 2.5 seconds (I mean, I’m assuming those things have to get oil changes?).  You have to do the math to make sure your flight pattern is right if you want to shoot a rocket into space.  And you can’t find out all the cool secrets that live in your dataset if it’s a messy pile of papers sitting on your desk.  So the way I see it, my job is to make data management as easy and as interesting as possible so that the people who have the data will be able to unlock the secrets that are waiting for them.  So spread the word, my fellow data nerds.  Let’s make data management as cool as regular oral hygiene.  😉

Cool Science: Crowdsourcing Big Data

Anyone who knows me at all knows I really like data.  It’s a tremendously nerdy interest, but I find data really fascinating, I guess in part because I love the idea that there is some great knowledge that’s hidden in the numbers, just waiting for someone to come along and dig it out.  What’s very cool is that we live in an age when technology allows us to generate massive amounts of data.  For example, the Large Hadron Collider generates more than 25 petabytes a year in data, which is more than 70 terabytes a day.  A DAY.  Some data analysis can be done by computers, but some of it really has to be done by people.  Plus, some studies really rely on the ability to gather data from massive groups of people in order to get an adequate sample from various groups to prove what you’re trying to show.  To solve these and other “big data” problems, some very smart and cool research groups have jumped on the crowdsourcing bandwagon and are having people from around the world get online and help solve the problems of data gathering and analysis.  Here are some cool projects I’ve heard about.

Eyewire: a group of researchers working on retinal connectomes at MIT found a fascinating way to get people to help with their data analysis – turn it into a game.  They have a good wiki that explains the project in depth, but the gist of it is that these researchers have microscopic scans of neurons from the retina.  Neurons are a huge tangled mess, so their computers could figure out how some of them fit together, but it really takes an actual person to go in and figure out what’s connected and what’s not.  So this team turned it into this 3D puzzle/game thing that’s really hard to explain unless you try it.  You go through a tutorial to learn how to use the system, and then you’re turned loose to start mapping neurons!  It’s not like the most compelling game I’ve ever played or something I’d spend hours doing, but it is interesting, and it helps neuroscience, so that’s pretty cool.

Small World of Words: this study aims to better understand human speech and how we subconsciously create networks of associations among words.  To do so, they set up a game to gather word associations from native and non-native English speakers.  Again, I wouldn’t necessarily call this a game in the sense of “woohoo, we’re having so much fun!” but it is kind of interesting to see what your brain comes up with when you’re given a set of random words.  (Plus it’s perhaps a little telling of your own psychological state if you really think about the words you’re coming up with.)  It takes like 2 minutes to do, and again, it’s contributing to science!  Also, according to their website, they are making their dataset publicly available, which as a research informationist/data librarian I wholeheartedly endorse.

Foldit: I haven’t played this yet, so I can’t speak to how fun it is (or boring), but it sounds similar to Eyewire in the sense of being a puzzle in which the players are helping to map a structure – in this case proteins.  Proteins are long chains of amino acids, but they fold up in certain ways that determine their function.  Knowing more about this folding structure makes it possible to create better drugs and understand the pathology of diseases.  For example, one of the things this project is looking at is proteins that are crucial for HIV to replicate itself within the human body.  Better understanding of the structure of these proteins could help contribute to drugs to treat HIV and AIDS.

So I encourage you to go play some games for science!  Do it now!  And if you’re at work and someone tries to stop you, just politely explain that you’re not playing a game – you’re curing AIDS.  🙂

The Librarian’s First Dataset: A Treatise on Incredible Nerdiness

I must preface this post by saying that, if you didn’t know already, I’m a huge herd.  The biggest.  There’s nothing I’m more passionate about than knowledge and learning, and this has often earned me very perplexed looks from people who probably think I’m crazy.  In this post, I’m going to wax poetic about knowledge and reveal the depths of my geekiness.  However, I’m guessing if you’re here reading this blog, this is probably not going to come as any sort of a surprise to you.

For the last few weeks, I’ve been working on planning a research data management class.  Working with researchers on their data is hands-down my favorite part of my job.  I adore science and the best part of being a medical librarian/research informationist is that I get to work with all different researchers and hear about all sorts of fascinating things.  Sometimes I regret that I didn’t get a science degree, but mostly I’m okay with it because this job allows me to get my hands into all sorts of different things and never have to choose a specialty. Talking to researchers is fascinating.  However, the more I talk to them, the more I realize that a lot of them really have no idea what they’re doing when it comes to data management.  These are brilliant people, to be sure, but the way they handle their data makes me cringe.  They’ve never been trained to do it properly, but as a librarian, I have that training.  Part of what I do is helping people with their data, but I also believe in the adage about giving a man a fish versus teaching him to fish.  I’m one librarian in a huge research enterprise.  As much as I’d like to, there’s no way I could possibly reach everyone to personally help them figure out their data.  So one of the things I decided to do to help mitigate the fact that I can’t be in eight million places at once is to offer a class on research data management.

Because I work in the field of medicine, in which everything must be evidence-based, of course I wasn’t satisfied just to offer a class and hope people liked it.  I am a data librarian, so I decided that I should probably gather some data!  My plan was to devise a pre-test that people would take before the class, then a follow-up post test.  Obviously the goal was that they wouldn’t know the answers to the questions on the pre-test, and then they would after the class. I spent weeks agonizing over how best to assess this. I’ve had very, very preliminary training in devising assessment instruments, but mostly I was just kind of taking a shot in the dark when I came up with my pre-test. I changed the questions a million times, but I finally came up with something that I thought would probably work.

Today, our office manager sent out the reminder email about tomorrow’s class to those who had RSVP’d.  The email contained a link to the survey and a brief explanation of why I was asking people to complete it.  It was a short survey, took only a couple minutes to complete, but I had this sinking feeling that everyone would ignore it.  Because of IRB (Institutional Research Board) requirements, I had emphasized in the email that people weren’t required to take the survey if they wanted to do the class.  I figured people would see that and just ignore the survey, but I was keeping my fingers crossed.  I was on the train to the airport in San Francisco on my way back to Los Angeles when I saw that the email had gone out.

So now, allow me to set the scene for one of the nerdiest moments of my life.  I had gotten to the airport and had some time to kill before my flight, so I was sitting in a wine bar getting something to eat (and drink of course).  I ordered a glass of Champagne (yeah, that’s how I roll) and pulled out my laptop.  I was logging on when the Champagne arrived.  I pulled up the survey site.  The email had only gone out maybe an hour or so earlier, so I wasn’t expecting any responses yet.  But when I logged on, you know what I found?  Almost EVERY SINGLE PERSON who has registered for the class had taken the survey!  When I saw the number of responses, I made an audible, astonished gasp, and several people in the restaurant turned and looked at me.  I refrained from getting up from my seat and jumping up and down in excitement, though this is what I would have done if I had been alone. 🙂

Not only did people respond to my survey, but they responded exactly as I hoped they would.  I won’t go into detail here, since obviously I’m going to attempt to publish all of this in a peer-reviewed journal.  🙂  But essentially, these pre-test results reveal that, as I had suspected, these people really need a lot of help with this stuff and don’t have a lot of knowledge of the many awesome resources out there.  Hopefully that will all change tomorrow when I teach this class.

So that is the story of how I came to have my very own research dataset.  This is incredibly heartening for me.  For one thing, I’ve always felt like I really ought to have more hands-on experience working with data if I’m going to teach it.  My dataset is super tiny compared to the datasets I help researchers with, but this is a good start.  More importantly, I am so excited that this actually worked.  I’ve been wanting to move forward with additional research in this area, but I wasn’t entirely sure if it was worthwhile, since I basically only had anecdotal evidence to suggest this kind of thing was needed, and there have been a few naysayers whose words weighed heavily on my mind.  I’ve worked really hard on all of this, and it’s been exhausting, especially with having to work around sort of a crazy travel schedule.  But now it feels like things are all falling into place.  All those little ideas I’ve had floating around in my mind about additional research I’d like to do feel a little more feasible now.  So it’s an exciting time for me career-wise.  Now that I’m a little more assured that I know what I’m doing, I have some good ideas about how to move forward. I’ve got a hunger for data and research now and I need more. 🙂

So yeah, again, probably news to no one, but I’m a huge nerd.  Now, in celebration, I’m going to order a second glass of Champagne to enjoy in the hour before I have to catch my flight.  Cheers!