Who owns science? Some musings on structural functionalism

This week I’ve been reading Sergio Sismondo’s An Introduction to Science and Technology Studies, which has given me a lot to think about in terms of theoretical backgrounds for understanding how science creates knowledge.  In fact, it’s almost given me too much to think about.  There are so many different theoretical bases brought into the mix here, and I can see the relative merits of each, so I find myself wondering how to make sense of it all, but also what it means to adopt a theoretical underpinning as a social scientist.  Is it like a religion, where you accept one and only one dogma, and all parts of it, to the exclusion of all others?  Or is it more like a buffet, where you pick a little bit of the things that seem appealing to you and leave behind the things that don’t catch your eye?  I’m hoping it’s the latter, and I’m going to go on that assumption until the theory police tell me I can’t do it. 🙂  So, on that assumption, here are some ideas I’ve put on my plate from Sismondo’s buffet.

Structural Functionalism and Mertonian Norms

My favorite theoretical framework I picked up here was structural functionalism, and in particular, Robert Merton’s four guiding norms.  Structural functionalism, as I understand it, argues that society is composed of institutional structures that function based on guiding norms and customs.  Merton suggests that science is one such institution, the primary goal of which is “the extension of certified knowledge” (23).  Merton also outlined four norms of behavior that guide scientific practice, suggesting that those who follow them will be rewarded and those who violate them will be punished.  The norms are universalism (that the same criteria should be used to evaluate scientific claims regardless of the race, gender, etc of the person making them), communism (that scientific knowledge belongs to everyone), disinterestedness (that scientists place the good of the scientific community ahead of their own personal gain), and organized skepticism (that the community should not believe new ideas until they have been convincingly proven).

Of those four norms, communism and disinterestedness speak the most to my interest in data sharing and reuse.  Communism seems the most obviously related.  It’s very interesting to think about what parts of science are typically thought to belong to the community and which are thought to be privately owned.  For example, the Supreme Court unanimously ruled in 2015 that human genes could not be patented, a decision that seems in line with Merton’s communism norm.  On the other hand, plenty of scientific ideas can be and are patented.  While many scientific journals are becoming open access and making their articles freely available, many more work on a subscription model, suggesting that the ideas shared within are available for common consumption – if you are willing and able to pay the fee.

Although this example comes from an entirely different realm than science, thinking about these ideas has reminded me of the case of the artist Anish Kapoor, who purchased the exclusive rights to paint with the world’s “blackest black” so that no other artist can use it.  In retaliation, another artist designed the “pinkest pink” paint and made it available for sale – to any artist except Anish Kapoor. While this episode is somewhat entertaining, it does bring up some interesting ideas about ownership in communities that are generally dedicated to the common good.  Art and science are very different, but they’re also quite alike in some ways that are very relevant to the work I’m doing.  They’re both activities carried out by individuals for their own reasons (artistic expression, scientific curiosity) for the common good (to share beauty with the world, to further scientific knowledge).  We are outraged when we hear of a rich artist laying exclusive claim to the raw materials of art so that no one else can use them.  It feels somehow petty, and it also seems like a disservice to not just the art world, but to all of is.  What could others be creating for us if they had access to that black?  I don’t know if we feel that same outrage when we hear of a scientist trying to lay exclusive claim to data.  Of course this isn’t a perfect analogy – a big part of the work of science is gathering or creating the data, which confuses the concept of ownership.  Still, I think there are some interesting ideas here to explore about how scientists think about common ownership of science – not just the ideas, but the data as well.

I started out this entry saying I was going to dip into some other theories – I have some things to say about social constructionism and actor-network theory, but now I’ve spent a long time going on and on about art and science and this is getting a bit long, so I think I’ll stop here for today. 🙂

 

On data, knowledge, and theory

As I’ve mentioned on this blog before, I recently started a PhD program at the University of Maryland’s iSchool, focusing on scientific researchers’ data reuse practices.  There’s a great deal of attention lately on encouraging, and even requiring, researchers to share their data, but less work has been done on how researchers actually make use of that shared data (or if indeed they do at all).  This semester, I’m doing an independent study with my advisor, Dr. Andrea Wiggins, with the aim of better understanding the theoretical background for this problem.  I have the good fortune of working in a job that involves interacting with researchers on data questions on a pretty much daily basis, so I have plenty of opportunity to observe actual practices, but I have less background on theoretical frameworks for contextualizing and understanding why  these things happen, so that’s my goal this semester!  I’ve picked out several readings and am going to write weekly reflections on what I’ve read and thought, and since I have this blog, I figured, why not inflict all this on you, my readers, as well? 🙂

This week I read Paul Davidson Reynolds’ Primer in Theory Construction, which breaks down the research process and explores the scientific method and all its component parts.  It is described as being designed “for those who have already studied one or more of the social, behavioral, or natural sciences, but have no formal introduction to the way theories are constructed, stated, tested, and connected together to form a scientific body of knowledge.”  While I was reading it, I often was thinking to myself, “well, yeah, obviously…” but after I had a little more time to think about it, it occurred to me that it was useful to really stop and think about why research is done the way it is and what we can really determine using data, inference, and logic.

One of the things I was thinking about as I was reading this book was how we make the jump from data to knowledge, and also how to operationalize terms like “data” and “knowledge.”  The NIH’s big data initiative is called Big Data to Knowledge, but what exactly does it mean to translate “big data” to “knowledge”?  How do we define “big data” (as opposed to small data?) and “knowledge”?  Are the ways that big data become knowledge different than the ways non-big data become knowledge?  There are some good definitions of big data, but how do we define “knowledge” in the scientific, and particularly biomedical, realm?

Thinking about how researchers use data by really breaking things down to their most basic level is a little different from how I’ve thought about things before, but actually makes good sense.  I suggest that the barriers to reuse of shared data are:

  • technological: there aren’t good tools for easily getting/reusing the data, or the data are poor quality or hard to find)
  • social: incentive structures of science often do not reward research that reuses data – take a look at the concept of #researchparasites
  • educational: reusing data involves a different skill set that most researchers aren’t taught

However, I never really thought about one of the most fundamental social factors, which is how researchers in a field conceptualize data and how it is transformed into knowledge.  Are there fundamental differences between the data I gather and data someone else gathers and I reuse?  Obviously if I gather my own data, I know more about its context, quality, and provenance.  If I reuse someone’s shared data, I don’t know how careful they were when collecting it, or other important things I might need to know about how the data were collected to be able to reuse them meaningfully.  For example, I once worked with a researcher on locating a clinical dataset for reuse, and once we got the dataset, the researcher asked how patient temperature had been measured – oral, axillary, rectal?  I got back in touch with the original data owner, and they didn’t know – the person who would be able to answer that question had moved on to a new position.  Apparently that mattered to the methods of the researcher I was working with, so they couldn’t use that dataset.  The sorts of things that seem like little minor details can actually make a big difference, but there’s really no way of knowing that unless you know how a research field works with and understands data.

Some things – like knowing how temperature was measured – are probably pretty specific to a narrow field, or even just a particular research method, and it’s probably not possible to know all of the intricacies of the many fields that comprise biomedical research.  However, I think there are also likely other fundamental qualities of data that would apply more broadly across many research fields, and perhaps that would be a useful approach to this question.

 

Can you hack it? On librarian-ing at hackathons

I had the great pleasure of spending the last few days working on a team at the latest NCBI hackathon.  I think this is the sixth hackathon I’ve been involved in, but this is the first time I’ve actually been a participant, i.e. a “hacker.”  Prior to working on these events, I’d heard a little bit about hackathons, mostly in the context of competitive hackathons – a bunch of teams compete against each other to find the “best” solution to some common problem, usually with the winning team receiving some sort of cash prize.  This approach can lead to successful and innovative solutions to problems in a short time frame.  However, the so-called NCBI-style hackathons that I’ve been involved in over the last couple years involve multiple teams each working on their own individual challenge over a period of three days. There are no winners, but in my experience, everyone walks away having accomplished something, and some very promising software products have come out of these hackathons.  For more specifics about the how and why of this kind of hackathon, check out the article I co-authored with several participants and the mastermind behind the hackathons, Ben Busby of NCBI.

As I said, this time was the first hackathon that I’ve actually been involved as a participant on a team, but I’ve had a lot of fun doing some librarian-y type “consulting” for five other hackathons before this, and it’s an experience I can highly recommend for any information professional who is interested in seeing science happen real-time.  There’s something very exciting about watching groups of people from different backgrounds, with different expertise, most of whom have never met each other before, get together on a Monday morning with nothing but an often very vague idea, and end up on Wednesday afternoon with working software that solves a real and significant biomedical research problem.  Not only that, but most of the groups manage to get pretty far along on writing a draft of a paper by that time, and several have gone on to publish those papers, with more on their way out (see the F1000Research Hackathons channel for some good examples).

As motivated and talented as all these hackathon participants are, as you can imagine, it takes a lot of organizational effort and background work to make something like this successful.  A lot of that work needs to be done by someone with a lot of scientific and computing expertise.  However, if you are a librarian who is reading this, I’m here to tell you that there are some really exciting opportunities to be involved with a hackathon, even if you are completely clueless when it comes to writing code.  In the past five hackathons, I’ve sort of functioned as an embedded informationist/librarian, doing things like:

  • basic lit searching for paper introductions and generally locating background information.  These aren’t formal papers that require an extensive or systematic lit review, but it’s useful for a paper to provide some context for why the problem is significant.  The hackers have a ton of work to fit in to three days, so it’s silly to have them spend their limited time on lit searching when a pro librarian can jump in and likely use their expertise to find things more easily anyway
  • manuscript editing and scholarly communication advice.  Anyone who has worked  with co-authors knows that it takes some work to make the paper sound cohesive, and not like five or six people’s papers smushed together.  Having someone like a librarian with editing experience to help make that happen can be really helpful.  Plus, many librarians  have relevant expertise in scholarly publishing, especially useful since hackathon participants are often students and earlier career researchers who haven’t had as much experience with submitting manuscripts.  They can benefit from advice on things like citation management and handling the submission process.  Also, I am a strong believer in having a knowledgeable non-expert read any paper, not just hackathon papers.  Often writers (and I absolutely include myself here) are so deeply immersed in their own work that they make generous assumptions about what readers will know about the topic.  It can be helpful to have someone who hasn’t been involved with the project from the start take a look at the manuscript and point out where additional background or explanation might be beneficial to aiding general understandability.
  • consulting on information seeking behavior and giving user feedback.  Most of the hackathons I’ve worked on have had teams made up of all different types of people – biologists, programmers, sys admins, other types of scientists.  They are all highly experienced and brilliant people, but most have a particular perspective related to their specific subject area, whereas librarians often have a broader perspective based on our interactions with lots of people from various different subject areas.  I often find myself thinking of how other researchers I’ve met might use a tool in other ways, potentially ones the hackathon creators didn’t necessarily intend.  Also, at least at the hackathons I’ve been at, some of the tools have definite use cases for librarians – for example, tools that involve novel ways of searching or visualizing MeSH terms or PubMed results.  Having a librarian on hand to give feedback about how the tool will work can be useful for teams with that kind of a scope.

I think librarians can bring a lot to hackathons, and I’d encourage all hackathon organizers to think about engaging librarians in the process early on.  But it’s not a one-way street – there’s a lot for librarians to gain from getting involved in a hackathon, even tangentially.  For one thing, seeing a project go from idea to reality in three days is interesting and informative.  When I first started working with hackathons, I didn’t have that much coding experience, and I certainly had no idea how software was actually developed.  Even just hanging around hackathons gave me so much of a better understanding, and as an informationist who supports data science, that understanding is very relevant.  Even if you’re not involved in data science per se, if you’re a biomedical librarian who wants to gain a better understanding of the science your users are engaged in, being involved in a hackathon will be a highly educational experience.  I hadn’t really realized how much I had learned by working with hackathons until a librarian friend asked me for some advice on genomic databases. I responded by mentioning how cool it was that ClinVar would tell you about pathogenic variants, including their location and type (insertion, deletion, etc), and my friend was like, what are you even talking about, and that was when it occurred to me that I’ve really learned a lot from hackathons!  And hey, if nothing else, there tends to be pizza at these events, and you can never go wrong with pizza.

I’ll end this post by reiterating that these hackathons aren’t about competing against each other, but there are awards given for certain “exemplary” achievements.  Never one to shy away from a little friendly competition, I hoped I might be honored for some contribution this time around, and I’m pleased to say I was indeed recognized . 🙂

It's true, I'm the absolute worst at darts.
There is a story behind this, but trust me when I say it’s true, I’m the absolute worst at darts.

A Silly Experiment in Quantifying Death (and Doing Better Code)

Doesn’t it seem like a lot of people died in 2016?  Think of all the famous people the world lost this year.  It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even.  Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it.  Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing.  For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names.  More importantly, this little project was actually a good way to practice a few things I wanted to teach myself.  Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people.  Here they are on a histogram, by year.

year_plotAs you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still).  The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average.  So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case.  I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data.  Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

month_plotObviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

age_plotThis chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, you may notice those outliers in 2006 and 2014 where someone died at a very, very old age. I didn’t realize it at first, butWikipedia does include some notable non-humans in their list. One is a famous tree that died in an ice storm at age 125 and the other a tortoise who had allegedly been owned by Charles Darwin, but significantly outlived him, dying at age 176.  Obviously this makes the data and therefore this analysis even more suspect as a true scientific pursuit.  But we had fun, right? 🙂

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository.  And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing.  I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around.  I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was).  Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github.  Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it).  This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish.  In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway?  Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way.  I learned how to write good functions, for real.  I learned how to use the %>%, (which is a pipe operator, and it’s very awesome).  I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point.  In the past, I would write code and be elated if it actually worked.  With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better?  Can I do that in one line of code instead of three?  Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work.  Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?

Practicing What I Preach: The Open PhD Experiment

(Note: this is an adapted version of a final paper I wrote for one of my classes. That’s why it’s so long!)

A few weeks ago, a researcher called my office to see if we could meet to discuss our shared interest in open data. I agreed, and a week later we were sitting in my office having a lively discussion about the many problems that currently hinder more widespread data sharing and reuse in biomedical research. When I mentioned that these topics would be the focus of my doctoral dissertation work, he expressed an interested in seeing some of my research. I replied that it was only my first semester, so I didn’t have much yet, but that I’d published a few papers on my previous research. “I don’t mean papers,” he said. “I mean your data, your code. If you’re doing a PhD on data sharing, don’t you think you should share your data, too?  In fact, why don’t you do an open PhD?”

Perhaps I should have immediately replied, “you’re absolutely right. I will do an open PhD.”  After all, on the face of it, this suggestion seems perfectly reasonable. My research, and in fact my entire career, revolves around the premise that researchers should share their data. It should be a no-brainer that I would also share my data. In principle, I have no problem with agreeing to do so, but in the real world of research, lofty ideals like service to the community and furthering science are sometimes abandoned in favor of more practical concerns, like getting one’s paper accepted or finishing one’s dissertation before other people have a chance to capitalize on the data.

So what I ended up telling this researcher was that I found his suggestion intriguing and I’d give it some serious thought. I have done just that in the intervening weeks, and here I will reflect on the reasons for my hesitation and explore the levels of openness I am prepared to take on in my doctoral program and my academic career.

My first (mis)adventure with data sharing

The first – and as yet only – time I shared my data was when I submitted an article to PLOS in 2014. PLOS was one of the first publishers to adopt an open data policy that required researchers to share the data underlying their manuscripts. I dutifully submitted my data to figshare, a popular, discipline-agnostic data repository, with the title “Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff.” To my surprise, someone at figshare took notice of my upload and tweeted out a link to my dataset. I could have sworn that I’d checked the box to keep the data private until I opted to officially release them, but when I’d gone back to fix a minor mistake in the title of the submission, the box must have gotten unchecked, and the status was changed to public.

After the tweet went out, I could see from the “views” counter that people were already looking at the data. Someone retweeted the link to the data, then another person, and another. The paper hadn’t even been reviewed by anyone yet, much less accepted for publication, but my data were out there for anyone to see, with the link spreading across Twitter. The situation made me nervous. I was excited that people were interested in my data, but what were they doing with it?  The views counter ticked up steadily, and people were not just viewing, but actually downloading the dataset as well.

I finally received word from PLOS that they’d accepted the paper, but they asked for major revisions; Reviewer 2 (it’s always Reviewer 2) was niggling over my statistical methods, and I was going to have to redo much of my work to respond to all the revision requests. During the revision process, I received an email from someone I’d never heard of, from an Eastern European country I can’t now recall. She had seen my data on figshare and she, too, wanted to write a paper on this topic. She asked me to send her a copy of my still-in-process paper, as well as a list of all relevant references I had found. The audacity of her request shocked me. Here was someone I’d never even met, telling me she wanted to use my data, write essentially the same paper as me, and she wanted me to give her my background research as well?  I wrote an email back, politely but firmly rebuffing her request, and I never heard from her again.

In the end, everything went fine: the paper was published and it has gone on to be cited seven times and featured in PLOS’s new Open Data collection (PLOS Collections 2016). I do still believe that researchers, particularly those whose work is supported by taxpayers’ money, have a responsibility to share their data when doing so will not violate their human subjects’ privacy. However, my own experience demonstrated to me that sharing research data cannot be viewed as a black and white proposition, that you share and are “good,” or you don’t and you are “bad.” Rather, many researchers have real, valid concerns about how they share their data, when, and with whom. Though my reasons probably differ from those of many other researchers, I have my own concerns that give me pause when it comes to the idea of an “open PhD.”

  1. I don’t think my data would be useful or interesting to anyone else.

Some datasets have near infinite value, with uses that extend far beyond the expertise or disciplinary affiliation of their original collector. New computational methodologies and analytic techniques make it possible to uncover previously undetected meaning in datasets or “mash up” disparate datasets to detect novel connections between seemingly unrelated phenomena. The ability to quickly, easily, and cheaply share massive amounts of data means that researchers around the world are able to make life-saving discoveries. For example, the National Cancer Institute’s Cancer Genomics Cloud Pilot program allows researchers to connect to cancer genome data and perform complex analyses on cloud computing platforms more powerful than any computers they could buy for their lab (National Cancer Institute Center for Biomedical Informatics & Information Technology 2016). Projects like this are exciting – they could bring about cures for cancer and vastly improve our lives. Few people would argue that sharing these kinds of datasets is important.

By comparison, my data just look silly. Personally, I find my research fascinating. I could spend hours talking about biomedical scientists’ research data sharing and reuse practices. However, I don’t flatter myself that others are clamoring to see all the thrilling survey data and titillating interview transcriptions I have collected. Beyond validating the results in my article, I see little value for these data. Of course, I have made the argument that data can have unexpected uses that their original collectors could never have imagined, so I am prepared to admit that my data may have usefulness beyond what I would expect. Perhaps I should take the 252 views and 37 downloads of my figshare dataset as evidence that my data are of interest to more people than I might expect.

  1. I’m often embarrassed by my amateurish ways.

I’m a fan of GitHub, a site where you can share your code and allow others to collaboratively contribute to your work, but I’m also terrified of it. I spend a very significant amount of time at my job working with R, my programming language of choice; I teach it, I consult on it, and I use it for my own research. I like to think I know what I’m doing, but in all honesty, I’m pretty much entirely self-taught in R and, though I’m a quick study, I haven’t been using it for that long. I am far from an expert, and I often write code that makes this fact obvious.

Recently I wrote some R code related to a research project I hope to submit for publication soon. The work involved downloading the full text of over 60,000 articles, but since the server’s interface only allowed downloading a thousand articles at a time, I needed to write code that would download the allowed amount, then repeat itself 60 times, updating the article numbers after each iteration. I spent hours trying to figure out the best way to do it, but everything I tried failed. I could download a hundred at a time, then manually update the numbers in the code and re-run it, but doing this 60 times would have been time-consuming. In a throwing-up-your-hands moment of frustration, I wrote a command that would essentially just write those 60 lines of code for me, then ran all 60 lines.

Frankly, this approach was idiotic. Anyone who knows the first thing about programming would scoff at my code, and rightly so. However, at the time, this slipshod approach was the best I could come up with. It’s not just code that may reveal that I don’t always know what I’m doing; the more open the research process, the more opportunity for others to see the unpolished, imperfect steps that lie beneath the shiny surface of the perfected, word-smithed article.

  1. It takes time to prepare data for broader consumption.

When I teach data management classes for researchers, I emphasize how good data management practices will make submitting their data at the end of the process easy, practically effortless. Of course, having your data perfectly ready to share without any extra effort at the end of your project is about as likely as a jumping out of bed and looking good enough to head off to work without taking any time to freshen up. For example, part of my to-do list for preparing the article for the project I described above for publication is figuring out how to actually write that code the right way, so I can share it without fear of being humiliated. Getting my data, code, writing, or any other scholarly output I produce into the kind of shape it would need to be for me to be willing to put my name on it takes time. When I’m already trying to manage a demanding full-time job with a doctoral program and somehow still find the time to enjoy some sort of leisure every now and then, polishing up something to get it ready for sharing doesn’t often take enough priority to make it onto my daily schedule.

A compromise: the open-ish PhD

Though I’ve just spent five pages expounding on the reasons I cannot do a fully open PhD, I am prepared to compromise. The ideal the researcher urged me toward in our original conversation – don’t wait for your dissertation, share your data now, get your code up on GitHub today! – may not be right for me, but I do believe it is feasible to find some way to share at least some of scholarly output, if not in real-time, than at least in a timely fashion. Therefore, I propose the following tenets of my open-ish PhD:

  • I will do my best to write code that I am reasonably proud of (or at least not actively ashamed of) and share it on GitHub. While I do not feel comfortable immediately sharing code that corresponds to projects I am actively pursuing and seeking to publish, I will at least share it upon publication. I will also share teaching-related code immediately on GitHub, especially since doing so provides a good model for the researchers I am teaching.
  • I will make a more concerted effort to share my scholarly writing not just in its final, polished form as journal articles, but also in more casual settings, such as on my blog. I am also interested in exploring pre-print servers like arXiv and bioRxiv as a means of more rapid dissemination of research findings in advance of formal journal article publication.
  • I will attempt to collect data in a more mindful and intentional way, recognizing that I am not simply collecting my data, but that the point of my efforts are to inform others in my scholarly and research communities. As a federal employee, the work that I conduct in my official capacity cannot be copyrighted because it belongs not to me, but to all the American people who pay my salary. As I go forward with my research, I will do my best to remember that I am doing it not merely to satisfy my curiosity or add to my CV, but to advance science, even in my own small way.

In the end, it probably doesn’t matter so much whether the final data I share are perfect, whether my code impresses other people with its efficiency and elegance, or whether something I write appears in Nature or on my little blog. What matters is making the effort to share, committing to the highest level of openness possible, and doing so publicly and visibly – essentially, leading by example. I can give lectures on the importance of data sharing and teach classes on open source tools until I’m blue in the face, but perhaps the most important thing I can do to convince researchers of the importance of sharing and reusing data is doing exactly that myself.

In defense of the live demo (despite its perils)

rstudio-bomb

When RStudio crashes, it is not subtle about it.  You get a picture of an old-timey bomb and the succinct, blunt message “R encountered a fatal error.”  A couple hundred of my librarian friends and colleagues got to see it live during the demo I gave as part of a webinar I did for the Medical Library Association on R for librarians earlier today.  At first, I thought the problem was minor.  When I tried to read in my data, I got this error message:

Error in file(file, “rt”) : cannot open the connection
In addition: Warning message:
In file(file, “rt”) :
cannot open file ‘lib_data_example.csv’: No such file or directory

It’s a good example of R’s somewhat opaque and not-super-helpful error messages, but I’ve seen it before and it’s not a big deal.  It just meant that R couldn’t find the file I’d asked for.  Most of the time it’s because you’ve spelled the file name wrong, or you’ve capitalized something that should be lower case.  I double checked the file name against the cheat sheet I’d printed out with all my code.  Nope, the file name was correct.  Another likely cause is that you’re in the wrong directory and you just need to set the working directory to where the file is located.  I checked that too – my working directory was indeed set to where my file should have been.  That was when RStudio crashed, though I’m still not sure exactly why that happened.  I assume RStudio did it just to mess with me.  🙂

I’m sure a lot of presenters would be pretty alarmed at this point, but I was actually quite amused.  People on Twitter seemed to notice:

Having your live demo crash is not very entertaining in and of itself, but I found the situation rather amusing because I had considered whether I should do a live demo and decided to go with it because it seemed so low risk.  What could go wrong?  Sure, live demos are unpredictable.  Websites go down, databases change their interface without warning (invariably they do this five minutes before your demo starts), software crashes, and so on. Still, the demo I was doing was really quite simple compared to a lot of the R I normally teach, and it involved using an interface I literally use almost every day.   I’ve had plenty of presentations go awry in the past, but this was one that I really thought had almost 0% chance of going wrong.  So when it all went wrong on the very first line of code, I couldn’t help but laugh.  It’s the live demo curse!  You can’t escape!

I’m sure most people who have spent any significant amount of doing live demos of technology have had the experience of seeing the whole thing blow up.  I know a lot of librarians who avoid the whole thing by making slides with screen shots of what they would show and do sort of a mock demo.  There’s nothing wrong with that, and I can understand the inclination to remove the uncertainty of the live demo from the equation.  But despite their being fraught with potential issues, I’m still in favor of live demos – and in a sense, I feel this way exactly because of their unpredicability.

For one thing, it’s helpful for learners to see how an experienced user thinks through the process of troubleshooting when something goes wrong.  It’s just a fact that stuff doesn’t always work perfectly in real life.  If the people I’m teaching are ever actually going to use the tools I’m demonstrating, eventually they’re going to run into some problems.  They’re more likely to be able to solve those problems if they’ve had a chance to see someone work through whatever issues arise.  This is true for many different types of technologies and information resources, but especially so with programming languages.  Learning to troubleshoot is itself an essential skill in programming, and what better way to learn than to see it in action?

Secondly, for brand new users of a technology, watching an instructor give a flawless and apparently effortless demonstration can actually make mastery feel out of reach for them.  In reality, a lot of time and effort likely went into developing that demo, trying out lots of different approaches, seeing what works well and what doesn’t, and arriving at the “perfect” final demo.  I’m certainly not suggesting that instructors should do freewheeling demos with no prior planning whatsoever, but I am in favor of an approach that acknowledges that things don’t always go right the first time.  When I learned R, I would watch  tutorials by these incredibly smart and talented instructors and think, oh my gosh, they make this look so easy and I’m totally lost – I’m never going to understand how this works.  Obviously I don’t want to look like an unprepared and incompetent fool in front of a class, but hey, things don’t always go perfectly.  I’m human, you’re human, we’re all going to make mistakes, but that’s part of learning, so let’s talk about what went wrong and how we fix it.

By the way, in case you’re wondering what did actually go wrong in this instance, I had inadvertently moved the data file in the process of uploading it to my Github repo – I thought I’d made a copy, but I had actually moved the original.  I quickly realized what had happened, and I knew roughly where I’d put the file, but it was in some folder buried deep in my file structure that I wouldn’t be able to locate easily on the spot.  The quickest solution I could think of, which I quickly did off-screen from the webinar (thank you dual monitors) was to copy the data from the repo, paste it into a new CSV and quickly save it where the original file should have been.  It worked fine and the demo went off as planned after that.

Day One

Many years ago, when I got my first bachelor’s degree, my parents gave me a hard time because I didn’t want to walk at graduation.  It seemed like kind of a pain in the ass, to be honest.  A long ceremony, watching and waiting as lots of people I’d never met or even heard of got their diplomas, and then like 30 seconds of glory as my name was announced and I crossed the stage – collecting a blank piece of paper, because my actual diploma would be sent in the mail later, when it was confirmed I’d actually met the requirements.  Oh, and also, I’d have to pay a few hundred dollars to rent the graduation garb.  RENT it.  I didn’t even get to keep it!  Seriously?

In the end, I did not walk for my bachelor’s degree.  On the day I would have been at that ceremony, my mom and I were celebrating my graduation with a trip to New York City, which was quite fun and meant way more to me than any graduation ceremony could have.  I didn’t walk when I got my second bachelor’s either, or any of my master’s degrees.  I say this as humbly as possible: I have more formal education than most people I know, but the only graduation ceremony I’ve ever been to was for my high school diploma.

The fact is, I’ve been holding out.  I wanted the PhD.  I would walk for that.  I’ll admit, partly I wanted the incredible hat.  🙂  I mean, come on, look at the garb you get to wear when you get your Phd.  I will gladly pay hundreds of dollars to rent that!

How styling is that regalia?
How styling is that regalia?

And finally, finally, things came together, and tonight, I found myself sitting in my very first ever doctoral class.  I’m a PhD student!  I can hardly believe it.  I know I kept grinning like a crazy person throughout my first class.   I’m doing this degree part-time while I continue to work at my job full-time (for which I am tremendously grateful both to the iSchool at UMD for accepting part time PhD students and to my library, my director, and my boss for their absolutely incredible support for this undertaking). I know that’s going to be hard work, but I think I’m up for it.  And more than that, I think it’s worth it.

When I went to the PhD program orientation recently, our cohort of brand new students had some time to spend with the continuing PhD students in a no-holds barred, tell-all session.  While it was kind of a “what’s said in this room stays in this room” kind of session, one thing that I think I can safely say came out of it was the idea that you should get in the habit of writing every day.  I agree this is a good habit to get into, and I’m going to try to make sure that the writing I do translates to some real output – maybe work toward a paper, or maybe just something I say on this blog.  Obviously I won’t write here on the blog every day, but I’ll try to document this process, hopefully share some interesting things.

I also want to post this entry now because I’m sure I’ll look back on this several years from now when it’s all over, and maybe I’ll laugh, or cry, or just think how naive I was.  But for now, let’s just say, starting with Day One, I am thrilled, excited, honored, and lucky, and I am so thankful to all of the people who have helped me arrive at this place.

Now let’s earn that incredible 8-pointed PhD hat!

Who Am I? The Identity Crisis of the Librarian/Informationist/Data Scientist

More and more lately, I’m asked the question “what do you do?” This is a surprisingly difficult question to answer.  Often, how I answer depends on who’s asking – is it someone who really cares or needs to know? – and how much detail I feel like going to at the moment when I’m asked.  When I’m asked at conferences, as I was quite a bit at FORCE2016, I tried to be as explanatory as possible without getting pedantic, boring, or long-winded.  My answer in those scenarios goes something like “I’m a data librarian – I do a lot of instruction on data science, like R and data visualization, and data management.”  When I’m asked in more social contexts, I hardly even bother explaining.  Depending on my mood and the person who’s asking, I’ll usually say something like data scientist, medical librarian, or, if I really don’t feel like talking about it, just librarian.  It’s hard to know how to describe yourself when you have a job title that is pretty obscure: Research Data Informationist.  I would venture to guess that 99% of my family, friends, and even work colleagues have little to no idea what I actually spend my days doing.

In some regards, that’s fine.  Does it really matter if my mom and dad know what it means that I’ve taught hundreds of scientists R? Not really (they’re still really proud, though!).  Do I care if my date has a clear understanding of what a data librarian does?  Not really.  Do I care if a random person I happen to chat with while I’m watching a hockey game at my local gets the nuances of the informationist profession?  Absolutely not.

On the other hand, there are often times that I wish I had a somewhat more scrutable job title.  When I’m talking to researchers at my institution, I want them to know what I do because I want them to know when to ask me for help.  I want them to know that the library has someone like me who can help with their data science questions, their data management needs, and so on.  I know it’s not natural to think “library” when the question is “how do I get help with finding data” or “I need to learn R and don’t know where to start” or “I’d like to create a data visualization but I have no idea how to do it” or any of the other myriad data-related issues I or my colleagues could address.

The “informationist” term is one that has a clear definition and a history within the realm of medical librarianship, but I feel like it has almost no meaning outside of our own field.  I can’t even count the number of weird variations I’ve heard on that title – informaticist, informationalist, informatist, and many more.  It would be nice to get to the point that researchers understood what an informationist is and how we can help them in their work, but I just don’t see that happening in the near future.

So what do we do to make our contributions and expertise and status as potential collaborators known?  What term can we call ourselves to make our role clear?  Librarian doesn’t really do it, because I think people have a very stereotypical and not at all correct view of what librarians do, and it doesn’t capture the data informationist role at all.  Informationist doesn’t do it, because no one has any clue what that means.  I’ve toyed with calling myself a data scientist, and though I do think that label fits, I have some reservations about using that title, probably mostly driven by a terrible case of imposter syndrome.

What’s in a name?  A lot, I think.  How can data librarians, informationists, library-based data scientists, whatever you want to call us, communicate our role, our expertise, our services, to our user communities?  Is there a better term for people who are doing this type of work?

Some ponderings on #force2016 and open data

I’m attending FORCE2016, which is my first FORCE11 conference after following this movement (or group?) for awhile and I have to say, this is one interesting, thought-provoking conference.  I haven’t been blogging in awhile, but I felt inspired to get a few thoughts down after the first day of FORCE2016:

  • I love the interdisciplinarity of this conference, and to me, that’s what makes it a great conference to attend.  In our “swag bag,” we were all given a “passport” and could earn extra tickets for getting signatures of attendees from different disciplines and geographic locations.  While free drinks are of course a great incentive, I think the fact that we have so many diverse attendees at this conference is a draw on its own.  I love that we are getting researchers, funders, publishers, librarians, and so many other stakeholders at the table, and I can’t think of another conference where I’ve seen this many different types of people from this many countries getting involved in the conversatioon.
  • I actually really love that there are so few concurrent sessions.  Obviously, fewer concurrent sessions means fewer voices joining the official conversation, but I think this is a small enough conference that there are ways to be involved, active, and vocal without necessarily being an invited speaker.  While I love big conferences like MLA, I always feel pulled in a million different directions – sometimes literally, like last year when I was scheduled to present papers at two different sessions during the same time period.  I feel more engaged at a conference when I’m seeing mostly the same content as others.  We’re all on the same page and we can have better conversations.  I also feel more engaged in the Twitter stream.  I’m not trying to follow five, ten, or more tweet streams at once from multiple sessions.  Instead, I’m seeing lots of different perspectives and ideas and feedback on one single session.  I like us all being on the same page.

Now, those are some positives, but I do have to bring it down with one negative from this conference, and that is that I think it’s hard to constructively talk about how to encourage sharing and open science when you have a whole conference full of open science advocates.  I do not in any way want to disparage anyone because I have a lot of respect for many of the participants in the session I’m talking about, but I was a little disappointed in the final session today on data management.  I loved the idea of an interactive session (plus I heard there would be balloons and chocolate, so, yeah!) and also the idea of debate on topics in data sharing and management, since that’s my jam.  I did debate in high school, so I can recognize the difficulty but also the usefulness of having to argue for a position with which you strongly disagree.  There’s real value in spending some time thinking about why people hold positions that are in opposition of your strongly held position.  And yeah, this was the last session of a long day, and it was fun, and it had popping of balloons, and apparently some chocolate, and whatnot, but I am a little disappointed at what I see as a real missed opportunity to spend some time really discussing how we can address some of the arguments against data sharing and data management.  Sure, we all laughed at the straw men that were being thrown out there by the teams who were being called upon to argue in favor of something that they (and all of us, as open science advocates) strongly disagreed with.  But I think we really lost an opportunity to spend some time giving serious thought to some of the real issues that researchers who are not open science advocates actually raise.  Someone in that session mentioned the open data excuses bingo page (you can find it here if you haven’t seen it before).  Again, funny, but SERIOUSLY I have actually have real researchers say ALL of these things, except for the thing about terrorists.  I will reiterate that I know and respect a lot of people involved with that session and I’m not trying to disparage them in any way, but I do hope we can give some real thought to some of the issues that were brought up in jest today.  Some of these excuses, or complaints, or whatever, are actual, strongly-held beliefs of many, many researchers.  The burden is on us, as open science advocates, to demonstrate why data sharing, data management, and the like are tenable positions and in fact the “correct” choice.

Okay, off my soap box!  I’m really enjoying this conference, having a great time reconnecting with people I’ve not seen in years, and making new connections.  And Portland!  What a great city. 🙂

To keep or not to keep: that is the question

I recently read an article in The Atlantic about people who are compulsive declutterers – the opposite of hoarders – who feel compelled to get rid of all their possessions. I’m more on the side of hoarding, because I always find myself thinking of eventualities in which I might need the item in question.  Indeed, it has often been the case that I will think of something I got rid of weeks or even years later and wish I still had it: a book I would have liked to reference, a piece of clothing I would have liked to wear, a receipt I could have used to take something back.  Of course, I don’t have unlimited storage space, so I can’t keep all this stuff.  The question of what to keep and for how long is one that librarians think about when it comes to weeding: deciding which parts of the collection to deaccession, or basically, get rid of.  There are evidence-based, tried-and-true ways of thinking about weeding a library collection, but that’s not so much true when it comes to data.  How is a scientist to decide what to keep and what not to keep?

I know this is a question that researchers are thinking about quite a bit, because I get more emails about this than almost any other issue.  In fact, I get emails not only from users of my own library, but researchers from all over the country who have somehow found my name.  What exactly do I need to keep?  If I have electronic records, do I need to keep a print copy as well?  How many years do I need to keep this stuff?  These are all very reasonable questions that it would be nice to say, yes, there is an answer and it is….! but it’s almost never so easy to point to a single answer.

A case in point: a couple years ago, I decided to teach a class about data preservation and retention.  In my naivete, I thought it would be nice to take a look through all the relevant policy and find the specific number of years that research data is required to be retained.  I read handbooks and guides.  I read policy documents from various agencies.   I even read the U.S. Code (I do not recommend it).  At the end of it, I found that not only is there not a single, definitive, policy answer to how long funded research data should be retained, but there are in fact all sorts of contradictory suggestions.  I found documents giving times from 3 years to 7 years to the super-helpful “as long as necessary.”

This may be difficult to answer from a policy perspective, but I think answering this from a best practices perspective is even trickier.  Let’s agree that we just can’t keep everything – storing data isn’t free, and it takes considerable time and effort to ensure that data remain accessible and usable.  Assuming that some stuff has to get thrown away, how do we distinguish trash from treasure, especially given the old adage about how the former might be the latter to others?  It’s hard to know whether something that appears useless now might actually be useful and interesting to someone in the future.  To take this to the extreme, here’s an actual example from a researcher I’ve worked with: he asked how he could have his program automatically discard everything in the thousandth place from his measurements.  In other words, he wanted 4.254 to be saved as 4.25.  I told him I could show him how, but I asked why he wanted to do this.  He told me that his machine was capable of measuring to the thousandth, but the measurement was only scientifically relevant to the hundredth place.  To scientists right now, 4.254 and 4.252 were essentially indistinguishable, so why bother with the extra noise of the thousandth place?  Fair point, but what about 5 years from now, or 10 years from now?  If science evolves to the point that this extra level of precision is meaningful, tomorrow’s researchers will probably be a little annoyed that today’s researchers had that measurement and just threw it away.  But then again, how can we know now when, or even if, that level of precision will be wanted?  For that matter, we can’t even say for sure whether this dataset will be useful at all.  Maybe a new and better method for making this measurement will be developed tomorrow, and all this stuff we gathered today will be irrelevant.  But how can we know?

These are all questions that I think are not easy to answer right now, but that people within research communities should be thinking about.  For one thing, I don’t think we can give one simple answer to how long data should be retained.  For one type of research, a few years may be enough.  For other fields, where it’s harder to replicate data, maybe we need to keep it in perpetuity.  When it comes to deciding what should be retained and what should be discarded, I think that answers cannot be dictated by one-size-fits-all policies and that subject matter experts and information professionals should work together to figure out good answers for specific communities and specific data.  Eventually, I suppose we’ll probably have some of those well-defined best practices for data retention in the same way that we have those best practices from collection management in libraries.  Until then, keep your crystal balls handy. 🙂