Can you hack it? On librarian-ing at hackathons

I had the great pleasure of spending the last few days working on a team at the latest NCBI hackathon.  I think this is the sixth hackathon I’ve been involved in, but this is the first time I’ve actually been a participant, i.e. a “hacker.”  Prior to working on these events, I’d heard a little bit about hackathons, mostly in the context of competitive hackathons – a bunch of teams compete against each other to find the “best” solution to some common problem, usually with the winning team receiving some sort of cash prize.  This approach can lead to successful and innovative solutions to problems in a short time frame.  However, the so-called NCBI-style hackathons that I’ve been involved in over the last couple years involve multiple teams each working on their own individual challenge over a period of three days. There are no winners, but in my experience, everyone walks away having accomplished something, and some very promising software products have come out of these hackathons.  For more specifics about the how and why of this kind of hackathon, check out the article I co-authored with several participants and the mastermind behind the hackathons, Ben Busby of NCBI.

As I said, this time was the first hackathon that I’ve actually been involved as a participant on a team, but I’ve had a lot of fun doing some librarian-y type “consulting” for five other hackathons before this, and it’s an experience I can highly recommend for any information professional who is interested in seeing science happen real-time.  There’s something very exciting about watching groups of people from different backgrounds, with different expertise, most of whom have never met each other before, get together on a Monday morning with nothing but an often very vague idea, and end up on Wednesday afternoon with working software that solves a real and significant biomedical research problem.  Not only that, but most of the groups manage to get pretty far along on writing a draft of a paper by that time, and several have gone on to publish those papers, with more on their way out (see the F1000Research Hackathons channel for some good examples).

As motivated and talented as all these hackathon participants are, as you can imagine, it takes a lot of organizational effort and background work to make something like this successful.  A lot of that work needs to be done by someone with a lot of scientific and computing expertise.  However, if you are a librarian who is reading this, I’m here to tell you that there are some really exciting opportunities to be involved with a hackathon, even if you are completely clueless when it comes to writing code.  In the past five hackathons, I’ve sort of functioned as an embedded informationist/librarian, doing things like:

  • basic lit searching for paper introductions and generally locating background information.  These aren’t formal papers that require an extensive or systematic lit review, but it’s useful for a paper to provide some context for why the problem is significant.  The hackers have a ton of work to fit in to three days, so it’s silly to have them spend their limited time on lit searching when a pro librarian can jump in and likely use their expertise to find things more easily anyway
  • manuscript editing and scholarly communication advice.  Anyone who has worked  with co-authors knows that it takes some work to make the paper sound cohesive, and not like five or six people’s papers smushed together.  Having someone like a librarian with editing experience to help make that happen can be really helpful.  Plus, many librarians  have relevant expertise in scholarly publishing, especially useful since hackathon participants are often students and earlier career researchers who haven’t had as much experience with submitting manuscripts.  They can benefit from advice on things like citation management and handling the submission process.  Also, I am a strong believer in having a knowledgeable non-expert read any paper, not just hackathon papers.  Often writers (and I absolutely include myself here) are so deeply immersed in their own work that they make generous assumptions about what readers will know about the topic.  It can be helpful to have someone who hasn’t been involved with the project from the start take a look at the manuscript and point out where additional background or explanation might be beneficial to aiding general understandability.
  • consulting on information seeking behavior and giving user feedback.  Most of the hackathons I’ve worked on have had teams made up of all different types of people – biologists, programmers, sys admins, other types of scientists.  They are all highly experienced and brilliant people, but most have a particular perspective related to their specific subject area, whereas librarians often have a broader perspective based on our interactions with lots of people from various different subject areas.  I often find myself thinking of how other researchers I’ve met might use a tool in other ways, potentially ones the hackathon creators didn’t necessarily intend.  Also, at least at the hackathons I’ve been at, some of the tools have definite use cases for librarians – for example, tools that involve novel ways of searching or visualizing MeSH terms or PubMed results.  Having a librarian on hand to give feedback about how the tool will work can be useful for teams with that kind of a scope.

I think librarians can bring a lot to hackathons, and I’d encourage all hackathon organizers to think about engaging librarians in the process early on.  But it’s not a one-way street – there’s a lot for librarians to gain from getting involved in a hackathon, even tangentially.  For one thing, seeing a project go from idea to reality in three days is interesting and informative.  When I first started working with hackathons, I didn’t have that much coding experience, and I certainly had no idea how software was actually developed.  Even just hanging around hackathons gave me so much of a better understanding, and as an informationist who supports data science, that understanding is very relevant.  Even if you’re not involved in data science per se, if you’re a biomedical librarian who wants to gain a better understanding of the science your users are engaged in, being involved in a hackathon will be a highly educational experience.  I hadn’t really realized how much I had learned by working with hackathons until a librarian friend asked me for some advice on genomic databases. I responded by mentioning how cool it was that ClinVar would tell you about pathogenic variants, including their location and type (insertion, deletion, etc), and my friend was like, what are you even talking about, and that was when it occurred to me that I’ve really learned a lot from hackathons!  And hey, if nothing else, there tends to be pizza at these events, and you can never go wrong with pizza.

I’ll end this post by reiterating that these hackathons aren’t about competing against each other, but there are awards given for certain “exemplary” achievements.  Never one to shy away from a little friendly competition, I hoped I might be honored for some contribution this time around, and I’m pleased to say I was indeed recognized . 🙂

It's true, I'm the absolute worst at darts.

There is a story behind this, but trust me when I say it’s true, I’m the absolute worst at darts.

A Silly Experiment in Quantifying Death (and Doing Better Code)

Doesn’t it seem like a lot of people died in 2016?  Think of all the famous people the world lost this year.  It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even.  Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it.  Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing.  For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names.  More importantly, this little project was actually a good way to practice a few things I wanted to teach myself.  Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people.  Here they are on a histogram, by year.

year_plotAs you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still).  The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average.  So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case.  I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data.  Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

month_plotObviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

age_plotThis chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, you may notice those outliers in 2006 and 2014 where someone died at a very, very old age. I didn’t realize it at first, butWikipedia does include some notable non-humans in their list. One is a famous tree that died in an ice storm at age 125 and the other a tortoise who had allegedly been owned by Charles Darwin, but significantly outlived him, dying at age 176.  Obviously this makes the data and therefore this analysis even more suspect as a true scientific pursuit.  But we had fun, right? 🙂

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository.  And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing.  I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around.  I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was).  Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github.  Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it).  This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish.  In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway?  Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way.  I learned how to write good functions, for real.  I learned how to use the %>%, (which is a pipe operator, and it’s very awesome).  I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point.  In the past, I would write code and be elated if it actually worked.  With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better?  Can I do that in one line of code instead of three?  Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work.  Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?