Dr. Federer’s Wild Ride: The Tale of the 3-year PhD

A couple days ago, I stood outside a conference room while my doctoral dissertation committee discussed the defense I’d just given. My advisor opened the door to invite me back in and said, “Congratulations, Dr. Federer!”

Obviously I’m very pleased about this, in large part because I am absolutely delighted this experience is over. I did this PhD in 3 years (most people take 4-6 years) and I did it while I was also working full-time in a pretty demanding job. At some points, to be honest, it was just basically awful. I say this not to dissuade anyone from doing a PhD part-time, because I do think it’s totally doable under certain circumstances (like if you have a really supportive boss, which I do, and if you don’t have kids, which I don’t), but honestly in hindsight, doing it do this quickly was really pretty ridiculous on my part, although I had my reasons.

One of my professors used to say something that stuck with me – don’t compare your insides to other people’s outsides. When you look at someone’s finished dissertation, or picture of a black hole, or whatever, you’re seeing a final product that has been polished over hours, sometimes years. It’s easy to think, oh my god, everybody has it so together and makes it look so easy, so why is this so hard for me? So in the spirit of cheering on those of you who may be in this process (or are thinking of it), I’d like to shed some light on what it took to become Dr. Federer. (By the way, that same professor also said to us once “at this point in the semester, if you’re not crying in public, you’re doing great,” which I think is also worth keeping in mind.)

First, let me tell you about my typical day. I did my course work in a year and a half, which I managed by taking summer courses. During that time, I would get up around 5 am and be at work by 6:30 so that I could leave by 4 pm. That gave me enough time to get home and walk my dog before getting back in the car and sitting in rush hour traffic on the Beltway for an hour. I had two classes each week, most from 6 – 9 pm, and this usually got me home around 9:30 (thankfully rarely any traffic on the way home that late). Then I’d have dinner – obviously I was not in the mood for cooking that late at night, so on weekends I would cook a couple big batches of something that I could pop in the microwave. It takes awhile to mentally wind down from a doctoral-level class (for me anyway), so I usually read a bit or watched a show and got to bed by 11 so I could get up and do it all again in the morning. Over the weekend, I would read whatever I needed to for the next week of classes (and believe me, there is a LOT of reading) and write, but I still managed to have some free time, so it wasn’t too overwhelming.

In my program, there are no qualifying exams because it’s a highly interdisciplinary department and everyone’s doing such different things that it wouldn’t really make sense. Instead, there’s a requirement to write an “integrative paper” that demonstrates that you’ve achieved an appropriate mastery of your subject to move on to the dissertation phase. Mine was a research study that provided a grounding for my dissertation by providing evidence some of the methodological choices I would make. That took a semester, and it was probably the easiest semester out of the whole program. Next came the dissertation proposal, in which you write the introduction, lit review, and methods section that will eventually become the first part of your dissertation, and your committee makes sure what you’re proposing sounds reasonable. That also took a semester. I do remember periods during those semesters in which I had to spend an entire weekend working, but it was nothing like the dissertation would end up being. I also really enjoyed not having to spend that hour in traffic twice a week to go to campus for classes.

I started the final sprint toward finishing my dissertation in January 2019, and that was when things got much harder. All the things I’d said I’d do during the proposal had seemed pretty straightforward, but when it came down to actually doing them, everything took so much longer than I expected and most of it didn’t work on the first (and sometimes even the second or third try). My topic modeling outputs were a nightmare. My code comments were filled with frustrated notes to myself and the occasional expletive. Manually categorizing and describing things took hours longer than I expected.

The coursework portion of my life had seemed grueling because of the early morning hours and the horrible hour in Beltway traffic, but at least then I had enough free time to feel like I had at least a little work/life balance. For the semester that I worked on the dissertation, nearly every minute of my life was spent working. I would put in an 8-hour day at work, where I’d started a new position with considerably more responsibilities and expectations than my previous one. I would come home and spend half an hour walking my dog and then quickly eat something. During the most intense period of work, I ate nothing but turkey and Swiss cheese sandwiches for the better part of a month because they were quick to make. At one point, a friend sent me a Grubhub gift card so I would eat something other than sandwiches.

After that, I’d spend the next several hours working. I tried to have a cutoff time of 9:30 so I’d have enough time to do something else with my brain for a little while before going to bed, or else I knew I’d never get to sleep. Even then, most nights I’d wake up around 3 or 4 am and almost immediately my brain would once again kick into high gear, going over different ways to better approach the problems, fixes to the code bugs I was running into, or mull over wording of whatever part I was currently writing. Some nights I managed to get back to sleep, but more often, I laid there in the dark with my mind going full-speed until my alarm went off in the morning. I was constantly exhausted. Weekends weren’t much better. I let myself sleep in a little bit, but then it was all work, usually 12 hours a day, with breaks to walk the dog and of course make a sandwich.

During this period, everything was about the dissertation. I stopped going to the gym. A social life was totally out of the question. I would text with my friends, but I didn’t see most of them for months. Even my dog, Ophelia, hardly got my attention. She developed a bad habit of barking at me in frustration when I would be sitting on my couch working, so I started bringing my laptop and sitting on the floor with her while I worked, which seemed to make her happy. She would bring her Squirrelly (in our home, all toys are named Squirrelly) over for me to throw, and we’d have a little tug-of-war, but I know she could tell that I was distracted. One night she came over and dropped here Squirrelly on the pile of books I was referring to – pictured below – like, “come on, lady, seriously.”

Ophelia hates research but loves playing Squirrelly

Even when I wasn’t working on the dissertation, I was thinking about the dissertation. It was an albatross that always hung heavy around my neck, constantly on my mind no matter what I was doing. If I talked to you at some point during this 4-month period, I can almost guarantee you that I was also thinking about the dissertation while I was doing it. When I did finally finish it and send it to my committee, I was shocked at how much more productive I became at work. I got more done in the day after it was submitted than I had in the entire week prior, just because the dissertation took up so much mental space in every minute of every day that I had very little bandwidth for anything else. It’s something I’ve never experienced before and I don’t know if it can be fully explained to someone who hasn’t gone through it.

But I got through it. After I submitted it to my committee, I spent an evening listening to podcasts, drinking wine, and doing a jigsaw puzzle, and I thought to myself, “wow, what a decadent way to spend an evening.” It’s not, though – it’s normal life (I mean honestly, it’s a little bit boring of an evening in normal-life terms), but I’d so forgotten what that meant that I felt like I was giving myself an incredible indulgence. Even now that the hardest work has been done for weeks, I still sometimes feel a lingering sense of guilt when I do something like read a book for fun, or watch a show, or take my dog for a long hike, like I should be working.

So, I say all of this not to complain about my experience, nor to scare anyone who’s thinking of going down this road. Was it hard? Harder than anything I’ve ever done before. Was it worth it? No question. Really, I write this to say, yes, this is hard, and we all experience it. My dissertation advisor once said in an email to me, “you make this look easy!” and I’m sure to a lot of people, I did. But it’s not, no matter what it may look like from the outside. So, you, dear reader, who maybe is working on a PhD or some other degree, or maybe a challenging project of some other sort, I want you to know that you’re doing great. Don’t compare yourself to anyone else. Don’t compare your insides to other people’s outsides, because, no matter how poised and calm we may look to the outside world, it’s pretty messy inside for all of us.

Behind the scenes of “Data sharing in PLOS ONE: An analysis of Data Availability Statements”

Recently some colleagues and I published a paper in PLOS in which we analyzed about 47,000 Data Availability Statements as a way of exploring the state of data sharing in a journal with a pretty strong data availability policy. The paper has gotten a good response from what I’ve seen on Twitter, and I’m really happy with how it turned out, thanks in part to some great feedback from the reviewers. But I also wanted to tell a few more things about how this paper came about – the things that don’t make it into the final scholarly article. A behind the scenes look, if you will.

The idea for this paper arose out of a somewhat eye-opening experience. I needed to get a hold of a good dataset – I forget why exactly, but I think it was when I was first starting to teach R and wanted to some real data that I could use in the classes for the hands-on exercises. Remembering that PLOS had this data availability policy, I thought to myself, ah, no problem, I will find an article that looks relevant to the researchers I’m teaching, download the data, and use it in my demo (with proper attribution and credit, of course). So I found an article that looked good and scrolled down to the Data Availability Statement.  Data available upon request.  Huh. I thought you weren’t allowed to say that, but okay, I guess this one slipped through the policy.  Found another one – data is within the paper, it said, except the only data in the paper were summary tables, which were of no use to me (nor would they be of use to anyone hoping to verify the study or reanalyze the data, for example).

What a weird fluke, I thought, that the first two papers I happened to look at didn’t really follow the policy. So I checked a third, and a fourth. Pretty soon I’d spent a half hour combing through recent PLOS articles and I had yet to find one with a publicly available dataset that I could easily download from a repository. I ended up looking elsewhere for data (did you know that baseball fans keep surprisingly in-depth data on a gazillion data points?) but I was left wondering what the real impact of this policy was, which was why I decided to do this study.

I’ll let you read the paper to find out what exactly it is that we found, but there’s one other behind-the-scenes anecdote that I’ll share about this paper that I hope will be encouraging. Obviously if you’re going to write critically about data availability, you’re going to look a little hypocritical if you don’t share your own data. I fully intended to share our data and planned to do so using Figshare, which is how I’d shared a dataset associated with another publication I’d previously published in PLOS. When I shared the data from the first article, I set it to be public immediately, though I didn’t expect anyone to want to see it before the paper was out. Unexpectedly, and unbeknownst to me, someone at Figshare apparently thought this was an interesting dataset and decided to tweet it out the same day I submitted the paper to PLOS, obviously well before it was ever published, much less accepted.

While the interest in the dataset was encouraging, I was also concerned about the fact that it was out before the paper was accepted. I figured I was flattering myself to think that someone would want to scoop me, but then, I got an email from someone I didn’t know, who told me that she had found my dataset and that she would like to write an article describing my results, and would I mind sharing my literature review/citations with her to save her the trouble? In other words, “hi, I would like to write basically the paper that you’re trying to get accepted using all of the work you did.” I want to be clear that I am all for data sharing, but this situation bothered me. Was I about to get scooped?

Obviously our paper came out, no one beat us to it, and as far as I know, no one has ever written another paper using that dataset, but I was thinking about it when I was uploading the data for this most recent paper.  This dataset was way more interesting and broadly applicable than the first one, so what if someone did get a hold of it before our paper came out? So what I decided to do was to upload it to Figshare, have it generate a DOI, but keep the dataset listed as private rather than publicly release it. Our data availability statement included the DOI and was therefore on the surface in compliance, but I had a feeling that, if you went to the DOI, it would tell you that the dataset was private or wasn’t found. Obviously I could have checked this before I submitted, but to be totally honest, I just left it as it was because I was genuinely curious whether any of the reviewers would try to check it themselves and say something.

To their credit, all three of the reviewers (who by the way, were incredibly helpful and gave the most useful feedback I’ve ever gotten on peer review, which I think significantly improved the paper) did indeed point out that the DOI didn’t work. In our revisions, our Data Availability Statement included a working link to not only the data, but also the code, on OSF. I invite anyone who is interested to reuse it and hope someone will find it useful. (Please don’t judge me on the quality of my code, though – I wrote it a long time ago when I was first learning R and I would do it way better now.)

 

My probably not-very-good R 23andme data analysis

My mom got the whole family 23andme kits for Christmas this year, and I’ve been looking forward to getting the results mostly so I could play with the raw data in R. It finally came in, so I went back to a blog post I’d read about analyzing 23andme data using a Bioconductor package called GWASCat. It’s a really good blog post, but as it happens, the package has been updated since it was written, so the code, sadly, didn’t work. Since I of course have VAST knowledge of bioinformatics (by which I mean I’ve hung around a lot of bioinformaticians and talked to them and kind of know a few things but not really) and am super awesome at R (by which I mean I’m like moderately okay at it), I decided to try my hand and coming up with my own analysis, based on the original blog post.

Let me be incredibly clear – I have only a vague notion of what I’m doing here, so you should not take any of what I say to be, you know, like, necessarily fact. In fact, I would love for a real bioinformatician to read this and point out my errors so I can fix them. That said, here is what I did, and what you can do too!

To walk you through it first in plain English – what you get in your raw data from 23andme is a list of RSIDs, which are accession numbers for SNPs, or single nucleotide polymorphisms. At a given position in your genetic sequence, for example, you may have an A, which means you’ll have brown hair, as opposed to a G, which means you’ll have blonde hair. Of course, it’s a lot more complicated than that, but the basic idea is that you can link traits to SNPs.

So the task that needs to be done here is two-fold. First, I need to get a list of SNPs with their strongest risk allele – in other words, what SNP location am I looking for, and which nucleotide is the one that’s associated with higher risk. Then, I need to match this up with my own list of SNPs and find the ones where both of my nucleotides are the risk allele. Here’s how I did it!

First I’m going to load my data and the packages we’ll need.

library(gwascat)
library(splitr)
d <- read.table(file = "file_name.txt",
 sep="\t", header=T,
 colClasses=c("character", "character", "numeric", "character"),
 col.names=c("rsid", "chrom", "position", "genotype"))

Next I need to pull in the data about the SNPs from gwascat. I can use this to match up the RSIDs in my data with the ones in their data. I’m also going to drop some other columns I’m not interested in at the moment.

data("gwdf_2014_09_08")
trait_list <- merge(gwdf_2014_09_08, d, by.x = "SNPs", by.y = "rsid")
trait_list <- trait_list[, c(1, 9, 10, 22, 27, 37)]

Now I want to find out where I have the risk allele. This is where this analysis gets potentially stupid. The risk allele is stored in the gwascat dataset with its RSID plus the allele, such as rs1423096-G. My understanding is that if you have two Gs at that position (remembering that you get one copy from each of your parents), then you’re at higher risk. So I want to create a new column in my merged dataset that has the risk allele printed twice, so that I can just compare it to the column with my data, and only have it show me the ones where I have two copies of the risk allele (since I don’t want to dig through all 10,000+ genes to find the ones of interest).

trait_list$badsnp <- paste(str_sub(trait_list$`Strongest SNP-Risk Allele`, -1), str_sub(trait_list$`Strongest SNP-Risk Allele`, -1), sep = "")

Okay, almost there! Now let’s remove all the stuff that’s not interesting and just keep the ones where I have two copies of the risk allele. I could also have it remove ones that don’t match my ancestry (European) or gender, but I’m not going to bother with it, since keeping the ones where I have a match gives me a reasonable amount to scroll through.

my_risks <- trait_list[trait_list$genotype == trait_list$badsnp,]

And there you have it! I don’t know how meaningful this analysis really is, but according to this, some traits that I have are higher educational attainment (true), persistent temperament (sure, I suppose so), migraines (true), and I care about the environment (true). Also I could be at higher risk of having a psychotic episode if I’m on methamphetamine, so I’ll probably skip trying that (which was actually my plan anyway). Anyway, it’s kind of entertaining to look at, and I’m finding SNPedia is useful for learning more.

So, now, bring on the bioinformaticians telling me how incorrect all of this is. I will eagerly make corrections to anything I am wrong about!

Flip flop: my failed experiment with flipped classroom R instruction

I don’t know if this terminology is common outside of library circles, but it seems like the “flipped classroom” has been all the rage in library instruction lately.  The idea is that learners do some work before coming to the session (like read something or watch a video lecture), and then the in-person time is spent on doing more activities, group exercises, etc.  As someone who is always keen to try something new and exciting, I decided to see what would happen if I tried out the flipped classroom model for my R classes.

Actually, teaching R this way makes a lot of sense.  Especially if you don’t have any experience, there’s a lot of baseline knowledge you need before you can really do anything interesting.  You’ve got to learn a lot of terminology, how the syntax of R works, boring things like what a data frame is and why it matters.  That could easily be covered before class to save the in person time for the more hands-on aspects.  I’ve also noticed a lot of variability in terms of how much people know coming into classes.  Some people are pretty tech savvy when they arrive, maybe even have some experience with another programming language.  Other people have difficulty understanding how to open a file.  It’s hard to figure out how to pace a class when you’ve got people from all over that spectrum of expertise.  On the other hand, curriculum planning would be much easier if you could know that everyone is starting out with a certain set of knowledge and build off of it.

The other reason I wanted to try this is just the time factor.  I’m busy, really busy.  My library’s training room is also hard to book because we offer so many classes.  The people I teach are busy.  I teach my basic introduction to R course as a 3-hour session, and though I’d really rather make it 4 hours, even finding a 3-hour window when I and the room are both available and people are likely to be able to attend is difficult.  Plus, it would be nice if there was some way to deliver this instruction that wasn’t so time-intensive for me.  I love teaching R – it’s probably my favorite thing I do in my job and I’d estimate I’ve taught close to 500 researchers how to code.  I generally spend around 9 hours a month teaching R, plus another 4-6 hours doing prep, administrative stuff, and all the other things that have to get done to make a class function.  That’s a lot of time, and though I don’t at all mind doing it, I’d definitely be interested in any sort of way I could streamline that work without having a negative impact on the experience of learning R from me.

For all these reasons, I decided to experiment with trying out a flipped classroom model for my introduction to R class.  I had grand plans of making a series of short video tutorials that covered bite-sized pieces of learning R.  There would be a bunch of them, but they’d be about 5 minutes each.  I arranged for the library to get Adobe Captivate, which is very cool video tutorial software, and these tutorials are going to be so awesome when I get around to making them.  However, I had already scheduled the class for today, February 28, and I hadn’t gotten around to making them yet.  Fortunately, I had a recording of a previous Intro to R class I’d taught, so I chopped the relevant parts of that up into smaller pieces and made a YouTube playlist that served as my pre-class work for this session, probably about two and a half hours total.

I had 42 people were either signed up or on the waitlist at the end of last week.  I think I made the class description pretty clear – that this session was only an hour, but you did have to do stuff before you got there.  I sent out an email with the link to the video reminding people that they would be lost in class if they didn’t watch this stuff.  Even so, yesterday morning, the last of the videos had only 8 views, and I knew at least two of those were from me checking the video to make sure it worked.  So I sent out another email, once again imploring them to watch the videos before they came to class and to please cancel their registration and sign up for a regular R class if this video thing wasn’t for them.

By the time I taught the class this afternoon, 20 people had canceled their registration.  Of the remaining 22, 5 showed up.  Of the 5 that showed up, it quickly became apparent to me that none of them had watched the videos.  I knew no one was going to answer honestly if I asked who had watched them, so I started by telling them to read in the CSV file to a data frame.  This request is pretty fundamental, and also pretty much the first thing I covered in the videos, so when I was met with a lot of blank stares, I knew this experiment had pretty much failed.  I did my best to cover what I could in an hour, but that’s not much, so instead of this being a cool, interactive class where people ended up feeling empowered and ready to go write code, I got the feeling those people left feeling bewildered and like they wasted an hour.  One guy who had come in 10 minutes late came up to me after class and was like, “so this is a programming language?  What can you do with it?”  And I kind of looked at him like….whaaaat?  It turned out he hadn’t even registered for the class to begin with, much less done any of the pre-class work – he had been in the library and saw me teaching and apparently thought it looked interesting so he decided to wander in.

I felt disappointed by this failed experiment, but I’m not one to give up at the first sign of failure, so I’ve been thinking about how I could make this system work.  It could just be that this model is not suited to people in the setting where I teach.  I am similar to them – a busy, working professional who knows this is useful and I should learn it, but it’s hard to find the time – and I think about what it would take for me to do the pre-class work.  If I had the time and the videos were decent enough quality, I think I’d do it, but honestly chances are 50-50 that I’d be able to find the time.  So maybe this model just isn’t made for my community.

Before I give up on this experiment entirely, though, I’d love to hear from anyone who has tried this kind of approach for adult learners.  Did it work, did it not?  What went well and what didn’t?  And of course, being the data queen that I am, I intend to collect some data.  I’m working on a modified class evaluation for those 5 brave souls who did come to get some feedback on the pre-class work model, and I’m also planning on sending a survey out to the other 38 people who didn’t come to see what I can find out from them.  Data to the rescue of the flipped class!

Can you hack it? On librarian-ing at hackathons

I had the great pleasure of spending the last few days working on a team at the latest NCBI hackathon.  I think this is the sixth hackathon I’ve been involved in, but this is the first time I’ve actually been a participant, i.e. a “hacker.”  Prior to working on these events, I’d heard a little bit about hackathons, mostly in the context of competitive hackathons – a bunch of teams compete against each other to find the “best” solution to some common problem, usually with the winning team receiving some sort of cash prize.  This approach can lead to successful and innovative solutions to problems in a short time frame.  However, the so-called NCBI-style hackathons that I’ve been involved in over the last couple years involve multiple teams each working on their own individual challenge over a period of three days. There are no winners, but in my experience, everyone walks away having accomplished something, and some very promising software products have come out of these hackathons.  For more specifics about the how and why of this kind of hackathon, check out the article I co-authored with several participants and the mastermind behind the hackathons, Ben Busby of NCBI.

As I said, this time was the first hackathon that I’ve actually been involved as a participant on a team, but I’ve had a lot of fun doing some librarian-y type “consulting” for five other hackathons before this, and it’s an experience I can highly recommend for any information professional who is interested in seeing science happen real-time.  There’s something very exciting about watching groups of people from different backgrounds, with different expertise, most of whom have never met each other before, get together on a Monday morning with nothing but an often very vague idea, and end up on Wednesday afternoon with working software that solves a real and significant biomedical research problem.  Not only that, but most of the groups manage to get pretty far along on writing a draft of a paper by that time, and several have gone on to publish those papers, with more on their way out (see the F1000Research Hackathons channel for some good examples).

As motivated and talented as all these hackathon participants are, as you can imagine, it takes a lot of organizational effort and background work to make something like this successful.  A lot of that work needs to be done by someone with a lot of scientific and computing expertise.  However, if you are a librarian who is reading this, I’m here to tell you that there are some really exciting opportunities to be involved with a hackathon, even if you are completely clueless when it comes to writing code.  In the past five hackathons, I’ve sort of functioned as an embedded informationist/librarian, doing things like:

  • basic lit searching for paper introductions and generally locating background information.  These aren’t formal papers that require an extensive or systematic lit review, but it’s useful for a paper to provide some context for why the problem is significant.  The hackers have a ton of work to fit in to three days, so it’s silly to have them spend their limited time on lit searching when a pro librarian can jump in and likely use their expertise to find things more easily anyway
  • manuscript editing and scholarly communication advice.  Anyone who has worked  with co-authors knows that it takes some work to make the paper sound cohesive, and not like five or six people’s papers smushed together.  Having someone like a librarian with editing experience to help make that happen can be really helpful.  Plus, many librarians  have relevant expertise in scholarly publishing, especially useful since hackathon participants are often students and earlier career researchers who haven’t had as much experience with submitting manuscripts.  They can benefit from advice on things like citation management and handling the submission process.  Also, I am a strong believer in having a knowledgeable non-expert read any paper, not just hackathon papers.  Often writers (and I absolutely include myself here) are so deeply immersed in their own work that they make generous assumptions about what readers will know about the topic.  It can be helpful to have someone who hasn’t been involved with the project from the start take a look at the manuscript and point out where additional background or explanation might be beneficial to aiding general understandability.
  • consulting on information seeking behavior and giving user feedback.  Most of the hackathons I’ve worked on have had teams made up of all different types of people – biologists, programmers, sys admins, other types of scientists.  They are all highly experienced and brilliant people, but most have a particular perspective related to their specific subject area, whereas librarians often have a broader perspective based on our interactions with lots of people from various different subject areas.  I often find myself thinking of how other researchers I’ve met might use a tool in other ways, potentially ones the hackathon creators didn’t necessarily intend.  Also, at least at the hackathons I’ve been at, some of the tools have definite use cases for librarians – for example, tools that involve novel ways of searching or visualizing MeSH terms or PubMed results.  Having a librarian on hand to give feedback about how the tool will work can be useful for teams with that kind of a scope.

I think librarians can bring a lot to hackathons, and I’d encourage all hackathon organizers to think about engaging librarians in the process early on.  But it’s not a one-way street – there’s a lot for librarians to gain from getting involved in a hackathon, even tangentially.  For one thing, seeing a project go from idea to reality in three days is interesting and informative.  When I first started working with hackathons, I didn’t have that much coding experience, and I certainly had no idea how software was actually developed.  Even just hanging around hackathons gave me so much of a better understanding, and as an informationist who supports data science, that understanding is very relevant.  Even if you’re not involved in data science per se, if you’re a biomedical librarian who wants to gain a better understanding of the science your users are engaged in, being involved in a hackathon will be a highly educational experience.  I hadn’t really realized how much I had learned by working with hackathons until a librarian friend asked me for some advice on genomic databases. I responded by mentioning how cool it was that ClinVar would tell you about pathogenic variants, including their location and type (insertion, deletion, etc), and my friend was like, what are you even talking about, and that was when it occurred to me that I’ve really learned a lot from hackathons!  And hey, if nothing else, there tends to be pizza at these events, and you can never go wrong with pizza.

I’ll end this post by reiterating that these hackathons aren’t about competing against each other, but there are awards given for certain “exemplary” achievements.  Never one to shy away from a little friendly competition, I hoped I might be honored for some contribution this time around, and I’m pleased to say I was indeed recognized . 🙂

It's true, I'm the absolute worst at darts.

There is a story behind this, but trust me when I say it’s true, I’m the absolute worst at darts.

A Silly Experiment in Quantifying Death (and Doing Better Code)

Doesn’t it seem like a lot of people died in 2016?  Think of all the famous people the world lost this year.  It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even.  Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it.  Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing.  For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names.  More importantly, this little project was actually a good way to practice a few things I wanted to teach myself.  Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people.  Here they are on a histogram, by year.

year_plotAs you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still).  The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average.  So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case.  I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data.  Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

month_plotObviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

age_plotThis chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, you may notice those outliers in 2006 and 2014 where someone died at a very, very old age. I didn’t realize it at first, butWikipedia does include some notable non-humans in their list. One is a famous tree that died in an ice storm at age 125 and the other a tortoise who had allegedly been owned by Charles Darwin, but significantly outlived him, dying at age 176.  Obviously this makes the data and therefore this analysis even more suspect as a true scientific pursuit.  But we had fun, right? 🙂

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository.  And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing.  I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around.  I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was).  Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github.  Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it).  This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish.  In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway?  Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way.  I learned how to write good functions, for real.  I learned how to use the %>%, (which is a pipe operator, and it’s very awesome).  I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point.  In the past, I would write code and be elated if it actually worked.  With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better?  Can I do that in one line of code instead of three?  Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work.  Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?

In defense of the live demo (despite its perils)

rstudio-bomb

When RStudio crashes, it is not subtle about it.  You get a picture of an old-timey bomb and the succinct, blunt message “R encountered a fatal error.”  A couple hundred of my librarian friends and colleagues got to see it live during the demo I gave as part of a webinar I did for the Medical Library Association on R for librarians earlier today.  At first, I thought the problem was minor.  When I tried to read in my data, I got this error message:

Error in file(file, “rt”) : cannot open the connection
In addition: Warning message:
In file(file, “rt”) :
cannot open file ‘lib_data_example.csv’: No such file or directory

It’s a good example of R’s somewhat opaque and not-super-helpful error messages, but I’ve seen it before and it’s not a big deal.  It just meant that R couldn’t find the file I’d asked for.  Most of the time it’s because you’ve spelled the file name wrong, or you’ve capitalized something that should be lower case.  I double checked the file name against the cheat sheet I’d printed out with all my code.  Nope, the file name was correct.  Another likely cause is that you’re in the wrong directory and you just need to set the working directory to where the file is located.  I checked that too – my working directory was indeed set to where my file should have been.  That was when RStudio crashed, though I’m still not sure exactly why that happened.  I assume RStudio did it just to mess with me.  🙂

I’m sure a lot of presenters would be pretty alarmed at this point, but I was actually quite amused.  People on Twitter seemed to notice:

Having your live demo crash is not very entertaining in and of itself, but I found the situation rather amusing because I had considered whether I should do a live demo and decided to go with it because it seemed so low risk.  What could go wrong?  Sure, live demos are unpredictable.  Websites go down, databases change their interface without warning (invariably they do this five minutes before your demo starts), software crashes, and so on. Still, the demo I was doing was really quite simple compared to a lot of the R I normally teach, and it involved using an interface I literally use almost every day.   I’ve had plenty of presentations go awry in the past, but this was one that I really thought had almost 0% chance of going wrong.  So when it all went wrong on the very first line of code, I couldn’t help but laugh.  It’s the live demo curse!  You can’t escape!

I’m sure most people who have spent any significant amount of doing live demos of technology have had the experience of seeing the whole thing blow up.  I know a lot of librarians who avoid the whole thing by making slides with screen shots of what they would show and do sort of a mock demo.  There’s nothing wrong with that, and I can understand the inclination to remove the uncertainty of the live demo from the equation.  But despite their being fraught with potential issues, I’m still in favor of live demos – and in a sense, I feel this way exactly because of their unpredicability.

For one thing, it’s helpful for learners to see how an experienced user thinks through the process of troubleshooting when something goes wrong.  It’s just a fact that stuff doesn’t always work perfectly in real life.  If the people I’m teaching are ever actually going to use the tools I’m demonstrating, eventually they’re going to run into some problems.  They’re more likely to be able to solve those problems if they’ve had a chance to see someone work through whatever issues arise.  This is true for many different types of technologies and information resources, but especially so with programming languages.  Learning to troubleshoot is itself an essential skill in programming, and what better way to learn than to see it in action?

Secondly, for brand new users of a technology, watching an instructor give a flawless and apparently effortless demonstration can actually make mastery feel out of reach for them.  In reality, a lot of time and effort likely went into developing that demo, trying out lots of different approaches, seeing what works well and what doesn’t, and arriving at the “perfect” final demo.  I’m certainly not suggesting that instructors should do freewheeling demos with no prior planning whatsoever, but I am in favor of an approach that acknowledges that things don’t always go right the first time.  When I learned R, I would watch  tutorials by these incredibly smart and talented instructors and think, oh my gosh, they make this look so easy and I’m totally lost – I’m never going to understand how this works.  Obviously I don’t want to look like an unprepared and incompetent fool in front of a class, but hey, things don’t always go perfectly.  I’m human, you’re human, we’re all going to make mistakes, but that’s part of learning, so let’s talk about what went wrong and how we fix it.

By the way, in case you’re wondering what did actually go wrong in this instance, I had inadvertently moved the data file in the process of uploading it to my Github repo – I thought I’d made a copy, but I had actually moved the original.  I quickly realized what had happened, and I knew roughly where I’d put the file, but it was in some folder buried deep in my file structure that I wouldn’t be able to locate easily on the spot.  The quickest solution I could think of, which I quickly did off-screen from the webinar (thank you dual monitors) was to copy the data from the repo, paste it into a new CSV and quickly save it where the original file should have been.  It worked fine and the demo went off as planned after that.

Who Am I? The Identity Crisis of the Librarian/Informationist/Data Scientist

More and more lately, I’m asked the question “what do you do?” This is a surprisingly difficult question to answer.  Often, how I answer depends on who’s asking – is it someone who really cares or needs to know? – and how much detail I feel like going to at the moment when I’m asked.  When I’m asked at conferences, as I was quite a bit at FORCE2016, I tried to be as explanatory as possible without getting pedantic, boring, or long-winded.  My answer in those scenarios goes something like “I’m a data librarian – I do a lot of instruction on data science, like R and data visualization, and data management.”  When I’m asked in more social contexts, I hardly even bother explaining.  Depending on my mood and the person who’s asking, I’ll usually say something like data scientist, medical librarian, or, if I really don’t feel like talking about it, just librarian.  It’s hard to know how to describe yourself when you have a job title that is pretty obscure: Research Data Informationist.  I would venture to guess that 99% of my family, friends, and even work colleagues have little to no idea what I actually spend my days doing.

In some regards, that’s fine.  Does it really matter if my mom and dad know what it means that I’ve taught hundreds of scientists R? Not really (they’re still really proud, though!).  Do I care if my date has a clear understanding of what a data librarian does?  Not really.  Do I care if a random person I happen to chat with while I’m watching a hockey game at my local gets the nuances of the informationist profession?  Absolutely not.

On the other hand, there are often times that I wish I had a somewhat more scrutable job title.  When I’m talking to researchers at my institution, I want them to know what I do because I want them to know when to ask me for help.  I want them to know that the library has someone like me who can help with their data science questions, their data management needs, and so on.  I know it’s not natural to think “library” when the question is “how do I get help with finding data” or “I need to learn R and don’t know where to start” or “I’d like to create a data visualization but I have no idea how to do it” or any of the other myriad data-related issues I or my colleagues could address.

The “informationist” term is one that has a clear definition and a history within the realm of medical librarianship, but I feel like it has almost no meaning outside of our own field.  I can’t even count the number of weird variations I’ve heard on that title – informaticist, informationalist, informatist, and many more.  It would be nice to get to the point that researchers understood what an informationist is and how we can help them in their work, but I just don’t see that happening in the near future.

So what do we do to make our contributions and expertise and status as potential collaborators known?  What term can we call ourselves to make our role clear?  Librarian doesn’t really do it, because I think people have a very stereotypical and not at all correct view of what librarians do, and it doesn’t capture the data informationist role at all.  Informationist doesn’t do it, because no one has any clue what that means.  I’ve toyed with calling myself a data scientist, and though I do think that label fits, I have some reservations about using that title, probably mostly driven by a terrible case of imposter syndrome.

What’s in a name?  A lot, I think.  How can data librarians, informationists, library-based data scientists, whatever you want to call us, communicate our role, our expertise, our services, to our user communities?  Is there a better term for people who are doing this type of work?

Some ponderings on #force2016 and open data

I’m attending FORCE2016, which is my first FORCE11 conference after following this movement (or group?) for awhile and I have to say, this is one interesting, thought-provoking conference.  I haven’t been blogging in awhile, but I felt inspired to get a few thoughts down after the first day of FORCE2016:

  • I love the interdisciplinarity of this conference, and to me, that’s what makes it a great conference to attend.  In our “swag bag,” we were all given a “passport” and could earn extra tickets for getting signatures of attendees from different disciplines and geographic locations.  While free drinks are of course a great incentive, I think the fact that we have so many diverse attendees at this conference is a draw on its own.  I love that we are getting researchers, funders, publishers, librarians, and so many other stakeholders at the table, and I can’t think of another conference where I’ve seen this many different types of people from this many countries getting involved in the conversatioon.
  • I actually really love that there are so few concurrent sessions.  Obviously, fewer concurrent sessions means fewer voices joining the official conversation, but I think this is a small enough conference that there are ways to be involved, active, and vocal without necessarily being an invited speaker.  While I love big conferences like MLA, I always feel pulled in a million different directions – sometimes literally, like last year when I was scheduled to present papers at two different sessions during the same time period.  I feel more engaged at a conference when I’m seeing mostly the same content as others.  We’re all on the same page and we can have better conversations.  I also feel more engaged in the Twitter stream.  I’m not trying to follow five, ten, or more tweet streams at once from multiple sessions.  Instead, I’m seeing lots of different perspectives and ideas and feedback on one single session.  I like us all being on the same page.

Now, those are some positives, but I do have to bring it down with one negative from this conference, and that is that I think it’s hard to constructively talk about how to encourage sharing and open science when you have a whole conference full of open science advocates.  I do not in any way want to disparage anyone because I have a lot of respect for many of the participants in the session I’m talking about, but I was a little disappointed in the final session today on data management.  I loved the idea of an interactive session (plus I heard there would be balloons and chocolate, so, yeah!) and also the idea of debate on topics in data sharing and management, since that’s my jam.  I did debate in high school, so I can recognize the difficulty but also the usefulness of having to argue for a position with which you strongly disagree.  There’s real value in spending some time thinking about why people hold positions that are in opposition of your strongly held position.  And yeah, this was the last session of a long day, and it was fun, and it had popping of balloons, and apparently some chocolate, and whatnot, but I am a little disappointed at what I see as a real missed opportunity to spend some time really discussing how we can address some of the arguments against data sharing and data management.  Sure, we all laughed at the straw men that were being thrown out there by the teams who were being called upon to argue in favor of something that they (and all of us, as open science advocates) strongly disagreed with.  But I think we really lost an opportunity to spend some time giving serious thought to some of the real issues that researchers who are not open science advocates actually raise.  Someone in that session mentioned the open data excuses bingo page (you can find it here if you haven’t seen it before).  Again, funny, but SERIOUSLY I have actually have real researchers say ALL of these things, except for the thing about terrorists.  I will reiterate that I know and respect a lot of people involved with that session and I’m not trying to disparage them in any way, but I do hope we can give some real thought to some of the issues that were brought up in jest today.  Some of these excuses, or complaints, or whatever, are actual, strongly-held beliefs of many, many researchers.  The burden is on us, as open science advocates, to demonstrate why data sharing, data management, and the like are tenable positions and in fact the “correct” choice.

Okay, off my soap box!  I’m really enjoying this conference, having a great time reconnecting with people I’ve not seen in years, and making new connections.  And Portland!  What a great city. 🙂

To keep or not to keep: that is the question

I recently read an article in The Atlantic about people who are compulsive declutterers – the opposite of hoarders – who feel compelled to get rid of all their possessions. I’m more on the side of hoarding, because I always find myself thinking of eventualities in which I might need the item in question.  Indeed, it has often been the case that I will think of something I got rid of weeks or even years later and wish I still had it: a book I would have liked to reference, a piece of clothing I would have liked to wear, a receipt I could have used to take something back.  Of course, I don’t have unlimited storage space, so I can’t keep all this stuff.  The question of what to keep and for how long is one that librarians think about when it comes to weeding: deciding which parts of the collection to deaccession, or basically, get rid of.  There are evidence-based, tried-and-true ways of thinking about weeding a library collection, but that’s not so much true when it comes to data.  How is a scientist to decide what to keep and what not to keep?

I know this is a question that researchers are thinking about quite a bit, because I get more emails about this than almost any other issue.  In fact, I get emails not only from users of my own library, but researchers from all over the country who have somehow found my name.  What exactly do I need to keep?  If I have electronic records, do I need to keep a print copy as well?  How many years do I need to keep this stuff?  These are all very reasonable questions that it would be nice to say, yes, there is an answer and it is….! but it’s almost never so easy to point to a single answer.

A case in point: a couple years ago, I decided to teach a class about data preservation and retention.  In my naivete, I thought it would be nice to take a look through all the relevant policy and find the specific number of years that research data is required to be retained.  I read handbooks and guides.  I read policy documents from various agencies.   I even read the U.S. Code (I do not recommend it).  At the end of it, I found that not only is there not a single, definitive, policy answer to how long funded research data should be retained, but there are in fact all sorts of contradictory suggestions.  I found documents giving times from 3 years to 7 years to the super-helpful “as long as necessary.”

This may be difficult to answer from a policy perspective, but I think answering this from a best practices perspective is even trickier.  Let’s agree that we just can’t keep everything – storing data isn’t free, and it takes considerable time and effort to ensure that data remain accessible and usable.  Assuming that some stuff has to get thrown away, how do we distinguish trash from treasure, especially given the old adage about how the former might be the latter to others?  It’s hard to know whether something that appears useless now might actually be useful and interesting to someone in the future.  To take this to the extreme, here’s an actual example from a researcher I’ve worked with: he asked how he could have his program automatically discard everything in the thousandth place from his measurements.  In other words, he wanted 4.254 to be saved as 4.25.  I told him I could show him how, but I asked why he wanted to do this.  He told me that his machine was capable of measuring to the thousandth, but the measurement was only scientifically relevant to the hundredth place.  To scientists right now, 4.254 and 4.252 were essentially indistinguishable, so why bother with the extra noise of the thousandth place?  Fair point, but what about 5 years from now, or 10 years from now?  If science evolves to the point that this extra level of precision is meaningful, tomorrow’s researchers will probably be a little annoyed that today’s researchers had that measurement and just threw it away.  But then again, how can we know now when, or even if, that level of precision will be wanted?  For that matter, we can’t even say for sure whether this dataset will be useful at all.  Maybe a new and better method for making this measurement will be developed tomorrow, and all this stuff we gathered today will be irrelevant.  But how can we know?

These are all questions that I think are not easy to answer right now, but that people within research communities should be thinking about.  For one thing, I don’t think we can give one simple answer to how long data should be retained.  For one type of research, a few years may be enough.  For other fields, where it’s harder to replicate data, maybe we need to keep it in perpetuity.  When it comes to deciding what should be retained and what should be discarded, I think that answers cannot be dictated by one-size-fits-all policies and that subject matter experts and information professionals should work together to figure out good answers for specific communities and specific data.  Eventually, I suppose we’ll probably have some of those well-defined best practices for data retention in the same way that we have those best practices from collection management in libraries.  Until then, keep your crystal balls handy. 🙂