A Silly Experiment in Quantifying Death (and Doing Better Code)

Doesn’t it seem like a lot of people died in 2016?  Think of all the famous people the world lost this year.  It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even.  Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it.  Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing.  For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names.  More importantly, this little project was actually a good way to practice a few things I wanted to teach myself.  Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people.  Here they are on a histogram, by year.

year_plotAs you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still).  The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average.  So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case.  I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data.  Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

month_plotObviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

age_plotThis chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, you may notice those outliers in 2006 and 2014 where someone died at a very, very old age. I didn’t realize it at first, butWikipedia does include some notable non-humans in their list. One is a famous tree that died in an ice storm at age 125 and the other a tortoise who had allegedly been owned by Charles Darwin, but significantly outlived him, dying at age 176.  Obviously this makes the data and therefore this analysis even more suspect as a true scientific pursuit.  But we had fun, right? 🙂

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository.  And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing.  I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around.  I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was).  Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github.  Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it).  This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish.  In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway?  Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way.  I learned how to write good functions, for real.  I learned how to use the %>%, (which is a pipe operator, and it’s very awesome).  I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point.  In the past, I would write code and be elated if it actually worked.  With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better?  Can I do that in one line of code instead of three?  Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work.  Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?

Practicing What I Preach: The Open PhD Experiment

(Note: this is an adapted version of a final paper I wrote for one of my classes. That’s why it’s so long!)

A few weeks ago, a researcher called my office to see if we could meet to discuss our shared interest in open data. I agreed, and a week later we were sitting in my office having a lively discussion about the many problems that currently hinder more widespread data sharing and reuse in biomedical research. When I mentioned that these topics would be the focus of my doctoral dissertation work, he expressed an interested in seeing some of my research. I replied that it was only my first semester, so I didn’t have much yet, but that I’d published a few papers on my previous research. “I don’t mean papers,” he said. “I mean your data, your code. If you’re doing a PhD on data sharing, don’t you think you should share your data, too?  In fact, why don’t you do an open PhD?”

Perhaps I should have immediately replied, “you’re absolutely right. I will do an open PhD.”  After all, on the face of it, this suggestion seems perfectly reasonable. My research, and in fact my entire career, revolves around the premise that researchers should share their data. It should be a no-brainer that I would also share my data. In principle, I have no problem with agreeing to do so, but in the real world of research, lofty ideals like service to the community and furthering science are sometimes abandoned in favor of more practical concerns, like getting one’s paper accepted or finishing one’s dissertation before other people have a chance to capitalize on the data.

So what I ended up telling this researcher was that I found his suggestion intriguing and I’d give it some serious thought. I have done just that in the intervening weeks, and here I will reflect on the reasons for my hesitation and explore the levels of openness I am prepared to take on in my doctoral program and my academic career.

My first (mis)adventure with data sharing

The first – and as yet only – time I shared my data was when I submitted an article to PLOS in 2014. PLOS was one of the first publishers to adopt an open data policy that required researchers to share the data underlying their manuscripts. I dutifully submitted my data to figshare, a popular, discipline-agnostic data repository, with the title “Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff.” To my surprise, someone at figshare took notice of my upload and tweeted out a link to my dataset. I could have sworn that I’d checked the box to keep the data private until I opted to officially release them, but when I’d gone back to fix a minor mistake in the title of the submission, the box must have gotten unchecked, and the status was changed to public.

After the tweet went out, I could see from the “views” counter that people were already looking at the data. Someone retweeted the link to the data, then another person, and another. The paper hadn’t even been reviewed by anyone yet, much less accepted for publication, but my data were out there for anyone to see, with the link spreading across Twitter. The situation made me nervous. I was excited that people were interested in my data, but what were they doing with it?  The views counter ticked up steadily, and people were not just viewing, but actually downloading the dataset as well.

I finally received word from PLOS that they’d accepted the paper, but they asked for major revisions; Reviewer 2 (it’s always Reviewer 2) was niggling over my statistical methods, and I was going to have to redo much of my work to respond to all the revision requests. During the revision process, I received an email from someone I’d never heard of, from an Eastern European country I can’t now recall. She had seen my data on figshare and she, too, wanted to write a paper on this topic. She asked me to send her a copy of my still-in-process paper, as well as a list of all relevant references I had found. The audacity of her request shocked me. Here was someone I’d never even met, telling me she wanted to use my data, write essentially the same paper as me, and she wanted me to give her my background research as well?  I wrote an email back, politely but firmly rebuffing her request, and I never heard from her again.

In the end, everything went fine: the paper was published and it has gone on to be cited seven times and featured in PLOS’s new Open Data collection (PLOS Collections 2016). I do still believe that researchers, particularly those whose work is supported by taxpayers’ money, have a responsibility to share their data when doing so will not violate their human subjects’ privacy. However, my own experience demonstrated to me that sharing research data cannot be viewed as a black and white proposition, that you share and are “good,” or you don’t and you are “bad.” Rather, many researchers have real, valid concerns about how they share their data, when, and with whom. Though my reasons probably differ from those of many other researchers, I have my own concerns that give me pause when it comes to the idea of an “open PhD.”

  1. I don’t think my data would be useful or interesting to anyone else.

Some datasets have near infinite value, with uses that extend far beyond the expertise or disciplinary affiliation of their original collector. New computational methodologies and analytic techniques make it possible to uncover previously undetected meaning in datasets or “mash up” disparate datasets to detect novel connections between seemingly unrelated phenomena. The ability to quickly, easily, and cheaply share massive amounts of data means that researchers around the world are able to make life-saving discoveries. For example, the National Cancer Institute’s Cancer Genomics Cloud Pilot program allows researchers to connect to cancer genome data and perform complex analyses on cloud computing platforms more powerful than any computers they could buy for their lab (National Cancer Institute Center for Biomedical Informatics & Information Technology 2016). Projects like this are exciting – they could bring about cures for cancer and vastly improve our lives. Few people would argue that sharing these kinds of datasets is important.

By comparison, my data just look silly. Personally, I find my research fascinating. I could spend hours talking about biomedical scientists’ research data sharing and reuse practices. However, I don’t flatter myself that others are clamoring to see all the thrilling survey data and titillating interview transcriptions I have collected. Beyond validating the results in my article, I see little value for these data. Of course, I have made the argument that data can have unexpected uses that their original collectors could never have imagined, so I am prepared to admit that my data may have usefulness beyond what I would expect. Perhaps I should take the 252 views and 37 downloads of my figshare dataset as evidence that my data are of interest to more people than I might expect.

  1. I’m often embarrassed by my amateurish ways.

I’m a fan of GitHub, a site where you can share your code and allow others to collaboratively contribute to your work, but I’m also terrified of it. I spend a very significant amount of time at my job working with R, my programming language of choice; I teach it, I consult on it, and I use it for my own research. I like to think I know what I’m doing, but in all honesty, I’m pretty much entirely self-taught in R and, though I’m a quick study, I haven’t been using it for that long. I am far from an expert, and I often write code that makes this fact obvious.

Recently I wrote some R code related to a research project I hope to submit for publication soon. The work involved downloading the full text of over 60,000 articles, but since the server’s interface only allowed downloading a thousand articles at a time, I needed to write code that would download the allowed amount, then repeat itself 60 times, updating the article numbers after each iteration. I spent hours trying to figure out the best way to do it, but everything I tried failed. I could download a hundred at a time, then manually update the numbers in the code and re-run it, but doing this 60 times would have been time-consuming. In a throwing-up-your-hands moment of frustration, I wrote a command that would essentially just write those 60 lines of code for me, then ran all 60 lines.

Frankly, this approach was idiotic. Anyone who knows the first thing about programming would scoff at my code, and rightly so. However, at the time, this slipshod approach was the best I could come up with. It’s not just code that may reveal that I don’t always know what I’m doing; the more open the research process, the more opportunity for others to see the unpolished, imperfect steps that lie beneath the shiny surface of the perfected, word-smithed article.

  1. It takes time to prepare data for broader consumption.

When I teach data management classes for researchers, I emphasize how good data management practices will make submitting their data at the end of the process easy, practically effortless. Of course, having your data perfectly ready to share without any extra effort at the end of your project is about as likely as a jumping out of bed and looking good enough to head off to work without taking any time to freshen up. For example, part of my to-do list for preparing the article for the project I described above for publication is figuring out how to actually write that code the right way, so I can share it without fear of being humiliated. Getting my data, code, writing, or any other scholarly output I produce into the kind of shape it would need to be for me to be willing to put my name on it takes time. When I’m already trying to manage a demanding full-time job with a doctoral program and somehow still find the time to enjoy some sort of leisure every now and then, polishing up something to get it ready for sharing doesn’t often take enough priority to make it onto my daily schedule.

A compromise: the open-ish PhD

Though I’ve just spent five pages expounding on the reasons I cannot do a fully open PhD, I am prepared to compromise. The ideal the researcher urged me toward in our original conversation – don’t wait for your dissertation, share your data now, get your code up on GitHub today! – may not be right for me, but I do believe it is feasible to find some way to share at least some of scholarly output, if not in real-time, than at least in a timely fashion. Therefore, I propose the following tenets of my open-ish PhD:

  • I will do my best to write code that I am reasonably proud of (or at least not actively ashamed of) and share it on GitHub. While I do not feel comfortable immediately sharing code that corresponds to projects I am actively pursuing and seeking to publish, I will at least share it upon publication. I will also share teaching-related code immediately on GitHub, especially since doing so provides a good model for the researchers I am teaching.
  • I will make a more concerted effort to share my scholarly writing not just in its final, polished form as journal articles, but also in more casual settings, such as on my blog. I am also interested in exploring pre-print servers like arXiv and bioRxiv as a means of more rapid dissemination of research findings in advance of formal journal article publication.
  • I will attempt to collect data in a more mindful and intentional way, recognizing that I am not simply collecting my data, but that the point of my efforts are to inform others in my scholarly and research communities. As a federal employee, the work that I conduct in my official capacity cannot be copyrighted because it belongs not to me, but to all the American people who pay my salary. As I go forward with my research, I will do my best to remember that I am doing it not merely to satisfy my curiosity or add to my CV, but to advance science, even in my own small way.

In the end, it probably doesn’t matter so much whether the final data I share are perfect, whether my code impresses other people with its efficiency and elegance, or whether something I write appears in Nature or on my little blog. What matters is making the effort to share, committing to the highest level of openness possible, and doing so publicly and visibly – essentially, leading by example. I can give lectures on the importance of data sharing and teach classes on open source tools until I’m blue in the face, but perhaps the most important thing I can do to convince researchers of the importance of sharing and reusing data is doing exactly that myself.

In defense of the live demo (despite its perils)

rstudio-bomb

When RStudio crashes, it is not subtle about it.  You get a picture of an old-timey bomb and the succinct, blunt message “R encountered a fatal error.”  A couple hundred of my librarian friends and colleagues got to see it live during the demo I gave as part of a webinar I did for the Medical Library Association on R for librarians earlier today.  At first, I thought the problem was minor.  When I tried to read in my data, I got this error message:

Error in file(file, “rt”) : cannot open the connection
In addition: Warning message:
In file(file, “rt”) :
cannot open file ‘lib_data_example.csv’: No such file or directory

It’s a good example of R’s somewhat opaque and not-super-helpful error messages, but I’ve seen it before and it’s not a big deal.  It just meant that R couldn’t find the file I’d asked for.  Most of the time it’s because you’ve spelled the file name wrong, or you’ve capitalized something that should be lower case.  I double checked the file name against the cheat sheet I’d printed out with all my code.  Nope, the file name was correct.  Another likely cause is that you’re in the wrong directory and you just need to set the working directory to where the file is located.  I checked that too – my working directory was indeed set to where my file should have been.  That was when RStudio crashed, though I’m still not sure exactly why that happened.  I assume RStudio did it just to mess with me.  🙂

I’m sure a lot of presenters would be pretty alarmed at this point, but I was actually quite amused.  People on Twitter seemed to notice:

Having your live demo crash is not very entertaining in and of itself, but I found the situation rather amusing because I had considered whether I should do a live demo and decided to go with it because it seemed so low risk.  What could go wrong?  Sure, live demos are unpredictable.  Websites go down, databases change their interface without warning (invariably they do this five minutes before your demo starts), software crashes, and so on. Still, the demo I was doing was really quite simple compared to a lot of the R I normally teach, and it involved using an interface I literally use almost every day.   I’ve had plenty of presentations go awry in the past, but this was one that I really thought had almost 0% chance of going wrong.  So when it all went wrong on the very first line of code, I couldn’t help but laugh.  It’s the live demo curse!  You can’t escape!

I’m sure most people who have spent any significant amount of doing live demos of technology have had the experience of seeing the whole thing blow up.  I know a lot of librarians who avoid the whole thing by making slides with screen shots of what they would show and do sort of a mock demo.  There’s nothing wrong with that, and I can understand the inclination to remove the uncertainty of the live demo from the equation.  But despite their being fraught with potential issues, I’m still in favor of live demos – and in a sense, I feel this way exactly because of their unpredicability.

For one thing, it’s helpful for learners to see how an experienced user thinks through the process of troubleshooting when something goes wrong.  It’s just a fact that stuff doesn’t always work perfectly in real life.  If the people I’m teaching are ever actually going to use the tools I’m demonstrating, eventually they’re going to run into some problems.  They’re more likely to be able to solve those problems if they’ve had a chance to see someone work through whatever issues arise.  This is true for many different types of technologies and information resources, but especially so with programming languages.  Learning to troubleshoot is itself an essential skill in programming, and what better way to learn than to see it in action?

Secondly, for brand new users of a technology, watching an instructor give a flawless and apparently effortless demonstration can actually make mastery feel out of reach for them.  In reality, a lot of time and effort likely went into developing that demo, trying out lots of different approaches, seeing what works well and what doesn’t, and arriving at the “perfect” final demo.  I’m certainly not suggesting that instructors should do freewheeling demos with no prior planning whatsoever, but I am in favor of an approach that acknowledges that things don’t always go right the first time.  When I learned R, I would watch  tutorials by these incredibly smart and talented instructors and think, oh my gosh, they make this look so easy and I’m totally lost – I’m never going to understand how this works.  Obviously I don’t want to look like an unprepared and incompetent fool in front of a class, but hey, things don’t always go perfectly.  I’m human, you’re human, we’re all going to make mistakes, but that’s part of learning, so let’s talk about what went wrong and how we fix it.

By the way, in case you’re wondering what did actually go wrong in this instance, I had inadvertently moved the data file in the process of uploading it to my Github repo – I thought I’d made a copy, but I had actually moved the original.  I quickly realized what had happened, and I knew roughly where I’d put the file, but it was in some folder buried deep in my file structure that I wouldn’t be able to locate easily on the spot.  The quickest solution I could think of, which I quickly did off-screen from the webinar (thank you dual monitors) was to copy the data from the repo, paste it into a new CSV and quickly save it where the original file should have been.  It worked fine and the demo went off as planned after that.

Day One

Many years ago, when I got my first bachelor’s degree, my parents gave me a hard time because I didn’t want to walk at graduation.  It seemed like kind of a pain in the ass, to be honest.  A long ceremony, watching and waiting as lots of people I’d never met or even heard of got their diplomas, and then like 30 seconds of glory as my name was announced and I crossed the stage – collecting a blank piece of paper, because my actual diploma would be sent in the mail later, when it was confirmed I’d actually met the requirements.  Oh, and also, I’d have to pay a few hundred dollars to rent the graduation garb.  RENT it.  I didn’t even get to keep it!  Seriously?

In the end, I did not walk for my bachelor’s degree.  On the day I would have been at that ceremony, my mom and I were celebrating my graduation with a trip to New York City, which was quite fun and meant way more to me than any graduation ceremony could have.  I didn’t walk when I got my second bachelor’s either, or any of my master’s degrees.  I say this as humbly as possible: I have more formal education than most people I know, but the only graduation ceremony I’ve ever been to was for my high school diploma.

The fact is, I’ve been holding out.  I wanted the PhD.  I would walk for that.  I’ll admit, partly I wanted the incredible hat.  🙂  I mean, come on, look at the garb you get to wear when you get your Phd.  I will gladly pay hundreds of dollars to rent that!

How styling is that regalia?

How styling is that regalia?

And finally, finally, things came together, and tonight, I found myself sitting in my very first ever doctoral class.  I’m a PhD student!  I can hardly believe it.  I know I kept grinning like a crazy person throughout my first class.   I’m doing this degree part-time while I continue to work at my job full-time (for which I am tremendously grateful both to the iSchool at UMD for accepting part time PhD students and to my library, my director, and my boss for their absolutely incredible support for this undertaking). I know that’s going to be hard work, but I think I’m up for it.  And more than that, I think it’s worth it.

When I went to the PhD program orientation recently, our cohort of brand new students had some time to spend with the continuing PhD students in a no-holds barred, tell-all session.  While it was kind of a “what’s said in this room stays in this room” kind of session, one thing that I think I can safely say came out of it was the idea that you should get in the habit of writing every day.  I agree this is a good habit to get into, and I’m going to try to make sure that the writing I do translates to some real output – maybe work toward a paper, or maybe just something I say on this blog.  Obviously I won’t write here on the blog every day, but I’ll try to document this process, hopefully share some interesting things.

I also want to post this entry now because I’m sure I’ll look back on this several years from now when it’s all over, and maybe I’ll laugh, or cry, or just think how naive I was.  But for now, let’s just say, starting with Day One, I am thrilled, excited, honored, and lucky, and I am so thankful to all of the people who have helped me arrive at this place.

Now let’s earn that incredible 8-pointed PhD hat!

Who Am I? The Identity Crisis of the Librarian/Informationist/Data Scientist

More and more lately, I’m asked the question “what do you do?” This is a surprisingly difficult question to answer.  Often, how I answer depends on who’s asking – is it someone who really cares or needs to know? – and how much detail I feel like going to at the moment when I’m asked.  When I’m asked at conferences, as I was quite a bit at FORCE2016, I tried to be as explanatory as possible without getting pedantic, boring, or long-winded.  My answer in those scenarios goes something like “I’m a data librarian – I do a lot of instruction on data science, like R and data visualization, and data management.”  When I’m asked in more social contexts, I hardly even bother explaining.  Depending on my mood and the person who’s asking, I’ll usually say something like data scientist, medical librarian, or, if I really don’t feel like talking about it, just librarian.  It’s hard to know how to describe yourself when you have a job title that is pretty obscure: Research Data Informationist.  I would venture to guess that 99% of my family, friends, and even work colleagues have little to no idea what I actually spend my days doing.

In some regards, that’s fine.  Does it really matter if my mom and dad know what it means that I’ve taught hundreds of scientists R? Not really (they’re still really proud, though!).  Do I care if my date has a clear understanding of what a data librarian does?  Not really.  Do I care if a random person I happen to chat with while I’m watching a hockey game at my local gets the nuances of the informationist profession?  Absolutely not.

On the other hand, there are often times that I wish I had a somewhat more scrutable job title.  When I’m talking to researchers at my institution, I want them to know what I do because I want them to know when to ask me for help.  I want them to know that the library has someone like me who can help with their data science questions, their data management needs, and so on.  I know it’s not natural to think “library” when the question is “how do I get help with finding data” or “I need to learn R and don’t know where to start” or “I’d like to create a data visualization but I have no idea how to do it” or any of the other myriad data-related issues I or my colleagues could address.

The “informationist” term is one that has a clear definition and a history within the realm of medical librarianship, but I feel like it has almost no meaning outside of our own field.  I can’t even count the number of weird variations I’ve heard on that title – informaticist, informationalist, informatist, and many more.  It would be nice to get to the point that researchers understood what an informationist is and how we can help them in their work, but I just don’t see that happening in the near future.

So what do we do to make our contributions and expertise and status as potential collaborators known?  What term can we call ourselves to make our role clear?  Librarian doesn’t really do it, because I think people have a very stereotypical and not at all correct view of what librarians do, and it doesn’t capture the data informationist role at all.  Informationist doesn’t do it, because no one has any clue what that means.  I’ve toyed with calling myself a data scientist, and though I do think that label fits, I have some reservations about using that title, probably mostly driven by a terrible case of imposter syndrome.

What’s in a name?  A lot, I think.  How can data librarians, informationists, library-based data scientists, whatever you want to call us, communicate our role, our expertise, our services, to our user communities?  Is there a better term for people who are doing this type of work?

Some ponderings on #force2016 and open data

I’m attending FORCE2016, which is my first FORCE11 conference after following this movement (or group?) for awhile and I have to say, this is one interesting, thought-provoking conference.  I haven’t been blogging in awhile, but I felt inspired to get a few thoughts down after the first day of FORCE2016:

  • I love the interdisciplinarity of this conference, and to me, that’s what makes it a great conference to attend.  In our “swag bag,” we were all given a “passport” and could earn extra tickets for getting signatures of attendees from different disciplines and geographic locations.  While free drinks are of course a great incentive, I think the fact that we have so many diverse attendees at this conference is a draw on its own.  I love that we are getting researchers, funders, publishers, librarians, and so many other stakeholders at the table, and I can’t think of another conference where I’ve seen this many different types of people from this many countries getting involved in the conversatioon.
  • I actually really love that there are so few concurrent sessions.  Obviously, fewer concurrent sessions means fewer voices joining the official conversation, but I think this is a small enough conference that there are ways to be involved, active, and vocal without necessarily being an invited speaker.  While I love big conferences like MLA, I always feel pulled in a million different directions – sometimes literally, like last year when I was scheduled to present papers at two different sessions during the same time period.  I feel more engaged at a conference when I’m seeing mostly the same content as others.  We’re all on the same page and we can have better conversations.  I also feel more engaged in the Twitter stream.  I’m not trying to follow five, ten, or more tweet streams at once from multiple sessions.  Instead, I’m seeing lots of different perspectives and ideas and feedback on one single session.  I like us all being on the same page.

Now, those are some positives, but I do have to bring it down with one negative from this conference, and that is that I think it’s hard to constructively talk about how to encourage sharing and open science when you have a whole conference full of open science advocates.  I do not in any way want to disparage anyone because I have a lot of respect for many of the participants in the session I’m talking about, but I was a little disappointed in the final session today on data management.  I loved the idea of an interactive session (plus I heard there would be balloons and chocolate, so, yeah!) and also the idea of debate on topics in data sharing and management, since that’s my jam.  I did debate in high school, so I can recognize the difficulty but also the usefulness of having to argue for a position with which you strongly disagree.  There’s real value in spending some time thinking about why people hold positions that are in opposition of your strongly held position.  And yeah, this was the last session of a long day, and it was fun, and it had popping of balloons, and apparently some chocolate, and whatnot, but I am a little disappointed at what I see as a real missed opportunity to spend some time really discussing how we can address some of the arguments against data sharing and data management.  Sure, we all laughed at the straw men that were being thrown out there by the teams who were being called upon to argue in favor of something that they (and all of us, as open science advocates) strongly disagreed with.  But I think we really lost an opportunity to spend some time giving serious thought to some of the real issues that researchers who are not open science advocates actually raise.  Someone in that session mentioned the open data excuses bingo page (you can find it here if you haven’t seen it before).  Again, funny, but SERIOUSLY I have actually have real researchers say ALL of these things, except for the thing about terrorists.  I will reiterate that I know and respect a lot of people involved with that session and I’m not trying to disparage them in any way, but I do hope we can give some real thought to some of the issues that were brought up in jest today.  Some of these excuses, or complaints, or whatever, are actual, strongly-held beliefs of many, many researchers.  The burden is on us, as open science advocates, to demonstrate why data sharing, data management, and the like are tenable positions and in fact the “correct” choice.

Okay, off my soap box!  I’m really enjoying this conference, having a great time reconnecting with people I’ve not seen in years, and making new connections.  And Portland!  What a great city. 🙂

To keep or not to keep: that is the question

I recently read an article in The Atlantic about people who are compulsive declutterers – the opposite of hoarders – who feel compelled to get rid of all their possessions. I’m more on the side of hoarding, because I always find myself thinking of eventualities in which I might need the item in question.  Indeed, it has often been the case that I will think of something I got rid of weeks or even years later and wish I still had it: a book I would have liked to reference, a piece of clothing I would have liked to wear, a receipt I could have used to take something back.  Of course, I don’t have unlimited storage space, so I can’t keep all this stuff.  The question of what to keep and for how long is one that librarians think about when it comes to weeding: deciding which parts of the collection to deaccession, or basically, get rid of.  There are evidence-based, tried-and-true ways of thinking about weeding a library collection, but that’s not so much true when it comes to data.  How is a scientist to decide what to keep and what not to keep?

I know this is a question that researchers are thinking about quite a bit, because I get more emails about this than almost any other issue.  In fact, I get emails not only from users of my own library, but researchers from all over the country who have somehow found my name.  What exactly do I need to keep?  If I have electronic records, do I need to keep a print copy as well?  How many years do I need to keep this stuff?  These are all very reasonable questions that it would be nice to say, yes, there is an answer and it is….! but it’s almost never so easy to point to a single answer.

A case in point: a couple years ago, I decided to teach a class about data preservation and retention.  In my naivete, I thought it would be nice to take a look through all the relevant policy and find the specific number of years that research data is required to be retained.  I read handbooks and guides.  I read policy documents from various agencies.   I even read the U.S. Code (I do not recommend it).  At the end of it, I found that not only is there not a single, definitive, policy answer to how long funded research data should be retained, but there are in fact all sorts of contradictory suggestions.  I found documents giving times from 3 years to 7 years to the super-helpful “as long as necessary.”

This may be difficult to answer from a policy perspective, but I think answering this from a best practices perspective is even trickier.  Let’s agree that we just can’t keep everything – storing data isn’t free, and it takes considerable time and effort to ensure that data remain accessible and usable.  Assuming that some stuff has to get thrown away, how do we distinguish trash from treasure, especially given the old adage about how the former might be the latter to others?  It’s hard to know whether something that appears useless now might actually be useful and interesting to someone in the future.  To take this to the extreme, here’s an actual example from a researcher I’ve worked with: he asked how he could have his program automatically discard everything in the thousandth place from his measurements.  In other words, he wanted 4.254 to be saved as 4.25.  I told him I could show him how, but I asked why he wanted to do this.  He told me that his machine was capable of measuring to the thousandth, but the measurement was only scientifically relevant to the hundredth place.  To scientists right now, 4.254 and 4.252 were essentially indistinguishable, so why bother with the extra noise of the thousandth place?  Fair point, but what about 5 years from now, or 10 years from now?  If science evolves to the point that this extra level of precision is meaningful, tomorrow’s researchers will probably be a little annoyed that today’s researchers had that measurement and just threw it away.  But then again, how can we know now when, or even if, that level of precision will be wanted?  For that matter, we can’t even say for sure whether this dataset will be useful at all.  Maybe a new and better method for making this measurement will be developed tomorrow, and all this stuff we gathered today will be irrelevant.  But how can we know?

These are all questions that I think are not easy to answer right now, but that people within research communities should be thinking about.  For one thing, I don’t think we can give one simple answer to how long data should be retained.  For one type of research, a few years may be enough.  For other fields, where it’s harder to replicate data, maybe we need to keep it in perpetuity.  When it comes to deciding what should be retained and what should be discarded, I think that answers cannot be dictated by one-size-fits-all policies and that subject matter experts and information professionals should work together to figure out good answers for specific communities and specific data.  Eventually, I suppose we’ll probably have some of those well-defined best practices for data retention in the same way that we have those best practices from collection management in libraries.  Until then, keep your crystal balls handy. 🙂

See One, Do One, Teach One: Data Science Instruction Edition

In medical education, you’ll often hear the phrase “see one, do one, teach one.” I know this not because I’m a medical librarian, but because I watched ER religiously when I was in high school. 🙂  To put it simply, to learn to do a medical procedure, you first watch a seasoned clinician doing the procedure, then you do it yourself with guidance and feedback, and then you teach someone else how to do it.  While I’m not learning how to do medical procedures, I think this same idea applies to learning anything, really, and it’s actually how I’ve learned to do a lot of the cool things I’ve picked up in the last couple of years in my work at my current library.

Being sort of a Data Services department of one, I tend to put a lot of emphasis on instruction.  There are many thousands of researchers at my institution, but only one of me.  I can’t possibly help all of them one on one, so doing a hybrid in-person/webinar session that can reach lots and lots of people is a good use of my time.  I would have to go back to look at my statistics, but I don’t think I’d be too far off base if I said I’ve taught 200 people how to use R in the last year, which I think is a pretty effective use of my time!  Even better for me, teaching R has enabled me to learn way more than I would have on my own.  This time a year ago, I don’t think I could do much of anything with R, but with every class I teach, I learn more and more, and thus become even more prepared to teach it.

When I came to my library two years ago, I had some ideas about what I thought people should know about data management, but I figured I should collect some data about it (I mean, obviously, right?).  We did a survey.  I got my data and analyzed them to see what topics people were most interested in.  I put on classes on things like metadata, preservation, and data sharing, but the attendance wasn’t what I thought it would be based on the numbers from my survey.  Clearly something about my approach wasn’t reaching my researchers.  That’s when I decided to focus less on what I thought people should know and look at the problems they were really having.  Around the same time, I was starting to learn more about data science, and specifically R, and I realized that R could really solve a lot of the problems that people had.  Plus, people were interested in learning it.  Lots more people would show up for a class on R than they would for a class on metadata (sad, but true).

The only problem was, I didn’t think I knew R well enough to teach it.  What if really experienced people showed up and started calling me out on my inexperience, or asking questions I didn’t know the answer to?  I was really nervous about teaching an R class the first time, but I decided that I could make it manageable by biting off a little chunk.  I scheduled a class on making heatmaps in R, which was something I knew a lot of people wanted to learn.  Mind you, when I scheduled this class, I did not myself know how to make a heatmap in R.  But I put it on the instruction calendar, it went up on the website, and soon enough, I had not only a full class, but a waitlist.

Fortunately, there are many, many resources available for learning how to do things in R.  Lots of them are free.  That solved the “see one” problem.  Next, to “do one.”  I spend a long, long time putting together the hands-on exercises I create for my classes.  I try out lots of different things.  I mess around with the code and see what happens if I try things in different ways.  I try to anticipate what questions people might ask and experiment with my code so I have an answer.  Like, “what happens if you don’t put those spaces between everything in your code?” (answer, at least in R: nothing, it works fine with or without the spaces; I just like them in there because I can read it more easily).

My first few classes went well.  Sometimes people asked questions I didn’t know the answers to.  Even worse, sometimes I gave incorrect answers because I felt like I should say something even if I wasn’t really sure.  In one of the first classes I taught, someone asked whether = was equivalent to <- (the assignment operator) in R.  I’d seen <- used most often, but I thought I’d seen = used sometimes too, so I said something like, “uhhh, I don’t know, I mean, yeah, I think they’re the same, like, yeah, sure?”  A woman in the back row got really annoyed at that.  “They’re not the same at all,” she said, and I could feel myself turning bright red.  “That’s factually incorrect,” she added.  Shortly after that she got up and left in the middle of the class.  I was mortified, but the class still got good evaluations, so I figured it hadn’t been all bad.

These days, I schedule my classes based on two things: is it something I think my researchers want to learn, and is it something I want to learn.  That first part is relatively easy to figure out – I just talk to people, a lot, and I implore them to give me feedback about what classes they want on my class evaluations.  On the whole, they do, and this is how I end up with probably 90% of the classes I offer.  Sometimes this leads to much trepidation on my part, as people ask for things that I worry I’m not going to be able to teach.  For example, people had been asking for a class on statistical analysis in R.  I’ve taken a few different statistics classes, but stats were still something that filled me with terror.  When I submit my own articles for publication, I’m overcome with fear that I’ve made some horrible mistake in my statistical analyses and that peer reviewers are going to rip my article apart.  Or worse, the peer reviewers will miss it, it’ll be published, and readers will rip me apart.  The thought of actually teaching a class on how to do this seemed like a ridiculous idea, yet it was what so many people wanted.

So I went ahead and scheduled the class.  A lot of people signed up.  I got some very thick textbooks on statistics and statistical analysis in R and I spent many hours learning about all of this.  I got some data, saw what sorts of examples would make sense to demonstrate.  I painstakingly wrote out my code in R markdown, with lots of comments, so that everything would be well-explained.  And then, the morning arrived when I was to give the class for the first time.  Probably it was for the best that it was a webinar.  I was teleworking, so I gave the webinar from my home office, wearing sweatpants and my favorite UCLA t-shirt, with some lovely roses my boyfriend had brought me on my desk and my trusty dog looking in through the French doors.  I went through my examples, talking about linear regression, and tests of independence, and all sorts of other things, that, until I’d started to teach the webinar, I’d been very doubtful I had a good handle on.  But suddenly, I realized I kind of actually knew what I was talking about!  People typed their questions in the chat window and I  knew the answers! When the two hours were up and I signed off, I felt good about it, and over the next few days, I got lots of emails from people thanking me for the great class, which was great, since my main goal had just been to not say anything too stupid. 🙂

Now, I don’t feel so nervous about offering some of these advanced classes.  It’s kind of exciting to have the opportunity to stretch myself to learn things that I think are interesting.  Plus, nothing will give you more incentive to learn something you’ve wanted to explore than committing yourself to teach a class on it!  I’ve learned so much about so many cool things because people have said, hey, can you teach me this, and I say, sure! then scramble off to my office and check the indices of all my R books to see where I can learn how to do whatever that thing is.

The point of all this is to say that, for me at least, the “teach one” part of the old mantra is perhaps something librarians should jump on when it comes to expanding library roles in data management and data science.  I’m very fortunate that I get to spend most of my time working on data and nothing else, so I recognize that not everyone can take a week to immerse themselves in statistics, but I do think that librarians can and should stretch themselves to learn new things that will benefit our patrons.

My other piece of advice, which is surely nothing new: when someone asks a question, don’t be afraid to say I don’t know.  I learned quickly from that whole “= is not the same as <-” business.  Now when someone asks a question and I don’t know the answer, I do one of two things.  If I can, I try it out in the code right then and there.  So if someone says something like, can you rearrange the order of those two things in your code? I’ll say, huh, I never thought about that – let’s find out, and then do just that.  Other times, the question is something complicated, like, how do I do this random thing?  In those cases, I’ll say, that’s a great question, and I don’t actually know the answer, but if you’ll send me an email after this so I have your contact info, I will find out and follow up with you.  I’ve said that at least once in every class I’ve taught in the last 6 months, and the number of times someone has actually followed up with me: none.  I think this is probably due to one of two reasons.  One, I really emphasize troubleshooting and how to find out how to learn to do things in R when I teach, so it’s very possible that the person goes off and finds the answer themselves, which is great.  Two, I think there are times when people pose an idle question because they’re just kind of curious, or they want to look smart in front of their peers, and they don’t follow up because the answer doesn’t really matter that much to them anyway.

So there you go!  That’s my philosophy of getting to learn how to do cool stuff with data in order to benefit my researchers. 🙂

R for libRarians: data analysis and processing

I heard from several people after I wrote my last post about visualization who were excited about learning the very cool things that R can do.  Yay!  That post only scratched the surface of the many, nearly endless, things that R can do in terms of visualization, so if that seemed interesting to you, I hope you will go forth and learn more!  In case I haven’t already convinced you of R’s awesomeness (no, I’m not a paid R spokesperson or anything), I have a little more to say about why R is so great for data processing and analysis.

When it comes to data analysis, most of the researchers I know are either using some fancypants statistical software that costs lots of money, or they’re using Excel.  As a librarian, I have the same sort of feelings for Excel as I do for Google: wonderful tool, great when used properly, but frequently used improperly in the context of research.  Excel is okay for some very specific purposes, but at least in my experience, researchers are often using it for tasks to which it is not particularly suited.  As far as the fancypants statistical software, a lot of labs can’t afford it.  Even more problematic, every single one I’m aware of uses proprietary file formats, meaning that no one else can see your data unless they too invest in that expensive software.  As data sharing is becoming the expectation, having all your data locked in a proprietary format isn’t going to work.

Enter R!  Here are some of the reasons why I love it:

  • R is free and open source.  It’s supported by a huge community of users who are generally open to sharing their code.  This is great because those of us who are not programmers can take advantage of the work that others have already done to solve complex tasks.  For example, I had some data from a survey I had conducted, mostly in the form of responses to Likert-type scale questions.  I’m decidedly not a statistician and I was really not sure exactly how I should analyze these questions.  Plus, I wanted to create a visualization and I wasn’t entirely sure how I wanted it to look.  I suspected someone had probably already tackled these problems in R, so I Googled “R likert.”  Yes!  Sure enough, someone had already written a package for analyzing Likert data, aptly called likert.  I downloaded and installed the package in under a minute, and it made my data analysis so easy.  Big bonus: R can generally open files from all of those statistical software programs.  I saved the day for some researchers when the data they needed was in a proprietary format, but they didn’t want to pay several thousands of dollars to buy that program, and I opened the data in like 5 seconds in R.
  • R enhances research reproducibility. Sure, there are a lot of things you can do in Excel that you can do in R.  I could open an Excel spreadsheet and do, for example, a find and replace to change some values of something.  I could probably even do some fairly complex math and even statistics in Excel if I really knew what I was doing.  However, nothing I do here is going to be documented.  I have no record explaining how I changed my data, why I did things the way I did, and so on.  Case in point number 1: I frequently work on processing data that had been shared or downloaded from a repository to get it into the format that researchers need.  They tell me what kind of analysis they want to do, and the specifications they need the data to meet, and I can clean everything up for them much more easily than they could.  Before I learned R, this took a long time, for one thing, but I also had to document all the changes I made by hand. I would keep Word documents that painstakingly described every step of what I had done so I had a record of it if the researchers needed it.  It was a huge pain and ridiculously inefficient.  With R, none of that is necessary.  I write an R script that does whatever I need to do with the data.  Not only does R do it faster and more efficiently than Excel might, if I need a record of my actions, I have it all right there in the form of the script, which I can save, share, come back to when I completely forgot what I did 6 months later, and so on.  Another really nice point in this same vein, is that R never does anything with your original file, or your raw data.  If you change something up in Excel, save it, and then later realize you messed up, you’re out of luck if you’re working on your copy of the raw data.  That doesn’t happen with R, because R pulls the data, whatever that may be, into your computer’s working memory and sort of keeps its own copy there.  That means I can go to town doing all sorts of crazy stuff with the data, experiment and mess around with it to my heart’s content, and my raw data file is never actually touched.
  • Compared to some other solutions, R is a workhorse. I suspect some data scientists would  disagree with me characterizing R as a workhorse, which is why I qualified that statement.  R is not a great solution for truly big data.  However, it can handle much bigger data than Excel, which will groan if you try to load a file with several hundred thousand records and break if you try to load more than a million.  By comparison, this afternoon I loaded a JSON file with 1.5 million lines into R and it took about a minute.  So, while it may not be there yet in terms of big data, I think R is a nice solution for small to medium data.  Besides that, I think learning R is very pragmatic, because once you’ve got the basics down, you can do so many things with it.  Though it was originally created as a statistical language, you can do almost anything you can think of to/with data using R, and once you’ve got the hang of the basic syntax, you’re really set to branch out into a lot of really interesting areas.  I talked in the last post about visualization, which I think R really excels at.  I’m particularly excited about learning to use R for machine learning and natural language processing, which are two areas that I think are going to be particularly important in terms of data analysis and knowledge discovery in the next few years.  There’s a great deal of data freely available, and learning skills like some basic R programming will vastly increase your ability to get it, interact with it, and learn something interesting from it.

I should add that there are many other scripting languages that can accomplish many of the same things as R.  I highlight R because, in my experience, it is the most approachable for non-programmers and thus the most likely to appeal to librarians, who are my primary audience here.  I’m in the process of learning Python, and I’m at the point of wanting to bang my head against a wall with it.  R is not necessarily easy when you first get started, but I felt comfortable using it with much less effort than I expected it would take.  Your mileage may vary, but for the effort to payoff ratio I got, I absolutely think that my time spent learning R was well worth it.

R for libRarians: visualization

I recently blogged about R and how cool it is, and how it’s really not as scary to learn as many novices (including myself, a few years ago) might think.  Several of my fellow librarians commented, or emailed, to ask more about how I’m using R in my library work, so I thought I would take a moment to share some of those ideas here, and also to encourage other librarians who are using R (or related languages/tools) to jump in and share how you’re using it in your library work.

I should preface this by saying I don’t do a lot of “regular” library work anymore – most of what I do is working with researchers on their data, teaching classes about data, and collecting and working with my own research data.  However, I did do more traditional library things in the past, so I know that these kinds of skills would be useful.  In particular, there are three areas where I’ve found R to be very useful: visualization, data processing (or wrangling, or cleaning, or whatever you want to call it), and textual analysis.  Because I could say a lot about each of these, I’m going to do this over several posts, starting with today’s post on visualization.

Data visualization is one of my new favorite things to work on, and by far the tool I use most is R, specifically the ggplot2 package.  This package utilizes the concepts outlined in Leland Wilkinson’s Grammar of Graphics, which takes visualizations apart into their individual components.  As Wilkinson explains it,  “a language consisting of words and no grammar expresses only as many ideas as there are words. By specifying how words are combined in statements, a grammar expands a language’s scope…The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).”  When I teach ggplot2, I like to say that the kind of premade charts we can create with Excel are like the Dr. Seuss of visualizations, whereas the complex and nuanced graphics we can create with ggplot2 are the War and Peace.

For example, I needed to create a graph for an article I was publishing that showed how people had responded to two questions: basically, how important they felt a task was to their work, and how good they thought they were at that task.  I was not just interested in how many people had rated themselves in each of the five bins in my Likert scale, so a histogram or bar chart wouldn’t capture what I wanted.  That would show me how people had answered each question individually, but I was interested in showing the distribution of combinations of responses.  In other words, did people who said that a task was important to them have a correspondingly high level of expertise? I was picturing something sort of like a scatterplot, but with each each point (i.e., each combination of responses) sized according to how many people had responded with that combination.  I was able to do exactly this with ggplot2:

This was exactly what I wanted, and not something that I could have created with Excel, because it isn’t a “standard” chart type.  Not only that, but since everything was written in code, I was able to save it so I had an exact record of what I did (when I get back to my work computer, instead of my personal one, I will get the file and actually put that code here!).  It was also very easy to go back and make changes.  In the original version, I had the points sized by actual number of people who had responded, but one of the reviewers felt this was potentially confusing because of the disparity in the size of each group (110 scientific researchers, but only 21 clinical researchers).  I was asked to change the points to show percent of responses, rather than number of responses, and this took just one minor change to the code that I could accomplish in less than a minute.

I also like ggplot2 for creating highly complex graphics that demonstrate correlations in multivariate data sets.  When I’m teaching, I like to use the sample data set that comes with ggplot2, which has info about around 55,000 diamonds, with 10 variables, including things like price, cut, color, carat, quality, and so on.  How is price determined for these diamonds?  Is it simply a matter of size – the bigger it is, the more it costs?  Or do other variables also contribute to the price?  We could do some math to find out the actual answer, but we could also quickly create a visualization that maps out some of these relationships to see if some patterns start to emerge.

First, I’ll create a scatterplot of my diamonds, with price on the x-axis and carat on the y-axis.  Here it is, with the code to create it below:

a <- ggplot(diam, aes(x = price, y = carat)) + geom_point() + geom_abline(slope = 0.0002656748, intercept = 0, col = "red")

If there were a perfect relationship between price and diamond size, we would expect our points to cluster along the red line I’ve inserted here, which demonstrates a 1:1 relationship.  Clearly, that is not the case.  So we might propose that there are other variables that contribute to a diamond’s price.  If I really wanted to, I could actually demonstrate lots of variables in one chart.  For example, this sort of crazy visualization shows five different variables: price (x-axis), carat (y-axis), color (color of point, with red being worst quality color and lightest yellow being best quality color), clarity (size of point, with smallest point being lowest quality clarity and largest point being highest quality clarity), and cut (faceted, with each of the five cut categories shown in its own chart).

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,"YlOrRd"))) + facet_wrap(~cut, nrow = 1)

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,”YlOrRd”))) + facet_wrap(~cut, nrow = 1)

We’d have to do some more robust mathematical analysis of this to really get info about the various correlations here, but just in glancing at this, I can see that there are definitely some interesting patterns here and that this data might be worth further looking into.  And since I use ggplot2 quick a bit and am fairly proficient with it, this plot took me less than a minute to put together, which is exactly why I love ggplot2 so much.

You can probably see how you could use ggplot2 to create, as I’ve said, nearly infinitely customized charts and graphs.  To relate this back to libraries, you could create visualizations about your collection, your budget, or whatever other numbers you might want to visually display in a presentation or a publication.  There are also other R packages that let you create other types of visualizations.  I haven’t used it, but there’s a package called VennDiagram that lets you, well, make Venn diagrams – back in my days of teaching PubMed, I used to always use Venn diagrams to show how Boolean operators work, and this would allow you to make them really easily (I was always doing weird stuff with Powerpoint to try to make mine look right, and they never quite did).  There are also packages like ggvis and Shiny that let you create interactive visualizations that you could put on a website, which could be cool.  I’ve only just started to play around with these packages, so I don’t have any examples of my own, but you can see some examples of cool things that people have done in the Shiny Gallery.

So there you go!  I love R for visualizations, and I think it’s much easier to create nice looking graphics with R than it is with Excel or Powerpoint, once you get the hang of it.  Now that I’ve heard from some other librarians who are coding, do any of you have other ideas about using R (or other languages!) for visualizations, or examples of visualizations you’ve created?

Some Additional Resources:

  • I teach a class on ggplot2 at my library – the handout and class exercises are on my Data Services libguide.
  • The help documentation for ggplot2 is quite thorough.  Looking at the various options, you can see how you can create a nearly infinite variety of charts and graphs.
  • If you’re interested in learning more about the Grammar of Graphics but don’t want to read the whole book, Hadley Wickham, who created ggplot2, has written a nice article, A Layered Grammar of Graphics, that captures many of the ideas.