R for libRarians: data analysis and processing

I heard from several people after I wrote my last post about visualization who were excited about learning the very cool things that R can do.  Yay!  That post only scratched the surface of the many, nearly endless, things that R can do in terms of visualization, so if that seemed interesting to you, I hope you will go forth and learn more!  In case I haven’t already convinced you of R’s awesomeness (no, I’m not a paid R spokesperson or anything), I have a little more to say about why R is so great for data processing and analysis.

When it comes to data analysis, most of the researchers I know are either using some fancypants statistical software that costs lots of money, or they’re using Excel.  As a librarian, I have the same sort of feelings for Excel as I do for Google: wonderful tool, great when used properly, but frequently used improperly in the context of research.  Excel is okay for some very specific purposes, but at least in my experience, researchers are often using it for tasks to which it is not particularly suited.  As far as the fancypants statistical software, a lot of labs can’t afford it.  Even more problematic, every single one I’m aware of uses proprietary file formats, meaning that no one else can see your data unless they too invest in that expensive software.  As data sharing is becoming the expectation, having all your data locked in a proprietary format isn’t going to work.

Enter R!  Here are some of the reasons why I love it:

  • R is free and open source.  It’s supported by a huge community of users who are generally open to sharing their code.  This is great because those of us who are not programmers can take advantage of the work that others have already done to solve complex tasks.  For example, I had some data from a survey I had conducted, mostly in the form of responses to Likert-type scale questions.  I’m decidedly not a statistician and I was really not sure exactly how I should analyze these questions.  Plus, I wanted to create a visualization and I wasn’t entirely sure how I wanted it to look.  I suspected someone had probably already tackled these problems in R, so I Googled “R likert.”  Yes!  Sure enough, someone had already written a package for analyzing Likert data, aptly called likert.  I downloaded and installed the package in under a minute, and it made my data analysis so easy.  Big bonus: R can generally open files from all of those statistical software programs.  I saved the day for some researchers when the data they needed was in a proprietary format, but they didn’t want to pay several thousands of dollars to buy that program, and I opened the data in like 5 seconds in R.
  • R enhances research reproducibility. Sure, there are a lot of things you can do in Excel that you can do in R.  I could open an Excel spreadsheet and do, for example, a find and replace to change some values of something.  I could probably even do some fairly complex math and even statistics in Excel if I really knew what I was doing.  However, nothing I do here is going to be documented.  I have no record explaining how I changed my data, why I did things the way I did, and so on.  Case in point number 1: I frequently work on processing data that had been shared or downloaded from a repository to get it into the format that researchers need.  They tell me what kind of analysis they want to do, and the specifications they need the data to meet, and I can clean everything up for them much more easily than they could.  Before I learned R, this took a long time, for one thing, but I also had to document all the changes I made by hand. I would keep Word documents that painstakingly described every step of what I had done so I had a record of it if the researchers needed it.  It was a huge pain and ridiculously inefficient.  With R, none of that is necessary.  I write an R script that does whatever I need to do with the data.  Not only does R do it faster and more efficiently than Excel might, if I need a record of my actions, I have it all right there in the form of the script, which I can save, share, come back to when I completely forgot what I did 6 months later, and so on.  Another really nice point in this same vein, is that R never does anything with your original file, or your raw data.  If you change something up in Excel, save it, and then later realize you messed up, you’re out of luck if you’re working on your copy of the raw data.  That doesn’t happen with R, because R pulls the data, whatever that may be, into your computer’s working memory and sort of keeps its own copy there.  That means I can go to town doing all sorts of crazy stuff with the data, experiment and mess around with it to my heart’s content, and my raw data file is never actually touched.
  • Compared to some other solutions, R is a workhorse. I suspect some data scientists would  disagree with me characterizing R as a workhorse, which is why I qualified that statement.  R is not a great solution for truly big data.  However, it can handle much bigger data than Excel, which will groan if you try to load a file with several hundred thousand records and break if you try to load more than a million.  By comparison, this afternoon I loaded a JSON file with 1.5 million lines into R and it took about a minute.  So, while it may not be there yet in terms of big data, I think R is a nice solution for small to medium data.  Besides that, I think learning R is very pragmatic, because once you’ve got the basics down, you can do so many things with it.  Though it was originally created as a statistical language, you can do almost anything you can think of to/with data using R, and once you’ve got the hang of the basic syntax, you’re really set to branch out into a lot of really interesting areas.  I talked in the last post about visualization, which I think R really excels at.  I’m particularly excited about learning to use R for machine learning and natural language processing, which are two areas that I think are going to be particularly important in terms of data analysis and knowledge discovery in the next few years.  There’s a great deal of data freely available, and learning skills like some basic R programming will vastly increase your ability to get it, interact with it, and learn something interesting from it.

I should add that there are many other scripting languages that can accomplish many of the same things as R.  I highlight R because, in my experience, it is the most approachable for non-programmers and thus the most likely to appeal to librarians, who are my primary audience here.  I’m in the process of learning Python, and I’m at the point of wanting to bang my head against a wall with it.  R is not necessarily easy when you first get started, but I felt comfortable using it with much less effort than I expected it would take.  Your mileage may vary, but for the effort to payoff ratio I got, I absolutely think that my time spent learning R was well worth it.

R for libRarians: visualization

I recently blogged about R and how cool it is, and how it’s really not as scary to learn as many novices (including myself, a few years ago) might think.  Several of my fellow librarians commented, or emailed, to ask more about how I’m using R in my library work, so I thought I would take a moment to share some of those ideas here, and also to encourage other librarians who are using R (or related languages/tools) to jump in and share how you’re using it in your library work.

I should preface this by saying I don’t do a lot of “regular” library work anymore – most of what I do is working with researchers on their data, teaching classes about data, and collecting and working with my own research data.  However, I did do more traditional library things in the past, so I know that these kinds of skills would be useful.  In particular, there are three areas where I’ve found R to be very useful: visualization, data processing (or wrangling, or cleaning, or whatever you want to call it), and textual analysis.  Because I could say a lot about each of these, I’m going to do this over several posts, starting with today’s post on visualization.

Data visualization is one of my new favorite things to work on, and by far the tool I use most is R, specifically the ggplot2 package.  This package utilizes the concepts outlined in Leland Wilkinson’s Grammar of Graphics, which takes visualizations apart into their individual components.  As Wilkinson explains it,  “a language consisting of words and no grammar expresses only as many ideas as there are words. By specifying how words are combined in statements, a grammar expands a language’s scope…The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).”  When I teach ggplot2, I like to say that the kind of premade charts we can create with Excel are like the Dr. Seuss of visualizations, whereas the complex and nuanced graphics we can create with ggplot2 are the War and Peace.

For example, I needed to create a graph for an article I was publishing that showed how people had responded to two questions: basically, how important they felt a task was to their work, and how good they thought they were at that task.  I was not just interested in how many people had rated themselves in each of the five bins in my Likert scale, so a histogram or bar chart wouldn’t capture what I wanted.  That would show me how people had answered each question individually, but I was interested in showing the distribution of combinations of responses.  In other words, did people who said that a task was important to them have a correspondingly high level of expertise? I was picturing something sort of like a scatterplot, but with each each point (i.e., each combination of responses) sized according to how many people had responded with that combination.  I was able to do exactly this with ggplot2:

This was exactly what I wanted, and not something that I could have created with Excel, because it isn’t a “standard” chart type.  Not only that, but since everything was written in code, I was able to save it so I had an exact record of what I did (when I get back to my work computer, instead of my personal one, I will get the file and actually put that code here!).  It was also very easy to go back and make changes.  In the original version, I had the points sized by actual number of people who had responded, but one of the reviewers felt this was potentially confusing because of the disparity in the size of each group (110 scientific researchers, but only 21 clinical researchers).  I was asked to change the points to show percent of responses, rather than number of responses, and this took just one minor change to the code that I could accomplish in less than a minute.

I also like ggplot2 for creating highly complex graphics that demonstrate correlations in multivariate data sets.  When I’m teaching, I like to use the sample data set that comes with ggplot2, which has info about around 55,000 diamonds, with 10 variables, including things like price, cut, color, carat, quality, and so on.  How is price determined for these diamonds?  Is it simply a matter of size – the bigger it is, the more it costs?  Or do other variables also contribute to the price?  We could do some math to find out the actual answer, but we could also quickly create a visualization that maps out some of these relationships to see if some patterns start to emerge.

First, I’ll create a scatterplot of my diamonds, with price on the x-axis and carat on the y-axis.  Here it is, with the code to create it below:

a <- ggplot(diam, aes(x = price, y = carat)) + geom_point() + geom_abline(slope = 0.0002656748, intercept = 0, col = "red")

If there were a perfect relationship between price and diamond size, we would expect our points to cluster along the red line I’ve inserted here, which demonstrates a 1:1 relationship.  Clearly, that is not the case.  So we might propose that there are other variables that contribute to a diamond’s price.  If I really wanted to, I could actually demonstrate lots of variables in one chart.  For example, this sort of crazy visualization shows five different variables: price (x-axis), carat (y-axis), color (color of point, with red being worst quality color and lightest yellow being best quality color), clarity (size of point, with smallest point being lowest quality clarity and largest point being highest quality clarity), and cut (faceted, with each of the five cut categories shown in its own chart).

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,"YlOrRd"))) + facet_wrap(~cut, nrow = 1)

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,”YlOrRd”))) + facet_wrap(~cut, nrow = 1)

We’d have to do some more robust mathematical analysis of this to really get info about the various correlations here, but just in glancing at this, I can see that there are definitely some interesting patterns here and that this data might be worth further looking into.  And since I use ggplot2 quick a bit and am fairly proficient with it, this plot took me less than a minute to put together, which is exactly why I love ggplot2 so much.

You can probably see how you could use ggplot2 to create, as I’ve said, nearly infinitely customized charts and graphs.  To relate this back to libraries, you could create visualizations about your collection, your budget, or whatever other numbers you might want to visually display in a presentation or a publication.  There are also other R packages that let you create other types of visualizations.  I haven’t used it, but there’s a package called VennDiagram that lets you, well, make Venn diagrams – back in my days of teaching PubMed, I used to always use Venn diagrams to show how Boolean operators work, and this would allow you to make them really easily (I was always doing weird stuff with Powerpoint to try to make mine look right, and they never quite did).  There are also packages like ggvis and Shiny that let you create interactive visualizations that you could put on a website, which could be cool.  I’ve only just started to play around with these packages, so I don’t have any examples of my own, but you can see some examples of cool things that people have done in the Shiny Gallery.

So there you go!  I love R for visualizations, and I think it’s much easier to create nice looking graphics with R than it is with Excel or Powerpoint, once you get the hang of it.  Now that I’ve heard from some other librarians who are coding, do any of you have other ideas about using R (or other languages!) for visualizations, or examples of visualizations you’ve created?

Some Additional Resources:

  • I teach a class on ggplot2 at my library – the handout and class exercises are on my Data Services libguide.
  • The help documentation for ggplot2 is quite thorough.  Looking at the various options, you can see how you can create a nearly infinite variety of charts and graphs.
  • If you’re interested in learning more about the Grammar of Graphics but don’t want to read the whole book, Hadley Wickham, who created ggplot2, has written a nice article, A Layered Grammar of Graphics, that captures many of the ideas.

So you think you can code

I’ve been thinking about many ideas lately dealing with data and data science (this is, I’m sure, not news to anyone).  I’ve also had several people encourage me to pick my blog back up, and I’ve recently made my den into a cute and comfy little office, so, why not put all this together and resume blogging with a little post about my thoughts on data!  In particular, in this post I’m going to talk about coding.

Early on in my library career when I first got interested in data, I was talking to one of my first bosses and told her I thought I should learn R, which is essentially a scripting language, very useful for data processing, analysis, statistics, and visualization.  She gave me a sort of dubious look, and even as I said it, I was thinking in my head, yeah, I’m probably not going to do that.  I’m no computer scientist.  Fast forward a few years later, and not only have I actually learned R, it’s probably the single most important skill in my professional toolbox.

Here’s the thing – you don’t have to be a computer scientist to code, especially in R.  It’s actually remarkably straightforward, once you get over the initial strangeness of it and get a feel for the syntax.  I started offering R classes around the beginning of this year and I call my introductory classes “Introduction to R for Non-programmers.”  I had two reasons for selecting this name: one, I had only been using R for less than a year myself and didn’t (and still don’t) consider myself an expert.  When I started thinking about getting up in front of a room of people and teaching them to code, I had horrifying visions of experienced computer scientists calling me out on my relative lack of expertise, mocking my class exercises, or correcting me in front of everyone.  So, I figured, let’s set the bar low. 🙂  More importantly, I wanted to emphasize that R is approachable!  It’s not scary!  I can learn it, you can learn it.  Hell, young children can (and do) learn it.  Not only that, but you can learn it from one of a plethora of free resources without ever cracking a book or spending a dime.  All it takes is a little time, patience, and practice.

The payoff?  For one thing, you can impress your friends with your nerdy awesome skills!  (Or at least that’s what I keep telling myself.)  If you work with data of any kind, you can simplify your work, because using R (or other scientific programming languages) is faaaaar more efficient than using other point and click tools like Excel.  You can create super awesome visualizations, do crazy data analysis in a snap, and work with big huge data sets that would break Excel.  And you can do all of this for free!  If you’re a research and/or medical librarian, you will also make yourself an invaluable resource to your user community.  I believe that I could teach an R class every day at my library and there would still be people showing up.  We regularly have waitlists of 20 or more people.  Scientists are starting to catch on to all the reasons I’ve mentioned above, but not all of them have the time or inclination to use one of the free online resources.  Plus, since I’m a real human person who knows my users and their research and their data, I know what they probably want to do, so my classes are more tailored to them.

I was being introduced to Hadley Wickham yesterday, who is a pretty big deal in the R world, as he created some very important R packages (kind of like apps), and my friend and colleague who introduced me said, “this is Lisa; she is our prototypical data scientist librarian.”  I know there are other librarian coders out there because I’m on mailing lists with some of them, but I’m not currently aware of any other data librarians or medical librarians who know R.  I’m sure there are others and I would be very interested in knowing them.  And if it is fair to consider me a “prototype,” I wonder how many other librarians will be interested in becoming data scientist librarians.  I’m really interested in hearing from the librarians reading this – do you want to code?  Do you think you can learn to code?  And if not, why not?