I recently blogged about R and how cool it is, and how it’s really not as scary to learn as many novices (including myself, a few years ago) might think. Several of my fellow librarians commented, or emailed, to ask more about how I’m using R in my library work, so I thought I would take a moment to share some of those ideas here, and also to encourage other librarians who are using R (or related languages/tools) to jump in and share how you’re using it in your library work.
I should preface this by saying I don’t do a lot of “regular” library work anymore – most of what I do is working with researchers on their data, teaching classes about data, and collecting and working with my own research data. However, I did do more traditional library things in the past, so I know that these kinds of skills would be useful. In particular, there are three areas where I’ve found R to be very useful: visualization, data processing (or wrangling, or cleaning, or whatever you want to call it), and textual analysis. Because I could say a lot about each of these, I’m going to do this over several posts, starting with today’s post on visualization.
Data visualization is one of my new favorite things to work on, and by far the tool I use most is R, specifically the ggplot2 package. This package utilizes the concepts outlined in Leland Wilkinson’s Grammar of Graphics, which takes visualizations apart into their individual components. As Wilkinson explains it, “a language consisting of words and no grammar expresses only as many ideas as there are words. By specifying how words are combined in statements, a grammar expands a language’s scope…The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).” When I teach ggplot2, I like to say that the kind of premade charts we can create with Excel are like the Dr. Seuss of visualizations, whereas the complex and nuanced graphics we can create with ggplot2 are the War and Peace.
For example, I needed to create a graph for an article I was publishing that showed how people had responded to two questions: basically, how important they felt a task was to their work, and how good they thought they were at that task. I was not just interested in how many people had rated themselves in each of the five bins in my Likert scale, so a histogram or bar chart wouldn’t capture what I wanted. That would show me how people had answered each question individually, but I was interested in showing the distribution of combinations of responses. In other words, did people who said that a task was important to them have a correspondingly high level of expertise? I was picturing something sort of like a scatterplot, but with each each point (i.e., each combination of responses) sized according to how many people had responded with that combination. I was able to do exactly this with ggplot2:
This was exactly what I wanted, and not something that I could have created with Excel, because it isn’t a “standard” chart type. Not only that, but since everything was written in code, I was able to save it so I had an exact record of what I did (when I get back to my work computer, instead of my personal one, I will get the file and actually put that code here!). It was also very easy to go back and make changes. In the original version, I had the points sized by actual number of people who had responded, but one of the reviewers felt this was potentially confusing because of the disparity in the size of each group (110 scientific researchers, but only 21 clinical researchers). I was asked to change the points to show percent of responses, rather than number of responses, and this took just one minor change to the code that I could accomplish in less than a minute.
I also like ggplot2 for creating highly complex graphics that demonstrate correlations in multivariate data sets. When I’m teaching, I like to use the sample data set that comes with ggplot2, which has info about around 55,000 diamonds, with 10 variables, including things like price, cut, color, carat, quality, and so on. How is price determined for these diamonds? Is it simply a matter of size – the bigger it is, the more it costs? Or do other variables also contribute to the price? We could do some math to find out the actual answer, but we could also quickly create a visualization that maps out some of these relationships to see if some patterns start to emerge.
First, I’ll create a scatterplot of my diamonds, with price on the x-axis and carat on the y-axis. Here it is, with the code to create it below:
If there were a perfect relationship between price and diamond size, we would expect our points to cluster along the red line I’ve inserted here, which demonstrates a 1:1 relationship. Clearly, that is not the case. So we might propose that there are other variables that contribute to a diamond’s price. If I really wanted to, I could actually demonstrate lots of variables in one chart. For example, this sort of crazy visualization shows five different variables: price (x-axis), carat (y-axis), color (color of point, with red being worst quality color and lightest yellow being best quality color), clarity (size of point, with smallest point being lowest quality clarity and largest point being highest quality clarity), and cut (faceted, with each of the five cut categories shown in its own chart).
We’d have to do some more robust mathematical analysis of this to really get info about the various correlations here, but just in glancing at this, I can see that there are definitely some interesting patterns here and that this data might be worth further looking into. And since I use ggplot2 quick a bit and am fairly proficient with it, this plot took me less than a minute to put together, which is exactly why I love ggplot2 so much.
You can probably see how you could use ggplot2 to create, as I’ve said, nearly infinitely customized charts and graphs. To relate this back to libraries, you could create visualizations about your collection, your budget, or whatever other numbers you might want to visually display in a presentation or a publication. There are also other R packages that let you create other types of visualizations. I haven’t used it, but there’s a package called VennDiagram that lets you, well, make Venn diagrams – back in my days of teaching PubMed, I used to always use Venn diagrams to show how Boolean operators work, and this would allow you to make them really easily (I was always doing weird stuff with Powerpoint to try to make mine look right, and they never quite did). There are also packages like ggvis and Shiny that let you create interactive visualizations that you could put on a website, which could be cool. I’ve only just started to play around with these packages, so I don’t have any examples of my own, but you can see some examples of cool things that people have done in the Shiny Gallery.
So there you go! I love R for visualizations, and I think it’s much easier to create nice looking graphics with R than it is with Excel or Powerpoint, once you get the hang of it. Now that I’ve heard from some other librarians who are coding, do any of you have other ideas about using R (or other languages!) for visualizations, or examples of visualizations you’ve created?
Some Additional Resources:
- I teach a class on ggplot2 at my library – the handout and class exercises are on my Data Services libguide.
- The help documentation for ggplot2 is quite thorough. Looking at the various options, you can see how you can create a nearly infinite variety of charts and graphs.
- If you’re interested in learning more about the Grammar of Graphics but don’t want to read the whole book, Hadley Wickham, who created ggplot2, has written a nice article, A Layered Grammar of Graphics, that captures many of the ideas.