Flip flop: my failed experiment with flipped classroom R instruction

I don’t know if this terminology is common outside of library circles, but it seems like the “flipped classroom” has been all the rage in library instruction lately.  The idea is that learners do some work before coming to the session (like read something or watch a video lecture), and then the in-person time is spent on doing more activities, group exercises, etc.  As someone who is always keen to try something new and exciting, I decided to see what would happen if I tried out the flipped classroom model for my R classes.

Actually, teaching R this way makes a lot of sense.  Especially if you don’t have any experience, there’s a lot of baseline knowledge you need before you can really do anything interesting.  You’ve got to learn a lot of terminology, how the syntax of R works, boring things like what a data frame is and why it matters.  That could easily be covered before class to save the in person time for the more hands-on aspects.  I’ve also noticed a lot of variability in terms of how much people know coming into classes.  Some people are pretty tech savvy when they arrive, maybe even have some experience with another programming language.  Other people have difficulty understanding how to open a file.  It’s hard to figure out how to pace a class when you’ve got people from all over that spectrum of expertise.  On the other hand, curriculum planning would be much easier if you could know that everyone is starting out with a certain set of knowledge and build off of it.

The other reason I wanted to try this is just the time factor.  I’m busy, really busy.  My library’s training room is also hard to book because we offer so many classes.  The people I teach are busy.  I teach my basic introduction to R course as a 3-hour session, and though I’d really rather make it 4 hours, even finding a 3-hour window when I and the room are both available and people are likely to be able to attend is difficult.  Plus, it would be nice if there was some way to deliver this instruction that wasn’t so time-intensive for me.  I love teaching R – it’s probably my favorite thing I do in my job and I’d estimate I’ve taught close to 500 researchers how to code.  I generally spend around 9 hours a month teaching R, plus another 4-6 hours doing prep, administrative stuff, and all the other things that have to get done to make a class function.  That’s a lot of time, and though I don’t at all mind doing it, I’d definitely be interested in any sort of way I could streamline that work without having a negative impact on the experience of learning R from me.

For all these reasons, I decided to experiment with trying out a flipped classroom model for my introduction to R class.  I had grand plans of making a series of short video tutorials that covered bite-sized pieces of learning R.  There would be a bunch of them, but they’d be about 5 minutes each.  I arranged for the library to get Adobe Captivate, which is very cool video tutorial software, and these tutorials are going to be so awesome when I get around to making them.  However, I had already scheduled the class for today, February 28, and I hadn’t gotten around to making them yet.  Fortunately, I had a recording of a previous Intro to R class I’d taught, so I chopped the relevant parts of that up into smaller pieces and made a YouTube playlist that served as my pre-class work for this session, probably about two and a half hours total.

I had 42 people were either signed up or on the waitlist at the end of last week.  I think I made the class description pretty clear – that this session was only an hour, but you did have to do stuff before you got there.  I sent out an email with the link to the video reminding people that they would be lost in class if they didn’t watch this stuff.  Even so, yesterday morning, the last of the videos had only 8 views, and I knew at least two of those were from me checking the video to make sure it worked.  So I sent out another email, once again imploring them to watch the videos before they came to class and to please cancel their registration and sign up for a regular R class if this video thing wasn’t for them.

By the time I taught the class this afternoon, 20 people had canceled their registration.  Of the remaining 22, 5 showed up.  Of the 5 that showed up, it quickly became apparent to me that none of them had watched the videos.  I knew no one was going to answer honestly if I asked who had watched them, so I started by telling them to read in the CSV file to a data frame.  This request is pretty fundamental, and also pretty much the first thing I covered in the videos, so when I was met with a lot of blank stares, I knew this experiment had pretty much failed.  I did my best to cover what I could in an hour, but that’s not much, so instead of this being a cool, interactive class where people ended up feeling empowered and ready to go write code, I got the feeling those people left feeling bewildered and like they wasted an hour.  One guy who had come in 10 minutes late came up to me after class and was like, “so this is a programming language?  What can you do with it?”  And I kind of looked at him like….whaaaat?  It turned out he hadn’t even registered for the class to begin with, much less done any of the pre-class work – he had been in the library and saw me teaching and apparently thought it looked interesting so he decided to wander in.

I felt disappointed by this failed experiment, but I’m not one to give up at the first sign of failure, so I’ve been thinking about how I could make this system work.  It could just be that this model is not suited to people in the setting where I teach.  I am similar to them – a busy, working professional who knows this is useful and I should learn it, but it’s hard to find the time – and I think about what it would take for me to do the pre-class work.  If I had the time and the videos were decent enough quality, I think I’d do it, but honestly chances are 50-50 that I’d be able to find the time.  So maybe this model just isn’t made for my community.

Before I give up on this experiment entirely, though, I’d love to hear from anyone who has tried this kind of approach for adult learners.  Did it work, did it not?  What went well and what didn’t?  And of course, being the data queen that I am, I intend to collect some data.  I’m working on a modified class evaluation for those 5 brave souls who did come to get some feedback on the pre-class work model, and I’m also planning on sending a survey out to the other 38 people who didn’t come to see what I can find out from them.  Data to the rescue of the flipped class!

A Silly Experiment in Quantifying Death (and Doing Better Code)

Doesn’t it seem like a lot of people died in 2016?  Think of all the famous people the world lost this year.  It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even.  Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it.  Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing.  For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names.  More importantly, this little project was actually a good way to practice a few things I wanted to teach myself.  Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people.  Here they are on a histogram, by year.

year_plotAs you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still).  The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average.  So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case.  I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data.  Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

month_plotObviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

age_plotThis chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, you may notice those outliers in 2006 and 2014 where someone died at a very, very old age. I didn’t realize it at first, butWikipedia does include some notable non-humans in their list. One is a famous tree that died in an ice storm at age 125 and the other a tortoise who had allegedly been owned by Charles Darwin, but significantly outlived him, dying at age 176.  Obviously this makes the data and therefore this analysis even more suspect as a true scientific pursuit.  But we had fun, right? 🙂

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository.  And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing.  I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around.  I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was).  Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github.  Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it).  This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish.  In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway?  Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way.  I learned how to write good functions, for real.  I learned how to use the %>%, (which is a pipe operator, and it’s very awesome).  I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point.  In the past, I would write code and be elated if it actually worked.  With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better?  Can I do that in one line of code instead of three?  Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work.  Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?

In defense of the live demo (despite its perils)

rstudio-bomb

When RStudio crashes, it is not subtle about it.  You get a picture of an old-timey bomb and the succinct, blunt message “R encountered a fatal error.”  A couple hundred of my librarian friends and colleagues got to see it live during the demo I gave as part of a webinar I did for the Medical Library Association on R for librarians earlier today.  At first, I thought the problem was minor.  When I tried to read in my data, I got this error message:

Error in file(file, “rt”) : cannot open the connection
In addition: Warning message:
In file(file, “rt”) :
cannot open file ‘lib_data_example.csv’: No such file or directory

It’s a good example of R’s somewhat opaque and not-super-helpful error messages, but I’ve seen it before and it’s not a big deal.  It just meant that R couldn’t find the file I’d asked for.  Most of the time it’s because you’ve spelled the file name wrong, or you’ve capitalized something that should be lower case.  I double checked the file name against the cheat sheet I’d printed out with all my code.  Nope, the file name was correct.  Another likely cause is that you’re in the wrong directory and you just need to set the working directory to where the file is located.  I checked that too – my working directory was indeed set to where my file should have been.  That was when RStudio crashed, though I’m still not sure exactly why that happened.  I assume RStudio did it just to mess with me.  🙂

I’m sure a lot of presenters would be pretty alarmed at this point, but I was actually quite amused.  People on Twitter seemed to notice:

Having your live demo crash is not very entertaining in and of itself, but I found the situation rather amusing because I had considered whether I should do a live demo and decided to go with it because it seemed so low risk.  What could go wrong?  Sure, live demos are unpredictable.  Websites go down, databases change their interface without warning (invariably they do this five minutes before your demo starts), software crashes, and so on. Still, the demo I was doing was really quite simple compared to a lot of the R I normally teach, and it involved using an interface I literally use almost every day.   I’ve had plenty of presentations go awry in the past, but this was one that I really thought had almost 0% chance of going wrong.  So when it all went wrong on the very first line of code, I couldn’t help but laugh.  It’s the live demo curse!  You can’t escape!

I’m sure most people who have spent any significant amount of doing live demos of technology have had the experience of seeing the whole thing blow up.  I know a lot of librarians who avoid the whole thing by making slides with screen shots of what they would show and do sort of a mock demo.  There’s nothing wrong with that, and I can understand the inclination to remove the uncertainty of the live demo from the equation.  But despite their being fraught with potential issues, I’m still in favor of live demos – and in a sense, I feel this way exactly because of their unpredicability.

For one thing, it’s helpful for learners to see how an experienced user thinks through the process of troubleshooting when something goes wrong.  It’s just a fact that stuff doesn’t always work perfectly in real life.  If the people I’m teaching are ever actually going to use the tools I’m demonstrating, eventually they’re going to run into some problems.  They’re more likely to be able to solve those problems if they’ve had a chance to see someone work through whatever issues arise.  This is true for many different types of technologies and information resources, but especially so with programming languages.  Learning to troubleshoot is itself an essential skill in programming, and what better way to learn than to see it in action?

Secondly, for brand new users of a technology, watching an instructor give a flawless and apparently effortless demonstration can actually make mastery feel out of reach for them.  In reality, a lot of time and effort likely went into developing that demo, trying out lots of different approaches, seeing what works well and what doesn’t, and arriving at the “perfect” final demo.  I’m certainly not suggesting that instructors should do freewheeling demos with no prior planning whatsoever, but I am in favor of an approach that acknowledges that things don’t always go right the first time.  When I learned R, I would watch  tutorials by these incredibly smart and talented instructors and think, oh my gosh, they make this look so easy and I’m totally lost – I’m never going to understand how this works.  Obviously I don’t want to look like an unprepared and incompetent fool in front of a class, but hey, things don’t always go perfectly.  I’m human, you’re human, we’re all going to make mistakes, but that’s part of learning, so let’s talk about what went wrong and how we fix it.

By the way, in case you’re wondering what did actually go wrong in this instance, I had inadvertently moved the data file in the process of uploading it to my Github repo – I thought I’d made a copy, but I had actually moved the original.  I quickly realized what had happened, and I knew roughly where I’d put the file, but it was in some folder buried deep in my file structure that I wouldn’t be able to locate easily on the spot.  The quickest solution I could think of, which I quickly did off-screen from the webinar (thank you dual monitors) was to copy the data from the repo, paste it into a new CSV and quickly save it where the original file should have been.  It worked fine and the demo went off as planned after that.

See One, Do One, Teach One: Data Science Instruction Edition

In medical education, you’ll often hear the phrase “see one, do one, teach one.” I know this not because I’m a medical librarian, but because I watched ER religiously when I was in high school. 🙂  To put it simply, to learn to do a medical procedure, you first watch a seasoned clinician doing the procedure, then you do it yourself with guidance and feedback, and then you teach someone else how to do it.  While I’m not learning how to do medical procedures, I think this same idea applies to learning anything, really, and it’s actually how I’ve learned to do a lot of the cool things I’ve picked up in the last couple of years in my work at my current library.

Being sort of a Data Services department of one, I tend to put a lot of emphasis on instruction.  There are many thousands of researchers at my institution, but only one of me.  I can’t possibly help all of them one on one, so doing a hybrid in-person/webinar session that can reach lots and lots of people is a good use of my time.  I would have to go back to look at my statistics, but I don’t think I’d be too far off base if I said I’ve taught 200 people how to use R in the last year, which I think is a pretty effective use of my time!  Even better for me, teaching R has enabled me to learn way more than I would have on my own.  This time a year ago, I don’t think I could do much of anything with R, but with every class I teach, I learn more and more, and thus become even more prepared to teach it.

When I came to my library two years ago, I had some ideas about what I thought people should know about data management, but I figured I should collect some data about it (I mean, obviously, right?).  We did a survey.  I got my data and analyzed them to see what topics people were most interested in.  I put on classes on things like metadata, preservation, and data sharing, but the attendance wasn’t what I thought it would be based on the numbers from my survey.  Clearly something about my approach wasn’t reaching my researchers.  That’s when I decided to focus less on what I thought people should know and look at the problems they were really having.  Around the same time, I was starting to learn more about data science, and specifically R, and I realized that R could really solve a lot of the problems that people had.  Plus, people were interested in learning it.  Lots more people would show up for a class on R than they would for a class on metadata (sad, but true).

The only problem was, I didn’t think I knew R well enough to teach it.  What if really experienced people showed up and started calling me out on my inexperience, or asking questions I didn’t know the answer to?  I was really nervous about teaching an R class the first time, but I decided that I could make it manageable by biting off a little chunk.  I scheduled a class on making heatmaps in R, which was something I knew a lot of people wanted to learn.  Mind you, when I scheduled this class, I did not myself know how to make a heatmap in R.  But I put it on the instruction calendar, it went up on the website, and soon enough, I had not only a full class, but a waitlist.

Fortunately, there are many, many resources available for learning how to do things in R.  Lots of them are free.  That solved the “see one” problem.  Next, to “do one.”  I spend a long, long time putting together the hands-on exercises I create for my classes.  I try out lots of different things.  I mess around with the code and see what happens if I try things in different ways.  I try to anticipate what questions people might ask and experiment with my code so I have an answer.  Like, “what happens if you don’t put those spaces between everything in your code?” (answer, at least in R: nothing, it works fine with or without the spaces; I just like them in there because I can read it more easily).

My first few classes went well.  Sometimes people asked questions I didn’t know the answers to.  Even worse, sometimes I gave incorrect answers because I felt like I should say something even if I wasn’t really sure.  In one of the first classes I taught, someone asked whether = was equivalent to <- (the assignment operator) in R.  I’d seen <- used most often, but I thought I’d seen = used sometimes too, so I said something like, “uhhh, I don’t know, I mean, yeah, I think they’re the same, like, yeah, sure?”  A woman in the back row got really annoyed at that.  “They’re not the same at all,” she said, and I could feel myself turning bright red.  “That’s factually incorrect,” she added.  Shortly after that she got up and left in the middle of the class.  I was mortified, but the class still got good evaluations, so I figured it hadn’t been all bad.

These days, I schedule my classes based on two things: is it something I think my researchers want to learn, and is it something I want to learn.  That first part is relatively easy to figure out – I just talk to people, a lot, and I implore them to give me feedback about what classes they want on my class evaluations.  On the whole, they do, and this is how I end up with probably 90% of the classes I offer.  Sometimes this leads to much trepidation on my part, as people ask for things that I worry I’m not going to be able to teach.  For example, people had been asking for a class on statistical analysis in R.  I’ve taken a few different statistics classes, but stats were still something that filled me with terror.  When I submit my own articles for publication, I’m overcome with fear that I’ve made some horrible mistake in my statistical analyses and that peer reviewers are going to rip my article apart.  Or worse, the peer reviewers will miss it, it’ll be published, and readers will rip me apart.  The thought of actually teaching a class on how to do this seemed like a ridiculous idea, yet it was what so many people wanted.

So I went ahead and scheduled the class.  A lot of people signed up.  I got some very thick textbooks on statistics and statistical analysis in R and I spent many hours learning about all of this.  I got some data, saw what sorts of examples would make sense to demonstrate.  I painstakingly wrote out my code in R markdown, with lots of comments, so that everything would be well-explained.  And then, the morning arrived when I was to give the class for the first time.  Probably it was for the best that it was a webinar.  I was teleworking, so I gave the webinar from my home office, wearing sweatpants and my favorite UCLA t-shirt, with some lovely roses my boyfriend had brought me on my desk and my trusty dog looking in through the French doors.  I went through my examples, talking about linear regression, and tests of independence, and all sorts of other things, that, until I’d started to teach the webinar, I’d been very doubtful I had a good handle on.  But suddenly, I realized I kind of actually knew what I was talking about!  People typed their questions in the chat window and I  knew the answers! When the two hours were up and I signed off, I felt good about it, and over the next few days, I got lots of emails from people thanking me for the great class, which was great, since my main goal had just been to not say anything too stupid. 🙂

Now, I don’t feel so nervous about offering some of these advanced classes.  It’s kind of exciting to have the opportunity to stretch myself to learn things that I think are interesting.  Plus, nothing will give you more incentive to learn something you’ve wanted to explore than committing yourself to teach a class on it!  I’ve learned so much about so many cool things because people have said, hey, can you teach me this, and I say, sure! then scramble off to my office and check the indices of all my R books to see where I can learn how to do whatever that thing is.

The point of all this is to say that, for me at least, the “teach one” part of the old mantra is perhaps something librarians should jump on when it comes to expanding library roles in data management and data science.  I’m very fortunate that I get to spend most of my time working on data and nothing else, so I recognize that not everyone can take a week to immerse themselves in statistics, but I do think that librarians can and should stretch themselves to learn new things that will benefit our patrons.

My other piece of advice, which is surely nothing new: when someone asks a question, don’t be afraid to say I don’t know.  I learned quickly from that whole “= is not the same as <-” business.  Now when someone asks a question and I don’t know the answer, I do one of two things.  If I can, I try it out in the code right then and there.  So if someone says something like, can you rearrange the order of those two things in your code? I’ll say, huh, I never thought about that – let’s find out, and then do just that.  Other times, the question is something complicated, like, how do I do this random thing?  In those cases, I’ll say, that’s a great question, and I don’t actually know the answer, but if you’ll send me an email after this so I have your contact info, I will find out and follow up with you.  I’ve said that at least once in every class I’ve taught in the last 6 months, and the number of times someone has actually followed up with me: none.  I think this is probably due to one of two reasons.  One, I really emphasize troubleshooting and how to find out how to learn to do things in R when I teach, so it’s very possible that the person goes off and finds the answer themselves, which is great.  Two, I think there are times when people pose an idle question because they’re just kind of curious, or they want to look smart in front of their peers, and they don’t follow up because the answer doesn’t really matter that much to them anyway.

So there you go!  That’s my philosophy of getting to learn how to do cool stuff with data in order to benefit my researchers. 🙂

R for libRarians: data analysis and processing

I heard from several people after I wrote my last post about visualization who were excited about learning the very cool things that R can do.  Yay!  That post only scratched the surface of the many, nearly endless, things that R can do in terms of visualization, so if that seemed interesting to you, I hope you will go forth and learn more!  In case I haven’t already convinced you of R’s awesomeness (no, I’m not a paid R spokesperson or anything), I have a little more to say about why R is so great for data processing and analysis.

When it comes to data analysis, most of the researchers I know are either using some fancypants statistical software that costs lots of money, or they’re using Excel.  As a librarian, I have the same sort of feelings for Excel as I do for Google: wonderful tool, great when used properly, but frequently used improperly in the context of research.  Excel is okay for some very specific purposes, but at least in my experience, researchers are often using it for tasks to which it is not particularly suited.  As far as the fancypants statistical software, a lot of labs can’t afford it.  Even more problematic, every single one I’m aware of uses proprietary file formats, meaning that no one else can see your data unless they too invest in that expensive software.  As data sharing is becoming the expectation, having all your data locked in a proprietary format isn’t going to work.

Enter R!  Here are some of the reasons why I love it:

  • R is free and open source.  It’s supported by a huge community of users who are generally open to sharing their code.  This is great because those of us who are not programmers can take advantage of the work that others have already done to solve complex tasks.  For example, I had some data from a survey I had conducted, mostly in the form of responses to Likert-type scale questions.  I’m decidedly not a statistician and I was really not sure exactly how I should analyze these questions.  Plus, I wanted to create a visualization and I wasn’t entirely sure how I wanted it to look.  I suspected someone had probably already tackled these problems in R, so I Googled “R likert.”  Yes!  Sure enough, someone had already written a package for analyzing Likert data, aptly called likert.  I downloaded and installed the package in under a minute, and it made my data analysis so easy.  Big bonus: R can generally open files from all of those statistical software programs.  I saved the day for some researchers when the data they needed was in a proprietary format, but they didn’t want to pay several thousands of dollars to buy that program, and I opened the data in like 5 seconds in R.
  • R enhances research reproducibility. Sure, there are a lot of things you can do in Excel that you can do in R.  I could open an Excel spreadsheet and do, for example, a find and replace to change some values of something.  I could probably even do some fairly complex math and even statistics in Excel if I really knew what I was doing.  However, nothing I do here is going to be documented.  I have no record explaining how I changed my data, why I did things the way I did, and so on.  Case in point number 1: I frequently work on processing data that had been shared or downloaded from a repository to get it into the format that researchers need.  They tell me what kind of analysis they want to do, and the specifications they need the data to meet, and I can clean everything up for them much more easily than they could.  Before I learned R, this took a long time, for one thing, but I also had to document all the changes I made by hand. I would keep Word documents that painstakingly described every step of what I had done so I had a record of it if the researchers needed it.  It was a huge pain and ridiculously inefficient.  With R, none of that is necessary.  I write an R script that does whatever I need to do with the data.  Not only does R do it faster and more efficiently than Excel might, if I need a record of my actions, I have it all right there in the form of the script, which I can save, share, come back to when I completely forgot what I did 6 months later, and so on.  Another really nice point in this same vein, is that R never does anything with your original file, or your raw data.  If you change something up in Excel, save it, and then later realize you messed up, you’re out of luck if you’re working on your copy of the raw data.  That doesn’t happen with R, because R pulls the data, whatever that may be, into your computer’s working memory and sort of keeps its own copy there.  That means I can go to town doing all sorts of crazy stuff with the data, experiment and mess around with it to my heart’s content, and my raw data file is never actually touched.
  • Compared to some other solutions, R is a workhorse. I suspect some data scientists would  disagree with me characterizing R as a workhorse, which is why I qualified that statement.  R is not a great solution for truly big data.  However, it can handle much bigger data than Excel, which will groan if you try to load a file with several hundred thousand records and break if you try to load more than a million.  By comparison, this afternoon I loaded a JSON file with 1.5 million lines into R and it took about a minute.  So, while it may not be there yet in terms of big data, I think R is a nice solution for small to medium data.  Besides that, I think learning R is very pragmatic, because once you’ve got the basics down, you can do so many things with it.  Though it was originally created as a statistical language, you can do almost anything you can think of to/with data using R, and once you’ve got the hang of the basic syntax, you’re really set to branch out into a lot of really interesting areas.  I talked in the last post about visualization, which I think R really excels at.  I’m particularly excited about learning to use R for machine learning and natural language processing, which are two areas that I think are going to be particularly important in terms of data analysis and knowledge discovery in the next few years.  There’s a great deal of data freely available, and learning skills like some basic R programming will vastly increase your ability to get it, interact with it, and learn something interesting from it.

I should add that there are many other scripting languages that can accomplish many of the same things as R.  I highlight R because, in my experience, it is the most approachable for non-programmers and thus the most likely to appeal to librarians, who are my primary audience here.  I’m in the process of learning Python, and I’m at the point of wanting to bang my head against a wall with it.  R is not necessarily easy when you first get started, but I felt comfortable using it with much less effort than I expected it would take.  Your mileage may vary, but for the effort to payoff ratio I got, I absolutely think that my time spent learning R was well worth it.

R for libRarians: visualization

I recently blogged about R and how cool it is, and how it’s really not as scary to learn as many novices (including myself, a few years ago) might think.  Several of my fellow librarians commented, or emailed, to ask more about how I’m using R in my library work, so I thought I would take a moment to share some of those ideas here, and also to encourage other librarians who are using R (or related languages/tools) to jump in and share how you’re using it in your library work.

I should preface this by saying I don’t do a lot of “regular” library work anymore – most of what I do is working with researchers on their data, teaching classes about data, and collecting and working with my own research data.  However, I did do more traditional library things in the past, so I know that these kinds of skills would be useful.  In particular, there are three areas where I’ve found R to be very useful: visualization, data processing (or wrangling, or cleaning, or whatever you want to call it), and textual analysis.  Because I could say a lot about each of these, I’m going to do this over several posts, starting with today’s post on visualization.

Data visualization is one of my new favorite things to work on, and by far the tool I use most is R, specifically the ggplot2 package.  This package utilizes the concepts outlined in Leland Wilkinson’s Grammar of Graphics, which takes visualizations apart into their individual components.  As Wilkinson explains it,  “a language consisting of words and no grammar expresses only as many ideas as there are words. By specifying how words are combined in statements, a grammar expands a language’s scope…The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).”  When I teach ggplot2, I like to say that the kind of premade charts we can create with Excel are like the Dr. Seuss of visualizations, whereas the complex and nuanced graphics we can create with ggplot2 are the War and Peace.

For example, I needed to create a graph for an article I was publishing that showed how people had responded to two questions: basically, how important they felt a task was to their work, and how good they thought they were at that task.  I was not just interested in how many people had rated themselves in each of the five bins in my Likert scale, so a histogram or bar chart wouldn’t capture what I wanted.  That would show me how people had answered each question individually, but I was interested in showing the distribution of combinations of responses.  In other words, did people who said that a task was important to them have a correspondingly high level of expertise? I was picturing something sort of like a scatterplot, but with each each point (i.e., each combination of responses) sized according to how many people had responded with that combination.  I was able to do exactly this with ggplot2:

This was exactly what I wanted, and not something that I could have created with Excel, because it isn’t a “standard” chart type.  Not only that, but since everything was written in code, I was able to save it so I had an exact record of what I did (when I get back to my work computer, instead of my personal one, I will get the file and actually put that code here!).  It was also very easy to go back and make changes.  In the original version, I had the points sized by actual number of people who had responded, but one of the reviewers felt this was potentially confusing because of the disparity in the size of each group (110 scientific researchers, but only 21 clinical researchers).  I was asked to change the points to show percent of responses, rather than number of responses, and this took just one minor change to the code that I could accomplish in less than a minute.

I also like ggplot2 for creating highly complex graphics that demonstrate correlations in multivariate data sets.  When I’m teaching, I like to use the sample data set that comes with ggplot2, which has info about around 55,000 diamonds, with 10 variables, including things like price, cut, color, carat, quality, and so on.  How is price determined for these diamonds?  Is it simply a matter of size – the bigger it is, the more it costs?  Or do other variables also contribute to the price?  We could do some math to find out the actual answer, but we could also quickly create a visualization that maps out some of these relationships to see if some patterns start to emerge.

First, I’ll create a scatterplot of my diamonds, with price on the x-axis and carat on the y-axis.  Here it is, with the code to create it below:

a <- ggplot(diam, aes(x = price, y = carat)) + geom_point() + geom_abline(slope = 0.0002656748, intercept = 0, col = "red")

If there were a perfect relationship between price and diamond size, we would expect our points to cluster along the red line I’ve inserted here, which demonstrates a 1:1 relationship.  Clearly, that is not the case.  So we might propose that there are other variables that contribute to a diamond’s price.  If I really wanted to, I could actually demonstrate lots of variables in one chart.  For example, this sort of crazy visualization shows five different variables: price (x-axis), carat (y-axis), color (color of point, with red being worst quality color and lightest yellow being best quality color), clarity (size of point, with smallest point being lowest quality clarity and largest point being highest quality clarity), and cut (faceted, with each of the five cut categories shown in its own chart).

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,"YlOrRd"))) + facet_wrap(~cut, nrow = 1)

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,”YlOrRd”))) + facet_wrap(~cut, nrow = 1)

We’d have to do some more robust mathematical analysis of this to really get info about the various correlations here, but just in glancing at this, I can see that there are definitely some interesting patterns here and that this data might be worth further looking into.  And since I use ggplot2 quick a bit and am fairly proficient with it, this plot took me less than a minute to put together, which is exactly why I love ggplot2 so much.

You can probably see how you could use ggplot2 to create, as I’ve said, nearly infinitely customized charts and graphs.  To relate this back to libraries, you could create visualizations about your collection, your budget, or whatever other numbers you might want to visually display in a presentation or a publication.  There are also other R packages that let you create other types of visualizations.  I haven’t used it, but there’s a package called VennDiagram that lets you, well, make Venn diagrams – back in my days of teaching PubMed, I used to always use Venn diagrams to show how Boolean operators work, and this would allow you to make them really easily (I was always doing weird stuff with Powerpoint to try to make mine look right, and they never quite did).  There are also packages like ggvis and Shiny that let you create interactive visualizations that you could put on a website, which could be cool.  I’ve only just started to play around with these packages, so I don’t have any examples of my own, but you can see some examples of cool things that people have done in the Shiny Gallery.

So there you go!  I love R for visualizations, and I think it’s much easier to create nice looking graphics with R than it is with Excel or Powerpoint, once you get the hang of it.  Now that I’ve heard from some other librarians who are coding, do any of you have other ideas about using R (or other languages!) for visualizations, or examples of visualizations you’ve created?

Some Additional Resources:

  • I teach a class on ggplot2 at my library – the handout and class exercises are on my Data Services libguide.
  • The help documentation for ggplot2 is quite thorough.  Looking at the various options, you can see how you can create a nearly infinite variety of charts and graphs.
  • If you’re interested in learning more about the Grammar of Graphics but don’t want to read the whole book, Hadley Wickham, who created ggplot2, has written a nice article, A Layered Grammar of Graphics, that captures many of the ideas.

So you think you can code

I’ve been thinking about many ideas lately dealing with data and data science (this is, I’m sure, not news to anyone).  I’ve also had several people encourage me to pick my blog back up, and I’ve recently made my den into a cute and comfy little office, so, why not put all this together and resume blogging with a little post about my thoughts on data!  In particular, in this post I’m going to talk about coding.

Early on in my library career when I first got interested in data, I was talking to one of my first bosses and told her I thought I should learn R, which is essentially a scripting language, very useful for data processing, analysis, statistics, and visualization.  She gave me a sort of dubious look, and even as I said it, I was thinking in my head, yeah, I’m probably not going to do that.  I’m no computer scientist.  Fast forward a few years later, and not only have I actually learned R, it’s probably the single most important skill in my professional toolbox.

Here’s the thing – you don’t have to be a computer scientist to code, especially in R.  It’s actually remarkably straightforward, once you get over the initial strangeness of it and get a feel for the syntax.  I started offering R classes around the beginning of this year and I call my introductory classes “Introduction to R for Non-programmers.”  I had two reasons for selecting this name: one, I had only been using R for less than a year myself and didn’t (and still don’t) consider myself an expert.  When I started thinking about getting up in front of a room of people and teaching them to code, I had horrifying visions of experienced computer scientists calling me out on my relative lack of expertise, mocking my class exercises, or correcting me in front of everyone.  So, I figured, let’s set the bar low. 🙂  More importantly, I wanted to emphasize that R is approachable!  It’s not scary!  I can learn it, you can learn it.  Hell, young children can (and do) learn it.  Not only that, but you can learn it from one of a plethora of free resources without ever cracking a book or spending a dime.  All it takes is a little time, patience, and practice.

The payoff?  For one thing, you can impress your friends with your nerdy awesome skills!  (Or at least that’s what I keep telling myself.)  If you work with data of any kind, you can simplify your work, because using R (or other scientific programming languages) is faaaaar more efficient than using other point and click tools like Excel.  You can create super awesome visualizations, do crazy data analysis in a snap, and work with big huge data sets that would break Excel.  And you can do all of this for free!  If you’re a research and/or medical librarian, you will also make yourself an invaluable resource to your user community.  I believe that I could teach an R class every day at my library and there would still be people showing up.  We regularly have waitlists of 20 or more people.  Scientists are starting to catch on to all the reasons I’ve mentioned above, but not all of them have the time or inclination to use one of the free online resources.  Plus, since I’m a real human person who knows my users and their research and their data, I know what they probably want to do, so my classes are more tailored to them.

I was being introduced to Hadley Wickham yesterday, who is a pretty big deal in the R world, as he created some very important R packages (kind of like apps), and my friend and colleague who introduced me said, “this is Lisa; she is our prototypical data scientist librarian.”  I know there are other librarian coders out there because I’m on mailing lists with some of them, but I’m not currently aware of any other data librarians or medical librarians who know R.  I’m sure there are others and I would be very interested in knowing them.  And if it is fair to consider me a “prototype,” I wonder how many other librarians will be interested in becoming data scientist librarians.  I’m really interested in hearing from the librarians reading this – do you want to code?  Do you think you can learn to code?  And if not, why not?