See One, Do One, Teach One: Data Science Instruction Edition

In medical education, you’ll often hear the phrase “see one, do one, teach one.” I know this not because I’m a medical librarian, but because I watched ER religiously when I was in high school. 🙂  To put it simply, to learn to do a medical procedure, you first watch a seasoned clinician doing the procedure, then you do it yourself with guidance and feedback, and then you teach someone else how to do it.  While I’m not learning how to do medical procedures, I think this same idea applies to learning anything, really, and it’s actually how I’ve learned to do a lot of the cool things I’ve picked up in the last couple of years in my work at my current library.

Being sort of a Data Services department of one, I tend to put a lot of emphasis on instruction.  There are many thousands of researchers at my institution, but only one of me.  I can’t possibly help all of them one on one, so doing a hybrid in-person/webinar session that can reach lots and lots of people is a good use of my time.  I would have to go back to look at my statistics, but I don’t think I’d be too far off base if I said I’ve taught 200 people how to use R in the last year, which I think is a pretty effective use of my time!  Even better for me, teaching R has enabled me to learn way more than I would have on my own.  This time a year ago, I don’t think I could do much of anything with R, but with every class I teach, I learn more and more, and thus become even more prepared to teach it.

When I came to my library two years ago, I had some ideas about what I thought people should know about data management, but I figured I should collect some data about it (I mean, obviously, right?).  We did a survey.  I got my data and analyzed them to see what topics people were most interested in.  I put on classes on things like metadata, preservation, and data sharing, but the attendance wasn’t what I thought it would be based on the numbers from my survey.  Clearly something about my approach wasn’t reaching my researchers.  That’s when I decided to focus less on what I thought people should know and look at the problems they were really having.  Around the same time, I was starting to learn more about data science, and specifically R, and I realized that R could really solve a lot of the problems that people had.  Plus, people were interested in learning it.  Lots more people would show up for a class on R than they would for a class on metadata (sad, but true).

The only problem was, I didn’t think I knew R well enough to teach it.  What if really experienced people showed up and started calling me out on my inexperience, or asking questions I didn’t know the answer to?  I was really nervous about teaching an R class the first time, but I decided that I could make it manageable by biting off a little chunk.  I scheduled a class on making heatmaps in R, which was something I knew a lot of people wanted to learn.  Mind you, when I scheduled this class, I did not myself know how to make a heatmap in R.  But I put it on the instruction calendar, it went up on the website, and soon enough, I had not only a full class, but a waitlist.

Fortunately, there are many, many resources available for learning how to do things in R.  Lots of them are free.  That solved the “see one” problem.  Next, to “do one.”  I spend a long, long time putting together the hands-on exercises I create for my classes.  I try out lots of different things.  I mess around with the code and see what happens if I try things in different ways.  I try to anticipate what questions people might ask and experiment with my code so I have an answer.  Like, “what happens if you don’t put those spaces between everything in your code?” (answer, at least in R: nothing, it works fine with or without the spaces; I just like them in there because I can read it more easily).

My first few classes went well.  Sometimes people asked questions I didn’t know the answers to.  Even worse, sometimes I gave incorrect answers because I felt like I should say something even if I wasn’t really sure.  In one of the first classes I taught, someone asked whether = was equivalent to <- (the assignment operator) in R.  I’d seen <- used most often, but I thought I’d seen = used sometimes too, so I said something like, “uhhh, I don’t know, I mean, yeah, I think they’re the same, like, yeah, sure?”  A woman in the back row got really annoyed at that.  “They’re not the same at all,” she said, and I could feel myself turning bright red.  “That’s factually incorrect,” she added.  Shortly after that she got up and left in the middle of the class.  I was mortified, but the class still got good evaluations, so I figured it hadn’t been all bad.

These days, I schedule my classes based on two things: is it something I think my researchers want to learn, and is it something I want to learn.  That first part is relatively easy to figure out – I just talk to people, a lot, and I implore them to give me feedback about what classes they want on my class evaluations.  On the whole, they do, and this is how I end up with probably 90% of the classes I offer.  Sometimes this leads to much trepidation on my part, as people ask for things that I worry I’m not going to be able to teach.  For example, people had been asking for a class on statistical analysis in R.  I’ve taken a few different statistics classes, but stats were still something that filled me with terror.  When I submit my own articles for publication, I’m overcome with fear that I’ve made some horrible mistake in my statistical analyses and that peer reviewers are going to rip my article apart.  Or worse, the peer reviewers will miss it, it’ll be published, and readers will rip me apart.  The thought of actually teaching a class on how to do this seemed like a ridiculous idea, yet it was what so many people wanted.

So I went ahead and scheduled the class.  A lot of people signed up.  I got some very thick textbooks on statistics and statistical analysis in R and I spent many hours learning about all of this.  I got some data, saw what sorts of examples would make sense to demonstrate.  I painstakingly wrote out my code in R markdown, with lots of comments, so that everything would be well-explained.  And then, the morning arrived when I was to give the class for the first time.  Probably it was for the best that it was a webinar.  I was teleworking, so I gave the webinar from my home office, wearing sweatpants and my favorite UCLA t-shirt, with some lovely roses my boyfriend had brought me on my desk and my trusty dog looking in through the French doors.  I went through my examples, talking about linear regression, and tests of independence, and all sorts of other things, that, until I’d started to teach the webinar, I’d been very doubtful I had a good handle on.  But suddenly, I realized I kind of actually knew what I was talking about!  People typed their questions in the chat window and I  knew the answers! When the two hours were up and I signed off, I felt good about it, and over the next few days, I got lots of emails from people thanking me for the great class, which was great, since my main goal had just been to not say anything too stupid. 🙂

Now, I don’t feel so nervous about offering some of these advanced classes.  It’s kind of exciting to have the opportunity to stretch myself to learn things that I think are interesting.  Plus, nothing will give you more incentive to learn something you’ve wanted to explore than committing yourself to teach a class on it!  I’ve learned so much about so many cool things because people have said, hey, can you teach me this, and I say, sure! then scramble off to my office and check the indices of all my R books to see where I can learn how to do whatever that thing is.

The point of all this is to say that, for me at least, the “teach one” part of the old mantra is perhaps something librarians should jump on when it comes to expanding library roles in data management and data science.  I’m very fortunate that I get to spend most of my time working on data and nothing else, so I recognize that not everyone can take a week to immerse themselves in statistics, but I do think that librarians can and should stretch themselves to learn new things that will benefit our patrons.

My other piece of advice, which is surely nothing new: when someone asks a question, don’t be afraid to say I don’t know.  I learned quickly from that whole “= is not the same as <-” business.  Now when someone asks a question and I don’t know the answer, I do one of two things.  If I can, I try it out in the code right then and there.  So if someone says something like, can you rearrange the order of those two things in your code? I’ll say, huh, I never thought about that – let’s find out, and then do just that.  Other times, the question is something complicated, like, how do I do this random thing?  In those cases, I’ll say, that’s a great question, and I don’t actually know the answer, but if you’ll send me an email after this so I have your contact info, I will find out and follow up with you.  I’ve said that at least once in every class I’ve taught in the last 6 months, and the number of times someone has actually followed up with me: none.  I think this is probably due to one of two reasons.  One, I really emphasize troubleshooting and how to find out how to learn to do things in R when I teach, so it’s very possible that the person goes off and finds the answer themselves, which is great.  Two, I think there are times when people pose an idle question because they’re just kind of curious, or they want to look smart in front of their peers, and they don’t follow up because the answer doesn’t really matter that much to them anyway.

So there you go!  That’s my philosophy of getting to learn how to do cool stuff with data in order to benefit my researchers. 🙂

R for libRarians: data analysis and processing

I heard from several people after I wrote my last post about visualization who were excited about learning the very cool things that R can do.  Yay!  That post only scratched the surface of the many, nearly endless, things that R can do in terms of visualization, so if that seemed interesting to you, I hope you will go forth and learn more!  In case I haven’t already convinced you of R’s awesomeness (no, I’m not a paid R spokesperson or anything), I have a little more to say about why R is so great for data processing and analysis.

When it comes to data analysis, most of the researchers I know are either using some fancypants statistical software that costs lots of money, or they’re using Excel.  As a librarian, I have the same sort of feelings for Excel as I do for Google: wonderful tool, great when used properly, but frequently used improperly in the context of research.  Excel is okay for some very specific purposes, but at least in my experience, researchers are often using it for tasks to which it is not particularly suited.  As far as the fancypants statistical software, a lot of labs can’t afford it.  Even more problematic, every single one I’m aware of uses proprietary file formats, meaning that no one else can see your data unless they too invest in that expensive software.  As data sharing is becoming the expectation, having all your data locked in a proprietary format isn’t going to work.

Enter R!  Here are some of the reasons why I love it:

  • R is free and open source.  It’s supported by a huge community of users who are generally open to sharing their code.  This is great because those of us who are not programmers can take advantage of the work that others have already done to solve complex tasks.  For example, I had some data from a survey I had conducted, mostly in the form of responses to Likert-type scale questions.  I’m decidedly not a statistician and I was really not sure exactly how I should analyze these questions.  Plus, I wanted to create a visualization and I wasn’t entirely sure how I wanted it to look.  I suspected someone had probably already tackled these problems in R, so I Googled “R likert.”  Yes!  Sure enough, someone had already written a package for analyzing Likert data, aptly called likert.  I downloaded and installed the package in under a minute, and it made my data analysis so easy.  Big bonus: R can generally open files from all of those statistical software programs.  I saved the day for some researchers when the data they needed was in a proprietary format, but they didn’t want to pay several thousands of dollars to buy that program, and I opened the data in like 5 seconds in R.
  • R enhances research reproducibility. Sure, there are a lot of things you can do in Excel that you can do in R.  I could open an Excel spreadsheet and do, for example, a find and replace to change some values of something.  I could probably even do some fairly complex math and even statistics in Excel if I really knew what I was doing.  However, nothing I do here is going to be documented.  I have no record explaining how I changed my data, why I did things the way I did, and so on.  Case in point number 1: I frequently work on processing data that had been shared or downloaded from a repository to get it into the format that researchers need.  They tell me what kind of analysis they want to do, and the specifications they need the data to meet, and I can clean everything up for them much more easily than they could.  Before I learned R, this took a long time, for one thing, but I also had to document all the changes I made by hand. I would keep Word documents that painstakingly described every step of what I had done so I had a record of it if the researchers needed it.  It was a huge pain and ridiculously inefficient.  With R, none of that is necessary.  I write an R script that does whatever I need to do with the data.  Not only does R do it faster and more efficiently than Excel might, if I need a record of my actions, I have it all right there in the form of the script, which I can save, share, come back to when I completely forgot what I did 6 months later, and so on.  Another really nice point in this same vein, is that R never does anything with your original file, or your raw data.  If you change something up in Excel, save it, and then later realize you messed up, you’re out of luck if you’re working on your copy of the raw data.  That doesn’t happen with R, because R pulls the data, whatever that may be, into your computer’s working memory and sort of keeps its own copy there.  That means I can go to town doing all sorts of crazy stuff with the data, experiment and mess around with it to my heart’s content, and my raw data file is never actually touched.
  • Compared to some other solutions, R is a workhorse. I suspect some data scientists would  disagree with me characterizing R as a workhorse, which is why I qualified that statement.  R is not a great solution for truly big data.  However, it can handle much bigger data than Excel, which will groan if you try to load a file with several hundred thousand records and break if you try to load more than a million.  By comparison, this afternoon I loaded a JSON file with 1.5 million lines into R and it took about a minute.  So, while it may not be there yet in terms of big data, I think R is a nice solution for small to medium data.  Besides that, I think learning R is very pragmatic, because once you’ve got the basics down, you can do so many things with it.  Though it was originally created as a statistical language, you can do almost anything you can think of to/with data using R, and once you’ve got the hang of the basic syntax, you’re really set to branch out into a lot of really interesting areas.  I talked in the last post about visualization, which I think R really excels at.  I’m particularly excited about learning to use R for machine learning and natural language processing, which are two areas that I think are going to be particularly important in terms of data analysis and knowledge discovery in the next few years.  There’s a great deal of data freely available, and learning skills like some basic R programming will vastly increase your ability to get it, interact with it, and learn something interesting from it.

I should add that there are many other scripting languages that can accomplish many of the same things as R.  I highlight R because, in my experience, it is the most approachable for non-programmers and thus the most likely to appeal to librarians, who are my primary audience here.  I’m in the process of learning Python, and I’m at the point of wanting to bang my head against a wall with it.  R is not necessarily easy when you first get started, but I felt comfortable using it with much less effort than I expected it would take.  Your mileage may vary, but for the effort to payoff ratio I got, I absolutely think that my time spent learning R was well worth it.