To keep or not to keep: that is the question

I recently read an article in The Atlantic about people who are compulsive declutterers – the opposite of hoarders – who feel compelled to get rid of all their possessions. I’m more on the side of hoarding, because I always find myself thinking of eventualities in which I might need the item in question.  Indeed, it has often been the case that I will think of something I got rid of weeks or even years later and wish I still had it: a book I would have liked to reference, a piece of clothing I would have liked to wear, a receipt I could have used to take something back.  Of course, I don’t have unlimited storage space, so I can’t keep all this stuff.  The question of what to keep and for how long is one that librarians think about when it comes to weeding: deciding which parts of the collection to deaccession, or basically, get rid of.  There are evidence-based, tried-and-true ways of thinking about weeding a library collection, but that’s not so much true when it comes to data.  How is a scientist to decide what to keep and what not to keep?

I know this is a question that researchers are thinking about quite a bit, because I get more emails about this than almost any other issue.  In fact, I get emails not only from users of my own library, but researchers from all over the country who have somehow found my name.  What exactly do I need to keep?  If I have electronic records, do I need to keep a print copy as well?  How many years do I need to keep this stuff?  These are all very reasonable questions that it would be nice to say, yes, there is an answer and it is….! but it’s almost never so easy to point to a single answer.

A case in point: a couple years ago, I decided to teach a class about data preservation and retention.  In my naivete, I thought it would be nice to take a look through all the relevant policy and find the specific number of years that research data is required to be retained.  I read handbooks and guides.  I read policy documents from various agencies.   I even read the U.S. Code (I do not recommend it).  At the end of it, I found that not only is there not a single, definitive, policy answer to how long funded research data should be retained, but there are in fact all sorts of contradictory suggestions.  I found documents giving times from 3 years to 7 years to the super-helpful “as long as necessary.”

This may be difficult to answer from a policy perspective, but I think answering this from a best practices perspective is even trickier.  Let’s agree that we just can’t keep everything – storing data isn’t free, and it takes considerable time and effort to ensure that data remain accessible and usable.  Assuming that some stuff has to get thrown away, how do we distinguish trash from treasure, especially given the old adage about how the former might be the latter to others?  It’s hard to know whether something that appears useless now might actually be useful and interesting to someone in the future.  To take this to the extreme, here’s an actual example from a researcher I’ve worked with: he asked how he could have his program automatically discard everything in the thousandth place from his measurements.  In other words, he wanted 4.254 to be saved as 4.25.  I told him I could show him how, but I asked why he wanted to do this.  He told me that his machine was capable of measuring to the thousandth, but the measurement was only scientifically relevant to the hundredth place.  To scientists right now, 4.254 and 4.252 were essentially indistinguishable, so why bother with the extra noise of the thousandth place?  Fair point, but what about 5 years from now, or 10 years from now?  If science evolves to the point that this extra level of precision is meaningful, tomorrow’s researchers will probably be a little annoyed that today’s researchers had that measurement and just threw it away.  But then again, how can we know now when, or even if, that level of precision will be wanted?  For that matter, we can’t even say for sure whether this dataset will be useful at all.  Maybe a new and better method for making this measurement will be developed tomorrow, and all this stuff we gathered today will be irrelevant.  But how can we know?

These are all questions that I think are not easy to answer right now, but that people within research communities should be thinking about.  For one thing, I don’t think we can give one simple answer to how long data should be retained.  For one type of research, a few years may be enough.  For other fields, where it’s harder to replicate data, maybe we need to keep it in perpetuity.  When it comes to deciding what should be retained and what should be discarded, I think that answers cannot be dictated by one-size-fits-all policies and that subject matter experts and information professionals should work together to figure out good answers for specific communities and specific data.  Eventually, I suppose we’ll probably have some of those well-defined best practices for data retention in the same way that we have those best practices from collection management in libraries.  Until then, keep your crystal balls handy. 🙂

See One, Do One, Teach One: Data Science Instruction Edition

In medical education, you’ll often hear the phrase “see one, do one, teach one.” I know this not because I’m a medical librarian, but because I watched ER religiously when I was in high school. 🙂  To put it simply, to learn to do a medical procedure, you first watch a seasoned clinician doing the procedure, then you do it yourself with guidance and feedback, and then you teach someone else how to do it.  While I’m not learning how to do medical procedures, I think this same idea applies to learning anything, really, and it’s actually how I’ve learned to do a lot of the cool things I’ve picked up in the last couple of years in my work at my current library.

Being sort of a Data Services department of one, I tend to put a lot of emphasis on instruction.  There are many thousands of researchers at my institution, but only one of me.  I can’t possibly help all of them one on one, so doing a hybrid in-person/webinar session that can reach lots and lots of people is a good use of my time.  I would have to go back to look at my statistics, but I don’t think I’d be too far off base if I said I’ve taught 200 people how to use R in the last year, which I think is a pretty effective use of my time!  Even better for me, teaching R has enabled me to learn way more than I would have on my own.  This time a year ago, I don’t think I could do much of anything with R, but with every class I teach, I learn more and more, and thus become even more prepared to teach it.

When I came to my library two years ago, I had some ideas about what I thought people should know about data management, but I figured I should collect some data about it (I mean, obviously, right?).  We did a survey.  I got my data and analyzed them to see what topics people were most interested in.  I put on classes on things like metadata, preservation, and data sharing, but the attendance wasn’t what I thought it would be based on the numbers from my survey.  Clearly something about my approach wasn’t reaching my researchers.  That’s when I decided to focus less on what I thought people should know and look at the problems they were really having.  Around the same time, I was starting to learn more about data science, and specifically R, and I realized that R could really solve a lot of the problems that people had.  Plus, people were interested in learning it.  Lots more people would show up for a class on R than they would for a class on metadata (sad, but true).

The only problem was, I didn’t think I knew R well enough to teach it.  What if really experienced people showed up and started calling me out on my inexperience, or asking questions I didn’t know the answer to?  I was really nervous about teaching an R class the first time, but I decided that I could make it manageable by biting off a little chunk.  I scheduled a class on making heatmaps in R, which was something I knew a lot of people wanted to learn.  Mind you, when I scheduled this class, I did not myself know how to make a heatmap in R.  But I put it on the instruction calendar, it went up on the website, and soon enough, I had not only a full class, but a waitlist.

Fortunately, there are many, many resources available for learning how to do things in R.  Lots of them are free.  That solved the “see one” problem.  Next, to “do one.”  I spend a long, long time putting together the hands-on exercises I create for my classes.  I try out lots of different things.  I mess around with the code and see what happens if I try things in different ways.  I try to anticipate what questions people might ask and experiment with my code so I have an answer.  Like, “what happens if you don’t put those spaces between everything in your code?” (answer, at least in R: nothing, it works fine with or without the spaces; I just like them in there because I can read it more easily).

My first few classes went well.  Sometimes people asked questions I didn’t know the answers to.  Even worse, sometimes I gave incorrect answers because I felt like I should say something even if I wasn’t really sure.  In one of the first classes I taught, someone asked whether = was equivalent to <- (the assignment operator) in R.  I’d seen <- used most often, but I thought I’d seen = used sometimes too, so I said something like, “uhhh, I don’t know, I mean, yeah, I think they’re the same, like, yeah, sure?”  A woman in the back row got really annoyed at that.  “They’re not the same at all,” she said, and I could feel myself turning bright red.  “That’s factually incorrect,” she added.  Shortly after that she got up and left in the middle of the class.  I was mortified, but the class still got good evaluations, so I figured it hadn’t been all bad.

These days, I schedule my classes based on two things: is it something I think my researchers want to learn, and is it something I want to learn.  That first part is relatively easy to figure out – I just talk to people, a lot, and I implore them to give me feedback about what classes they want on my class evaluations.  On the whole, they do, and this is how I end up with probably 90% of the classes I offer.  Sometimes this leads to much trepidation on my part, as people ask for things that I worry I’m not going to be able to teach.  For example, people had been asking for a class on statistical analysis in R.  I’ve taken a few different statistics classes, but stats were still something that filled me with terror.  When I submit my own articles for publication, I’m overcome with fear that I’ve made some horrible mistake in my statistical analyses and that peer reviewers are going to rip my article apart.  Or worse, the peer reviewers will miss it, it’ll be published, and readers will rip me apart.  The thought of actually teaching a class on how to do this seemed like a ridiculous idea, yet it was what so many people wanted.

So I went ahead and scheduled the class.  A lot of people signed up.  I got some very thick textbooks on statistics and statistical analysis in R and I spent many hours learning about all of this.  I got some data, saw what sorts of examples would make sense to demonstrate.  I painstakingly wrote out my code in R markdown, with lots of comments, so that everything would be well-explained.  And then, the morning arrived when I was to give the class for the first time.  Probably it was for the best that it was a webinar.  I was teleworking, so I gave the webinar from my home office, wearing sweatpants and my favorite UCLA t-shirt, with some lovely roses my boyfriend had brought me on my desk and my trusty dog looking in through the French doors.  I went through my examples, talking about linear regression, and tests of independence, and all sorts of other things, that, until I’d started to teach the webinar, I’d been very doubtful I had a good handle on.  But suddenly, I realized I kind of actually knew what I was talking about!  People typed their questions in the chat window and I  knew the answers! When the two hours were up and I signed off, I felt good about it, and over the next few days, I got lots of emails from people thanking me for the great class, which was great, since my main goal had just been to not say anything too stupid. 🙂

Now, I don’t feel so nervous about offering some of these advanced classes.  It’s kind of exciting to have the opportunity to stretch myself to learn things that I think are interesting.  Plus, nothing will give you more incentive to learn something you’ve wanted to explore than committing yourself to teach a class on it!  I’ve learned so much about so many cool things because people have said, hey, can you teach me this, and I say, sure! then scramble off to my office and check the indices of all my R books to see where I can learn how to do whatever that thing is.

The point of all this is to say that, for me at least, the “teach one” part of the old mantra is perhaps something librarians should jump on when it comes to expanding library roles in data management and data science.  I’m very fortunate that I get to spend most of my time working on data and nothing else, so I recognize that not everyone can take a week to immerse themselves in statistics, but I do think that librarians can and should stretch themselves to learn new things that will benefit our patrons.

My other piece of advice, which is surely nothing new: when someone asks a question, don’t be afraid to say I don’t know.  I learned quickly from that whole “= is not the same as <-” business.  Now when someone asks a question and I don’t know the answer, I do one of two things.  If I can, I try it out in the code right then and there.  So if someone says something like, can you rearrange the order of those two things in your code? I’ll say, huh, I never thought about that – let’s find out, and then do just that.  Other times, the question is something complicated, like, how do I do this random thing?  In those cases, I’ll say, that’s a great question, and I don’t actually know the answer, but if you’ll send me an email after this so I have your contact info, I will find out and follow up with you.  I’ve said that at least once in every class I’ve taught in the last 6 months, and the number of times someone has actually followed up with me: none.  I think this is probably due to one of two reasons.  One, I really emphasize troubleshooting and how to find out how to learn to do things in R when I teach, so it’s very possible that the person goes off and finds the answer themselves, which is great.  Two, I think there are times when people pose an idle question because they’re just kind of curious, or they want to look smart in front of their peers, and they don’t follow up because the answer doesn’t really matter that much to them anyway.

So there you go!  That’s my philosophy of getting to learn how to do cool stuff with data in order to benefit my researchers. 🙂

R for libRarians: data analysis and processing

I heard from several people after I wrote my last post about visualization who were excited about learning the very cool things that R can do.  Yay!  That post only scratched the surface of the many, nearly endless, things that R can do in terms of visualization, so if that seemed interesting to you, I hope you will go forth and learn more!  In case I haven’t already convinced you of R’s awesomeness (no, I’m not a paid R spokesperson or anything), I have a little more to say about why R is so great for data processing and analysis.

When it comes to data analysis, most of the researchers I know are either using some fancypants statistical software that costs lots of money, or they’re using Excel.  As a librarian, I have the same sort of feelings for Excel as I do for Google: wonderful tool, great when used properly, but frequently used improperly in the context of research.  Excel is okay for some very specific purposes, but at least in my experience, researchers are often using it for tasks to which it is not particularly suited.  As far as the fancypants statistical software, a lot of labs can’t afford it.  Even more problematic, every single one I’m aware of uses proprietary file formats, meaning that no one else can see your data unless they too invest in that expensive software.  As data sharing is becoming the expectation, having all your data locked in a proprietary format isn’t going to work.

Enter R!  Here are some of the reasons why I love it:

  • R is free and open source.  It’s supported by a huge community of users who are generally open to sharing their code.  This is great because those of us who are not programmers can take advantage of the work that others have already done to solve complex tasks.  For example, I had some data from a survey I had conducted, mostly in the form of responses to Likert-type scale questions.  I’m decidedly not a statistician and I was really not sure exactly how I should analyze these questions.  Plus, I wanted to create a visualization and I wasn’t entirely sure how I wanted it to look.  I suspected someone had probably already tackled these problems in R, so I Googled “R likert.”  Yes!  Sure enough, someone had already written a package for analyzing Likert data, aptly called likert.  I downloaded and installed the package in under a minute, and it made my data analysis so easy.  Big bonus: R can generally open files from all of those statistical software programs.  I saved the day for some researchers when the data they needed was in a proprietary format, but they didn’t want to pay several thousands of dollars to buy that program, and I opened the data in like 5 seconds in R.
  • R enhances research reproducibility. Sure, there are a lot of things you can do in Excel that you can do in R.  I could open an Excel spreadsheet and do, for example, a find and replace to change some values of something.  I could probably even do some fairly complex math and even statistics in Excel if I really knew what I was doing.  However, nothing I do here is going to be documented.  I have no record explaining how I changed my data, why I did things the way I did, and so on.  Case in point number 1: I frequently work on processing data that had been shared or downloaded from a repository to get it into the format that researchers need.  They tell me what kind of analysis they want to do, and the specifications they need the data to meet, and I can clean everything up for them much more easily than they could.  Before I learned R, this took a long time, for one thing, but I also had to document all the changes I made by hand. I would keep Word documents that painstakingly described every step of what I had done so I had a record of it if the researchers needed it.  It was a huge pain and ridiculously inefficient.  With R, none of that is necessary.  I write an R script that does whatever I need to do with the data.  Not only does R do it faster and more efficiently than Excel might, if I need a record of my actions, I have it all right there in the form of the script, which I can save, share, come back to when I completely forgot what I did 6 months later, and so on.  Another really nice point in this same vein, is that R never does anything with your original file, or your raw data.  If you change something up in Excel, save it, and then later realize you messed up, you’re out of luck if you’re working on your copy of the raw data.  That doesn’t happen with R, because R pulls the data, whatever that may be, into your computer’s working memory and sort of keeps its own copy there.  That means I can go to town doing all sorts of crazy stuff with the data, experiment and mess around with it to my heart’s content, and my raw data file is never actually touched.
  • Compared to some other solutions, R is a workhorse. I suspect some data scientists would  disagree with me characterizing R as a workhorse, which is why I qualified that statement.  R is not a great solution for truly big data.  However, it can handle much bigger data than Excel, which will groan if you try to load a file with several hundred thousand records and break if you try to load more than a million.  By comparison, this afternoon I loaded a JSON file with 1.5 million lines into R and it took about a minute.  So, while it may not be there yet in terms of big data, I think R is a nice solution for small to medium data.  Besides that, I think learning R is very pragmatic, because once you’ve got the basics down, you can do so many things with it.  Though it was originally created as a statistical language, you can do almost anything you can think of to/with data using R, and once you’ve got the hang of the basic syntax, you’re really set to branch out into a lot of really interesting areas.  I talked in the last post about visualization, which I think R really excels at.  I’m particularly excited about learning to use R for machine learning and natural language processing, which are two areas that I think are going to be particularly important in terms of data analysis and knowledge discovery in the next few years.  There’s a great deal of data freely available, and learning skills like some basic R programming will vastly increase your ability to get it, interact with it, and learn something interesting from it.

I should add that there are many other scripting languages that can accomplish many of the same things as R.  I highlight R because, in my experience, it is the most approachable for non-programmers and thus the most likely to appeal to librarians, who are my primary audience here.  I’m in the process of learning Python, and I’m at the point of wanting to bang my head against a wall with it.  R is not necessarily easy when you first get started, but I felt comfortable using it with much less effort than I expected it would take.  Your mileage may vary, but for the effort to payoff ratio I got, I absolutely think that my time spent learning R was well worth it.

R for libRarians: visualization

I recently blogged about R and how cool it is, and how it’s really not as scary to learn as many novices (including myself, a few years ago) might think.  Several of my fellow librarians commented, or emailed, to ask more about how I’m using R in my library work, so I thought I would take a moment to share some of those ideas here, and also to encourage other librarians who are using R (or related languages/tools) to jump in and share how you’re using it in your library work.

I should preface this by saying I don’t do a lot of “regular” library work anymore – most of what I do is working with researchers on their data, teaching classes about data, and collecting and working with my own research data.  However, I did do more traditional library things in the past, so I know that these kinds of skills would be useful.  In particular, there are three areas where I’ve found R to be very useful: visualization, data processing (or wrangling, or cleaning, or whatever you want to call it), and textual analysis.  Because I could say a lot about each of these, I’m going to do this over several posts, starting with today’s post on visualization.

Data visualization is one of my new favorite things to work on, and by far the tool I use most is R, specifically the ggplot2 package.  This package utilizes the concepts outlined in Leland Wilkinson’s Grammar of Graphics, which takes visualizations apart into their individual components.  As Wilkinson explains it,  “a language consisting of words and no grammar expresses only as many ideas as there are words. By specifying how words are combined in statements, a grammar expands a language’s scope…The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).”  When I teach ggplot2, I like to say that the kind of premade charts we can create with Excel are like the Dr. Seuss of visualizations, whereas the complex and nuanced graphics we can create with ggplot2 are the War and Peace.

For example, I needed to create a graph for an article I was publishing that showed how people had responded to two questions: basically, how important they felt a task was to their work, and how good they thought they were at that task.  I was not just interested in how many people had rated themselves in each of the five bins in my Likert scale, so a histogram or bar chart wouldn’t capture what I wanted.  That would show me how people had answered each question individually, but I was interested in showing the distribution of combinations of responses.  In other words, did people who said that a task was important to them have a correspondingly high level of expertise? I was picturing something sort of like a scatterplot, but with each each point (i.e., each combination of responses) sized according to how many people had responded with that combination.  I was able to do exactly this with ggplot2:

This was exactly what I wanted, and not something that I could have created with Excel, because it isn’t a “standard” chart type.  Not only that, but since everything was written in code, I was able to save it so I had an exact record of what I did (when I get back to my work computer, instead of my personal one, I will get the file and actually put that code here!).  It was also very easy to go back and make changes.  In the original version, I had the points sized by actual number of people who had responded, but one of the reviewers felt this was potentially confusing because of the disparity in the size of each group (110 scientific researchers, but only 21 clinical researchers).  I was asked to change the points to show percent of responses, rather than number of responses, and this took just one minor change to the code that I could accomplish in less than a minute.

I also like ggplot2 for creating highly complex graphics that demonstrate correlations in multivariate data sets.  When I’m teaching, I like to use the sample data set that comes with ggplot2, which has info about around 55,000 diamonds, with 10 variables, including things like price, cut, color, carat, quality, and so on.  How is price determined for these diamonds?  Is it simply a matter of size – the bigger it is, the more it costs?  Or do other variables also contribute to the price?  We could do some math to find out the actual answer, but we could also quickly create a visualization that maps out some of these relationships to see if some patterns start to emerge.

First, I’ll create a scatterplot of my diamonds, with price on the x-axis and carat on the y-axis.  Here it is, with the code to create it below:

a <- ggplot(diam, aes(x = price, y = carat)) + geom_point() + geom_abline(slope = 0.0002656748, intercept = 0, col = "red")

If there were a perfect relationship between price and diamond size, we would expect our points to cluster along the red line I’ve inserted here, which demonstrates a 1:1 relationship.  Clearly, that is not the case.  So we might propose that there are other variables that contribute to a diamond’s price.  If I really wanted to, I could actually demonstrate lots of variables in one chart.  For example, this sort of crazy visualization shows five different variables: price (x-axis), carat (y-axis), color (color of point, with red being worst quality color and lightest yellow being best quality color), clarity (size of point, with smallest point being lowest quality clarity and largest point being highest quality clarity), and cut (faceted, with each of the five cut categories shown in its own chart).

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,"YlOrRd"))) + facet_wrap(~cut, nrow = 1)

ggplot(diam, aes(x = price, y = carat, col = color)) + geom_point(aes(size = clarity)) + scale_colour_manual(values = rev(brewer.pal(7,”YlOrRd”))) + facet_wrap(~cut, nrow = 1)

We’d have to do some more robust mathematical analysis of this to really get info about the various correlations here, but just in glancing at this, I can see that there are definitely some interesting patterns here and that this data might be worth further looking into.  And since I use ggplot2 quick a bit and am fairly proficient with it, this plot took me less than a minute to put together, which is exactly why I love ggplot2 so much.

You can probably see how you could use ggplot2 to create, as I’ve said, nearly infinitely customized charts and graphs.  To relate this back to libraries, you could create visualizations about your collection, your budget, or whatever other numbers you might want to visually display in a presentation or a publication.  There are also other R packages that let you create other types of visualizations.  I haven’t used it, but there’s a package called VennDiagram that lets you, well, make Venn diagrams – back in my days of teaching PubMed, I used to always use Venn diagrams to show how Boolean operators work, and this would allow you to make them really easily (I was always doing weird stuff with Powerpoint to try to make mine look right, and they never quite did).  There are also packages like ggvis and Shiny that let you create interactive visualizations that you could put on a website, which could be cool.  I’ve only just started to play around with these packages, so I don’t have any examples of my own, but you can see some examples of cool things that people have done in the Shiny Gallery.

So there you go!  I love R for visualizations, and I think it’s much easier to create nice looking graphics with R than it is with Excel or Powerpoint, once you get the hang of it.  Now that I’ve heard from some other librarians who are coding, do any of you have other ideas about using R (or other languages!) for visualizations, or examples of visualizations you’ve created?

Some Additional Resources:

  • I teach a class on ggplot2 at my library – the handout and class exercises are on my Data Services libguide.
  • The help documentation for ggplot2 is quite thorough.  Looking at the various options, you can see how you can create a nearly infinite variety of charts and graphs.
  • If you’re interested in learning more about the Grammar of Graphics but don’t want to read the whole book, Hadley Wickham, who created ggplot2, has written a nice article, A Layered Grammar of Graphics, that captures many of the ideas.

So you think you can code

I’ve been thinking about many ideas lately dealing with data and data science (this is, I’m sure, not news to anyone).  I’ve also had several people encourage me to pick my blog back up, and I’ve recently made my den into a cute and comfy little office, so, why not put all this together and resume blogging with a little post about my thoughts on data!  In particular, in this post I’m going to talk about coding.

Early on in my library career when I first got interested in data, I was talking to one of my first bosses and told her I thought I should learn R, which is essentially a scripting language, very useful for data processing, analysis, statistics, and visualization.  She gave me a sort of dubious look, and even as I said it, I was thinking in my head, yeah, I’m probably not going to do that.  I’m no computer scientist.  Fast forward a few years later, and not only have I actually learned R, it’s probably the single most important skill in my professional toolbox.

Here’s the thing – you don’t have to be a computer scientist to code, especially in R.  It’s actually remarkably straightforward, once you get over the initial strangeness of it and get a feel for the syntax.  I started offering R classes around the beginning of this year and I call my introductory classes “Introduction to R for Non-programmers.”  I had two reasons for selecting this name: one, I had only been using R for less than a year myself and didn’t (and still don’t) consider myself an expert.  When I started thinking about getting up in front of a room of people and teaching them to code, I had horrifying visions of experienced computer scientists calling me out on my relative lack of expertise, mocking my class exercises, or correcting me in front of everyone.  So, I figured, let’s set the bar low. 🙂  More importantly, I wanted to emphasize that R is approachable!  It’s not scary!  I can learn it, you can learn it.  Hell, young children can (and do) learn it.  Not only that, but you can learn it from one of a plethora of free resources without ever cracking a book or spending a dime.  All it takes is a little time, patience, and practice.

The payoff?  For one thing, you can impress your friends with your nerdy awesome skills!  (Or at least that’s what I keep telling myself.)  If you work with data of any kind, you can simplify your work, because using R (or other scientific programming languages) is faaaaar more efficient than using other point and click tools like Excel.  You can create super awesome visualizations, do crazy data analysis in a snap, and work with big huge data sets that would break Excel.  And you can do all of this for free!  If you’re a research and/or medical librarian, you will also make yourself an invaluable resource to your user community.  I believe that I could teach an R class every day at my library and there would still be people showing up.  We regularly have waitlists of 20 or more people.  Scientists are starting to catch on to all the reasons I’ve mentioned above, but not all of them have the time or inclination to use one of the free online resources.  Plus, since I’m a real human person who knows my users and their research and their data, I know what they probably want to do, so my classes are more tailored to them.

I was being introduced to Hadley Wickham yesterday, who is a pretty big deal in the R world, as he created some very important R packages (kind of like apps), and my friend and colleague who introduced me said, “this is Lisa; she is our prototypical data scientist librarian.”  I know there are other librarian coders out there because I’m on mailing lists with some of them, but I’m not currently aware of any other data librarians or medical librarians who know R.  I’m sure there are others and I would be very interested in knowing them.  And if it is fair to consider me a “prototype,” I wonder how many other librarians will be interested in becoming data scientist librarians.  I’m really interested in hearing from the librarians reading this – do you want to code?  Do you think you can learn to code?  And if not, why not?

Radical Reuse: Repurposing Yesterday’s Data for Tomorrow’s Discoveries

I’ve been invited to be speaker at this evening’s Health 2.0 STAT meetup at Bethesda’s Barking Dog, alongside some pretty awesome scientists with whom I’ve been collaborating on some interesting research projects.  This invitation is a good step toward my ridiculously nerdy goal of one day being invited to give a TED talk.  My talk, entitled “Radical Reuse: Repurposing Yesterday’s Data for Tomorrow’s Discoveries” will briefly outline my view of data sharing and reuse, including what I view as five key factors in enabling data reuse.  Since I have only five minutes for this talk, obviously I’ll be hitting only some highlights, so I decided to write this blog post to elaborate on the ideas in that talk.

First, let’s talk about the term “radical reuse.”  I borrow this term from the realm of design, where it refers to taking discarded objects and giving them new life in some context far removed from their original use.  For some nice examples (and some cool craft ideas), check out this Pinterest board devoted to the topic.  For example, shipping pallets are built to fulfill the specific purpose of providing a base for goods in transport.  The person assembling that shipping pallet, the person loading it on to a truck, the person unpacking it, and so on, use it for this specific purpose, but a very creative person might see that shipping pallet and realize that they can make a pretty cool wine rack out of it.

The very same principle is true of scientific research data.  Most often, a researcher collects data to test some specific hypothesis, often under the auspices of funding that was earmarked to address a particular area of science.  Maybe that researcher will go on to write an article that discusses the significance of this data in the context of that research question.  Or maybe that data will never be published anywhere because they represent negative or inconclusive findings (for a nice discussion of this publication bias, see Ben Goldacre’s 2012 TED talk).  Whatever the outcome, the usefulness of the dataset need not end when the researcher who gathered the data is done with it.  In fact, that data may help answer a question that the original researcher never even conceived, perhaps in an entirely different realm of science.  What’s more, the return on investment in that data increases when it can be reused to answer novel questions, science moves more quickly because the process of data gathering need not be repeated, and therapies potentially make their way into practice more quickly.

Unfortunately, science as it is practiced today does not particularly lend itself to this kind of radical reuse.  Datasets are difficult to find, hard to get from researchers who “own” them, and often incomprehensible to those who would seek to reuse them.  Changing how researchers gather, use, and share data is no trivial task, but to move toward an environment that is more conducive to data sharing, I suggest that we need to think about five factors:

  • Description: if you manage to find a dataset that will answer your question, it’s unlikely that the researcher who originally gathered that data is going to stand over your shoulder and explain the ins and outs of how the data were gathered, what the variables or abbreviations mean, or how the machine was calibrated when the data were gathered.  I recently helped some researchers locate data about influenza, and one of the variables was patient temperature.  Straight forward enough.  Except the researchers asked me to find out how temperature had been obtained – oral, rectal, tympanic membrane – since this affects the reading.  I emailed the contact person, and he didn’t know.  He gave me someone else to talk to, who also didn’t know.  I was never able to hunt down the answer to this fairly simple question, which is pretty problematic.  To the extent possible, data should be thoroughly described, particularly using standardized taxonomies, controlled vocabularies, and formal metadata schemas that will convey the maximum amount of information possible to potential data re-users or other people who have questions about the dataset.
  • Discoverability: when you go into a library, you don’t see a big pile of books just lying around and dig through the pile hoping you’ll find something you can use.  Obviously this would be ridiculous; chances are you’d throw up your hands in dismay and leave before you ever found what you were looking for.  Librarians catalog books, shelve them in a logical order, and put the information into a catalog that you can search and browse in a variety of ways so that you can find just the book you need with a minimal amount of effort.  And why shouldn’t the same be true of data?  One of the services I provide as a research data informationist is assisting researchers in locating datasets that can answer their questions.  I find it to be a very interesting part of my job, but frankly, I don’t think you should have to ask a specialist in order to find a dataset, anymore than I think you should have to ask a librarian to go find a book on the shelf for you.  Instead, we need to create “catalogs” that empower users to search existing datasets for themselves.  Databib, which I describe as a repository of repositories, is a good first step in this direction – you can use it to at least hopefully find a data repository that might have the kind of data you’re looking for, but we need to go even further and do a better job of cataloging well-described datasets so researchers can easily find them.
  • Dissemination: sometimes when I ask researchers about data sharing, the look of horror they give me is such that you’d think I’d asked them whether they’d consider giving up their firstborn child.  And to be fair, I can understand why researchers feel a sense of ownership about their data, which they have probably worked very hard to gather.  To be clear, when I talk about dissemination and sharing, I’m not suggesting that everyone upload their data to the internet for all the world to access.  Some datasets have confidential patient information, some have commercial value, some even have biosecurity implications, like H5N1 flu data that a federal advisory committee advised be withheld out of fear of potential bioterrorism.  Making all data available to anyone, anywhere is neither feasible nor advisable.  However, the scientific and academic communities should consider how to increase the incentives and remove the barriers to data sharing where appropriate, such as by creating the kind of data catalogs I described above, raising awareness about appropriate methods for data citation, and rewarding data sharing in the promotion and tenure process.
  • Digital Infrastructure: okay, this is normally called cyberinfrastructure, but I had this whole “words starting with the letter D” thing going and I didn’t want to ruin it. 🙂  If we want to do data sharing properly, we need to build the tools to manage, curate, and search it.  This might seem trivial – I mean, if Google can return 168 million web pages about dogs for me in 0.36 seconds, what’s the big deal with searching for data?  I’m not an IT person, so I’m really not the right person to explain the details of this, but as a case in point, consider the famed Library of Congress Twitter collection.  The Library of Congress announced that they would start collecting everything ever tweeted since Twitter started in 2006.  Cool, huh?  Only problem is, at least as of January 2013, LC couldn’t provide access to the tweets because they lacked the technology to allow such a huge dataset to be searched.  I can confirm that this was true when I contacted them in March or April of 2013 to ask about getting tweets with a specific hashtag that I wanted to use to conduct some research on the sociology of scientific data sharing, and they turned me down for this reason.  Imagine the logistical problems that would arise with even bigger, more complex datasets, like those associated with genome wide association studies.
  • Data Literacy: Back in my library school days, my first ever library job was at the reference desk at UCLA’s Louise M. Darling Biomedical Library.  My boss, Rikke Ogawa, who trained me to be an awesome medical librarian, emphasized that when people came and asked questions at the reference desk, this was a teachable moment.  Yes, you could just quickly print out the article the person needed because you knew PubMed inside and out, but the better thing to do was turn that swiveling monitor around and show the person how to find the information.  You know, the whole “give a man a fish and he’ll eat for a day, teach a man to fish and he’ll eat for a lifetime” thing.  The same is true of finding, using, and sharing data.  I’m in the process of conducting a survey about data practices at NIH, and almost 80% of the respondents have never had any training in data management.  Think about that for a second.  In one of the world’s most prestigious biomedical research institutions 80% of people have never been taught how to manage data.  Eighty per cent.  If you’re not as appalled by that as I am, well, you should be.  Data cannot be used to its fullest if the next generation of scientists continues with the kind of makeshift, slapdash data practices I often encounter in labs today.  I see the potential for more librarians to take positions like mine, focusing on making data better, but that doesn’t mean that scientists shouldn’t be trained in at least the basics of data management.

So that’s my data sharing manifesto.  What I propose is not the kind of thing that can be accomplished with a few quick changes.  It’s a significant paradigm shift in the way that data are collected and science is practiced.  Change is never easy and rarely embraced right away, but in the end, we’re often better for having challenged ourselves to do better than we’ve been doing.  Personally, I’m thrilled to be an informationist and librarian at this point in history, and I look forward to fondly reminiscing about these days in our data-driven future. 🙂

Librarian in a New City: A Dispatch From North Bethesda

tree

I haven’t blogged in a long time, and quite a lot has happened since my last posting here.  I’ve sold my scooter and my surfboard, packed up all my stuff, and traded one coast for another.  I’m no longer a Los Angeleno, but a North Bethesdian?  Marylander?  I’m not really sure what the right word is, but I’m living in North Bethesda, MD, a suburb of Washington DC, and working at that National Institutes of Health (I’ve had to go through a lot of ethics training in the last few days, so I feel obligated at this point to say emphatically that all the nonsense I write here is my own thoughts and should not in any way be taken to represent the US government, etc).  In any case, here are some observations about the highlights of my new life after my first couple weeks here.

– seasons are pretty awesome so far.  Look at that tree up there in the picture and all the leaves!  You don’t get that in LA. Plus, I get to accessorize with things like cute hats and gloves and scarves.  I’ll admit that I thought I might die out there when I went for a 30 minute run in 48 degree weather today, but I’m sure I’ll get acclimated.  Either that or use the very nice gym in the lovely building where I live.

– being a Metro commuter is kind of nice.  Granted, it’s not as fun as aggressively racing around the city on a scooter, but it’s a lot less stressful than having to deal with traffic.  I live only two stops away from my work, so the ride is almost short enough to make it pointless to even bother bringing a book to read.  My metro stop is outdoors, with a little hilly grassy area on the other side of the tracks, and the first time I ever took the metro, I was standing there waiting for the train and saw something that looked like a beaver running on the grass along the railing.  I was staring at it with apparently a rather dumbfounded look on my face, because a guy who was also waiting for the train said to me “there goes one of the four-legged commuters.”  I asked him if it was a beaver, and he told me it was a groundhog, also known as a woodchuck, and then laughed at me like I was an idiot.  I told him I was from LA and had only been in town for two days, and he seemed satisfied with this explanation.

– speaking of wildlife, Ophelia is enjoying herself tremendously here, especially all the opportunities to encounter new animals.  At first, this little urban dog was highly skeptical of all the beautiful walking trails that go through these great wooded areas.  We would start down a nice tree-lined trail, and she would get visibly freaked out and refuse to continue.  I’m sure she was asking herself, where are all the buildings?  Why aren’t there any homeless people throwing pizza or screaming at me here?  Why are there so many trees, and why do they have leaves instead of palms on them, and more importantly, why are these leaves falling on my head?  However, when she realized there were things like rabbits, chipmunks, and squirrels of many different colors hanging out in the woods, she became much more interested.  Now when we walk down trails, she’s running from side to side, sniffing excitedly, trying to drag me into the trees to chase small wildlife.

– this is probably going to sound ridiculous, but by far the best thing that has happened to me here, the biggest and most significant change in my life, is that I have a dishwasher now.  After three years without one, I’d gotten very tired of washing dishes by hand, and very lazy.  I would think about cooking something, and calculate the number of dishes I’d have to wash by hand, and then I’d shake my head and order a pizza.  If by some miracle I did motivate myself to cook something, I’d often eat it straight out of the pot just to save myself the trouble of having to wash a plate or bowl.  The pizza delivery guy and I literally knew each other’s names because we saw each other every week, and one time he said “I brought you some extra ranch dressing this week because I forgot it last week,” which was nice, but kind of an embarrassing reminder of how often I had junk food delivered directly to my door so I could avoid washing a dish or two.  I’ve lived here for almost a month now, and you know how many times I’ve ordered pizza?  ZERO.  In fact, I don’t even know where to order pizza from in this neighborhood.  If I want pizza now, I whip up my own made-from-scratch crust and make my own pizza, and I don’t care if it dirties one dish or twenty, cause I’m throwing all of it in the dishwasher and it’s emerging sparkly clean and sanitized.

There are lots of things I have to say about my new life here, as well as waxing poetic about science and data and librarianship in my typically nerdy fashion, but I’ll save that for later.  For now, good night from Maryland!

What book are you reading?

UCLA recently got a new university librarian, Ginny Steel, who had her first day on Monday.  I appreciated that she had an all-staff meeting that very same day to talk to us about her plans and also to get feedback from us on our thoughts and priorities.  Much of the hour and a half meeting was questions from the staff.  I had one of those “yup, I’m totally in the right profession” moments when one of the questions from the staff was “what’s your favorite book?”  Where else but a library would you hear a question like that at an all-staff meeting with a new leader?  She answered the question in the same way I had when posed the same question a week or two earlier: how can you pick just one?  With all the great books I’ve read, there’s no way I could select one above the others.  When she said as much, the staff person asked her what she was currently reading.  She said a book about Chicago – I believe the title was The Third Coast.

I really like hearing what people are reading.  The only problem is that many times the answer turns out to be something very interesting, and then I look it up and it sounds incredible, so naturally I have to get a copy, and then it gets put on top of my ever-expanding pile of to-be-read treasures.  Nonetheless, I’m curious to know what you, the blog readers, are reading.   If there are still any of you left after my long blog absence!

Currently I’m in the middle of three books:

  • The Gold Bug Variations: I was reading this book while sitting alone in a restaurant for lunch, and the waiter asked me what it was about.  I was fairly certain that he probably didn’t want to hear the 10 minute-long microbiology, music, and literature lecture that it would take to explain the book.  So after a moment’s hesitation, I replied, “it’s about DNA, Bach music, and librarians.”  He said it sounded like something he would like, so I guess the was a good enough description.  I won’t bore any of you readers with the lecture either, but encourage you to check it out of you like fairly intellectual novels and any/all of the three aforementioned items.
  • Van Gogh: The Life: I was so fortunate to get to visit the Van Gogh Museum in Amsterdam recently, and it was quite an incredible experience, especially in how educational it was.  I obviously had seen pictures of many Van Gogh paintings before, but never so many of the real thing in one place.  I also only knew the broad strokes of his life – crazy, cut off his ear, etc.  The information that went along with the pictures revealed a much more nuanced view of his life.  I left feeling awed by the fact that he hadn’t even started painting until he was almost 30, and even more awed that he generally lacked any sort of innate artistic talent, and only achieved such great success as he did by creating prolifically and doggedly persisting at his craft in the face of constant rejection, even from his own family.  As soon as I left the museum, I got right on Amazon and looked for the best biography I could find, and found this one.  It’s a hefty tome – almost 1000 pages – but really an enjoyable read.  I’m not generally a biography person, but I’m really liking this.
  • Dry Tears: I also had a chance to visit the Anne Frank House when I was in Amsterdam, also a moving experience.  Earlier that year I’d read Man’s Search For Meaning by Victor Frankl, a Jewish psychiatrist who had survived Auschwitz.  As with Van Gogh’s life, I really only had a general sense of the history surrounding the Holocaust.  One of the things I’ve never understood is how the Nazis ever got so many people to go along with such crazy things.  I stumbled upon an online course on Coursera, called The Holocaust, that includes readings on history and Holocaust literature, as well as movies and documentaries.  The course started Monday and I checked out the first few items from the reading list today.  This is the first of the books, and so far I’m finding it very interesting.  I don’t know if there’s really a good answer for why people did what they did at that time, but I’m hoping I’ll at least learn a little more about what life was like at that time.

So that’s my current reading list.  How about all of you?

the sweet scent of Actinomycetes (or why rain smells good)

This morning I walked out the door and caught a whiff of something I don’t smell often in Los Angeles – actinomycetes!

You know that “rain smell” that you can detect, especially on a day when it hasn’t rained in awhile?  That’s actinomycetes.  It’s a kind of bacteria that lives in the soil, and when it rains, the water hitting the ground aerosolizes the bacteria, creating that distinctive rain smell.  So the next time you catch a whiff of the lovely, fresh scent of rain, don’t forget that it’s actually tiny liquid droplets of dirt bacteria entering your nose. 🙂

we meet again, my blog

Hello again, world.  It’s been a very long time since I last wrote here – what a terrible blogger I’ve been!  I’ve got some excuses – namely a very busy work schedule and way too many really interesting books to distract me.  However, I’m back and I hope some of my old readers are with me.  And perhaps some new.

I was thinking recently about some things I wanted to blog about.  I was also thinking how I really admire those people who commit to doing something artistic or creative on a regular schedule.  I have a couple of friends who have done the photo-a-day for a year thing, one of whom (Jonathan Wilson – he’s quite talented) is now in the midst of his 6th year of doing so.  Of course there’s the Julie and Julia movie based on the blogger who cooked her way through Julia Child’s cookbook.  I briefly followed a blog called A Dress A Day by a woman who is literally making a dress every single day, then photographing it, then blogging about it.  Here I am, sitting here wasting time on the internet and doing crossword puzzles when I could have hundreds of new dresses in my closet if I just applied myself!

Since my closet isn’t really big enough to accommodate that many dresses, I thought maybe I should undertake something slightly different if I were going to attempt a year-long creative challenge.  And then I thought about how I hadn’t written in my blog for so long, and I put two and two together… could I write a blog post every day for a year?  Why not give it a try?  I figured the worst that could happen would be that I wouldn’t be able to do it, and it’ll be like that time I said I wasn’t going to buy anymore new books until I read the ones I already had (that lasted about two weeks, I think).

So, I say, here goes.  Let’s see if, starting now, I can write a blog post every day for a year.  It doesn’t necessarily have to be long.  It can be about anything I find that interesting that day, whether it’s a librarian thing, an incredibly nerdy science thing, a cooking thing, a knitting thing, an Ophelia thing, or whatever.  Or maybe just a funny joke or a picture I happen to like that day.  I’m sure not every person will find all the posts interesting, but hopefully many people will like at least some of them. 🙂