Practicing What I Preach: The Open PhD Experiment

(Note: this is an adapted version of a final paper I wrote for one of my classes. That’s why it’s so long!)

A few weeks ago, a researcher called my office to see if we could meet to discuss our shared interest in open data. I agreed, and a week later we were sitting in my office having a lively discussion about the many problems that currently hinder more widespread data sharing and reuse in biomedical research. When I mentioned that these topics would be the focus of my doctoral dissertation work, he expressed an interested in seeing some of my research. I replied that it was only my first semester, so I didn’t have much yet, but that I’d published a few papers on my previous research. “I don’t mean papers,” he said. “I mean your data, your code. If you’re doing a PhD on data sharing, don’t you think you should share your data, too?  In fact, why don’t you do an open PhD?”

Perhaps I should have immediately replied, “you’re absolutely right. I will do an open PhD.”  After all, on the face of it, this suggestion seems perfectly reasonable. My research, and in fact my entire career, revolves around the premise that researchers should share their data. It should be a no-brainer that I would also share my data. In principle, I have no problem with agreeing to do so, but in the real world of research, lofty ideals like service to the community and furthering science are sometimes abandoned in favor of more practical concerns, like getting one’s paper accepted or finishing one’s dissertation before other people have a chance to capitalize on the data.

So what I ended up telling this researcher was that I found his suggestion intriguing and I’d give it some serious thought. I have done just that in the intervening weeks, and here I will reflect on the reasons for my hesitation and explore the levels of openness I am prepared to take on in my doctoral program and my academic career.

My first (mis)adventure with data sharing

The first – and as yet only – time I shared my data was when I submitted an article to PLOS in 2014. PLOS was one of the first publishers to adopt an open data policy that required researchers to share the data underlying their manuscripts. I dutifully submitted my data to figshare, a popular, discipline-agnostic data repository, with the title “Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff.” To my surprise, someone at figshare took notice of my upload and tweeted out a link to my dataset. I could have sworn that I’d checked the box to keep the data private until I opted to officially release them, but when I’d gone back to fix a minor mistake in the title of the submission, the box must have gotten unchecked, and the status was changed to public.

After the tweet went out, I could see from the “views” counter that people were already looking at the data. Someone retweeted the link to the data, then another person, and another. The paper hadn’t even been reviewed by anyone yet, much less accepted for publication, but my data were out there for anyone to see, with the link spreading across Twitter. The situation made me nervous. I was excited that people were interested in my data, but what were they doing with it?  The views counter ticked up steadily, and people were not just viewing, but actually downloading the dataset as well.

I finally received word from PLOS that they’d accepted the paper, but they asked for major revisions; Reviewer 2 (it’s always Reviewer 2) was niggling over my statistical methods, and I was going to have to redo much of my work to respond to all the revision requests. During the revision process, I received an email from someone I’d never heard of, from an Eastern European country I can’t now recall. She had seen my data on figshare and she, too, wanted to write a paper on this topic. She asked me to send her a copy of my still-in-process paper, as well as a list of all relevant references I had found. The audacity of her request shocked me. Here was someone I’d never even met, telling me she wanted to use my data, write essentially the same paper as me, and she wanted me to give her my background research as well?  I wrote an email back, politely but firmly rebuffing her request, and I never heard from her again.

In the end, everything went fine: the paper was published and it has gone on to be cited seven times and featured in PLOS’s new Open Data collection (PLOS Collections 2016). I do still believe that researchers, particularly those whose work is supported by taxpayers’ money, have a responsibility to share their data when doing so will not violate their human subjects’ privacy. However, my own experience demonstrated to me that sharing research data cannot be viewed as a black and white proposition, that you share and are “good,” or you don’t and you are “bad.” Rather, many researchers have real, valid concerns about how they share their data, when, and with whom. Though my reasons probably differ from those of many other researchers, I have my own concerns that give me pause when it comes to the idea of an “open PhD.”

  1. I don’t think my data would be useful or interesting to anyone else.

Some datasets have near infinite value, with uses that extend far beyond the expertise or disciplinary affiliation of their original collector. New computational methodologies and analytic techniques make it possible to uncover previously undetected meaning in datasets or “mash up” disparate datasets to detect novel connections between seemingly unrelated phenomena. The ability to quickly, easily, and cheaply share massive amounts of data means that researchers around the world are able to make life-saving discoveries. For example, the National Cancer Institute’s Cancer Genomics Cloud Pilot program allows researchers to connect to cancer genome data and perform complex analyses on cloud computing platforms more powerful than any computers they could buy for their lab (National Cancer Institute Center for Biomedical Informatics & Information Technology 2016). Projects like this are exciting – they could bring about cures for cancer and vastly improve our lives. Few people would argue that sharing these kinds of datasets is important.

By comparison, my data just look silly. Personally, I find my research fascinating. I could spend hours talking about biomedical scientists’ research data sharing and reuse practices. However, I don’t flatter myself that others are clamoring to see all the thrilling survey data and titillating interview transcriptions I have collected. Beyond validating the results in my article, I see little value for these data. Of course, I have made the argument that data can have unexpected uses that their original collectors could never have imagined, so I am prepared to admit that my data may have usefulness beyond what I would expect. Perhaps I should take the 252 views and 37 downloads of my figshare dataset as evidence that my data are of interest to more people than I might expect.

  1. I’m often embarrassed by my amateurish ways.

I’m a fan of GitHub, a site where you can share your code and allow others to collaboratively contribute to your work, but I’m also terrified of it. I spend a very significant amount of time at my job working with R, my programming language of choice; I teach it, I consult on it, and I use it for my own research. I like to think I know what I’m doing, but in all honesty, I’m pretty much entirely self-taught in R and, though I’m a quick study, I haven’t been using it for that long. I am far from an expert, and I often write code that makes this fact obvious.

Recently I wrote some R code related to a research project I hope to submit for publication soon. The work involved downloading the full text of over 60,000 articles, but since the server’s interface only allowed downloading a thousand articles at a time, I needed to write code that would download the allowed amount, then repeat itself 60 times, updating the article numbers after each iteration. I spent hours trying to figure out the best way to do it, but everything I tried failed. I could download a hundred at a time, then manually update the numbers in the code and re-run it, but doing this 60 times would have been time-consuming. In a throwing-up-your-hands moment of frustration, I wrote a command that would essentially just write those 60 lines of code for me, then ran all 60 lines.

Frankly, this approach was idiotic. Anyone who knows the first thing about programming would scoff at my code, and rightly so. However, at the time, this slipshod approach was the best I could come up with. It’s not just code that may reveal that I don’t always know what I’m doing; the more open the research process, the more opportunity for others to see the unpolished, imperfect steps that lie beneath the shiny surface of the perfected, word-smithed article.

  1. It takes time to prepare data for broader consumption.

When I teach data management classes for researchers, I emphasize how good data management practices will make submitting their data at the end of the process easy, practically effortless. Of course, having your data perfectly ready to share without any extra effort at the end of your project is about as likely as a jumping out of bed and looking good enough to head off to work without taking any time to freshen up. For example, part of my to-do list for preparing the article for the project I described above for publication is figuring out how to actually write that code the right way, so I can share it without fear of being humiliated. Getting my data, code, writing, or any other scholarly output I produce into the kind of shape it would need to be for me to be willing to put my name on it takes time. When I’m already trying to manage a demanding full-time job with a doctoral program and somehow still find the time to enjoy some sort of leisure every now and then, polishing up something to get it ready for sharing doesn’t often take enough priority to make it onto my daily schedule.

A compromise: the open-ish PhD

Though I’ve just spent five pages expounding on the reasons I cannot do a fully open PhD, I am prepared to compromise. The ideal the researcher urged me toward in our original conversation – don’t wait for your dissertation, share your data now, get your code up on GitHub today! – may not be right for me, but I do believe it is feasible to find some way to share at least some of scholarly output, if not in real-time, than at least in a timely fashion. Therefore, I propose the following tenets of my open-ish PhD:

  • I will do my best to write code that I am reasonably proud of (or at least not actively ashamed of) and share it on GitHub. While I do not feel comfortable immediately sharing code that corresponds to projects I am actively pursuing and seeking to publish, I will at least share it upon publication. I will also share teaching-related code immediately on GitHub, especially since doing so provides a good model for the researchers I am teaching.
  • I will make a more concerted effort to share my scholarly writing not just in its final, polished form as journal articles, but also in more casual settings, such as on my blog. I am also interested in exploring pre-print servers like arXiv and bioRxiv as a means of more rapid dissemination of research findings in advance of formal journal article publication.
  • I will attempt to collect data in a more mindful and intentional way, recognizing that I am not simply collecting my data, but that the point of my efforts are to inform others in my scholarly and research communities. As a federal employee, the work that I conduct in my official capacity cannot be copyrighted because it belongs not to me, but to all the American people who pay my salary. As I go forward with my research, I will do my best to remember that I am doing it not merely to satisfy my curiosity or add to my CV, but to advance science, even in my own small way.

In the end, it probably doesn’t matter so much whether the final data I share are perfect, whether my code impresses other people with its efficiency and elegance, or whether something I write appears in Nature or on my little blog. What matters is making the effort to share, committing to the highest level of openness possible, and doing so publicly and visibly – essentially, leading by example. I can give lectures on the importance of data sharing and teach classes on open source tools until I’m blue in the face, but perhaps the most important thing I can do to convince researchers of the importance of sharing and reusing data is doing exactly that myself.

Who Am I? The Identity Crisis of the Librarian/Informationist/Data Scientist

More and more lately, I’m asked the question “what do you do?” This is a surprisingly difficult question to answer.  Often, how I answer depends on who’s asking – is it someone who really cares or needs to know? – and how much detail I feel like going to at the moment when I’m asked.  When I’m asked at conferences, as I was quite a bit at FORCE2016, I tried to be as explanatory as possible without getting pedantic, boring, or long-winded.  My answer in those scenarios goes something like “I’m a data librarian – I do a lot of instruction on data science, like R and data visualization, and data management.”  When I’m asked in more social contexts, I hardly even bother explaining.  Depending on my mood and the person who’s asking, I’ll usually say something like data scientist, medical librarian, or, if I really don’t feel like talking about it, just librarian.  It’s hard to know how to describe yourself when you have a job title that is pretty obscure: Research Data Informationist.  I would venture to guess that 99% of my family, friends, and even work colleagues have little to no idea what I actually spend my days doing.

In some regards, that’s fine.  Does it really matter if my mom and dad know what it means that I’ve taught hundreds of scientists R? Not really (they’re still really proud, though!).  Do I care if my date has a clear understanding of what a data librarian does?  Not really.  Do I care if a random person I happen to chat with while I’m watching a hockey game at my local gets the nuances of the informationist profession?  Absolutely not.

On the other hand, there are often times that I wish I had a somewhat more scrutable job title.  When I’m talking to researchers at my institution, I want them to know what I do because I want them to know when to ask me for help.  I want them to know that the library has someone like me who can help with their data science questions, their data management needs, and so on.  I know it’s not natural to think “library” when the question is “how do I get help with finding data” or “I need to learn R and don’t know where to start” or “I’d like to create a data visualization but I have no idea how to do it” or any of the other myriad data-related issues I or my colleagues could address.

The “informationist” term is one that has a clear definition and a history within the realm of medical librarianship, but I feel like it has almost no meaning outside of our own field.  I can’t even count the number of weird variations I’ve heard on that title – informaticist, informationalist, informatist, and many more.  It would be nice to get to the point that researchers understood what an informationist is and how we can help them in their work, but I just don’t see that happening in the near future.

So what do we do to make our contributions and expertise and status as potential collaborators known?  What term can we call ourselves to make our role clear?  Librarian doesn’t really do it, because I think people have a very stereotypical and not at all correct view of what librarians do, and it doesn’t capture the data informationist role at all.  Informationist doesn’t do it, because no one has any clue what that means.  I’ve toyed with calling myself a data scientist, and though I do think that label fits, I have some reservations about using that title, probably mostly driven by a terrible case of imposter syndrome.

What’s in a name?  A lot, I think.  How can data librarians, informationists, library-based data scientists, whatever you want to call us, communicate our role, our expertise, our services, to our user communities?  Is there a better term for people who are doing this type of work?

Some ponderings on #force2016 and open data

I’m attending FORCE2016, which is my first FORCE11 conference after following this movement (or group?) for awhile and I have to say, this is one interesting, thought-provoking conference.  I haven’t been blogging in awhile, but I felt inspired to get a few thoughts down after the first day of FORCE2016:

  • I love the interdisciplinarity of this conference, and to me, that’s what makes it a great conference to attend.  In our “swag bag,” we were all given a “passport” and could earn extra tickets for getting signatures of attendees from different disciplines and geographic locations.  While free drinks are of course a great incentive, I think the fact that we have so many diverse attendees at this conference is a draw on its own.  I love that we are getting researchers, funders, publishers, librarians, and so many other stakeholders at the table, and I can’t think of another conference where I’ve seen this many different types of people from this many countries getting involved in the conversatioon.
  • I actually really love that there are so few concurrent sessions.  Obviously, fewer concurrent sessions means fewer voices joining the official conversation, but I think this is a small enough conference that there are ways to be involved, active, and vocal without necessarily being an invited speaker.  While I love big conferences like MLA, I always feel pulled in a million different directions – sometimes literally, like last year when I was scheduled to present papers at two different sessions during the same time period.  I feel more engaged at a conference when I’m seeing mostly the same content as others.  We’re all on the same page and we can have better conversations.  I also feel more engaged in the Twitter stream.  I’m not trying to follow five, ten, or more tweet streams at once from multiple sessions.  Instead, I’m seeing lots of different perspectives and ideas and feedback on one single session.  I like us all being on the same page.

Now, those are some positives, but I do have to bring it down with one negative from this conference, and that is that I think it’s hard to constructively talk about how to encourage sharing and open science when you have a whole conference full of open science advocates.  I do not in any way want to disparage anyone because I have a lot of respect for many of the participants in the session I’m talking about, but I was a little disappointed in the final session today on data management.  I loved the idea of an interactive session (plus I heard there would be balloons and chocolate, so, yeah!) and also the idea of debate on topics in data sharing and management, since that’s my jam.  I did debate in high school, so I can recognize the difficulty but also the usefulness of having to argue for a position with which you strongly disagree.  There’s real value in spending some time thinking about why people hold positions that are in opposition of your strongly held position.  And yeah, this was the last session of a long day, and it was fun, and it had popping of balloons, and apparently some chocolate, and whatnot, but I am a little disappointed at what I see as a real missed opportunity to spend some time really discussing how we can address some of the arguments against data sharing and data management.  Sure, we all laughed at the straw men that were being thrown out there by the teams who were being called upon to argue in favor of something that they (and all of us, as open science advocates) strongly disagreed with.  But I think we really lost an opportunity to spend some time giving serious thought to some of the real issues that researchers who are not open science advocates actually raise.  Someone in that session mentioned the open data excuses bingo page (you can find it here if you haven’t seen it before).  Again, funny, but SERIOUSLY I have actually have real researchers say ALL of these things, except for the thing about terrorists.  I will reiterate that I know and respect a lot of people involved with that session and I’m not trying to disparage them in any way, but I do hope we can give some real thought to some of the issues that were brought up in jest today.  Some of these excuses, or complaints, or whatever, are actual, strongly-held beliefs of many, many researchers.  The burden is on us, as open science advocates, to demonstrate why data sharing, data management, and the like are tenable positions and in fact the “correct” choice.

Okay, off my soap box!  I’m really enjoying this conference, having a great time reconnecting with people I’ve not seen in years, and making new connections.  And Portland!  What a great city. 🙂

Radical Reuse: Repurposing Yesterday’s Data for Tomorrow’s Discoveries

I’ve been invited to be speaker at this evening’s Health 2.0 STAT meetup at Bethesda’s Barking Dog, alongside some pretty awesome scientists with whom I’ve been collaborating on some interesting research projects.  This invitation is a good step toward my ridiculously nerdy goal of one day being invited to give a TED talk.  My talk, entitled “Radical Reuse: Repurposing Yesterday’s Data for Tomorrow’s Discoveries” will briefly outline my view of data sharing and reuse, including what I view as five key factors in enabling data reuse.  Since I have only five minutes for this talk, obviously I’ll be hitting only some highlights, so I decided to write this blog post to elaborate on the ideas in that talk.

First, let’s talk about the term “radical reuse.”  I borrow this term from the realm of design, where it refers to taking discarded objects and giving them new life in some context far removed from their original use.  For some nice examples (and some cool craft ideas), check out this Pinterest board devoted to the topic.  For example, shipping pallets are built to fulfill the specific purpose of providing a base for goods in transport.  The person assembling that shipping pallet, the person loading it on to a truck, the person unpacking it, and so on, use it for this specific purpose, but a very creative person might see that shipping pallet and realize that they can make a pretty cool wine rack out of it.

The very same principle is true of scientific research data.  Most often, a researcher collects data to test some specific hypothesis, often under the auspices of funding that was earmarked to address a particular area of science.  Maybe that researcher will go on to write an article that discusses the significance of this data in the context of that research question.  Or maybe that data will never be published anywhere because they represent negative or inconclusive findings (for a nice discussion of this publication bias, see Ben Goldacre’s 2012 TED talk).  Whatever the outcome, the usefulness of the dataset need not end when the researcher who gathered the data is done with it.  In fact, that data may help answer a question that the original researcher never even conceived, perhaps in an entirely different realm of science.  What’s more, the return on investment in that data increases when it can be reused to answer novel questions, science moves more quickly because the process of data gathering need not be repeated, and therapies potentially make their way into practice more quickly.

Unfortunately, science as it is practiced today does not particularly lend itself to this kind of radical reuse.  Datasets are difficult to find, hard to get from researchers who “own” them, and often incomprehensible to those who would seek to reuse them.  Changing how researchers gather, use, and share data is no trivial task, but to move toward an environment that is more conducive to data sharing, I suggest that we need to think about five factors:

  • Description: if you manage to find a dataset that will answer your question, it’s unlikely that the researcher who originally gathered that data is going to stand over your shoulder and explain the ins and outs of how the data were gathered, what the variables or abbreviations mean, or how the machine was calibrated when the data were gathered.  I recently helped some researchers locate data about influenza, and one of the variables was patient temperature.  Straight forward enough.  Except the researchers asked me to find out how temperature had been obtained – oral, rectal, tympanic membrane – since this affects the reading.  I emailed the contact person, and he didn’t know.  He gave me someone else to talk to, who also didn’t know.  I was never able to hunt down the answer to this fairly simple question, which is pretty problematic.  To the extent possible, data should be thoroughly described, particularly using standardized taxonomies, controlled vocabularies, and formal metadata schemas that will convey the maximum amount of information possible to potential data re-users or other people who have questions about the dataset.
  • Discoverability: when you go into a library, you don’t see a big pile of books just lying around and dig through the pile hoping you’ll find something you can use.  Obviously this would be ridiculous; chances are you’d throw up your hands in dismay and leave before you ever found what you were looking for.  Librarians catalog books, shelve them in a logical order, and put the information into a catalog that you can search and browse in a variety of ways so that you can find just the book you need with a minimal amount of effort.  And why shouldn’t the same be true of data?  One of the services I provide as a research data informationist is assisting researchers in locating datasets that can answer their questions.  I find it to be a very interesting part of my job, but frankly, I don’t think you should have to ask a specialist in order to find a dataset, anymore than I think you should have to ask a librarian to go find a book on the shelf for you.  Instead, we need to create “catalogs” that empower users to search existing datasets for themselves.  Databib, which I describe as a repository of repositories, is a good first step in this direction – you can use it to at least hopefully find a data repository that might have the kind of data you’re looking for, but we need to go even further and do a better job of cataloging well-described datasets so researchers can easily find them.
  • Dissemination: sometimes when I ask researchers about data sharing, the look of horror they give me is such that you’d think I’d asked them whether they’d consider giving up their firstborn child.  And to be fair, I can understand why researchers feel a sense of ownership about their data, which they have probably worked very hard to gather.  To be clear, when I talk about dissemination and sharing, I’m not suggesting that everyone upload their data to the internet for all the world to access.  Some datasets have confidential patient information, some have commercial value, some even have biosecurity implications, like H5N1 flu data that a federal advisory committee advised be withheld out of fear of potential bioterrorism.  Making all data available to anyone, anywhere is neither feasible nor advisable.  However, the scientific and academic communities should consider how to increase the incentives and remove the barriers to data sharing where appropriate, such as by creating the kind of data catalogs I described above, raising awareness about appropriate methods for data citation, and rewarding data sharing in the promotion and tenure process.
  • Digital Infrastructure: okay, this is normally called cyberinfrastructure, but I had this whole “words starting with the letter D” thing going and I didn’t want to ruin it. 🙂  If we want to do data sharing properly, we need to build the tools to manage, curate, and search it.  This might seem trivial – I mean, if Google can return 168 million web pages about dogs for me in 0.36 seconds, what’s the big deal with searching for data?  I’m not an IT person, so I’m really not the right person to explain the details of this, but as a case in point, consider the famed Library of Congress Twitter collection.  The Library of Congress announced that they would start collecting everything ever tweeted since Twitter started in 2006.  Cool, huh?  Only problem is, at least as of January 2013, LC couldn’t provide access to the tweets because they lacked the technology to allow such a huge dataset to be searched.  I can confirm that this was true when I contacted them in March or April of 2013 to ask about getting tweets with a specific hashtag that I wanted to use to conduct some research on the sociology of scientific data sharing, and they turned me down for this reason.  Imagine the logistical problems that would arise with even bigger, more complex datasets, like those associated with genome wide association studies.
  • Data Literacy: Back in my library school days, my first ever library job was at the reference desk at UCLA’s Louise M. Darling Biomedical Library.  My boss, Rikke Ogawa, who trained me to be an awesome medical librarian, emphasized that when people came and asked questions at the reference desk, this was a teachable moment.  Yes, you could just quickly print out the article the person needed because you knew PubMed inside and out, but the better thing to do was turn that swiveling monitor around and show the person how to find the information.  You know, the whole “give a man a fish and he’ll eat for a day, teach a man to fish and he’ll eat for a lifetime” thing.  The same is true of finding, using, and sharing data.  I’m in the process of conducting a survey about data practices at NIH, and almost 80% of the respondents have never had any training in data management.  Think about that for a second.  In one of the world’s most prestigious biomedical research institutions 80% of people have never been taught how to manage data.  Eighty per cent.  If you’re not as appalled by that as I am, well, you should be.  Data cannot be used to its fullest if the next generation of scientists continues with the kind of makeshift, slapdash data practices I often encounter in labs today.  I see the potential for more librarians to take positions like mine, focusing on making data better, but that doesn’t mean that scientists shouldn’t be trained in at least the basics of data management.

So that’s my data sharing manifesto.  What I propose is not the kind of thing that can be accomplished with a few quick changes.  It’s a significant paradigm shift in the way that data are collected and science is practiced.  Change is never easy and rarely embraced right away, but in the end, we’re often better for having challenged ourselves to do better than we’ve been doing.  Personally, I’m thrilled to be an informationist and librarian at this point in history, and I look forward to fondly reminiscing about these days in our data-driven future. 🙂

the sweet scent of Actinomycetes (or why rain smells good)

This morning I walked out the door and caught a whiff of something I don’t smell often in Los Angeles – actinomycetes!

You know that “rain smell” that you can detect, especially on a day when it hasn’t rained in awhile?  That’s actinomycetes.  It’s a kind of bacteria that lives in the soil, and when it rains, the water hitting the ground aerosolizes the bacteria, creating that distinctive rain smell.  So the next time you catch a whiff of the lovely, fresh scent of rain, don’t forget that it’s actually tiny liquid droplets of dirt bacteria entering your nose. 🙂

Why Data Management is Cool (Sort Of)

“She told me the topic was really boring, but that you made it kind of interesting,” the woman said when I asked her to be honest about what our mutual acquaintance had said after attending a class I’d taught on writing a data management plan.  This is not the first time I’d heard something like this.  The fact is, I’m pretty damn passionate and excited about a topic that most people find slightly less boring than watching paint dry: data.  Now, I’m not going to try to convince you that data is not nerdy.  It is.  Very nerdy.   I have never claimed to be cool, and this is probably one of my least cool interests.  However, I think I have some very good reasons for finding data rather interesting.

I remember pretty much the exact moment when I realized the very interesting potential that lives in data.  I was in library school and taking a class in the biomedical engineering department about medical knowledge representation, and we were spending the whole quarter on talking about the very complicated issue of representing the clinical data around a very specific disease (glioblastoma multiforme or GBM, a type of brain cancer).  It’s very difficult with this disease, as with many others, to arrange and organize the data just about a single patient in such a way that a clinician can make sense of it.  There’s genetic data, vital signs data, drug dosing data, imaging data, lab report data, genetic data, doctor’s subjective notes, patient’s subjective reports of their symptoms, and tons of other stuff, and it all shifts and changes over time as the disease progresses or recedes.  Is there any way to build a system that could present this data in any sort of a manageable way to allow a clinician to view meaningful trends that might provide insight into the course of disease that could help improve treatment?  Disappointingly, at least for now, the answer seems to be no, not really.

But the moment that I really knew that I wanted to work with this stuff was when we were talking about personalized medicine and genetic data.  In the case of GBM, as with many other diseases, certain medicines work very well on some patients, but fail almost completely in others.  Many factors could play into this, but there’s likely a large genetic component for why this should be.  Given enough data about the patients in whom these drugs worked and in whom they didn’t, then, could we potentially figure out in advance which drug could help someone?  Extrapolating from that, if we have enough health data about enough different patients, aren’t there endless puzzles we could solve just by examining the patterns that would emerge by getting enough information into a system that could make it comprehensible?

Perhaps that’s oversimplifying it, but I do think it’s fair to conceive of data as pure, unrefined knowledge.  When I look at a dataset, I don’t see a bunch of numbers or some random collection of information.  I imagine what potential lives within that data just waiting to be uncovered by the careful observation of some astute individual or a program that can pick out the patterns that no human could ever catch.  To me, raw data represents the final frontier of wild, untamed knowledge just waiting to be understood and explained, and to someone like me who is really in love with knowledge above all, that’s a pretty damn cool thing.

Yes, I know that writing a data management plan or figuring out what kind of metadata to use for a dataset is pretty boring.  I’m not denying that.  But sometimes you have to do some boring stuff to make cool things happen.  You have to get your oil changed if you want your Bugatti Veyron to do 0 to 60 in 2.5 seconds (I mean, I’m assuming those things have to get oil changes?).  You have to do the math to make sure your flight pattern is right if you want to shoot a rocket into space.  And you can’t find out all the cool secrets that live in your dataset if it’s a messy pile of papers sitting on your desk.  So the way I see it, my job is to make data management as easy and as interesting as possible so that the people who have the data will be able to unlock the secrets that are waiting for them.  So spread the word, my fellow data nerds.  Let’s make data management as cool as regular oral hygiene.  😉

Reading the Great Books of Science

It’s been ages since I posted here, and I can’t let all the blog readers down, can I?!  I’ve been up to all sorts of fantastically nerdy things lately, which have kept me rather too busy for blogging, and which I will probably report on here in due time.  For now, let’s talk science and books, which as we all know, are two of my favorite things (the other top contenders for my favorite things being dogs, champagne, and Paris).

One of the many perks of working at a major research institution is that really awesome people come speak here.  Case in point: a few weeks back, I had the opportunity to attend a Q&A session with James Watson, as in Watson and Crick, as in discoverers of the double helix structure of DNA.  True story: the Q&A ended at the exact same time as I had to be across campus for the start of a class, so I knew I was going to have to leave early.  When I told this to one of my bosses, who was also attending, she said, “you’re going to get up and walk out while James Watson is talking?”  And indeed, that is exactly what I did. 🙂

However, before I left, one of the things Watson had to say that struck me was regarding what he referred to as “the great books.”  I forget exactly how he put it, but he said that he had appreciated his schooling for exposing him to these great books, which had helped shape his thinking.  This statement reminded me of a blog post I’d recently read about Carl Sagan’s reading list, written in his own hand and excerpted from his papers, now held by the Library of Congress.  As the blog post I’d read eloquently puts it, is it possible to “reverse engineer” a great mind by following in that thinker’s literary footsteps?

I’m sure it’s not so simple as that, but in any case, I decided that I would like to add to my already completely ridiculous collection of to-read books by creating my own “great books of science” library.  Based on my research into what one might currently consider the important books in science (at least for the non-scientist), I’ve started my library with the following titles:

  1. Charles Darwin – The Origin of Species
  2. Richard Dawkins – The Selfish Gene
  3. Stephen Hawking – A Brief History of Time
  4. Matt Ridley – Genome: The Autobiography of a Species in 23 Chapters
  5. Carl Sagan – Cosmos

So far, I’m about 1/3 of the way into Genome, which I really enjoy (but I did also just start Haruki Murakami’s The Wind-Up Bird Chronicle, the reading of which has become a near-obsession that currently occupies almost all of my free time).  It’s a nice overview of evolution and genetics, though perhaps a little bit less technical than I would have liked, but certainly an enjoyable read.

So, dear blog readers, as you can see, my list is at present by no means comprehensive.  What would you add to a library of the “great books” of science?  Let me know in the comments so I can add to my Amazon wish list. 🙂

More Neuroscience Awesomeness and A Challenge for Librarians!

In my neuroscience class, we’ve now moved away from developmental neuroscience and into what I find way more interesting and the real reason I wanted to take the course: molecular neuroscience.  For the next three weeks, we’ll be learning how nerves communicate with each other.  Mostly this is through different channels that send stuff like ions and neurotransmitters in and out of cells.  We had a guest speaker who specializes in genetic neurological diseases, and she focused her talk specifically on what are called “channelopathies.”  That is, genetic diseases in which symptoms are caused by problems with these nerve channels.  Some of these problems are common – for example, many types of migraine are caused by channelopathies – but some are rare and super bizarre.

Here’s one of the rare and super bizarre ones the lecturer told us about: periodic paralysis is a condition in which the patient becomes temporarily but completely paralyzed, and then afterwards, they’re totally fine.  The paralysis can be brought on by all different kinds of things – stress, excitement, etc.  The lecturer told us about a really strange case of familial periodic paralysis that was found in a large family in Ireland.  Genetically, it’s autosomal dominant, meaning that if one parent has it, the children have a 50% chance of developing it.  So as one would expect, about half of this family is affected.  The trigger for this particular familial periodic paralysis is overeating.  The lecturer said “think of the gatherings this family must have.  They all get together and eat a big meal, and then half of them are paralyzed!”  Can you imagine, half of a family falling over paralyzed after dinner and then getting up and going home a few hours later like nothing ever happened?  Wouldn’t that make for some awkward family reunions?  Since the condition isn’t dangerous, I think it’s okay if we laugh a little bit at that image, right?  (Obviously familial periodic paralysis is not funny, and I’m definitely not making fun of it.  But don’t you have to admit that you’re wondering how different your family gatherings might have been had half of you been paralyzed for awhile after dinner?)

This family and their condition intrigued me so much that as soon as I got home, I went to PubMed to see if I could find the case in the literature (I really can’t help it…I’m a librarian), but my searching has turned up nothing so far.  Therefore I am challenging the medical librarians out there to find me a case report.  If you find it, you will win….I don’t know, honor and glory.  🙂  So to run down again, here’s what we know:

  • autosomal dominant
  • channelopathy (I think she said on the potassium ion channel, which would make sense because I found lots of cases of hyperkalemic periodic paralysis)
  • familial periodic paralysis
  • overeating
  • probably an Irish family (the lecturer did specify Irish, but as every librarian knows, people often misremember these kinds of details, so probably best not to rely on this particular piece of information)

Alright, go! 🙂

(And by the way, if no one finds this within a week, I’ll email the guest lecturer and ask, but let’s try to save me the embarrassment of having to compose that bizarre email, shall we?)

Talking the Talk: Why Research Informationists Should Go to Class!

On a recent evening, I found myself wondering about neurotransmitters (like you do).  I had sort of a vague idea of how they worked, but It occurred to me that, as the liaison librarian to the departments of all brain-y things at UCLA (neurology, neuroscience, psychiatry, psychology, etc), I’d probably be doing myself a favor if I learned a little bit more about these areas.  Thus it was that I came to enroll in Neuroscience 101B, an undergrad course in developmental and molecular neuroscience – that is, how the nervous system is formed during gestation, and how neurotransmitters and other molecular signalling methods work in the adult.  I had to contact the professor to get special permission to join the course, and he said I was welcome to freely attend the lectures if I wanted (as it’s huge and they don’t take roll), but I could also officially enroll, which would require that I take the three exams and complete weekly, page-long critical responses to recent articles in the field.  I thought to myself, “if I just audit this class, things will get busy during the quarter like they always do, and I’ll stop going.  But if I actually enroll and have to earn a grade, I have real incentive to learn this.”  So I decided to actually enroll, and I’m so glad I did – I’m only two and a half weeks in, but I can already see how taking this class is going to be so helpful to me as a librarian and research informationist.  Already I have started to get some benefits:

  1. Learn their language.  A mere sampling of the words and phrases that have entered my vocabulary in just two and a half weeks: ligand, rostral/caudal, filopodia, membrane diffusible, notochord, presynaptic compartment.  No, I did not make any of that up, and yes, I can define all of it.  In short, I am learning to speak the language of neuroscience.
  2. Learn their experimental methods: Thanks to this class, I now know what two-photon microscopy is.  I know the exact procedure by which one creates a cranial window for imaging neurons via a craniotomy (don’t look it up.  Trust me. It involves dental adhesive and super glue and it’s not at all pleasant).  I can explain several different experimental methods for examining neuronal activity, as well as various reasons why one would want to examine neuronal activity in the first place. Understanding the how and why of the science makes such a huge difference in being able to understand the how and why of their research methods.  Obviously, for a research informationist, this is key.
  3. Learn the big names in the field.  Though I live in LA, I’m not one to name drop. 🙂  However, I will say this about neuroscience, in my experience of it: you are going to learn to recognize the people who did the big experiments (and it’s probably true of other fields as well).  For one thing, you can’t help but know them because there’s stuff named after them (see for example the Cajal-Retzius cell and the interstitial cell of Cajal, both named for an evidently reclusive Spanish Nobel prize winner who spent hours and hours of his life dyeing nerves to study them and thus ended up discovering tons of stuff).  But even when there’s not something named after the researcher, you still learn who did the experiment, and I get the feeling this kind of thing might even be on the exam.  I appreciate that about the field – credit where credit is due, right?  More importantly, it’s interesting to learn the big names who are currently doing research in the field, particularly when those big names happen to be on my campus and publishing in Nature and such.  When I hear those things in lecture, that is something I definitely file away for later.
  4. Learn about the department (and have them learn about me).  When I contacted the professor to ask to take the course and told him why, I have a feeling that was probably the first time he even knew he had a liaison librarian.  Now, not only do he and the other two class professors know I’m here, but they also know that I’m interested in what they do.  Plus, I’m learning all sorts of things about the department (such as the fact that they have TONS of seminars and lectures I’d never heard about) as well as things about the student experience, so I have more of a context in which to understand the kind of research assistance these students might need.
  5. Learn fantastic trivia for more interesting conversations.  Okay, not an entirely serious reason, but a nice side effect of the course.  For example, did you know that in a rat, each whisker is connected to a single neuron?  I assume the same is true for dogs, so now I like to bug Ophelia by touching a single whisker and wondering which neuron it’s setting off.  (I explained to her that it’s for science, but she still seems annoyed by it.)  Or how about that there are proteins and neurotransmitters with names like Sonic hedgehog, Dickkopf (means big head in German!), and Frizzled?

All of this is important to me because I love working with researchers and I feel like I can more legitimately sit at the table now, so to speak.  Obviously the knowledge I’m getting from one undergrad survey class is hardly enough to get me up to speed on something so complex as the nervous system, but at least now I feel like I understand all of those brain-y departments better, especially in terms of the research they’re conducting.

I do want to emphasize that I don’t think a degree in a science field is necessary for a research informationist or other librarian who is interested in working with clinical or basic science researchers.  Some of the best science/medical librarians I know have liberal arts degrees: political science, English, philosophy, etc.  Regardless of your educational background, though, I think the best science librarians are those who are able to learn how to adapt to the field and learn the language and culture of the science they work with.  Like different regions of the United States, each scientific field has their own dialect and “regional” traditions and practices.  If you don’t know how to operate in that language and tradition, you are pretty obviously an outsider.  But…if you want to slip in amongst them…it’s easy enough to do so if you have a little knowledge.  Taking a class is not necessarily for everyone.  I don’t know many adult professional people who would voluntarily spend their weekend studying for a neuroscience exam (I’m lame, I know, but look, I really want an A), but for those librarians who can manage it, I can’t speak highly enough of the experience.  Fortunately for those who are not quite as insane ambitious as I am, there are other ways of gaining knowledge too, like checking out Data Curation Profiles, going to open lectures and grand rounds, talking to researchers about their work, and, erm, reading Wikipedia.  🙂

Of course I say all this now, but I might be singing a different tune after my first exam this Monday. Now, if you’ll excuse me, I have to go remind myself about the three different mechanisms by which synaptic topography is modeled in the developing nervous system.

Cool Science: Crowdsourcing Big Data

Anyone who knows me at all knows I really like data.  It’s a tremendously nerdy interest, but I find data really fascinating, I guess in part because I love the idea that there is some great knowledge that’s hidden in the numbers, just waiting for someone to come along and dig it out.  What’s very cool is that we live in an age when technology allows us to generate massive amounts of data.  For example, the Large Hadron Collider generates more than 25 petabytes a year in data, which is more than 70 terabytes a day.  A DAY.  Some data analysis can be done by computers, but some of it really has to be done by people.  Plus, some studies really rely on the ability to gather data from massive groups of people in order to get an adequate sample from various groups to prove what you’re trying to show.  To solve these and other “big data” problems, some very smart and cool research groups have jumped on the crowdsourcing bandwagon and are having people from around the world get online and help solve the problems of data gathering and analysis.  Here are some cool projects I’ve heard about.

Eyewire: a group of researchers working on retinal connectomes at MIT found a fascinating way to get people to help with their data analysis – turn it into a game.  They have a good wiki that explains the project in depth, but the gist of it is that these researchers have microscopic scans of neurons from the retina.  Neurons are a huge tangled mess, so their computers could figure out how some of them fit together, but it really takes an actual person to go in and figure out what’s connected and what’s not.  So this team turned it into this 3D puzzle/game thing that’s really hard to explain unless you try it.  You go through a tutorial to learn how to use the system, and then you’re turned loose to start mapping neurons!  It’s not like the most compelling game I’ve ever played or something I’d spend hours doing, but it is interesting, and it helps neuroscience, so that’s pretty cool.

Small World of Words: this study aims to better understand human speech and how we subconsciously create networks of associations among words.  To do so, they set up a game to gather word associations from native and non-native English speakers.  Again, I wouldn’t necessarily call this a game in the sense of “woohoo, we’re having so much fun!” but it is kind of interesting to see what your brain comes up with when you’re given a set of random words.  (Plus it’s perhaps a little telling of your own psychological state if you really think about the words you’re coming up with.)  It takes like 2 minutes to do, and again, it’s contributing to science!  Also, according to their website, they are making their dataset publicly available, which as a research informationist/data librarian I wholeheartedly endorse.

Foldit: I haven’t played this yet, so I can’t speak to how fun it is (or boring), but it sounds similar to Eyewire in the sense of being a puzzle in which the players are helping to map a structure – in this case proteins.  Proteins are long chains of amino acids, but they fold up in certain ways that determine their function.  Knowing more about this folding structure makes it possible to create better drugs and understand the pathology of diseases.  For example, one of the things this project is looking at is proteins that are crucial for HIV to replicate itself within the human body.  Better understanding of the structure of these proteins could help contribute to drugs to treat HIV and AIDS.

So I encourage you to go play some games for science!  Do it now!  And if you’re at work and someone tries to stop you, just politely explain that you’re not playing a game – you’re curing AIDS.  🙂