Living my best academic life: 2018 resolutions for getting that PhD done

I didn’t feel very optimistic going into 2017. I had recently lost my father and grandfather in the same week, and I was feeling anxious and depressed about what seemed like a pretty disastrous outcome to the 2016 elections. I don’t think I made any resolutions that year because I was so disheartened by the whole situation that I figured, who cares? My focus in 2017 was basically, do what it takes to get through it, eat some good food and drink some good wine because possibly the world will end pretty soon, etc.

But I feel different going into 2018, more motivated and invigorated. Yeah, 2017 was pretty shitty in some ways, but there were also some good things about it, actually some really great things! I know it’s very silly, but it also feels like there’s something to wiping the slate clean and starting over. At this point, I’ve worked out 100% of the days in 2018! I’ve eaten healthy, and put my shoes away and all those other things I aim to do EVERY SINGLE DAY OF THIS YEAR.

More importantly to my motivation, there’s a chance that this year could be when I finish my PhD, if I can manage to do my dissertation work in three semesters (i.e. 12 months). Maybe this is a ridiculous goal, but I’m kind of a ridiculous person, and it sure would be nice to finish. To that end, I’m deciding to make the goal for this year to live my best academic life. What does that mean?

  • read (something academic, that is) every day. My former advisor, who I still keep in touch with on Twitter, very usefully recommended #365papers – i.e. read a scholarly paper every day of the year. I probably need to read around that much for my dissertation anyway, and I do also have a huge backlog of interesting articles I’ve filed away to read “one day.” So far I’m one down, 364 to go! (But again, I’ve read an academic paper 100% of the days this year)
  • write every day. It doesn’t have to be a lot. A blog post (this counts for today!), a bit of a paper, part of my dissertation, something for work, even an academic related tweet. I know that doing a dissertation will involve way more writing than I’d been doing for the other parts of my PhD work, so I want to get into the habit now.
  • keep working on open science. I’m finally getting to the point in my coding skills that I don’t feel horrendously embarrassed for other people to see my code, but I still often think, eh, who’s going to want to see this? That’s totally the wrong idea, especially for someone whose scholarly research focuses on data sharing and reuse! I’m going to try to make a lot more commits to GitHub, even if it’s just silly stuff that I’m working on for my own entertainment, because who knows how someone else might find it useful.

So there you go! I’ll be tweeting out the papers I read on my Twitter account (@lisafederer) using #365papers, putting stuff up on my GitHub account, and I’ll probably (hopefully?) be writing more here, so watch this space!

A Method to the Madness: Choosing a Dissertation Methodology (#Quant4Life)

Somehow, shockingly, I’ve arrived at the point where I’m just a few mere months from finishing my coursework for my doctoral program (okay, 50 days, but who’s counting?), which means that next semester, I get down to the business of starting my dissertation. One of the interesting things about being in a highly interdisciplinary program like mine is that your dissertation research can be a lot of things.  It can be qualitative, it can be quantitative. It can be rigorously scientific and data-driven or it can be squishy and social science-y (perhaps I’m betraying some of my biases here in these descriptions).

If it weren’t enough that I had so many endless options available to me, this semester I’m taking two classes that couldn’t be more different in terms of methodology.  One is a data collection class from the Survey Methodology department.  We complete homework assignments in which we calculate response and cooperation rates for surveys, determining disposition for 20 different categories of response/non-response/deferral, and deciding which response and cooperation rate formula is most appropriate for this sample.  My other class is a qualitative methods class in the communications department.  On the first day of that class, I uncomfortably took down the notes “qual methods: implies multiple truths, not one TRUTH – people have different meaning.”

I count myself lucky to be in a discipline in which I have so many methodological tools in my belt, rather than rely on one method to answer all my questions.  But then again, how do I choose which tool to pull out of the belt when faced with a problem, like having to write a dissertation?

I came into my doctoral program with a pretty clear idea of the problem I wanted to address – assessing the value of shared data and somehow quantifying reuse. I envisioned my solution involving some sort of machine learning algorithm that would try to predict usefulness of datasets (because HOW COOL WOULD THAT BE?).  Then, halfway through the program, my awesome advisor moved to a new university, and I moved to a new advisor who was equally awesome but seemed to have much more of a qualitative approach.  I got very excited about these methods, which were really new to me, and started applying them to a new problem that was also very close to my heart – scientific hackathons, which I’ve been closely involved with for several years.  This kind of approach would necessitate an almost entirely qualitative approach – I’d be doing ethnographic observation, in-depth interviews, and so on.

So now, here I find myself 50 days away from the big choice. What’s my dissertation topic?  The thing I like to keep in mind is that this doesn’t necessarily mean ALL that much in the long run.  This isn’t the sum of my life’s work.  It’s one of many large research projects I’ll undertake.  Still, I want it to be something that’s meaningful and worthwhile and personally rewarding.  And perhaps most importantly of all, I want to use a methodology that makes me feel comfortable.  Do I want to talk to people about their truth?  I’ve learned some unexpected things using those methodologies and I’m glad I’ve learned something about how to do that kind of research, but in the end, I don’t think I want to be a qual researcher.  I want numbers, data, hard facts.

I guess I really knew this was what I would end up deciding in the second or third week of my qual methods class.  The professor asked a question about how one might interpret some type of qualitative data, and I answered with a response along the lines of “well, you could verify the responses by cross-checking against existing, verified datasets of a similar population.”  She gave me a very odd look, and paused, seemingly uncertain how to respond to this strange alien in her class, and then responded, “You ARE very quantitative, aren’t you?”

#Quant4Life

Some real talk from a very tired PhD student

This post is going to be different from what I normally write.  It’s going to seem a little bleak for awhile, but stick with me, because it’s going to have a happy ending.

You know the way that some girls dream of their wedding day for their whole lives? That’s kind of like me, but instead of with getting married, it was with getting my PhD (I know, I was a weird kid). Starting almost 15 years ago when I was an adjunct professor, and to this day, people will sometimes send me emails that begin “Dear Dr. Federer,” and I think, not yet, but one day.

Eventually that day did come, and I got into this PhD program, working on a topic I’m really fascinated by and I think is pretty timely and relevant.  It was great.  There was the one little catch that I also had a full-time job that I love and a lease on an apartment that was well beyond grad student means, but I’m a pretty motivated person and I figured I could handle working full-time and doing the PhD program part-time.

This plan went fine the first semester.  So fine that I figured, well, why not just go ahead and do a third class in the spring?  Being a full-time PhD student with a high-pressure, full-time job?  Sure!  WHY NOT.  The semester is halfway through now, and I’m not dead yet. So this weekend, when I was looking at the PhD student handbook and I realized that after this semester, I’ll need just 4 more classes to complete my coursework, a cockamamie plan popped into my head.  I had this little conversation with myself:

evil Lisa: what if you did all four classes over the summer?
regular Lisa: I don’t know, while working full-time? That sounds like a bit much.
evil Lisa: but then once you’re done you could advance to candidacy.  Maybe you could finish the whole thing in two years!  I bet no one has ever even done that!
regular Lisa: but this sounds like torture
evil Lisa: why don’t you at least check the summer schedule and see if there are any interesting courses?regular Lisa: hmm, well, some of these do look pretty good.  And they’re online.  Maybe it wouldn’t be so bad.
evil Lisa: REGISTER FOR THEM.

And I did.

To my credit, a part of me knew this plan was not my greatest idea, so today, when I had a meeting with a potential new advisor, since my advisor is leaving for a new position, I said, “I had this idea, but I think it might be a little crazy,” and I told her and she looked at me very patiently, the way you look at a person who has lost all touch with reality and said, “yes, that’s crazy.”

After that conversation, I came back to the graduate student lounge to wait for my class to start, and I looked at the draft of a paper I’m working on, I looked at my slides for a presentation I’m giving in class this afternoon, I looked at my Outlook calendar for work, and I hated all of it.  The presentation looked like garbage and the paper seemed to be going nowhere.  I’d spent hours working on this paper, and it really had seemed like an interesting idea at the time, but now it seemed like a completely pointless waste of time.  The more I thought about data sharing and reuse, the more I hated it.

How could this be?  I love data!  I could talk about data all day!  How could it be that I suddenly hated data?  That was when I realized that I’ve been going about this all wrong and my ridiculous approach was actually ruining the entire experience.  It’s like if you love ice cream and you have a gallon and you try to jut devour the entire thing in one sitting.  Of course it would be a horrible experience.  You’d be sick and you’d hate yourself, and you’d definitely hate ice cream.  On the other hand, if you had a little bit of the ice cream over several days, you’d enjoy it a lot more.

I have this instinct from my days of long-distance running: when I’m many miles in and tired, and I want to slow down, that’s when I push myself to run even faster.  The slower I run, the longer it’ll take me to finish, but if I just run as hard as I can, the run will be over sooner.  I’m not sure about the validity of this approach from a distance running perspective, but I think it’s fair to say it’s a completely stupid idea when it comes to a PhD.

People warned me when I started this program that everyone gets burned out at some point, and I thought, not me, I love my topic, there’s no way I could ever get tired of it.  That’s why it was especially confusing when I sat there looking at my paper draft yesterday and just hating the guts out of data sharing and reuse.  Fortunately, I don’t hate data.  I hate torturing myself.

So, that’s why I’m not going to!  Could I take four courses over the summer?  I suppose.  Could I finish a PhD in two years while working full-time? I guess it’s possible.  But what would be the point, if I emerged from the process angry and tired and hating data?  Time to slow down and enjoy the ride, and de-register for at least two of those summer classes.

Scientific “artifacts” – #overlyhonestmethods and indirect observation

This week I’ve been reading the first half of Bruno Latour and Steve Woolgar’s book Laboratory Life: The Construction of Scientific Facts.  Like many of the other pieces I’ve been reading lately, this book argues for a social contructivist theory of scientific knowledge, which is a perspective I’m really starting to identify with.  What I’m finding most interesting about this book is the ethnographic approach that was taken to observe the creation of scientific knowledge.  Basically, Bruno Latour spent two years observing in a biology lab at the Salk Institute.  Chapter 1 begins with a snippet of a transcript covering about 5 minutes of activity in a lab – all the little seemingly insignificant bits of conversation and activity that, taken together, would allow an outside observer to understand how scientific knowledge is socially constructed.

The authors emphasize that real sociological understanding of science can only come from an outside observer, someone who is not themselves too caught up in the science – someone who can’t see the forest for the trees, as it were.  They even suggest that it’s important to “make the activities of the laboratory seem as strange as possible in order not to take too much for granted” (30).  Why should we need someone to spend two years in a lab watching research happen when the researchers are going to be writing up their methods and results in an article anyway, you may ask?  The authors argue that “printed scientific communications systematically misrepresent the activity that gives rise to published reports” and even “systematically conceal the nature of the activity” (28).  In my experience, I would agree that this is true – a great example of it is #overlyhonestmethods, my absolute favorite Twitter hashtag of all time, in which scientists reveal the dirty secrets that don’t make it into the Nature article.

I’ve been thinking that an ethnographic approach might be an effective way to approach my research, and I’m thinking it makes even more sense after what I’ve read of this book so far.  However, this research was done in the 1970s, when research was a lot different.  Of course there are still clinical and bench researchers who are doing actual physical things that a person can observe, but a lot of research, especially the research I’m interested in, is more about digital data that’s already collected.  If I wanted to observe someone doing the kind of research I’m interested in, it would likely involve me sitting there and staring at them just doing stuff on a computer for 8 hours a day.  So I’m not sure if a traditional ethnographic approach is really workable for what I want to do.  Plus, I don’t think I’d get anyone to agree to let me observe them.  I know I certainly wouldn’t let someone just sit there and watch me work on my computer for a whole day, let alone two years (mostly because I’d be embarrassed for anyone else to know how much time I spend looking at pictures of dogs wearing top hats and videos of baby sloths).  Even if I could get someone to agree to that, I do wonder about the problem of observer effect – that the act of someone observing the phenomenon will substantively change that phenomenon (like how I probably wouldn’t take a break from writing this post to watch this video of a porcupine adorably nomming pumpkins if someone was observing me).

This thought takes me back to something I’ve been thinking about a lot lately, which is figuring out methods of indirect observation of researchers’ data reuse practices.  I’m very interested in exploring these sorts of methods because I feel like I’ll get better and more accurate results that way.  I don’t particularly like survey research for a lot of reasons: it’s hard to get people to fill out your survey, sometimes they answer in ways that don’t really give you the information you need, and you’re sort of limited in what kind of information you can get from them.  I like interviews and focus groups even less, for many of the same reasons.  Participant observation and ethnographic approaches have the problems I’ve discussed above.  So what I think I’m really interested in doing is exploring the “artifacts” of scientific research – the data, the articles, the repositories, the funny Twitter hashtags.  This idea sort of builds upon the concept I discussed in my blog last week – how systems can be studied and tells us something about their intended users.  I think this approach could yield some really interesting insights, and I’m curious to see what kind of “artifacts” I’ll be able to locate and use.

A Silly Experiment in Quantifying Death (and Doing Better Code)

Doesn’t it seem like a lot of people died in 2016?  Think of all the famous people the world lost this year.  It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even.  Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it.  Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing.  For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names.  More importantly, this little project was actually a good way to practice a few things I wanted to teach myself.  Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people.  Here they are on a histogram, by year.

year_plotAs you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still).  The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average.  So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case.  I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data.  Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

month_plotObviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

age_plotThis chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, you may notice those outliers in 2006 and 2014 where someone died at a very, very old age. I didn’t realize it at first, butWikipedia does include some notable non-humans in their list. One is a famous tree that died in an ice storm at age 125 and the other a tortoise who had allegedly been owned by Charles Darwin, but significantly outlived him, dying at age 176.  Obviously this makes the data and therefore this analysis even more suspect as a true scientific pursuit.  But we had fun, right? 🙂

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository.  And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing.  I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around.  I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was).  Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github.  Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it).  This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish.  In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway?  Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way.  I learned how to write good functions, for real.  I learned how to use the %>%, (which is a pipe operator, and it’s very awesome).  I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point.  In the past, I would write code and be elated if it actually worked.  With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better?  Can I do that in one line of code instead of three?  Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work.  Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?

the sweet scent of Actinomycetes (or why rain smells good)

This morning I walked out the door and caught a whiff of something I don’t smell often in Los Angeles – actinomycetes!

You know that “rain smell” that you can detect, especially on a day when it hasn’t rained in awhile?  That’s actinomycetes.  It’s a kind of bacteria that lives in the soil, and when it rains, the water hitting the ground aerosolizes the bacteria, creating that distinctive rain smell.  So the next time you catch a whiff of the lovely, fresh scent of rain, don’t forget that it’s actually tiny liquid droplets of dirt bacteria entering your nose. 🙂

Reading the Great Books of Science

It’s been ages since I posted here, and I can’t let all the blog readers down, can I?!  I’ve been up to all sorts of fantastically nerdy things lately, which have kept me rather too busy for blogging, and which I will probably report on here in due time.  For now, let’s talk science and books, which as we all know, are two of my favorite things (the other top contenders for my favorite things being dogs, champagne, and Paris).

One of the many perks of working at a major research institution is that really awesome people come speak here.  Case in point: a few weeks back, I had the opportunity to attend a Q&A session with James Watson, as in Watson and Crick, as in discoverers of the double helix structure of DNA.  True story: the Q&A ended at the exact same time as I had to be across campus for the start of a class, so I knew I was going to have to leave early.  When I told this to one of my bosses, who was also attending, she said, “you’re going to get up and walk out while James Watson is talking?”  And indeed, that is exactly what I did. 🙂

However, before I left, one of the things Watson had to say that struck me was regarding what he referred to as “the great books.”  I forget exactly how he put it, but he said that he had appreciated his schooling for exposing him to these great books, which had helped shape his thinking.  This statement reminded me of a blog post I’d recently read about Carl Sagan’s reading list, written in his own hand and excerpted from his papers, now held by the Library of Congress.  As the blog post I’d read eloquently puts it, is it possible to “reverse engineer” a great mind by following in that thinker’s literary footsteps?

I’m sure it’s not so simple as that, but in any case, I decided that I would like to add to my already completely ridiculous collection of to-read books by creating my own “great books of science” library.  Based on my research into what one might currently consider the important books in science (at least for the non-scientist), I’ve started my library with the following titles:

  1. Charles Darwin – The Origin of Species
  2. Richard Dawkins – The Selfish Gene
  3. Stephen Hawking – A Brief History of Time
  4. Matt Ridley – Genome: The Autobiography of a Species in 23 Chapters
  5. Carl Sagan – Cosmos

So far, I’m about 1/3 of the way into Genome, which I really enjoy (but I did also just start Haruki Murakami’s The Wind-Up Bird Chronicle, the reading of which has become a near-obsession that currently occupies almost all of my free time).  It’s a nice overview of evolution and genetics, though perhaps a little bit less technical than I would have liked, but certainly an enjoyable read.

So, dear blog readers, as you can see, my list is at present by no means comprehensive.  What would you add to a library of the “great books” of science?  Let me know in the comments so I can add to my Amazon wish list. 🙂

Cool Science: Crowdsourcing Big Data

Anyone who knows me at all knows I really like data.  It’s a tremendously nerdy interest, but I find data really fascinating, I guess in part because I love the idea that there is some great knowledge that’s hidden in the numbers, just waiting for someone to come along and dig it out.  What’s very cool is that we live in an age when technology allows us to generate massive amounts of data.  For example, the Large Hadron Collider generates more than 25 petabytes a year in data, which is more than 70 terabytes a day.  A DAY.  Some data analysis can be done by computers, but some of it really has to be done by people.  Plus, some studies really rely on the ability to gather data from massive groups of people in order to get an adequate sample from various groups to prove what you’re trying to show.  To solve these and other “big data” problems, some very smart and cool research groups have jumped on the crowdsourcing bandwagon and are having people from around the world get online and help solve the problems of data gathering and analysis.  Here are some cool projects I’ve heard about.

Eyewire: a group of researchers working on retinal connectomes at MIT found a fascinating way to get people to help with their data analysis – turn it into a game.  They have a good wiki that explains the project in depth, but the gist of it is that these researchers have microscopic scans of neurons from the retina.  Neurons are a huge tangled mess, so their computers could figure out how some of them fit together, but it really takes an actual person to go in and figure out what’s connected and what’s not.  So this team turned it into this 3D puzzle/game thing that’s really hard to explain unless you try it.  You go through a tutorial to learn how to use the system, and then you’re turned loose to start mapping neurons!  It’s not like the most compelling game I’ve ever played or something I’d spend hours doing, but it is interesting, and it helps neuroscience, so that’s pretty cool.

Small World of Words: this study aims to better understand human speech and how we subconsciously create networks of associations among words.  To do so, they set up a game to gather word associations from native and non-native English speakers.  Again, I wouldn’t necessarily call this a game in the sense of “woohoo, we’re having so much fun!” but it is kind of interesting to see what your brain comes up with when you’re given a set of random words.  (Plus it’s perhaps a little telling of your own psychological state if you really think about the words you’re coming up with.)  It takes like 2 minutes to do, and again, it’s contributing to science!  Also, according to their website, they are making their dataset publicly available, which as a research informationist/data librarian I wholeheartedly endorse.

Foldit: I haven’t played this yet, so I can’t speak to how fun it is (or boring), but it sounds similar to Eyewire in the sense of being a puzzle in which the players are helping to map a structure – in this case proteins.  Proteins are long chains of amino acids, but they fold up in certain ways that determine their function.  Knowing more about this folding structure makes it possible to create better drugs and understand the pathology of diseases.  For example, one of the things this project is looking at is proteins that are crucial for HIV to replicate itself within the human body.  Better understanding of the structure of these proteins could help contribute to drugs to treat HIV and AIDS.

So I encourage you to go play some games for science!  Do it now!  And if you’re at work and someone tries to stop you, just politely explain that you’re not playing a game – you’re curing AIDS.  🙂