Dr. Federer’s Wild Ride: The Tale of the 3-year PhD

A couple days ago, I stood outside a conference room while my doctoral dissertation committee discussed the defense I’d just given. My advisor opened the door to invite me back in and said, “Congratulations, Dr. Federer!”

Obviously I’m very pleased about this, in large part because I am absolutely delighted this experience is over. I did this PhD in 3 years (most people take 4-6 years) and I did it while I was also working full-time in a pretty demanding job. At some points, to be honest, it was just basically awful. I say this not to dissuade anyone from doing a PhD part-time, because I do think it’s totally doable under certain circumstances (like if you have a really supportive boss, which I do, and if you don’t have kids, which I don’t), but honestly in hindsight, doing it do this quickly was really pretty ridiculous on my part, although I had my reasons.

One of my professors used to say something that stuck with me – don’t compare your insides to other people’s outsides. When you look at someone’s finished dissertation, or picture of a black hole, or whatever, you’re seeing a final product that has been polished over hours, sometimes years. It’s easy to think, oh my god, everybody has it so together and makes it look so easy, so why is this so hard for me? So in the spirit of cheering on those of you who may be in this process (or are thinking of it), I’d like to shed some light on what it took to become Dr. Federer. (By the way, that same professor also said to us once “at this point in the semester, if you’re not crying in public, you’re doing great,” which I think is also worth keeping in mind.)

First, let me tell you about my typical day. I did my course work in a year and a half, which I managed by taking summer courses. During that time, I would get up around 5 am and be at work by 6:30 so that I could leave by 4 pm. That gave me enough time to get home and walk my dog before getting back in the car and sitting in rush hour traffic on the Beltway for an hour. I had two classes each week, most from 6 – 9 pm, and this usually got me home around 9:30 (thankfully rarely any traffic on the way home that late). Then I’d have dinner – obviously I was not in the mood for cooking that late at night, so on weekends I would cook a couple big batches of something that I could pop in the microwave. It takes awhile to mentally wind down from a doctoral-level class (for me anyway), so I usually read a bit or watched a show and got to bed by 11 so I could get up and do it all again in the morning. Over the weekend, I would read whatever I needed to for the next week of classes (and believe me, there is a LOT of reading) and write, but I still managed to have some free time, so it wasn’t too overwhelming.

In my program, there are no qualifying exams because it’s a highly interdisciplinary department and everyone’s doing such different things that it wouldn’t really make sense. Instead, there’s a requirement to write an “integrative paper” that demonstrates that you’ve achieved an appropriate mastery of your subject to move on to the dissertation phase. Mine was a research study that provided a grounding for my dissertation by providing evidence some of the methodological choices I would make. That took a semester, and it was probably the easiest semester out of the whole program. Next came the dissertation proposal, in which you write the introduction, lit review, and methods section that will eventually become the first part of your dissertation, and your committee makes sure what you’re proposing sounds reasonable. That also took a semester. I do remember periods during those semesters in which I had to spend an entire weekend working, but it was nothing like the dissertation would end up being. I also really enjoyed not having to spend that hour in traffic twice a week to go to campus for classes.

I started the final sprint toward finishing my dissertation in January 2019, and that was when things got much harder. All the things I’d said I’d do during the proposal had seemed pretty straightforward, but when it came down to actually doing them, everything took so much longer than I expected and most of it didn’t work on the first (and sometimes even the second or third try). My topic modeling outputs were a nightmare. My code comments were filled with frustrated notes to myself and the occasional expletive. Manually categorizing and describing things took hours longer than I expected.

The coursework portion of my life had seemed grueling because of the early morning hours and the horrible hour in Beltway traffic, but at least then I had enough free time to feel like I had at least a little work/life balance. For the semester that I worked on the dissertation, nearly every minute of my life was spent working. I would put in an 8-hour day at work, where I’d started a new position with considerably more responsibilities and expectations than my previous one. I would come home and spend half an hour walking my dog and then quickly eat something. During the most intense period of work, I ate nothing but turkey and Swiss cheese sandwiches for the better part of a month because they were quick to make. At one point, a friend sent me a Grubhub gift card so I would eat something other than sandwiches.

After that, I’d spend the next several hours working. I tried to have a cutoff time of 9:30 so I’d have enough time to do something else with my brain for a little while before going to bed, or else I knew I’d never get to sleep. Even then, most nights I’d wake up around 3 or 4 am and almost immediately my brain would once again kick into high gear, going over different ways to better approach the problems, fixes to the code bugs I was running into, or mull over wording of whatever part I was currently writing. Some nights I managed to get back to sleep, but more often, I laid there in the dark with my mind going full-speed until my alarm went off in the morning. I was constantly exhausted. Weekends weren’t much better. I let myself sleep in a little bit, but then it was all work, usually 12 hours a day, with breaks to walk the dog and of course make a sandwich.

During this period, everything was about the dissertation. I stopped going to the gym. A social life was totally out of the question. I would text with my friends, but I didn’t see most of them for months. Even my dog, Ophelia, hardly got my attention. She developed a bad habit of barking at me in frustration when I would be sitting on my couch working, so I started bringing my laptop and sitting on the floor with her while I worked, which seemed to make her happy. She would bring her Squirrelly (in our home, all toys are named Squirrelly) over for me to throw, and we’d have a little tug-of-war, but I know she could tell that I was distracted. One night she came over and dropped here Squirrelly on the pile of books I was referring to – pictured below – like, “come on, lady, seriously.”

Ophelia hates research but loves playing Squirrelly

Even when I wasn’t working on the dissertation, I was thinking about the dissertation. It was an albatross that always hung heavy around my neck, constantly on my mind no matter what I was doing. If I talked to you at some point during this 4-month period, I can almost guarantee you that I was also thinking about the dissertation while I was doing it. When I did finally finish it and send it to my committee, I was shocked at how much more productive I became at work. I got more done in the day after it was submitted than I had in the entire week prior, just because the dissertation took up so much mental space in every minute of every day that I had very little bandwidth for anything else. It’s something I’ve never experienced before and I don’t know if it can be fully explained to someone who hasn’t gone through it.

But I got through it. After I submitted it to my committee, I spent an evening listening to podcasts, drinking wine, and doing a jigsaw puzzle, and I thought to myself, “wow, what a decadent way to spend an evening.” It’s not, though – it’s normal life (I mean honestly, it’s a little bit boring of an evening in normal-life terms), but I’d so forgotten what that meant that I felt like I was giving myself an incredible indulgence. Even now that the hardest work has been done for weeks, I still sometimes feel a lingering sense of guilt when I do something like read a book for fun, or watch a show, or take my dog for a long hike, like I should be working.

So, I say all of this not to complain about my experience, nor to scare anyone who’s thinking of going down this road. Was it hard? Harder than anything I’ve ever done before. Was it worth it? No question. Really, I write this to say, yes, this is hard, and we all experience it. My dissertation advisor once said in an email to me, “you make this look easy!” and I’m sure to a lot of people, I did. But it’s not, no matter what it may look like from the outside. So, you, dear reader, who maybe is working on a PhD or some other degree, or maybe a challenging project of some other sort, I want you to know that you’re doing great. Don’t compare yourself to anyone else. Don’t compare your insides to other people’s outsides, because, no matter how poised and calm we may look to the outside world, it’s pretty messy inside for all of us.

Behind the scenes of “Data sharing in PLOS ONE: An analysis of Data Availability Statements”

Recently some colleagues and I published a paper in PLOS in which we analyzed about 47,000 Data Availability Statements as a way of exploring the state of data sharing in a journal with a pretty strong data availability policy. The paper has gotten a good response from what I’ve seen on Twitter, and I’m really happy with how it turned out, thanks in part to some great feedback from the reviewers. But I also wanted to tell a few more things about how this paper came about – the things that don’t make it into the final scholarly article. A behind the scenes look, if you will.

The idea for this paper arose out of a somewhat eye-opening experience. I needed to get a hold of a good dataset – I forget why exactly, but I think it was when I was first starting to teach R and wanted to some real data that I could use in the classes for the hands-on exercises. Remembering that PLOS had this data availability policy, I thought to myself, ah, no problem, I will find an article that looks relevant to the researchers I’m teaching, download the data, and use it in my demo (with proper attribution and credit, of course). So I found an article that looked good and scrolled down to the Data Availability Statement.  Data available upon request.  Huh. I thought you weren’t allowed to say that, but okay, I guess this one slipped through the policy.  Found another one – data is within the paper, it said, except the only data in the paper were summary tables, which were of no use to me (nor would they be of use to anyone hoping to verify the study or reanalyze the data, for example).

What a weird fluke, I thought, that the first two papers I happened to look at didn’t really follow the policy. So I checked a third, and a fourth. Pretty soon I’d spent a half hour combing through recent PLOS articles and I had yet to find one with a publicly available dataset that I could easily download from a repository. I ended up looking elsewhere for data (did you know that baseball fans keep surprisingly in-depth data on a gazillion data points?) but I was left wondering what the real impact of this policy was, which was why I decided to do this study.

I’ll let you read the paper to find out what exactly it is that we found, but there’s one other behind-the-scenes anecdote that I’ll share about this paper that I hope will be encouraging. Obviously if you’re going to write critically about data availability, you’re going to look a little hypocritical if you don’t share your own data. I fully intended to share our data and planned to do so using Figshare, which is how I’d shared a dataset associated with another publication I’d previously published in PLOS. When I shared the data from the first article, I set it to be public immediately, though I didn’t expect anyone to want to see it before the paper was out. Unexpectedly, and unbeknownst to me, someone at Figshare apparently thought this was an interesting dataset and decided to tweet it out the same day I submitted the paper to PLOS, obviously well before it was ever published, much less accepted.

While the interest in the dataset was encouraging, I was also concerned about the fact that it was out before the paper was accepted. I figured I was flattering myself to think that someone would want to scoop me, but then, I got an email from someone I didn’t know, who told me that she had found my dataset and that she would like to write an article describing my results, and would I mind sharing my literature review/citations with her to save her the trouble? In other words, “hi, I would like to write basically the paper that you’re trying to get accepted using all of the work you did.” I want to be clear that I am all for data sharing, but this situation bothered me. Was I about to get scooped?

Obviously our paper came out, no one beat us to it, and as far as I know, no one has ever written another paper using that dataset, but I was thinking about it when I was uploading the data for this most recent paper.  This dataset was way more interesting and broadly applicable than the first one, so what if someone did get a hold of it before our paper came out? So what I decided to do was to upload it to Figshare, have it generate a DOI, but keep the dataset listed as private rather than publicly release it. Our data availability statement included the DOI and was therefore on the surface in compliance, but I had a feeling that, if you went to the DOI, it would tell you that the dataset was private or wasn’t found. Obviously I could have checked this before I submitted, but to be totally honest, I just left it as it was because I was genuinely curious whether any of the reviewers would try to check it themselves and say something.

To their credit, all three of the reviewers (who by the way, were incredibly helpful and gave the most useful feedback I’ve ever gotten on peer review, which I think significantly improved the paper) did indeed point out that the DOI didn’t work. In our revisions, our Data Availability Statement included a working link to not only the data, but also the code, on OSF. I invite anyone who is interested to reuse it and hope someone will find it useful. (Please don’t judge me on the quality of my code, though – I wrote it a long time ago when I was first learning R and I would do it way better now.)

 

Living my best academic life: 2018 resolutions for getting that PhD done

I didn’t feel very optimistic going into 2017. I had recently lost my father and grandfather in the same week, and I was feeling anxious and depressed about what seemed like a pretty disastrous outcome to the 2016 elections. I don’t think I made any resolutions that year because I was so disheartened by the whole situation that I figured, who cares? My focus in 2017 was basically, do what it takes to get through it, eat some good food and drink some good wine because possibly the world will end pretty soon, etc.

But I feel different going into 2018, more motivated and invigorated. Yeah, 2017 was pretty shitty in some ways, but there were also some good things about it, actually some really great things! I know it’s very silly, but it also feels like there’s something to wiping the slate clean and starting over. At this point, I’ve worked out 100% of the days in 2018! I’ve eaten healthy, and put my shoes away and all those other things I aim to do EVERY SINGLE DAY OF THIS YEAR.

More importantly to my motivation, there’s a chance that this year could be when I finish my PhD, if I can manage to do my dissertation work in three semesters (i.e. 12 months). Maybe this is a ridiculous goal, but I’m kind of a ridiculous person, and it sure would be nice to finish. To that end, I’m deciding to make the goal for this year to live my best academic life. What does that mean?

  • read (something academic, that is) every day. My former advisor, who I still keep in touch with on Twitter, very usefully recommended #365papers – i.e. read a scholarly paper every day of the year. I probably need to read around that much for my dissertation anyway, and I do also have a huge backlog of interesting articles I’ve filed away to read “one day.” So far I’m one down, 364 to go! (But again, I’ve read an academic paper 100% of the days this year)
  • write every day. It doesn’t have to be a lot. A blog post (this counts for today!), a bit of a paper, part of my dissertation, something for work, even an academic related tweet. I know that doing a dissertation will involve way more writing than I’d been doing for the other parts of my PhD work, so I want to get into the habit now.
  • keep working on open science. I’m finally getting to the point in my coding skills that I don’t feel horrendously embarrassed for other people to see my code, but I still often think, eh, who’s going to want to see this? That’s totally the wrong idea, especially for someone whose scholarly research focuses on data sharing and reuse! I’m going to try to make a lot more commits to GitHub, even if it’s just silly stuff that I’m working on for my own entertainment, because who knows how someone else might find it useful.

So there you go! I’ll be tweeting out the papers I read on my Twitter account (@lisafederer) using #365papers, putting stuff up on my GitHub account, and I’ll probably (hopefully?) be writing more here, so watch this space!

A Method to the Madness: Choosing a Dissertation Methodology (#Quant4Life)

Somehow, shockingly, I’ve arrived at the point where I’m just a few mere months from finishing my coursework for my doctoral program (okay, 50 days, but who’s counting?), which means that next semester, I get down to the business of starting my dissertation. One of the interesting things about being in a highly interdisciplinary program like mine is that your dissertation research can be a lot of things.  It can be qualitative, it can be quantitative. It can be rigorously scientific and data-driven or it can be squishy and social science-y (perhaps I’m betraying some of my biases here in these descriptions).

If it weren’t enough that I had so many endless options available to me, this semester I’m taking two classes that couldn’t be more different in terms of methodology.  One is a data collection class from the Survey Methodology department.  We complete homework assignments in which we calculate response and cooperation rates for surveys, determining disposition for 20 different categories of response/non-response/deferral, and deciding which response and cooperation rate formula is most appropriate for this sample.  My other class is a qualitative methods class in the communications department.  On the first day of that class, I uncomfortably took down the notes “qual methods: implies multiple truths, not one TRUTH – people have different meaning.”

I count myself lucky to be in a discipline in which I have so many methodological tools in my belt, rather than rely on one method to answer all my questions.  But then again, how do I choose which tool to pull out of the belt when faced with a problem, like having to write a dissertation?

I came into my doctoral program with a pretty clear idea of the problem I wanted to address – assessing the value of shared data and somehow quantifying reuse. I envisioned my solution involving some sort of machine learning algorithm that would try to predict usefulness of datasets (because HOW COOL WOULD THAT BE?).  Then, halfway through the program, my awesome advisor moved to a new university, and I moved to a new advisor who was equally awesome but seemed to have much more of a qualitative approach.  I got very excited about these methods, which were really new to me, and started applying them to a new problem that was also very close to my heart – scientific hackathons, which I’ve been closely involved with for several years.  This kind of approach would necessitate an almost entirely qualitative approach – I’d be doing ethnographic observation, in-depth interviews, and so on.

So now, here I find myself 50 days away from the big choice. What’s my dissertation topic?  The thing I like to keep in mind is that this doesn’t necessarily mean ALL that much in the long run.  This isn’t the sum of my life’s work.  It’s one of many large research projects I’ll undertake.  Still, I want it to be something that’s meaningful and worthwhile and personally rewarding.  And perhaps most importantly of all, I want to use a methodology that makes me feel comfortable.  Do I want to talk to people about their truth?  I’ve learned some unexpected things using those methodologies and I’m glad I’ve learned something about how to do that kind of research, but in the end, I don’t think I want to be a qual researcher.  I want numbers, data, hard facts.

I guess I really knew this was what I would end up deciding in the second or third week of my qual methods class.  The professor asked a question about how one might interpret some type of qualitative data, and I answered with a response along the lines of “well, you could verify the responses by cross-checking against existing, verified datasets of a similar population.”  She gave me a very odd look, and paused, seemingly uncertain how to respond to this strange alien in her class, and then responded, “You ARE very quantitative, aren’t you?”

#Quant4Life

Some real talk from a very tired PhD student

This post is going to be different from what I normally write.  It’s going to seem a little bleak for awhile, but stick with me, because it’s going to have a happy ending.

You know the way that some girls dream of their wedding day for their whole lives? That’s kind of like me, but instead of with getting married, it was with getting my PhD (I know, I was a weird kid). Starting almost 15 years ago when I was an adjunct professor, and to this day, people will sometimes send me emails that begin “Dear Dr. Federer,” and I think, not yet, but one day.

Eventually that day did come, and I got into this PhD program, working on a topic I’m really fascinated by and I think is pretty timely and relevant.  It was great.  There was the one little catch that I also had a full-time job that I love and a lease on an apartment that was well beyond grad student means, but I’m a pretty motivated person and I figured I could handle working full-time and doing the PhD program part-time.

This plan went fine the first semester.  So fine that I figured, well, why not just go ahead and do a third class in the spring?  Being a full-time PhD student with a high-pressure, full-time job?  Sure!  WHY NOT.  The semester is halfway through now, and I’m not dead yet. So this weekend, when I was looking at the PhD student handbook and I realized that after this semester, I’ll need just 4 more classes to complete my coursework, a cockamamie plan popped into my head.  I had this little conversation with myself:

evil Lisa: what if you did all four classes over the summer?
regular Lisa: I don’t know, while working full-time? That sounds like a bit much.
evil Lisa: but then once you’re done you could advance to candidacy.  Maybe you could finish the whole thing in two years!  I bet no one has ever even done that!
regular Lisa: but this sounds like torture
evil Lisa: why don’t you at least check the summer schedule and see if there are any interesting courses?regular Lisa: hmm, well, some of these do look pretty good.  And they’re online.  Maybe it wouldn’t be so bad.
evil Lisa: REGISTER FOR THEM.

And I did.

To my credit, a part of me knew this plan was not my greatest idea, so today, when I had a meeting with a potential new advisor, since my advisor is leaving for a new position, I said, “I had this idea, but I think it might be a little crazy,” and I told her and she looked at me very patiently, the way you look at a person who has lost all touch with reality and said, “yes, that’s crazy.”

After that conversation, I came back to the graduate student lounge to wait for my class to start, and I looked at the draft of a paper I’m working on, I looked at my slides for a presentation I’m giving in class this afternoon, I looked at my Outlook calendar for work, and I hated all of it.  The presentation looked like garbage and the paper seemed to be going nowhere.  I’d spent hours working on this paper, and it really had seemed like an interesting idea at the time, but now it seemed like a completely pointless waste of time.  The more I thought about data sharing and reuse, the more I hated it.

How could this be?  I love data!  I could talk about data all day!  How could it be that I suddenly hated data?  That was when I realized that I’ve been going about this all wrong and my ridiculous approach was actually ruining the entire experience.  It’s like if you love ice cream and you have a gallon and you try to jut devour the entire thing in one sitting.  Of course it would be a horrible experience.  You’d be sick and you’d hate yourself, and you’d definitely hate ice cream.  On the other hand, if you had a little bit of the ice cream over several days, you’d enjoy it a lot more.

I have this instinct from my days of long-distance running: when I’m many miles in and tired, and I want to slow down, that’s when I push myself to run even faster.  The slower I run, the longer it’ll take me to finish, but if I just run as hard as I can, the run will be over sooner.  I’m not sure about the validity of this approach from a distance running perspective, but I think it’s fair to say it’s a completely stupid idea when it comes to a PhD.

People warned me when I started this program that everyone gets burned out at some point, and I thought, not me, I love my topic, there’s no way I could ever get tired of it.  That’s why it was especially confusing when I sat there looking at my paper draft yesterday and just hating the guts out of data sharing and reuse.  Fortunately, I don’t hate data.  I hate torturing myself.

So, that’s why I’m not going to!  Could I take four courses over the summer?  I suppose.  Could I finish a PhD in two years while working full-time? I guess it’s possible.  But what would be the point, if I emerged from the process angry and tired and hating data?  Time to slow down and enjoy the ride, and de-register for at least two of those summer classes.

Scientific “artifacts” – #overlyhonestmethods and indirect observation

This week I’ve been reading the first half of Bruno Latour and Steve Woolgar’s book Laboratory Life: The Construction of Scientific Facts.  Like many of the other pieces I’ve been reading lately, this book argues for a social contructivist theory of scientific knowledge, which is a perspective I’m really starting to identify with.  What I’m finding most interesting about this book is the ethnographic approach that was taken to observe the creation of scientific knowledge.  Basically, Bruno Latour spent two years observing in a biology lab at the Salk Institute.  Chapter 1 begins with a snippet of a transcript covering about 5 minutes of activity in a lab – all the little seemingly insignificant bits of conversation and activity that, taken together, would allow an outside observer to understand how scientific knowledge is socially constructed.

The authors emphasize that real sociological understanding of science can only come from an outside observer, someone who is not themselves too caught up in the science – someone who can’t see the forest for the trees, as it were.  They even suggest that it’s important to “make the activities of the laboratory seem as strange as possible in order not to take too much for granted” (30).  Why should we need someone to spend two years in a lab watching research happen when the researchers are going to be writing up their methods and results in an article anyway, you may ask?  The authors argue that “printed scientific communications systematically misrepresent the activity that gives rise to published reports” and even “systematically conceal the nature of the activity” (28).  In my experience, I would agree that this is true – a great example of it is #overlyhonestmethods, my absolute favorite Twitter hashtag of all time, in which scientists reveal the dirty secrets that don’t make it into the Nature article.

I’ve been thinking that an ethnographic approach might be an effective way to approach my research, and I’m thinking it makes even more sense after what I’ve read of this book so far.  However, this research was done in the 1970s, when research was a lot different.  Of course there are still clinical and bench researchers who are doing actual physical things that a person can observe, but a lot of research, especially the research I’m interested in, is more about digital data that’s already collected.  If I wanted to observe someone doing the kind of research I’m interested in, it would likely involve me sitting there and staring at them just doing stuff on a computer for 8 hours a day.  So I’m not sure if a traditional ethnographic approach is really workable for what I want to do.  Plus, I don’t think I’d get anyone to agree to let me observe them.  I know I certainly wouldn’t let someone just sit there and watch me work on my computer for a whole day, let alone two years (mostly because I’d be embarrassed for anyone else to know how much time I spend looking at pictures of dogs wearing top hats and videos of baby sloths).  Even if I could get someone to agree to that, I do wonder about the problem of observer effect – that the act of someone observing the phenomenon will substantively change that phenomenon (like how I probably wouldn’t take a break from writing this post to watch this video of a porcupine adorably nomming pumpkins if someone was observing me).

This thought takes me back to something I’ve been thinking about a lot lately, which is figuring out methods of indirect observation of researchers’ data reuse practices.  I’m very interested in exploring these sorts of methods because I feel like I’ll get better and more accurate results that way.  I don’t particularly like survey research for a lot of reasons: it’s hard to get people to fill out your survey, sometimes they answer in ways that don’t really give you the information you need, and you’re sort of limited in what kind of information you can get from them.  I like interviews and focus groups even less, for many of the same reasons.  Participant observation and ethnographic approaches have the problems I’ve discussed above.  So what I think I’m really interested in doing is exploring the “artifacts” of scientific research – the data, the articles, the repositories, the funny Twitter hashtags.  This idea sort of builds upon the concept I discussed in my blog last week – how systems can be studied and tells us something about their intended users.  I think this approach could yield some really interesting insights, and I’m curious to see what kind of “artifacts” I’ll be able to locate and use.

If data sharing is difficult, what can it tell us? An Actor-Network Theory approach

In my ongoing adventures in science and technology studies readings, this week I’ve been reading The Social Construction of Technological Systems.  It diverges a little bit from my interests, strictly speaking, and focuses more on development of technologies rather than more of the laboratory and clinical science that I’m interested in, but I’m still glad I read it because it sparked some thoughts and ideas that I think could be interesting to pursue.

The portions of the collection that I read were rooted in social constructivist theory (as you might guess from the title of the book), specifically Actor-Network Theory (ANT).  The preface to the 25th anniversary edition explores some new developments in the field since the original edition, including “posthuman” approaches that consider nonhuman actants within social systems (xxv).  Scientific researchers operate within a complex system – not only because scientific research is itself often complicated, but also because science happens within a social system involving things like grant funding and scholarly articles and citations and so on.  Data play important roles in that system, as the raw product of scientific research, as evidence for scientific claims, and, now that many researchers operate in fields where data sharing is becoming more expected, something of a commodity.  In ANT, actants can be nonhuman, so I think it would be reasonable to consider data an actant in the social network of scientific research, and potentially one of the more interesting parts of that network, even more so than the humans.

The other avenue this collection sent my mind down had to do with data repositories.  At the start of the chapter “Society in the Making: The Study of Technology as a Tool for Sociological Analysis,” Michael Callon argues that “the study of technology itself can be transformed into a sociological tool of analysis” (77).  To summarize his thesis, essentially he argues that technological systems are created by what he calls “engineer-sociologists,” the designers or creators of the technology, who have had to essentially transform themselves into sociologists to study the intended users in order to develop technologies that will meet their needs.  If this is true, then these new technologies should be able to tell us something about their intended users.

This chapter got me thinking about some of the systems that are in place for data sharing, like some of the major data repositories.  I won’t name any names, but there are a couple of very well-known data repositories that people often complain to me about when it comes to submitting their data.  In some labs, researchers have mentioned that they have one person who knows how to submit the data, and they all have to bug that person because they can’t figure out how to do it properly.  I’ve read some of the help documentation for some of these repositories, and those people weren’t complaining for nothing.  Many of these systems are a big pain – opaque in many of their requirements and onerous to use, yet many researchers are specifically required to put their data there because of grant or journal requirements.

So if we take Callon’s approach and view the system as a tool for sociological analysis, what does it say about the state of data sharing that some of these repositories are so difficult to use?  I can think of possibilities:

  • that the engineers haven’t really been in all that close of contact with the users, so they’ve built a system that doesn’t actually meet their users’ needs;
  • that the needs of the system administrators (good quality data with a minimal amount of effort on their part) are directly at odds with the needs of the data submitters (also a minimal amount of effort on their part) and the administrators’ needs won out;
  • that the engineers are aware of issues but there just isn’t money/time/resources to make the system easier to use.

Another possibility is that sharing data isn’t really that much of a priority for most researchers, so they go along with a hard-to-use system because it’s not worth the trouble to try to get it to change.  It’s sort of like how I feel like it’s really a huge pain to have to deal with the DMV, but I only have to go there once every few years, so I’m not about to start a huge campaign to reform the DMV, especially when there are bigger problems our elected officials should be dealing with.  Maybe sharing your data in some of these systems is like that – an annoyance you deal with because you have to.

This is all entirely speculation on my part, but I do think it’s an interesting approach to take.  It would be interesting to sit down with some of the people who built or who currently run some of these systems and get the story on why things are the way they are.

Delocalizing data – a data sharing conundrum

This week I’ve been reading the second half of Sergio Sismondo’s An Introduction to Science and Technology Studies and I have been finding myself interested in the question of the universality of scientific knowledge and data.  A single sentence that I think captures the scope of the problem I’m finding interesting: “scientific and engineering research is textured at the local level, that it is shaped by professional cultures and interplays of interests, and that its claims and products result from thoroughly social processes” (168).  That is to say, the output of a scientific experiment is not some sort of universal truth – rather, data are the record of a manipulation of nature at a given time in a given place by a given person, highly contextualized and far from universally applicable.

I was in my kitchen the other day, baking a mushroom pot pie, after reading Chapter 10, specifically the section on “Tinkering, Skills, and Tacit Knowledge.”  That section describes the difficulties researchers were having in recreating a certain type of laser, even when they had written documentation from the original creators, even when they had sufficient technical expertise to do so, even when they had all the proper tools – in fact, even when they themselves had already built one, they found it difficult to build a second laser.  As I was pulling my pie out of the oven, I was thinking about the tacit knowledge involved in baking – how I know what exactly is meant when the instructions say I should bake till the crust is “golden brown,” how I make the decision to use fresh thyme instead of the chipotle peppers the recipe called for because I don’t like too much heat, how I know that my oven tends to run a little cold so I should set the temperature 10 degrees higher than called for by the recipe.  Just having a recipe isn’t enough to get a really tasty mushroom pot pie out of the oven, just as having a research article or other scientific documentation isn’t enough to get success out of an experiment.

These problems raise some obvious issues around reproducibility, which is a huge focus of concern in science at the moment.  Obviously scientific instruments are hopefully a little more standardized than my old apartment oven that runs cold, but you’d be surprised how much variation exists in scientific research.  Reproducibility is especially a problem when the researcher is herself the instrument, such as in the case of certain types of qualitative research.  Focus group or interview research is usually conducted using a script, so theoretically anyone could pick up the script and use it to do an interview, but a highly experienced researcher knows how to go off-script in appropriate ways to get the needed information, asking probing questions or guiding a participant back from a tangent.

More relevant to my own research, thinking about data not as representations of some sort of universal truth, but as the results of an experiment conducted within a potentially complex local and social context, can shared data be meaningfully reused?  How do we filter out the noise and get to some sort of ground truth when it comes to data, or can we at all?  Part of the question that I really want to address in my dissertation is what barriers exist to reusing shared data, and I think this is a huge one.  Some of the problem can be addressed by standards, or “formal objectivity” (140).  However, as Sismondo notes, standards are themselves localized and tied to social processes.  Between different scientific fields, the same data point may be measured using vastly different techniques, and within a lab, the equipment you purchase often has a huge impact on how your data are collected and stored.  Maybe we can standardize to an extent within certain communities of practice, but can we really hope to get everyone in the world on one page when it comes to standards?

If we can’t standardize, then maybe we can at least document.  If I measured in inches but your analysis needs length input in centimeters, that’s okay, as long as you know I measured in inches and you convert the data before doing your analysis.  That seems fairly obvious, but how do I necessarily know what I need to document to fully contextualize the data for someone else to use it?  Is it important that I took the measurement on a Tuesday at 4 pm, that the temperature outside was 80 degrees with 70% humidity, that I used a ruler rather than a tape measure, that the ruler was made of plastic rather than wood?  I could go on and on.  How much documentation is enough, and who decides?

The concepts of reproducibility, standardization, and documentation are nothing new, but the idea of data being inextricably caught up in local and social contexts does get me thinking about the feasibility of reusing shared data.  I don’t think data sharing is going to stop – there are enough funders and journals on board with requiring data sharing that I think researchers should expect that data sharing will be part of their scientific work going forward.  The question then is what is the utility of this shared data.  Is it just useful for transparency of the published articles, to document and prove the claims made in those publications?  Or can we figure out ways to surmount data’s limited context and make it more broadly usable in other settings?  Are there certain fields that are more likely to achieve that formal objectivity than others, and therefore certain fields were data reuse may be more appropriate or at least easier than others?  I think this requires further thought.  Good thing I have a few years to spend thinking about it!

 

Who owns science? Some musings on structural functionalism

This week I’ve been reading Sergio Sismondo’s An Introduction to Science and Technology Studies, which has given me a lot to think about in terms of theoretical backgrounds for understanding how science creates knowledge.  In fact, it’s almost given me too much to think about.  There are so many different theoretical bases brought into the mix here, and I can see the relative merits of each, so I find myself wondering how to make sense of it all, but also what it means to adopt a theoretical underpinning as a social scientist.  Is it like a religion, where you accept one and only one dogma, and all parts of it, to the exclusion of all others?  Or is it more like a buffet, where you pick a little bit of the things that seem appealing to you and leave behind the things that don’t catch your eye?  I’m hoping it’s the latter, and I’m going to go on that assumption until the theory police tell me I can’t do it. 🙂  So, on that assumption, here are some ideas I’ve put on my plate from Sismondo’s buffet.

Structural Functionalism and Mertonian Norms

My favorite theoretical framework I picked up here was structural functionalism, and in particular, Robert Merton’s four guiding norms.  Structural functionalism, as I understand it, argues that society is composed of institutional structures that function based on guiding norms and customs.  Merton suggests that science is one such institution, the primary goal of which is “the extension of certified knowledge” (23).  Merton also outlined four norms of behavior that guide scientific practice, suggesting that those who follow them will be rewarded and those who violate them will be punished.  The norms are universalism (that the same criteria should be used to evaluate scientific claims regardless of the race, gender, etc of the person making them), communism (that scientific knowledge belongs to everyone), disinterestedness (that scientists place the good of the scientific community ahead of their own personal gain), and organized skepticism (that the community should not believe new ideas until they have been convincingly proven).

Of those four norms, communism and disinterestedness speak the most to my interest in data sharing and reuse.  Communism seems the most obviously related.  It’s very interesting to think about what parts of science are typically thought to belong to the community and which are thought to be privately owned.  For example, the Supreme Court unanimously ruled in 2015 that human genes could not be patented, a decision that seems in line with Merton’s communism norm.  On the other hand, plenty of scientific ideas can be and are patented.  While many scientific journals are becoming open access and making their articles freely available, many more work on a subscription model, suggesting that the ideas shared within are available for common consumption – if you are willing and able to pay the fee.

Although this example comes from an entirely different realm than science, thinking about these ideas has reminded me of the case of the artist Anish Kapoor, who purchased the exclusive rights to paint with the world’s “blackest black” so that no other artist can use it.  In retaliation, another artist designed the “pinkest pink” paint and made it available for sale – to any artist except Anish Kapoor. While this episode is somewhat entertaining, it does bring up some interesting ideas about ownership in communities that are generally dedicated to the common good.  Art and science are very different, but they’re also quite alike in some ways that are very relevant to the work I’m doing.  They’re both activities carried out by individuals for their own reasons (artistic expression, scientific curiosity) for the common good (to share beauty with the world, to further scientific knowledge).  We are outraged when we hear of a rich artist laying exclusive claim to the raw materials of art so that no one else can use them.  It feels somehow petty, and it also seems like a disservice to not just the art world, but to all of is.  What could others be creating for us if they had access to that black?  I don’t know if we feel that same outrage when we hear of a scientist trying to lay exclusive claim to data.  Of course this isn’t a perfect analogy – a big part of the work of science is gathering or creating the data, which confuses the concept of ownership.  Still, I think there are some interesting ideas here to explore about how scientists think about common ownership of science – not just the ideas, but the data as well.

I started out this entry saying I was going to dip into some other theories – I have some things to say about social constructionism and actor-network theory, but now I’ve spent a long time going on and on about art and science and this is getting a bit long, so I think I’ll stop here for today. 🙂

 

On data, knowledge, and theory

As I’ve mentioned on this blog before, I recently started a PhD program at the University of Maryland’s iSchool, focusing on scientific researchers’ data reuse practices.  There’s a great deal of attention lately on encouraging, and even requiring, researchers to share their data, but less work has been done on how researchers actually make use of that shared data (or if indeed they do at all).  This semester, I’m doing an independent study with my advisor, Dr. Andrea Wiggins, with the aim of better understanding the theoretical background for this problem.  I have the good fortune of working in a job that involves interacting with researchers on data questions on a pretty much daily basis, so I have plenty of opportunity to observe actual practices, but I have less background on theoretical frameworks for contextualizing and understanding why  these things happen, so that’s my goal this semester!  I’ve picked out several readings and am going to write weekly reflections on what I’ve read and thought, and since I have this blog, I figured, why not inflict all this on you, my readers, as well? 🙂

This week I read Paul Davidson Reynolds’ Primer in Theory Construction, which breaks down the research process and explores the scientific method and all its component parts.  It is described as being designed “for those who have already studied one or more of the social, behavioral, or natural sciences, but have no formal introduction to the way theories are constructed, stated, tested, and connected together to form a scientific body of knowledge.”  While I was reading it, I often was thinking to myself, “well, yeah, obviously…” but after I had a little more time to think about it, it occurred to me that it was useful to really stop and think about why research is done the way it is and what we can really determine using data, inference, and logic.

One of the things I was thinking about as I was reading this book was how we make the jump from data to knowledge, and also how to operationalize terms like “data” and “knowledge.”  The NIH’s big data initiative is called Big Data to Knowledge, but what exactly does it mean to translate “big data” to “knowledge”?  How do we define “big data” (as opposed to small data?) and “knowledge”?  Are the ways that big data become knowledge different than the ways non-big data become knowledge?  There are some good definitions of big data, but how do we define “knowledge” in the scientific, and particularly biomedical, realm?

Thinking about how researchers use data by really breaking things down to their most basic level is a little different from how I’ve thought about things before, but actually makes good sense.  I suggest that the barriers to reuse of shared data are:

  • technological: there aren’t good tools for easily getting/reusing the data, or the data are poor quality or hard to find)
  • social: incentive structures of science often do not reward research that reuses data – take a look at the concept of #researchparasites
  • educational: reusing data involves a different skill set that most researchers aren’t taught

However, I never really thought about one of the most fundamental social factors, which is how researchers in a field conceptualize data and how it is transformed into knowledge.  Are there fundamental differences between the data I gather and data someone else gathers and I reuse?  Obviously if I gather my own data, I know more about its context, quality, and provenance.  If I reuse someone’s shared data, I don’t know how careful they were when collecting it, or other important things I might need to know about how the data were collected to be able to reuse them meaningfully.  For example, I once worked with a researcher on locating a clinical dataset for reuse, and once we got the dataset, the researcher asked how patient temperature had been measured – oral, axillary, rectal?  I got back in touch with the original data owner, and they didn’t know – the person who would be able to answer that question had moved on to a new position.  Apparently that mattered to the methods of the researcher I was working with, so they couldn’t use that dataset.  The sorts of things that seem like little minor details can actually make a big difference, but there’s really no way of knowing that unless you know how a research field works with and understands data.

Some things – like knowing how temperature was measured – are probably pretty specific to a narrow field, or even just a particular research method, and it’s probably not possible to know all of the intricacies of the many fields that comprise biomedical research.  However, I think there are also likely other fundamental qualities of data that would apply more broadly across many research fields, and perhaps that would be a useful approach to this question.