As I’ve mentioned on this blog before, I recently started a PhD program at the University of Maryland’s iSchool, focusing on scientific researchers’ data reuse practices. There’s a great deal of attention lately on encouraging, and even requiring, researchers to share their data, but less work has been done on how researchers actually make use of that shared data (or if indeed they do at all). This semester, I’m doing an independent study with my advisor, Dr. Andrea Wiggins, with the aim of better understanding the theoretical background for this problem. I have the good fortune of working in a job that involves interacting with researchers on data questions on a pretty much daily basis, so I have plenty of opportunity to observe actual practices, but I have less background on theoretical frameworks for contextualizing and understanding why these things happen, so that’s my goal this semester! I’ve picked out several readings and am going to write weekly reflections on what I’ve read and thought, and since I have this blog, I figured, why not inflict all this on you, my readers, as well? 🙂
This week I read Paul Davidson Reynolds’ Primer in Theory Construction, which breaks down the research process and explores the scientific method and all its component parts. It is described as being designed “for those who have already studied one or more of the social, behavioral, or natural sciences, but have no formal introduction to the way theories are constructed, stated, tested, and connected together to form a scientific body of knowledge.” While I was reading it, I often was thinking to myself, “well, yeah, obviously…” but after I had a little more time to think about it, it occurred to me that it was useful to really stop and think about why research is done the way it is and what we can really determine using data, inference, and logic.
One of the things I was thinking about as I was reading this book was how we make the jump from data to knowledge, and also how to operationalize terms like “data” and “knowledge.” The NIH’s big data initiative is called Big Data to Knowledge, but what exactly does it mean to translate “big data” to “knowledge”? How do we define “big data” (as opposed to small data?) and “knowledge”? Are the ways that big data become knowledge different than the ways non-big data become knowledge? There are some good definitions of big data, but how do we define “knowledge” in the scientific, and particularly biomedical, realm?
Thinking about how researchers use data by really breaking things down to their most basic level is a little different from how I’ve thought about things before, but actually makes good sense. I suggest that the barriers to reuse of shared data are:
- technological: there aren’t good tools for easily getting/reusing the data, or the data are poor quality or hard to find)
- social: incentive structures of science often do not reward research that reuses data – take a look at the concept of #researchparasites
- educational: reusing data involves a different skill set that most researchers aren’t taught
However, I never really thought about one of the most fundamental social factors, which is how researchers in a field conceptualize data and how it is transformed into knowledge. Are there fundamental differences between the data I gather and data someone else gathers and I reuse? Obviously if I gather my own data, I know more about its context, quality, and provenance. If I reuse someone’s shared data, I don’t know how careful they were when collecting it, or other important things I might need to know about how the data were collected to be able to reuse them meaningfully. For example, I once worked with a researcher on locating a clinical dataset for reuse, and once we got the dataset, the researcher asked how patient temperature had been measured – oral, axillary, rectal? I got back in touch with the original data owner, and they didn’t know – the person who would be able to answer that question had moved on to a new position. Apparently that mattered to the methods of the researcher I was working with, so they couldn’t use that dataset. The sorts of things that seem like little minor details can actually make a big difference, but there’s really no way of knowing that unless you know how a research field works with and understands data.
Some things – like knowing how temperature was measured – are probably pretty specific to a narrow field, or even just a particular research method, and it’s probably not possible to know all of the intricacies of the many fields that comprise biomedical research. However, I think there are also likely other fundamental qualities of data that would apply more broadly across many research fields, and perhaps that would be a useful approach to this question.