To keep or not to keep: that is the question

I recently read an article in The Atlantic about people who are compulsive declutterers – the opposite of hoarders – who feel compelled to get rid of all their possessions. I’m more on the side of hoarding, because I always find myself thinking of eventualities in which I might need the item in question.  Indeed, it has often been the case that I will think of something I got rid of weeks or even years later and wish I still had it: a book I would have liked to reference, a piece of clothing I would have liked to wear, a receipt I could have used to take something back.  Of course, I don’t have unlimited storage space, so I can’t keep all this stuff.  The question of what to keep and for how long is one that librarians think about when it comes to weeding: deciding which parts of the collection to deaccession, or basically, get rid of.  There are evidence-based, tried-and-true ways of thinking about weeding a library collection, but that’s not so much true when it comes to data.  How is a scientist to decide what to keep and what not to keep?

I know this is a question that researchers are thinking about quite a bit, because I get more emails about this than almost any other issue.  In fact, I get emails not only from users of my own library, but researchers from all over the country who have somehow found my name.  What exactly do I need to keep?  If I have electronic records, do I need to keep a print copy as well?  How many years do I need to keep this stuff?  These are all very reasonable questions that it would be nice to say, yes, there is an answer and it is….! but it’s almost never so easy to point to a single answer.

A case in point: a couple years ago, I decided to teach a class about data preservation and retention.  In my naivete, I thought it would be nice to take a look through all the relevant policy and find the specific number of years that research data is required to be retained.  I read handbooks and guides.  I read policy documents from various agencies.   I even read the U.S. Code (I do not recommend it).  At the end of it, I found that not only is there not a single, definitive, policy answer to how long funded research data should be retained, but there are in fact all sorts of contradictory suggestions.  I found documents giving times from 3 years to 7 years to the super-helpful “as long as necessary.”

This may be difficult to answer from a policy perspective, but I think answering this from a best practices perspective is even trickier.  Let’s agree that we just can’t keep everything – storing data isn’t free, and it takes considerable time and effort to ensure that data remain accessible and usable.  Assuming that some stuff has to get thrown away, how do we distinguish trash from treasure, especially given the old adage about how the former might be the latter to others?  It’s hard to know whether something that appears useless now might actually be useful and interesting to someone in the future.  To take this to the extreme, here’s an actual example from a researcher I’ve worked with: he asked how he could have his program automatically discard everything in the thousandth place from his measurements.  In other words, he wanted 4.254 to be saved as 4.25.  I told him I could show him how, but I asked why he wanted to do this.  He told me that his machine was capable of measuring to the thousandth, but the measurement was only scientifically relevant to the hundredth place.  To scientists right now, 4.254 and 4.252 were essentially indistinguishable, so why bother with the extra noise of the thousandth place?  Fair point, but what about 5 years from now, or 10 years from now?  If science evolves to the point that this extra level of precision is meaningful, tomorrow’s researchers will probably be a little annoyed that today’s researchers had that measurement and just threw it away.  But then again, how can we know now when, or even if, that level of precision will be wanted?  For that matter, we can’t even say for sure whether this dataset will be useful at all.  Maybe a new and better method for making this measurement will be developed tomorrow, and all this stuff we gathered today will be irrelevant.  But how can we know?

These are all questions that I think are not easy to answer right now, but that people within research communities should be thinking about.  For one thing, I don’t think we can give one simple answer to how long data should be retained.  For one type of research, a few years may be enough.  For other fields, where it’s harder to replicate data, maybe we need to keep it in perpetuity.  When it comes to deciding what should be retained and what should be discarded, I think that answers cannot be dictated by one-size-fits-all policies and that subject matter experts and information professionals should work together to figure out good answers for specific communities and specific data.  Eventually, I suppose we’ll probably have some of those well-defined best practices for data retention in the same way that we have those best practices from collection management in libraries.  Until then, keep your crystal balls handy. 🙂