One of the recurring subjects among folks using data is: why does person x not share their data with me? Mostly because they are fearful and ignorant. Fearful? That their work will get scooped and/or their data might be found to be problematic. Ignorant? That they don’t know that they are obligated to share their data once they publish off of it and that it is in their interest to share their data. There is apparently a belief out there that data should be shared only after the big project is published, not after the initial work has been published. I will address this as well as the the converging logics of appropriateness and consequences here.
Let me address first this new belief. The idea that one can hold
onto one’s data after publishing an article because one has not yet
published the book is not the
norm. The norm, the obligation imposed by the National Science
Foundation and expected from the discipline, is that when the first
piece is published, then the data for that piece is supposed to be
accessible. That way, people can replicate that study. If the larger
study is years later (and it almost always is), that means that we would
have to wait years to replicate initial article Not only that but
there are no guarantees that the scholar will finish the book project
AND find a publisher.
Ok, let’s move from the specific myth to the broader logics of
replication. It has become clear over the decades that providing access
to data is the right thing to do from the standpoint of a logic of
appropriateness. The discussions make it clear that scholars need to be
transparent about their research and make it easy for others to
replicate one’s findings. This is the basic expectation for doing any
research but especially quantitative research. Providing interview
notes can be problematic due to confidentiality issues (although the NSF is funding a project to figure this out),
but providing data that one has created/collected is the standard
expectation of social science. Many journals now have replication
policies and store data at their websites. The question these days is
not the obligation to share data but to share the “do” files, macros, or
programs that are used to analyze the data.
Sharing data is clearly right from an appropriateness standpoint, but
it is also right from a rational self-interest perspective as well.
People worry that citations are over-rated, and they may be so. But if
you want to get cited, one of the best ways to do that is to share your
very useful dataset. According to the ISP symposium linked above (p.
21): “An author who makes data available is on average cited twice as frequently as an article with no data.”
If you check out the various lists of who gets cited the most, those
who create datasets and share them get cited more. Will Moore shared
with some folks a story at a recent conference, saying how he got pretty
famous in the discipline long before he published much because his name
was attached to the Polity dataset. He had no idea that this was going
to happen. One of the requirements of new grant applications is to
show how one plans to disseminate one’s research. Sharing data is one
basic and very important strategy.
We are in the business of creating public goods–knowledge. Not just
the findings but the data we develop along the way. It does mean that
some folks may free ride, but it also means that the collective
enterprise moves forward. Holding onto data is not only selfish but
short-sighted. Having others work on the same dataset is likely to lead
to feedback, which mean your work gets better, and to broader
imaginations of what is possible. When Ted Gurr developed the Minorities at Risk project,
he really had no idea how others would use it. Those that followed him
used it in a variety of ways, adding bits and pieces of data (I took
some of the IR data they had collected but not coded and used that to
test some stuff that Gurr never intended to ponder), asking different
questions and developing some very interesting findings.
Yes, folks along the way also discovered problems with the dataset,
but Gurr and the larger MAR team (which I subsequently joined) worked on
ways to improve the data (and, hey, we got NSF money do that–the next
batch of papers will address the improvements and then the revised data
will become available with better instructions). This is how social science works. Keeping the data to oneself, if even only for a few more years, is completely contrary to our enterprise.