State of the Art Informatics for Research Reproducibility, Reliability, and Reuse: Or How I Learned to Stop Worrying and Love Data Management
Scholarly publishers, research funders, universities, and the media, are increasingly scrutinizing research outputs. Of major concern is the integrity, reliability, and extensibility of the evidence on which published findings are based. A flood of new funder mandates, journal policies, university efforts, and professional society initiatives aim to make this data verifiable, reliable, and reusable: If “data is the new oil”, we need data management to prevent ‘fires’, ensure ‘high-octane’, and enable ‘recycling’.
In March, I had the pleasure of being the inaugural speaker in a new lecture series (http://library.wustl.edu/research-data-testing/dss_speaker/dss_altman.html) initiated by the Libraries at the Washington University in St. Louis Libraries — dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.
In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.
This blog post provides some wider background for the presentation, and a recap of its recommendations. The approaches can be roughly divided into three categories. The first approach focuses on tools for reproducible computation ranging from “statistical documents” (incorporating Knuth’s  concept of literate programming) to workflow systems and reproducible computing environments [for example, Buckheit & Donoho 1995; Schwab et al. 2000; Leisch & Rossini 2003; Deelman & Gils 2006; Gentleman & Temple-Lang 2007] With few exceptions [notably, Freire, et al. 2006] this focuses primarily on “simple replication” or “reproduction” –replicating exactly a precise set of result from an exact copy of original data made at the time of research.
Current leading examples of tools that support reproducible computation include:
- Ipython: ipython.org
- Knitr yihui.name/knitr/
- Research Compendia researchcompendia.org
- Run My Code runmycode.org
- Vistrails vistrails.org
The second approach focuses on data sharing methods and tools [see for example, Altman et al 2001; King 2007; Anderson et al., 2007; Crosas 2011].  This approaches more generally on helping researchers to share — both for replication and for broader reuse – including secondary uses and use in teaching. Increasingly work in this area [e.g. Gutmann 2009; Altman-King 2007] focuses on issues of enabling long-term and interdisciplinary access to data – this requires that the researchers’ tacit knowledge about data formats, measurement, structure and provenance be more explicitly documented.
Current leading examples of informatics tools that support data sharing include:
The third approach focuses on the norms, practices and licensing associated with data sharing archiving and replication and the related incentives embedded in scholarly communication [Pienta 2007; Hamermesh 2007; Altman & King 2007; King 2007; Hedstrom et al. 2008; McCullough 2009; Stodden 2009]. This approach seeks to create the necessary conditions to enable data sharing and reuse, and to examine and align citations around citation, data sharing, and peer review to encourage replicability and reusability.
Current leading examples of informatics tools that support richer citation, evaluation, open science, and review include:
- Data Cite datacite.org
- Data dryad datadryad.org
- Dataverse Network thedata.org
- DMPTOOL dmp.cdlib.org/
- Figshare figshare.com
- Journal of Visual Experiments jove.com
- ORCID: Orcid.org
- Research Replication Reports http://www.psychologicalscience.org/index.php/replication
- Thomson Reuters Data Citation Index wokinfo.com/products_tools/multidisciplinary/dci/
Many Tools, Few Solutions
In this area, there are many useful tools, but few solutions that offer a complete solution – even for a specialized community of practice. All three approaches are useful, and here are several general observations to be made about them. First, tools for replicable research such as VisTrails, MyExperiment, Wings, and StatDocs are characterized by their use of a specific and controlled defined software framework and their ability to facilitate near automatic replication. The complexity of these tools, and their small user and maintenance base means that we cannot rely on them to exist and function in five-ten years – they cannot ensure long term access. Because they focus only on results and not on capturing practices, descriptive metadata and documentation, they allow exact replication without providing the contextual information necessary for broader reuse. Finally these tools are heterogeneous across subdisciplines, and largely incompatible, they do not as yet offer a broadly scalable solution.
Second, tools and practices for data management have the potential to broadly increase data sharing and the impact of related publications However, although these tools are becoming easier to use, they still require an extra effort for the researcher. Moreover, since additional effort often comes near (or past) the conclusion of the main research project (and only after acceptance of an article and preparation for final publication) it is perceived as a burden, and often honored in the breach.
Third, incentives for replication have been weak in many disciplines – and journals are a key factor. The reluctance of journal editors to publish articles either confirming or non-confirming replications work authors’ incentives to create replicable work. Lack of formal provenance and attribution practices for data also weakens accountability, raises barriers to conducting replication and reuse, reduces incentive to disseminate data for reuse, and increases the ambiguity of replication studies, making them difficult to study.
Furthermore, new forms of evidence complicate replication and reuse. In most scientific disciplines, the amount of data potentially available for research is increasing non-linearly. In addition, changes in technology and society are greatly affecting the types and quantities of potential data available for scientific analysis, especially in the social sciences. This presents substantial challenges to the future replicability and reusability of research. Traditional data archives currently consist almost entirely of numeric tabular data from noncommercial sources. New forms of data differ from tabular data in size, format, structure, and complexity. Left in its original form, this sort of data is difficult or for scholars outside of the project that generated it to interpret and use. This is a barrier to integrative and interdisciplinary research, but also a significant obstacle to providing long-term access, which becomes practically impossible as the tacit knowledge necessary to interpret the data is forgotten. To enable broad use and to secure long term access requires more than simply storing the individual bits of information – it requires establishing and disseminating good data management practices. [Altman & King 2007]
How research libraries can jump-start the process.
Many research libraries should consider at least three steps:
1. Create a dataverse hosted by the Harvard Dataverse Network ( http://thedata.harvard.edu/dvn/faces/login/CreatorRequestInfoPage.xhtml ). This provides free, permanent storage, dissemination, with bit-level preservation insured by Harvard’s endowment. The dataverse can be branded, curated, and controlled by the library – so it enables libraries to maintain relationship with their patrons, and provide curation services, with minimal effort. (And since DVN is open-source, a library can always move from the hosted service to one they run themselves.
2. Link to DMPTool (https://dmp.cdlib.org/) from your libraries website. And consider joining DMPTool as an institution – especially if you use Shibboleth (Internet2) to authorize your users. You’ll be in good company — according to a recent ARL survey 75% of ARL libraries are now at least linking to DMPTool. Increasing researchers use of DMPtool provides early opportunities for conversation with libraries around data, enables libraries to offer service at a time when it is salient to the researcher , and provides a information which can be used to track and evaluate data management planning needs.
3. Publish a “libguide” focused on helping researchers get more credit for their work. This is a subject of intense interest, and the library can provide information about trends and tools in the area that researchers (especially junior researchers) of which researchers may not be aware. Some possible topics to include: Data citation (e.g. the http://www.force11.org/node/4769 ); researcher identifiers (e.g., http://orcid.org ); and impact metrics (http://libraries.mit.edu/scholarly/publishing/impact) .
Altman, M., L. Andreev, M. Diggory, M. Krot, G. King, D. Kiskis, A. Sone, S. Verba, A Digital Library for the Dissemination and Replication of Quantitative Social Science Research, Social Science Computer Review 19(4):458-71. 2001.
Altman, M. and G. King. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib Magazine 13(3/4). 2007.
Anderson, R. W. H. Greene, B. D. McCullough and H. D. Vinod. “The Role of Data/Code Archives in the Future of Economic Research,” Journal of Economic Methodology. 2007.
Buckheit, J. and D.L. Donoho,Wavelan and Reproducible Research, in A. Antoniadis (ed.) Wavelets and Statistics, Springer-Verlag. 1995.
Crosas, M., The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data, D-lib Magazine 17(1/2). 2011.
D.S. Hamermesh, “Viewpoint: Replication in Economics,” Canadian Journal of Economics. 2007.
Deelman, E. Y. Gil, (Eds.). Final Report on Workshop on the Challenges of Scientific Workflows. 2006. <http://vtcpc.isi.edu/wiki/images/b/bf/NSFWorkflow-Final.pdf>
Freire, J., C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, 10-18, 2006.
Gentleman R., R. Temple Lang. Statistical Analyses and Reproducible Research, Journal of Computational and Graphical Statistics 16(1): 1-23. 2007.
Gutmann M., M. Abrahamson, M. Adams, M. Altman, C. Arms, K. Bollen, M. Carlson, J. Crabtree, D. Donakowski, G. King, J. Lyle, M. Maynard, A. Pienta, R. Rockwell, L. Timms-Ferrara, C. Young, “From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data”, Library Trends 57(3):315-337. 2009.
Hedstrom, Margaret, Jinfang Niu, Kaye Marz,. “Incentives for Data Producers to Create “Archive/Ready” Data: Implications for Archives and Records Management”, Proceedings of the Society of American Archivists Research Forum. 2008.
King, G. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research, 32(2), 173–199. 2007.
Knuth, D.E., Literate Programming, CLSI Lecture Notes 27. Center for the Study of Language and Information. Stanford, Ca. 1992.
Leisch F., and A.J. Rossini, Reproducible Statistical Research, Chance 16(2): 46-50. 2003.
McCullough, B.D., Open Access Economics Journals and the Market for Reproducible Economic Research, Economic Analysis & Policy 39(1). 2009.
Pienta, A., LEADS Database Identifies At-Risk Legacy Studies, ICPSR Bulletin 27(1) 2006.
Schwab, M., M. Karrenbach, and J. Claerbout, Making Scientific Computations Reproducible, Computing in Science and Engineering 2: 61-67. 2000.
Stodden, V.The Legal Framework for Reproducible Scientific Research: Licensing and Copyright, Computing in Science and Engineering 11(1):35-40. 2009.
 Also see for example the CRAN reproducible research task view: http://cran.r-project.org/web/views/ReproducibleResearch.html; and the Reproducible Research tools page: http://reproducibleresearch.net/index.php/RR_links#Tools