Month: August 2016

The Role of Research Funding and Policy Community in Data Citation — Rewards, Incentives, and Infrastructure

Infrastructure and practices for data citation have made substantial progress over the last decade. This increases the potential rewards for data publication and reproducible science, however overall incentives remain relatively weak for many researchers.

This blog post summarizes a presentation given at the National Academies of Sciences as part of  Data Citation Workshop: Developing Policy And Practice.  The slides from the talk are embedded below:

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


Academic researchers as a class are drawn to research and scholarship through an interest in puzzle-solving, but they are also substantially incented by recognition and money.  Typically, these incentives are shaped and channeled through the processes and institutions of tenure and review; publication; grant, awards and prizes; industry consulting; and professional collaboration and mentoring. [1]

Citations have been described as academic “currency”, and while this is not literally true, they are a particularly visible form of recognition in the academy, and increasingly tied to monetary incentives as well. [2] Thus rules, norms, and institutions that affect citation practices have a substantial potential to change incentives.

When effort is invisible it is apt to be undervalued. Data has been the “dark matter” of the scholarly ecosystem — data citation aims to make the role of data visible.  While the citation of data is not entirely novel, there has been a concerted effort across researchers, funders, and publishers over approximately the last decade to reengineer data citation standards and tools to create more rational incentives to create reusable and reproducible research. [3]  In more formal terms, the proximate aim of the current data citation movement is to make transparent the linkages between research claims, the evidence base on which these claims are based; and the contributors who are responsible for that evidence base. The longer term aim is to shift the equilibrium of incentives so that building the common scientific evidence base is rewarded in proportion to its benefit to the overall scientific community.


There has been notable progress in standards, policies, and tools for data citation since the ‘bad old days’ of 2007, which Gary King and I grimly characterized at the time [4]:

How much slower would scientific progress be if the near universal standards for scholarly citation of articles and books had never been developed? Suppose shortly after publication only some printed works could be reliably found by other scholars; or if researchers were only permitted to read an article if they first committed not to criticize it, or were required to coauthor with the original author any work that built on the original. How many discoveries would never have been made if the titles of books and articles in libraries changed unpredictably, with no link back to the old title; if printed works existed in different libraries under different titles; if researchers routinely redistributed modified versions of other authors’ works without changing the title or author listed; or if publishing new editions of books meant that earlier editions were destroyed? …

Unfortunately, no such universal standards exist for citing quantitative data, and so all the problems listed above exist now. Practices vary from field to field, archive to archive, and often from article to article. The data cited may no longer exist, may not be available publicly, or may have never been held by anyone but the investigator. Data listed as available from the author are unlikely to be available for long and will not be available after the author retires or dies. Sometimes URLs are given, but they often do not persist. In recent years, a major archive renumbered all its acquisitions, rendering all citations to data it held invalid; identical data was distributed in different archives with different identifiers; data sets have been expanded or corrected and the old data, on which prior literature is based, was destroyed or renumbered and so is inaccessible; and modified versions of data are routinely distributed under the same name, without any standard for versioning. Copyeditors have no fixed rules, and often no rules whatsoever. Data are sometimes listed in the bibliography, sometimes in the text, sometimes not at all, and rarely with enough information to guarantee future access to the identical data set. Replicating published tables and figures even without having to rerun the original experiment, is often difficult or impossible.

A decade ago, while some publishers had data transparency policies — they were routinely honored in the breach. Now, a number of high profile journals both require that authors cite or include the data on which their publications rest; and have mechanisms to enforce this. PLOS is a notable example — its Data Availability Statement [5] states not only that data should be shared, but that articles should provide the persistent identifiers of shared data, and that these should resolve to well-known repositories.

A decade ago, the only major funder that had an organization-wide data sharing policy was NIH [6] — and this policy had notable limitations —  it was limited to large grants, and the resource sharing statements it required were brief, not peer reviewed, and not monitored. Today, as Jerry Sheehan noted in his presentation on Increasing Access to the Results of Federally Funded Scientific Research: Data Management and Citation, almost all federal support for research now complies with the Holdren memo, which requires policies and data management plans “describing how they will provide for long-term preservation of, and access to, scientific data”. [7]  A number of foundation funders have adopted similar policies. Furthermore as, panelist Patricia Knezek noted, data management plans are now part of the peer review process at the National Science Foundation, and datasets may be included in the biosketches that are part of the funding application process.   

A decade ago, few journals published replication data, and no high-profile journals existed that published data.  Over the last several years, the number of data journals has increased, and Nature Research launched Scientific Data — which has substantially raised the visibility of data publications.

A decade ago, tools for data citation were non-existent, and the infrastructure for general open data sharing outside of specific community collections was essentially limited to ICPSR’s publication related archive [8] and Harvard’s’ Virtual Data Center [9] (which later became the Dataverse Network). Today as panelists throughout the day noted [10] infrastructure such as CKAN, Figshare, and close to a dozen Dataverse-based archives accept open data from any field; [11] there are rich public directories of archives such as RE3data; and large data citation indices such Datacite, and the TR Data Citation Index, enable data citations to be discovered and evaluated. [12]

These new tools are critical for creating data sharing incentives and rewards.  They allow data to be shared and discovered for reuse, reuse to be attributed, and that attribution to be incorporated into metrics of scholarly productivity and impact. Moreover much of this infrastructure exists in large part because they received substantial startup support from the research funding community.


While open repositories and data citation indices enable researchers to more effectively get credit for data that is cited directly, and there is limited evidence that sharing research data is associated with higher citation rates, data sharing and citation remains quite limited. [13] As Marsha McNutt notes in her talk on Data Sharing: Some Cultural Perspectives, progress likely depends at least as much on cultural and organizational change, as on technical advances.   

Formally, the indexing of data citations enables citations to data to contribute to a researcher’s H-index, and other measures of scholarly productivity. As speaker Dianne Martin noted in the panel on Reward/Incentive Structures that her institution  (George Washington University) had begun to recognize data sharing and citation in the tenure and review process.

Despite the substantial progress over the last decade, there is little evidence that the incorporation of data citation and publication into tenure and review is yet either systematic or widespread.  Overall, positive incentives for citing data still appear to remain relatively weak:

  1. It remains the case that data is often used without being cited.[14]
  2. Even where data is cited — most individual data publications  (with notable exceptions primarily in the category of large community-databases) are neither in high impact publications nor highly cited.  Since scientists achieve significant recognition most often through publication in “high-impact” journals, and increasingly through publishing articles that are highly-cited — devoting effort to data publishing has a high opportunity cost.
  3. Even when cited, publishing one’s data is often perceived as increasing the likelihood that others will “leapfrog” your research, and publish high-impact articles with priority. Since scientific recognition relies strongly on priority of publication, this risk is a disincentive.
  4. While data citation and publication likely strengthens reproducibility, it also makes it easier for others to criticize published work. In the absence of strong positive rewards for reproducible research, this risk may be a disincentive overall.  


Funders and policy-makers have the potential to do more to strengthen positive incentives. Funders should support for mechanisms to assess and assign “transitive” credit, which would provides some share of the credit for publications to the other data and publications on which they would rely. [15] And funders and policy makers should support strong positive incentives for reproducible research — such as funding, and explicit recognition. [16]

Thus far, much of the efforts by funders, who are key stakeholders, focus on compliance. And in general, compliance has substantial limits as a design principle:

  • Compliance generates incentives to follow a stated rule — but not generally to go beyond to promote the values that motivated the rule.
  • Actors still need resources to comply, and as Chandler and other speakers on the panel on Supporting Organizations Facing The Challenges of Data Citation, compliance with data sharing is often viewed as an unfunded mandate.
  • Compliance-based incentives are prone to failure where the standards for compliance are ambiguous or conflicting.
  • Further, actors have incentives to comply with rules only when they have an expectation that behavior can be monitored, that the rule-maker will monitor behavior, and that violations of the rules will be penalized.
  • Moreover, external incentives, such as compliance, can displace existing internal motivations and social norms [17] — yielding a reduction in the desired behavior. Thus we should expect to promote the value the rule supports.

Journals have increased the monitoring of data transparency and sharing — primarily through policies like PLOS’s that require the author to supply before publication a replication data set and/or an explicit data citation or persistent identifiers that resolves to data in a well-known repository. This appears to be substantially increasing compliance with journal policies that had been on the books for over a decade.

However, neither universities nor funders are routinely auditing or monitoring compliance with data management plans.  As panelist Patricia Knezek emphasizes,  there are many questions about how funders will monitor compliance, how to incent compliance after the award is complete, and regarding uncertainties about the division of responsibility for compliance between the funded institution and the funded investigator.  Further, as noted in the panelists discussion with the workshop audience, data management plans for funded research made available to the public along with the abstracts creates a barrier to community-based monitoring and norms; scientists in the federal government are not currently subject to the same data sharing and management requirements as scientists in academia; and there is a need to support ‘convening’ organizations such as FORCE11, and the Research Data Alliance to bring multiple stakeholders to the table to align strategies on incentives and compliance .

Finally, as Cliff Lynch noted in the final discussion session of the workshop, compliance with data sharing requirements often comes into conflict with confidentiality requirements for the protection of data obtained from individuals and businesses, especially in the social, behavioral, and health sciences.  This is not a fundamental conflict  — it is possible to enable access to data without any intellectual property restrictions while still maintaining privacy. [18] However, absent common policies and legal instruments for intellectually-open but personally-confidential data, confidentiality requirements are a barrier (or sometimes an excuse) to open data.


[1] See for a review: Stephan PE. How economics shapes science. Cambridge, MA: Harvard University Press; 2012 Jan 15.

[2] Cronin B. The citation process. The role and significance of citations in scientific communication. London: Taylor Graham, 1984. 1984;1.

[3] Altman M, Crosas M. The evolution of data citation: from principles to implementation. IAssist Quarterly. 2013 Mar 1;37(1-4):62-70.

[4] Altman, Micah, and Gary King. “A proposed standard for the scholarly citation of quantitative data.” D-lib Magazine 13.3/4 (2007).

[5] See:

[6] See Final NIH Statement on Sharing Research Data, 2003, NOT-OD-03-032. Available from:

[7]  Holdren, J.P. 2013, “Increasing Access to the Results of Federally Funded Scientific Research “, OSTP. Available from:

[8]  King, Gary. “Replication, replication.” PS: Political Science & Politics 28.03 (1995): 444-452.

[9]  Altman M. Open source software for Libraries: from Greenstone to the Virtual Data Center and beyond. IASSIST Quarterly. 2002;25.

[10]  See particularly the presentation and discussion on Tools and Connections; Supporting Organizations Facing The Challenges of Data Citation, and Reward/Incentive Structures.  

[11] See , ,

[12] See , ,

[13]  Borgman CL. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology. 2012 Jun 1;63(6):1059-78.

[14]  Read, Kevin B., Jerry R. Sheehan, Michael F. Huerta, Lou S. Knecht, James G. Mork, and Betsy L. Humphreys. “Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.” PloS one10, no. 7 (2015): e0132735.

[15] See Katz, D.S., Choi, S.C.T., Wilkins-Diehr, N., Hong, N.C., Venters, C.C., Howison, J., Seinstra, F., Jones, M., Cranston, K., Clune, T.L. and de Val-Borro, M., 2015. Report on the second workshop on sustainable software for science: Practice and experiences (WSSSPE2). arXiv preprint arXiv:1507.01715.

[16]  See Nosek BA, Spies JR, Motyl M. Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science. 2012 Nov 1;7(6):615-31., Brandon, Alec, and John A. List. “Markets for replication.” Proceedings of the National Academy of Sciences 112.50 (2015): 15267-15268.

[17] Gneezy, U. and Rustichini, A., 2000. A fine is a price, . J. Legal Studies, 29,