Month: October 2012

Thoughts on Mitigating Threats to Data Quality Throughout the Curation Lifecycle

The workshop report from the UNC Curating for Data Quality workshop, in which I was delighted to participate, is now being made available. It contains many perspectives addressing a number of questions:

Data Quality Criteria and Contexts. What are the characteristics of data quality? What threats to data quality arise at different stages of the data life cycle? What kinds of work processes affect data quality? What elements of the curatiorial process most strongly affect data quality over time? How do data types and contexts influence data quality parameters?

Human and Institutional Factors. What are the costs associated with different levels of data quality? What kinds of incentives and constraints influence efforts of different stakeholders? How does one estimate the continuum from critical to tolerable errors? How often does one need to validate data?

Tools for Effective and Painless Curation. What kinds of tools and techniques exist or are required to insure that creators and curators address data quality?

Metrics. What are or should be the measures of data quality? How does one identify errors? How does one correct errors or mitigate their effects?

My current perspective, after reflecting on seven ‘quality’ frameworks from different disciplines that differ in complex and deep ways, is that the data quality criteria implied by the candidate frameworks are neither easily harmonized, nor readily quantified. Thus, a generalized systematic approach to evaluating data quality seems unlikely to emerge soon. Fortunately, developing an effective approach to digital curation that respects data quality does not require a comprehensive definition of data quality. Instead, we can appropriately address “data quality” in curation by limiting our consideration to a narrower applied questions:

Which aspects of data quality are (potentially) affected by (each stage of) digital curation activity? And how do we keep invariant data quality properties at each curation stage?

A number of approaches suggest seem particularly likely to bear fruit:

  1. Incorporate portfolio diversification in selection and appraisal.
  2. Support validation of preservation quality attributes such as authenticity, integrity, organization, and chain of custody throughout long-term preservation and use — from ingest through delivery and creation of derivative works.
  3. Apply semantic fingerprints for quality evaluation during ingest, format migration and delivery.

These approaches have the advantage of being independent of the content subject area, of the domain of measure, and of the particular semantics content of objects and collections. Therefore, they are broadly applicable. By mitigating these broad-spectrum threats to quality, we can improve the overall quality of curated collections, and their expected value to target communities.

My extended thoughts are here:

You may also be interested in the other presentations from the workshop, which are posted on the Conference Site.

Early Results from Auditing Distributed Preservation Networks

I was pleased to participate in the 2012 PLN Community Meeting.

Over the last decade, replication has become a required practice for digital preservation. Now, Distributed Digital Preservation (DDP) networks are emerging as a vital strategy to ensure long-term access to the scientific evidence base and cultural heritage. A number of DDP networks are currently in production, including CLOCKSS, Data-PASS, MetaArchive, COPPUL, Lukll, PeDALS, Synergies, Data One, and new networks, such as DFC and DPN are being developed.

These networks were created to mitigate the risk of content loss by diversifying across software architectures, organizational structures, geographic regions, as well as legal, political, and economic environments. And many of these networks have been successful at replicating a diverse set of content.

However, the point of the replication enterprise is recovery. Archival recovery is an even harder problem because one needs to validate not only that a set of objects is recoverable, but also that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive, even in the absence of bit-level failures.

In a presentation co-authored with Jonathan Crabtree, we summarized some lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system (, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. It also reveals gaps in the auditing tools we have available. Our presentation, below, focused on the importance of designing auditing systems to provide diagnostic information that can be used to diagnose non-confirmations of audited policies. Tom Lipkis followed with specific planned and possible extensions in LOCKSS that would enhance diagnosis and auditing.

You may also be interested in the other presentations from the workshop, which are posted on the PLN2012 Website.