Month: December 2012

Auditing Distributed Preservation Networks

The core ideas around distributed digital preservation are well-recognized in the library and archival communities. Geographically distributed replication of content has been clearly identified for at least a decade as a best practice for management of digital content, to which one must provide long-term access. The idea of using replication to reduce the risk of loss was explicit in the design of LOCKSS a decade ago. More recently, replication has been incorporated into best practices and standards. In order to be fully “trusted”, an organization must have a managed process for creating, maintaining, and verifying multiple geographically distributed copies of its collections — and this is now recognized as good community practice. Further, this requirement has been incorporated in Trustworthy Repositories Audit & Certification (TRAC) (see section C.1.2-C.1. in [8]), and in the newly released ISO standard that has followed it.

However, while geographic replication can be used to mitigate some risks (e.g. hardware failure), less technical risks such as curatorial error, internal malfeasance, economic failure, and organizational failure require that replications be diversified across distributed, collaborative organizations. The LOCKSS research team has also identified a taxonomy of potential single-points-of-failure (highly correlated risks), that at minimum, a trustworthy preservation system should mitigate against. These risks include media failure, hardware failure, software failure, communication errors, network failure, media and hardware obsolescence, software obsolescence, operator error, natural disaster, external attack, internal attack, economic failure and organizational failure [9].

There are a number of existing organizational/institutional efforts to mitigate threats to the digital scholarly record (and scientific evidence base) through replication of content. Of these, the longest continuously-running organization is LOCKSS. (Currently, the LOCKSS organization manages the Global LOCKSS Network, in which over one hundred libraries collaborate to replicating over 9000 e-journal titles from over five hundred publishers. Furthermore, there are ten different membership organizations currently using the LOCKSS system to host their own separate replication networks.) Each of these membership organizations (or quasi-organizations) represent a variation on the collaborative preservation model, comprises separate institutional members, and targets a different set of content.

Auditing is especially challenging for distributed digital preservation, and essential for four reasons:

1. Replication can prevent long-term loss of content only when loss or corruption of a copy is detected and repaired using other replicates — this is a form of low-level auditing. Without detection and repair, it is a statistical near-certainty that content will be lost to non-malicious threats. Surprisingly, one reduces risks far more effectively by having a few replicates and verifying their fixity very frequently, than one does by having less frequent auditing and more replicas — at least for the types of threats that affect individual (or small groups of) content objects at random (e.g. many forms of result of media failure, hardware, software, and curatorial error).

2. The point of the replication enterprise is recovery. Regular restore/recovery audits that test the ability to restore the entire collection or randomly selected collections/items are considered good practice even to establish the reliability of short term backups and are required by IT disaster-recovery planning frameworks. Archival recovery is an even harder problem because one needs to validate not only that a set of objects are recoverable, but that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive even in the absence of bit-level failures. [28,29] Because DPN is planned as a dark repository, recovery must be demonstrated within the system — it cannot be demonstrated through active use.

3. Transparent, regular, and systematic auditing is one of the few available safeguards against insider/internal threats.

4. Auditing of compliance with higher-level replication policies is recognized as essential for managing risks generally . For scalability, auditing of policies that apply to the individual management of collections must be automated. In order to automate these policies, there must be a transparent and reliable mapping from the higher level policies governing collections and their contents to the information and controls provided by technical infrastructure mapped.

This presentation, delivered at CNI 2012, summarizes the lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. Recommendations for closing these gaps are discussed, as are extensions that have been added to the SafeArchive system to mitigate risks in distributed digital preservation (DDP).