Keeping Bits Forever
Deploying cost-effective, reliable bit-level long-term preservation at scale remains an unsolved problem. Memory organizations have identified a number of high-level ‘best practices’, such as fixity checking and geographically distributed replication, but there is little specific guidance or empirically-based information on selecting specific preservation strategies that fit a curating institution’s risk-tolerance, threat profile, and budget. Thus, while cloud storage vendors such as Amazon tout 99.999999999% durability; these claims typically lack substantial explanation or even clear definitions Further, professional memory organizations vary significantly in the practices they use, and how they use them — even in the number of copies held.
In newly published research with Richard Landau we use multi-level statistical simulation to simulate failure resulting from hardware faults, storage conditions, “normal” organizational failure, and correlated multi-organizational failures. By varying the parameters of this multi-level model we can simulate everything from bad disk sectors, to economic recessions, to minor wars.
More formally, this work, presented at IDCC and forthcoming in the International Journal of Digital Curation (ArXiv preprint available) addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modeling, discrete-event-based simulation, hierarchical modeling, and then use empirically calibrated sensitivity analysis to identify effective strategies.
After analyzing hundreds of thousands of simulations we find that answer is ten 🙂 Seven copies, distributed across different organizations, and validated systematically every year, in weekly increments protects against everything except strong coordinated attacks on the auditing system itself — including bad disks, storage conditions, firm failures, recessions, and regional wars. Using a crytographically secure distributed auditing system and 3 additional copies will protect against even a strong attack — as long as at least half the servers remain.So… 10,
There are also some corollaries for those considering file format transformations…
- Use compression — provided its a well-established, lossless algorithm — what you lose in fragility you gain in costs of replication & auditing.
- If you need to encrypt your data — keep 4 keys and audit these.
- If format failure is a substantial risk, maintain 4 readers for each format, and audit your collection with these.
Here’s a complete list of our recommendations, in accessible form.
All of the code for the simulation is available on my program’s github site. For those interested in these questions and related areas of interest, writings on digital presentation are linked from my web site.