Month: November 2012

Amazon’s Creeping ‘Glacier’ and Digital Preservation

Amazon recently announced integration of their core S3 service with their low-cost storage system, Glacier. This facilitates the ability to add rules to S3 (or their reduced redundancy store) based on age, date, and S3 bucket prefix.

Regular incremental improvement and integration is a signature of Amazon’s modus operandi for its cloud services: Amazon has a pattern of announcing updates every few weeks that add services, integrate existing services, or (occasionally) lower prices. And they have introduced incremental improvements to the AWS platform over a dozen times since Glacier was announced at the end of August.

Interestingly, Glacier is an apt metaphor for this low-cost service in that it not only signifies “cold” storage, but also signals a massive object making its way slowly but inexorably across a space, covering everything in its path.

Why Glacier is Important

Why is Glacier important? First, as James Hamilton (disclosure, James is VP and Distinguished Engineer at Amazon) aptly summarizes, Glacier provides the volume economics for multi-site, replicated, cold storage (near-line) to small-scale and medium-scale users. While do-it-yourself solutions based on automated tape libraries can still beat Glacier’s price by a huge margin, the sweet spot for this approach has been shifted out so that only very large enterprises are likely to beat the price on Glacier by rolling out their own solutions using tape libraries, etc.

Second, many businesses, and also many services, are built upon or backed up through AWS and S3. Amazon’s continued integration of Glacier into AWS will make it increasingly straightforward to integrate low-cost cold-storage replication into preservation services such as DuraCloud, backup services such as Zmanda, and even into simple software tools like Cyberduck.

Overall, I’m optimistic that this is a Good Thing, and will improve the likelihood of meaningful future access to digital content. However, there are a number of substantial issues to keep in mind when considering Glacier as part of a digital preservation solution.

Issue 1. Technical infrastructure does not guarantee long term durability

Although some commenters have stated that Glacier will “probably outlive us all“, these claims are based on little evidence. The durability of institutions and services relies as much upon economic models, business models, organizational models, and organizational mission as upon technology. Based on the history of technology companies, one must consider that there is a substantial probability that Amazon itself will not be in existence in fifty years, and the future existence of any specific Amazon service is even more doubtful.

Issue 2. Lock-in and future cost projections

As Wired dramatically illustrated, the costs of retrieving all of one’s data from Glacier can be quite substantial. Further, as David Rosenthal has repeatedly pointed out, the long-term cost-competitiveness of preservation services depends “not on their initial pricing, but on how closely their pricing tracks the Kryder’s Law decrease in storage media costs”. And that “It is anyone’s guess how quickly Amazon will drop Glacier’s prices as the underlying storage media costs drop.” The importance of this future price-uncertainty is magnified by the degree of lock-in exhibited by the Glacier service.

Issue 3. Correlated failures

Amazon claims a ‘design reliability’ of ‘99.999999999%‘. This appears to be an extremely optimistic number without any formal published analysis backing it. The number appears to be based on a projection of theoretical failure rates for storage hardware (and such rates are  wildly optimistic under production conditions), together with the (unrealistic) assumption that all such failures are statistically independent.  Moreover, this ‘design reliability’ claim is unsupported (at time of writing) by Glacier’s terms of service, SLA, or customer agreement. To the contrary, the agreements appear to indemnify Amazon against any loss of damage, does not appear to offer a separate SLA for Glacier, and limits recovery under existing SLA’s (for services, such as S3) to refund of fees for periods the service was unavailable. If Amazon were highly confident in the applicability of the quoted ‘design reliability’ to production settings, one might expect a stronger SLA. However, despite these caveats, my guess is that Glacier’s will still turn out to be, in practice, substantially more reliable than the DIY solutions that most individual organizations can afford to implement entirely in-house.

Nevertheless, as previously discussed (most recently at Digital Preservation 2012), a large part of risk mitigation for digital assets is to diversify against sources of correlated failure. Although implementation details are not complete, Glacier does appear to diversify against some common risks to bits — primarily media failure, hardware failure, and localized natural disaster (such as fire, flood). This is good, but far from complete. A number of likely single-point (or highly correlated) vulnerabilities remain, including software failure (e.g. a bug in the AWS software for its control backplane might result in permanent loss that would go undetected for a substantial time; or cause other cascading failures — analogous to those we’ve seen previously); legal threats (leading to account lock-out — such as this, deletion, or content removal); or other institutional threats (such as a change in Amazon’s business model). It is critical that diversification against these additional failures be incorporated into a digital preservation strategy.

Preliminary Recommendations

To sum up, Glacier is an important service, and appears to be a solid option for cold storage, but institutions that are responsible for digital preservation and long-term access should not use the quoted design reliability in modeling likelihood of loss, nor rely on Glacier as the sole archival mechanism for their content.

Participative Geography, Information Science, and Politics

Lately, our DistrictBuilder software, a tool that allows people to easily participate in creating election districts, has gotten some additional attention. We recently received an Outstanding Software Development Award from the American Political Science Association (given by the Information Technology & Politics Section) and a Data Innovation Award given by the O’Reilly Strata Conference (for data with social impact). And just last week, we had the opportunity to present our work to the government of Mexico at the invitation of the Instituto Federal Electoral, as part of their International Colloquium on Redistricting.

During this presentation, I was able to reflect on the interplay of algorithms and public participation. and it became even clearer to me that applications like DistrictBuilder exemplify the ability of information science to improve policy and politics.

Redistricting in Mexico is particularly interesting, since it relies heavily on facially neutral geo-demographic criteria and optimization algorithms, which represents a different sort of contribution from information science. Thus, it was particularly interesting to me to consider the interplay between algorithmic approaches to problem solving and “wisdom of crowd” approaches, especially for problems in the public sphere.

It’s clear that complex optimization algorithms are an advance in redistricting in Mexico, and have an important role in public policy. However, they also have a number of limitations:

  • Algorithmic optimization solutions often depend on a choice of (theoretically arbitrary) ‘starting values’ from which the algorithm starts its search for a solution.
  • Quality algorithmic solutions typically rely on accurate input data.
  • Many optimization algorithms embed particular criteria or particular constraints into the algorithm itself.
  • Even where optimization algorithms are nominally agnostic to the criteria used for the goal, some criteria are more tractable than others; and some are more tractable for particular algorithms.
  • In many cases, when an algorithm yields a solution, we don’t know exactly (or even approximately, in any formal sense) how good that solution is.

I argue that explicitly incorporating a human element is important for algorithmic solutions in the public sphere. In particular:

  • Use open documentation and open (non-patented, or open-licensed) to enable external replication of algorithms.
  • Use open source to enable external verification of the implementation of particular algorithms.
  • Incorporate public input to improve the data (especially describing local communities and circumstances) in algorithm driven policies.
  • Incorporate crowd-sourced solutions as candidate “starting values” for further algorithmic refinement.
  • Subject algorithmic output to crowd-sourced public review to verify the quality of the solutions produced.

You can see the slides, which include more detail and references below. For much such slides, refer to our PublicMapping project site.