Jefferson Bailey is Director of Web Archiving at Internet Archive. Jefferson joined Internet Archive in Summer 2014 and manages Internet Archive’s web archiving services including Archive-It, used by over 500 institutions to preserve the web. He also oversees contract and domain-scale web archiving services for national libraries and archives around the world. He works closely with partner institutions on collaborative technology development, digital preservation, data research services, educational partnerships, and other programs. He presented the talk recorded below, entitled, Safety Nets: Rescue And Revival For Endangered Born-digital Records — as part of Program on Information Science Brown Bag Series:
Bailey abstracted his talk as follows:
The web is now firmly established as the primary communication and publication platform for sharing and accessing social and cultural materials. This networked world has created both opportunities and pitfalls for libraries and archives in their mission to preserve and provide ongoing access to knowledge. How can the affordances of the web be leveraged to drastically extend the plurality of representation in the archive? What challenges are imposed by the intrinsic ephemerality and mutability of online information? What methodological reorientations are demanded by the scale and dynamism of machine-generated cultural artifacts? This talk will explore the interplay of the web, contemporary historical records, and the programs, technologies, and approaches by which libraries and archives are working to extend their mission to preserve and provide access to the evidence of human activity in a world distinguished by the ubiquity of born-digital materials.
Bailey eloquently stated the importance of web archiving: “No future scholarship can study our era without considering materials published (only) on the web.” Further, he emphasized the importance of web archiving for social justice: Traditional archives disproportionately reflect social architectures of power, and the lived experiences of the advantaged. Web crawls capture a much broader (although not nearly complete) picture of the human experience.
The talk ranged over an impressively wide portfolio of initiatives — far too many to do justice discussing in a single blog post. Much more detail on these projects can be found in the slides and video above, Bailey’s professional writings, the Archive blog, and experiments page, and archive-it blog for some insights into these.
A unified argument ran through the Bailey’s presentation. At the risk of oversimplifying, I’ll restate the premises of the argument here:
- Understanding our era will require research, using large portions of the web, linked across time.
- The web is big — but not too big to collect (a substantial portion of) it. 
- Providing simple access (e.g. retrieval, linking) is more expansive than collection;
enabling discovery (e.g. search) is much harder than simple access;
and supporting computational research (which requires analysis at web-scale, and over time) —
is much, much harder than discovery.
- Research libraries should help with this (hardest) part.
I find the first three parts of the argument largely convincing. Increasingly, new discoveries in social science are based on analysis of massive collections of data that areis generated as a result of people’s public communications, and depends on tracing these actions and their consequences over time. The Internet Archive’s success to date establishes that much of these public communications can be collected and retained over time. And the history of database design (as well as my and my colleagues experiences in archiving and digital libraries) testifies to the challenges of effective discovery and access at scale.
I hope that we, as research libraries, will be step up to the challenges of enabling large-scale, long-term research over content such as this. Research libraries already have a stake in this problem because most of the the core ideas and fundamental methods (although not the operational platforms) for analysis of data at this scale comes from research institutions with which we are affiliated. Moreover if libraries lead the design of these platforms, participation in research will be far more open and equitable than if these platforms are ceded entirely to commercial actors.
For this among other reasons, we are convening a Summit on Grand Challenges in Information Science & Scholarly Communication, supported by a generous grant from the Mellon Foundation. During this summit we develop community research agendas in the areas of scholarly discovery at scale; digital curation and preservation; and open scholarship. For those interested in these questions and related areas of interest, we have published Program on Information Science reports and blog posts on some of the challenges of digital preservation at scale.
 The Internet Archive currently holds 35 petabytes of information. Which is roughly equivalent to the text of 7 million long novels — or to the amount of new information produced across the globe every 45 minutes.