Keeping Bits Forever

Keeping Bits Forever

Deploying cost-effective, reliable bit-level long-term preservation at scale remains an unsolved problem. Memory organizations have identified a number of high-level ‘best practices’, such as fixity checking and geographically distributed replication, but there is little specific guidance or empirically-based information on selecting specific preservation strategies that fit a curating institution’s risk-tolerance, threat profile, and budget. Thus, while cloud storage vendors such as Amazon tout 99.999999999% durability; these claims typically lack substantial explanation or even clear definitions Further, professional memory organizations vary significantly in the practices they use, and how they use them — even in the number of copies held.

In newly published research with Richard Landau we use multi-level statistical simulation to simulate failure resulting from hardware faults, storage conditions, “normal” organizational failure, and correlated multi-organizational failures. By varying the parameters of this multi-level model we can simulate everything from bad disk sectors, to economic recessions, to minor wars.

More formally, this work, presented at IDCC and forthcoming in the International Journal of Digital Curation (ArXiv preprint available) addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modeling, discrete-event-based simulation, hierarchical modeling, and then use empirically calibrated sensitivity analysis to identify effective strategies.

After analyzing hundreds of thousands of simulations we find that answer is ten 🙂 Seven copies, distributed across different organizations, and validated systematically every year, in weekly increments protects against everything except strong coordinated attacks on the auditing system itself — including bad disks, storage conditions, firm failures, recessions, and regional wars. Using a crytographically secure distributed auditing system and 3 additional copies will protect against even a strong attack — as long as at least half the servers remain.So… 10,

There are also some corollaries for those considering file format transformations…

  • Use compression — provided its a well-established, lossless algorithm — what you lose in fragility you gain in costs of replication & auditing.
  • If you need to encrypt your data — keep 4 keys and audit these.
  • If format failure is a substantial risk, maintain 4 readers for each format, and audit your collection with these.

Here’s a complete list of our recommendations, in accessible form.

All of the code for the simulation is available on my program’s github site. For those interested in these questions and related areas of interest, writings on digital presentation are linked from my web site.

Matching Uses and Protections for Government Data Releases: Presentation at the Simons Institute

Matching Uses and Protections for Government Data Releases: Presentation at the Simons Institute

(The blog had been on hiatus during 2019 as CREOS launched. This post is part of a series of catch-up blog post summarizing talks presented over the last 10 months.)

In the work included below, and presented at the Simons Institute, we describe work-in progress that aims to align emerging methods of data protections with research uses. We use the American Community Survey as an exemplar case for examining the range of ways that government data is used for research. We identify the range of research uses by combining evidence of use from multiple sources including research articles; national and local media coverage; social media; and research proposals. We then employ human and computer-assisted coding methods to characterize the range of data analysis methodologies that researchers employ. Then, building on previous work cataloging that surveys and characterizes computational and technical controls for privacy, we match these methods to available and emerging privacy and data security controls.

Our preliminary analysis suggests that tiered-access to government data will be necessary to support current and new research in the social and health sciences. We argue that discovery research (currently) requires access beyond limits of formal protections — empirically guided exploratory research, theory generation, process tracing, novel syntheses (etc.) are incompletely understood and formalized. This is in part because in analysis of privacy tradeoffs ‘‘worst-case’ analysis is used for risks, but average-case analysis used for benefits.

For those interested in these questions and related areas of interest, writing on modern approaches to privacy principles and protectins are linked from my web site.

Well-Being at Scale for Policy Research: A Conversation Convened by the Dubai Future Foundation

Well-Being Measures at Scale for Policy Research

(The blog had been on hiatus during 2019 as CREOS launched. This post is part of a series of catch-up blog post summarizing talks presented over the last 10 months.)

The Dubai Future Foundation invited me to lead a conversation on the significance of the concept of well-being for social-science and policy. These slides framed the conversation, and summarize of the subsequent discussion of opportunities and challenges of measuring well-being at scale.

Human well-being has four major dimensions: affective, health, economic, and ethical/political. Policy analysis typically focuses on only one of these dimensions, and measured coarsely — e.g. ranking countries by GDP, while ignoring the distribution of wealth (not to mention health, freedom, and life-satisfaction).

Focusing on GDP and other incomplete measures of well-being limits often leads to the wrong conclusion about, what makes good policy. For example in a paper we published recently The Happiness-Energy Paradox: Energy Use is Unrelated to Subjective Well-Being we find that although energy use does have a limited relationship with GDP it does correlate with subjective well-being. This suggests that conservation is compatible with direct national-interest in developed countries, and highlights the need for more sophisticated measures of policy outcomes.

One challenge to making policy analysis more sophisticated is measuring well-being at scale. The discussion highlighted a number of existing sources of big data that could be repurposed for broad use well-being measurements:

  • Facial expression
    • … short term subjective measures (e.g. happiness)
    • … measures of durable emotional states (e.g. anxiety/depression)
  • Quantified self systems, health mobile apps for
    … extensive collection of self-evaluative measure
  • Public social profiles to …
    … measure emotive statements and communication patterns indicative of well-being
    … detect life-events
  • Administrative and financial data …
    … detect life events
  • Always-on listening devices (e.g. alexa) to measure ….
    … voice tone, voice emotion, conversational patterns
  • Location tracking
    … movement variability proxies for well-being

For those interested in these questions and related areas of interest, writing on modern approaches to privacy principles and protectins are linked from my web site.

Privacy Gaps in Mediated Library Services

Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019

(The blog had been on hiatus during 2019 as CREOS launched. This post is part of a series of catch-up blog post summarizing talks presented over the last 10 months.)

Libraries enable patrons to access a wide range of information, but much of the access to this information is now directly managedy publishers. This has lead to a significant gap across library values, patrons perception of privacy, and effective privacy protection for access to digital resources.

In the work included below, and presented at NERCOMP 2019, we review privacy principles based on ALA, IFLA, and NISO policies. We then organizing and comparing high level privacy protections required by ALA checklist, NISO, and GDPR. This framework of principles and controls is then used to score the privacy policies and practices of major vendors of research library content. We evaluate each element of the vendors privacy policy, and use instrumented browsers to identify the types of tracking mechanisms used by different vendors. We use this set of privacy scores to support analyses of change over time, and of potential gaps between patron expectations and privacy policies and practices.

Our analysis reveals a number of patterns:

  • Invasize data collection, broad use, detailed tracking are common
    — this is inconsistent with library privacy values
  • Large commercial publishers have the weakest privacy protections — while discovery portals do better.
  • Open access is not associated with better privacy protection directly —
    Publishers track access to open access materials to the same degree as copyrighted works

We argue that libraries can do better through standardizing licenses, providing patrons with up-front and comprehensible information on privacy, and by incorporating privacy values into the design of new tools for access and discovery.

The data analysis supporting the presentation are available on the program’s github site. We will be posting the full preprint online soon. For those interested in these questions and related areas of interest, writing on modern approaches to privacy principles and protectins are linked from my web site.

CHI and the Future of Mobile UX

CHI and the Future of Mobile UX

Katie Montgomery

CHI is an annual international conference that focuses on human factors in computing systems, also known as human-computer interaction (HCI). At first glance this may not sound like an exciting topic until you realize that you are the human factor in computing systems and you are using interfaces and structures created by HCI research all of the time. Asking your phone how to get to the nearest pizza joint? That’s HCI. Celebrating your step count with your fitbit? HCI again. Typing? Definitely HCI. We rarely spend a day without using the myriad forms of interaction circumscribed by human factors in computing systems. It just doesn’t have a sexy name.

As forces for civic and cultural improvement through learning, libraries have an opportunity, and perhaps a responsibility, to discover and invent novel ways for people to interact with information. If we can leverage our access to knowledge in collaboration with technical giants (Google comes to mind), we may be able to open up new avenues to reach our patrons and improve their lives. That’s the point after all.

This year (2018) was my first time attending CHI and I’m coming at it from a library background so the entire experience was an eye opener. The schedule alone was 95 pages long (without abstracts), and contained topics ranging from interactivity in autonomous vehicles to bio design and existence. There were dozens of concurrent sessions and choosing between “Gender-Inclusive Design: Sense of Belonging and Bias in Web Interfaces” and “Evaluating the Disruptiveness of Mobile Interactions: A Mixed-Method Approach” was no simple task. Instead I skipped the anguish of session indecision and took the easier route: attending a few pre-designed 2-4 hour courses over the week, diving in depth into topics and interacting with my fellow conference-goers to brainstorm questions and solutions and learn about each other’s backgrounds.

One course was especially rewarding. “Mobile UX–The Next Ten Years?” taught by Simon Robinson, Jennifer Pearson, and Matt Jones encouraged us to try and extend our minds beyond the flat dark glassy rectangle that mobile devices seem to be stuck in and explore our other senses within the mobile context. [1] Matt likened our present experience with mobile devices to the story of Narcissus- A beautiful man finds a perfectly still pool of water that mirrors his face and falls in love with his own reflection, eventually wasting away from lack of food and water as he refuses to leave the flawless image he has found.

An edited painting by Caravaggio:

Much in the same way that Narcissus was entranced by an idealized self we are entranced by our phones, diving into them and rarely coming up for air. Matt posited an idea-what if our phones got us to put down our phones? No, not just some kind of alert saying that you’ve spent too much time on YouTube (although we discussed those ideas too), but actual apps whose intention is to get us to interact with the real world.

Matt told us a story about his daughter. When she was six or so they had purchased a small GPS driving device. On a trip his daughter, holding the device, piped up from the back, asking “Daddy, where are the bears?”. A little baffled, Matt told her he didn’t know. A few minutes later, after peering out the window for while she asked again “Daddy, where are the bears?”. This time he asked why she thought there should be bears and she explained “It says in half a mile bear right!”. Sure, the interaction is cute, but Matt used it to create a game: every time the GPS told them that there was a “bear” on the right or left he and his daughter had to find something outside the car- a bird, stone, a tree, something in the real world. Interaction and creation define much of what it means to be alive but mobile devices are often real-world isolating and consumptive. [2] So the question remains: how do we change that status quo? Mobile devices are ubiquitous and convincing people to simply use them less is unrealistic. So how can libraries take a leading role in redirecting energy and time towards experience and action? In a purely digital context we could include local clubs and activity suggestions pertaining to subjects in topic guides. In the more focused area of mobile devices we could encourage and participate in the development of apps that recognize geographic location and ping the user with information relating to local ecology, history, or culture. Something along the lines of “You’re near Thoreau’s cabin, would you like to take a detour to see it?”, or “The woods you’re in may have lady slippers (a rare native orchid), keep an eye out! This is what they look like:

Photo by Debbi Griffin

Even better, if the app could include crowd-sourced data people would be able to create content and expand the digital way-signs redirecting to the real world. The app could include preference settings so that the user would only be given notifications about nearby natural phenomena or historical monuments, depending on their interests. Somebody start making this, I want to use it.

Libraries have a pressing need to take HCI into explicit account. Historically librarians have been gatekeepers to information but with the advent of the online public access catalog (OPAC) we threw open the doors to knowledge and invited the world to search for it on their own terms. Except we didn’t. The way that resources are organized within a library is a fairly closed system that requires training to navigate, and while we have made great strides in improving our OPACs and websites so that they are more intuitive for our users there is still work to be done. In order to empower our users to find, evaluate, and use the resources we put at their disposal we need to examine the way that they interact with our systems and modify those systems to improve usability. It’s not enough for the library catalog to knows that a book exists. The patron needs to know too.

If the ideas raised in this post have set your imagination alight and you want to incorporate apps into your library consider looking at Nicole Hennig’s work on the subject. Her books Apps for Librarians: Using the Best Mobile Technology to Educate, Create, and Engage and Mobile Learning Trends: Accessibility, Ecosystems, Content Creation are a good place to start. For a more recent survey of the current technologies as they apply to academic libraries try reading Mobile Technology and Academic Libraries: Innovative Services for Research and Learning by Robin Canuel and Chad Crichton.


1. For more on a creative outlook for the future of mobile devices read “A Brief Rant on the Future of Interactive Design” by Bret Victor.

2. Sherry Turkle’s “Alone Together: Why We Expect More from Technology and Less From Each Other” goes into this phenomenon in depth.

Investigating the Evolving Information Needs of Entrepreneurs: Integrating Pedagogy, Practice & Research

Investigating the Evolving Information Needs of Entrepreneurs:
Integrating Pedagogy, Practice & Research

Nicholas Albaugh & Micah Altman

Innovation-driven entrepreneurship is essential and indispensable in the race to solve the world’s major challenges, especially in the areas of health, information technology, agriculture, and energy. MIT is a global leader in this type of entrepreneurship: a 2015 report from the Institute’s Sloan School of Management estimated that active companies founded by MIT alumni produce annual revenues of $1.9 trillion, equivalent to the world’s tenth largest economy. In terms of the curriculum at MIT, over sixty courses in entrepreneurship were taught during the 2016-2017 academic year.

Discovering, accessing, and integrating information is critical to the success of innovation-driven entrepreneurship and it is part of the Libraries’ core role to improve the foundations for discovery, access, and integration. The presence of a vibrant community of entrepreneurs provides an opportunity to delineate and understand the information skills, needs, and challenges of students and researchers engaged in entrepreneurial ventures. This understanding can inform strategies and methods to address these challenges and aid in the design of innovation methods of library instruction which move beyond small group lectures.

In this blog post, we are going to report on the background and preliminary results of a project designed to answer these questions. There were three stages to this project: background research to identify the information related skills of entrepreneurs, the design of a survey instrument, and the surveying of MIT’s delta v accelerator program.

Initial Steps & Background Research

This was a group effort. Nicholas Albaugh (Librarian for Innovation and Entrepreneurship) did most of the heavy lifting — performing both the ‘bench’ work identifying what was known about information use in entrepreneurship, interacting with the students and the class, and creating a first draft of communications. Micah Altman (Director of Research) provided overall scientific guidance, co-lead in conceptualization, developed the research design and methodology, performed the quantitative analysis, and provided critical review. Shikha Sharma, Business and Management Librarian, and Karrie Peterson, Head of Liaison, Instruction, and Reference Services, contributed to the conceptualization of the project and provided critical review.

During the first few months of the project, the four of us met roughly once a month to develop a prospectus outlining the research questions, methods, desired outcomes, and key outputs.

After this prospectus was completed, we wanted to build on previous work by identifying existing frameworks outlining the information skills necessary for entrepreneurial success and entrepreneurial competencies more broadly.

To identify these frameworks, we conducted background research in the business and library literature using three databases: Business Source Complete, ABI/INFORM Complete, and Library, Information Science and Technology Abstracts.

The primary article in terms of identifying key information-related skills for entrepreneurs was “21st Century Knowledge, Skills, and Abilities and Entrepreneurial Competencies: A Model for Undergraduate Entrepreneurship Education” by Trish Boyles. This delineated three broad categories of entrepreneurial competencies, cognitive, social, and action-oriented. The key information-related skills fell in the cognitive category, in particular:

  • A habit of actively searching for information
  • The ability to conduct searches systematically
  • The ability to recognize opportunities when not actively looking for them by recognizing connections between seemingly unconnected things

In addition to a general framework regarding the information-related skills of entrepreneurs, we wanted a more general framework for entrepreneurial competencies. The premier text for this is Bill Aulet’s Disciplined Entrepreneurship: 24 Steps to a Successful Startup. It is the textbook for the delta v program and its author is the Managing Director of the Martin Trust Center for MIT Entrepreneurship, one of the key parts of MIT’s entrepreneurial ecosystem. Outside MIT, it has been translated into eighteen languages and serves as the text for three, web-based edX courses taken by hundreds of thousands of people in countries all over the world.

MIT delta v

We decided to survey MIT’s delta v accelerator program, as it is widely considered the capstone entrepreneurial experience for students here on campus. Participants in the program work full time over the course of the summer on the following goals:

  • Defining and refining their target market
  • Conducting primary market research about their customers and users
  • Running experiments to validate or invalidate hypotheses regarding potential customers
  • Building and nurturing their founding team


The goal of the survey was to identify which stage of the information gathering phase of the delta v program was most time consuming and which part of that process was the most challenging. We were also interested in learning what resources and tools they used during these stages and processes and what tools they would have preferred to use. We also sought to identify specific information needs of those participating in the delta v programs in order to inform solutions going forward.

Our survey consisted of six multiple-choice questions and 5 open-ended questions. The multiple-choice questions addressed the following points:

  • Time spent on market analysis vs. business model development and the most challenging part of each process
  • The relative challenge of identifying and evaluating sources and extracting and analyzing information
  • Resources, tools, and methods used to locate, extract, and collect information

The open-ended questions addressed:

  • The most useful tools they used when seeking, collecting, and analyzing information and why
  • What existing tools would have been useful to them
  • The biggest surprises they encountered during this process



We launched a pilot version of this survey at the conclusion of the program in September 2017, in which six students participated.

Some suggestive patterns emerged: All of the entrepreneurs surveyed reported that market analysis was the most time-consuming phase involving seeking, collecting and analyzing information; and all of them used a library resource in their search for information. Further, nearly all of the entrepreneurs found evaluating sources of information, and summarizing, analyzing and mining those sources challenging or very challenging — and almost all relied on manual copying and pasting to extract or collect information they discovered.


We plan to survey a larger group of MIT delta v students during the upcoming summer 2018 cohort of the program. This larger data set will allow us to draw more generalizable conclusions regarding the information-related skills necessary for entrepreneurial success.

We hope these preliminary results will prompt other universities to investigate the specific information needs of entrepreneurs, particularly students in non-traditional settings like accelerators, incubators, and competitions as opposed to the classroom. Once these particular information needs are better understood, librarians can better address them through targeted workshops and instruction.

Guest Post: Graduate Research Intern, Katherine Montgomery, on the inaugural CHI Science Jam

Katie Montgomery is a Graduate Research Intern in the Program on Information Science, researching the areas of usability and accessibility.


by Katherine Montgomery

Research libraries are catalysts for interaction with and creation of knowledge. As information and interactions with it become increasingly digital, librarians are increasingly concerned with the way that computers and humans interact. [1]

The Computer Human Interface group of the ACM is a group of professionals devoted to studying these interactions. Their annual conference, CHI, is a place where people share the state of the art, and learn to use the state of the practice. CHI itself isn’t a standard library conference but it addresses many of the concerns of librarians in a broader context. For example, focal points include digital privacy (which libraries work to protect), improving UX in virtual and physical realms, gamifying learning interactions, and addressing the pitfalls of automation. The conference is also packed with people the library serves, i.e. academics.

A ‘jam’ or a ‘hackathon’ is distinguished by teams of relative strangers coming together to tackle specific problems in a focused and creative way within a limited time frame. The event fosters personal connections, concrete learning, pride in the product, and has the potential to generate real life changes. Libraries aim to nurture precisely these elements and would do well to look to hackathons and jams and adapt their structure to empower patrons. Here at the MIT libraries, we aim to create and inspire hacks in the great MIT tradition of using ingenuity and teamwork to create something remarkable.

Attending the Science Jam is a great way to start CHI, especially if you’re coming from a library background. The Science Jam enables you to interact with your prototypical patrons on problems that interest both of you and in a fashion that familiarizes you with patron needs. The Science Jam itself is a way to hack the conference. [2]

This is the first year they’ve run the program, and if you’ve never heard of a Science Jam before here’s the lowdown: it’s essentially a hackathon for scientists. You form teams, come up with a problem, pose a question, create a hypothesis, design a test, run the test, analyze your results, and present your study, all in 36 hours. About 60 people attended this year’s jam. We formed ten teams, broke into two rooms (so we could use each-other as test subjects the next day without contaminating our sample with knowledge of the study), and began the stimulating and occasionally frantic process. My team tackled privacy. Our initial problem? People share other people’s data without thinking about it or even realizing it. Our question was, how could we change this behavior? In order to create something testable we quickly honed the question to a much more specific issue and hypothesis. When people attend large conferences, or festivals, or concerts, or other public events they often take pictures that focus on a screen, or a float, or a stage, but include strangers in the foreground or to the sides. They then upload those pictures to their social media accounts where, even if they aren’t tagged, those strangers are vulnerable to facial analysis software and the eyes of the public. We hypothesized that if given cues that they are sharing the faces of strangers people might change their behavior by altering the photo to obscure those faces. Our initial hope was to create a digital interface but time and tech constraints limited us to a paper prototype. We took photographs which contained bystanders but were focussed on a different element, in this case a sign or a presenter with slides. We gave our participants the choice of selecting one of these photos to hypothetically upload to their social media account (we asked the participants to imagine that these were pictures they had taken). After selecting the photo they were presented with an upload interface with the option to go back and select another photo, crop the image, or upload the photo. However, these were given to three different groups with three additional caveats. The first group was given no textual cues as to the presence of potential bystanders in the photo (our control). The second group was given textual cues that there were potential bystanders in the picture, ie “this photo may contain two people, inside, standing up”. The third group was given visual cues that there were potential bystanders, ie blown up images of the faces beneath the main image. threefaces.png

These images were used with the express permission of the people they depict

For the most part, people uploaded the pictures anyway, not bothering to crop out the bystanders and not expressing concern for privacy in the follow-up questionnaire. The cues didn’t make a significant difference between behaviors, but we were surprised that such a technologically enlightened group didn’t take measures to protect people’s privacy more. Of course, our test group only contained 15 people (five per scenario), our prototype was on paper, and there were a number of other potential issues with our methodology, but the question and premise remain sound. How can we help people be aware of the fact that they may be violating other people’s privacy when uploading photographs to social media? And how do we help them alter that behavior?

The next day I attended a presentation given by Roberto Hoyle about his work testing the efficacy of various photo alterations in protecting privacy. Afterwards, we got to talking and posited an idea. What if Facebook added a feature to their image upload interface that asked a simple question: “Do you want to protect the privacy of the people you don’t know in this picture?”. If the person said yes then Facebook could auto-blur the faces it didn’t recognize as friends. The blur feature could be removed or modified, but it would bring the issue to the attention of the user and make it easy (and hopefully aesthetically pleasing, or at least acceptable), to obscure the faces of strangers.

While we agreed it was probably a moon shot I decided to go down to the exhibition hall and talk with the Facebook folks at their booth. I was met with a combination of skepticism and interest. Since then I’ve been in touch with a couple people at Facebook advocating for the idea. If your Facebook interface changes you’ll know it’s been a success. If not? Then the benefits are exclusively mine. Because of the Science Jam I had the opportunity to meet and work with people I would otherwise have never known, pursue meaningful ideas, improve my teamwork, practice scientific testing and analysis with a tight deadline, exercise my presentation skills, and make friends ahead of the conference itself. Libraries could benefit from implementing a similar model ahead of extended programming. Doing a week of events on graphic novels? Include a Cartoon Jam where people can come in, team up, generate ideas, produce some sketches and storylines, and share them with each other! Running a summer of gardening programs? Engage a couple of professionals in your area and encourage patrons to bring in photographs of their trouble gardens (lots of shade, rocky, hot, snow spill), form groups, hit the books, and pick each other’s minds for solutions. Trying to get the library more involved with the school letterpress? Collaborate with the experts there and run a Book Jam [3], challenging your students to connect e-readers and the early practice of printing. There are any number of ways that Libraries can take advantage of the Jam/hackathon model to engage their patrons and further the goal of becoming hubs for creation, not just consumption.

Excited? Inspired? Ready to work up a plan for your own hackathon or Jam? Take a look at the resources below and get cooking.


  1. Current research in the Program on Information Science focuses on how measures of attention and emotion could be integrated into these interactions.  
  2. CHI will be in beautiful Scotland next year. Attend the Science Jam. You won’t regret it.  Oh, and if you want to check out some of the documentation from this year’s Science Jam take a look at #ScienceJam #CHI2018 on Twitter.
  3. The very cool Codex Hackathon is already taken