Month: July 2013

Introducing the National Agenda for Digital Stewardship

Digital stewardship is vital for the authenticity of public records, the reliability of scientific evidence, and the enduring accessibility to our cultural heritage. Knowledge of ongoing research, practice, and organizational collaborations has been distributed widely across disciplines, sectors, and communities of practice.  A few days ago I was  honored to officially announce the NDSA’s National Agenda for Digital Stewardship at Digital Preservation 2013.  This identifies the highest-impact opportunities to advance the state of the art; the state of practice; and the state of collaboration within the next 3-5 years.

The 2014 Agenda integrates the perspective of dozens of experts and hundreds of institutions, convened through the Library of Congress. It outlines the challenges and opportunities related to digital preservation activities in four broad areas: Organizational Roles, Policies, and Practices; Digital Content Areas; Infrastructure Development; and Research Priorities.

Slides and video of the short (5-min) talk below:

Read the full report here:

Characterizing Data and Software for Social Science Research

This presentation, invited for a workshop  on data preservation for open science,  held at JCDL 2013, gives a brief tour of a large topic — how do we understand the types of data and software used in social science research. In this presentation I characterize the intellectual landscape  across 9 dimensions of data structure, content, measure, and use. I then use this framework to characterize  three interesting use cases.

This illustrates some particular challenges for long-term access to and replication of social science research, including the use of “messy” human sensors; the wide mix of data types, structures, sparsity; complex legal constraints; pervasive use of manual and computer-aided coding; use of niche commercial software and bespoke software; and very long-term access requirements.

July IAP Version of Managing Confidential Data Tutorial

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

So, for this year’s JulyAP, I updated a tutorial on managing confidential data, continuing to integrate the things we’ve learned in our privacy research project. This short course focuses on the practical side of managing confidential research data, but gives some pointers to the research areas.

The slides are below:

Do more Majority-Democratic Districts imply fewer Majority-Black districts?

My collaborator Micheal McDonald and I are now analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and others. Our earlier article, analyzing redistricting in Virginia established that  members of the public are capable of creating legal redistricting plans, and in many ways perform better than the legislature in doing so.

Our latest analysis, baed on data from Congressional redistricting in Florida, reinforces this finding. Furthermore, it highlights some of the limits of reform efforts, and the structural tradeoffs among redistricting criteria.

The reforms to process in Florida, catalyzed by advances in information technology, enabled a dramatic increase in public participation in the redistricting process. This reform process in Florida can be considered a partial success: The adopted plan implements one the the most efficient observable trade-offs among the reformer’s criteria, primarily along the lines of racial representation by creating an additional Black-majority district in the form of the current 5th Congressional District. This does not mean, however, that reform was entirely successful. The adopted plan is efficient, but is atypical of the plans submitted by the legislature and public. Based on the pattern of public submissions, and on contextual information, we suspect the adopted plan was drawn for partisan motivations. The public preference and good-government criteria might be better served by the selection of the other efficient plans – that were much more competitive, and less biased, at the cost of a reduction of the majority-minority seat.

Most of these trends can be made clear through information visualization methods. The figure below draws a line representing the Pareto (efficient) frontier in two dimensions to illustrate the major trade-offs between number of partisan seats and other criteria.

One of the visuals is below — it shows the tradeoffs between partisan advantage and different representational criteria, based on the pareto-frontier of plans publicly available.

(Click on the image below to enlarge)


These frontiers suggest that some criteria are more constraining on Democratic redistricters than on Republican redistricters. On the one hand   equipopulation is equally constraining on Democratic and Republic partisan seat creatio. However, the data suggests a structural trade-off  between and Black-majority seats and Democratic seats.

More details on the data, evaluations, etc. appear in the article (uncorrected manuscript).

Approaches to Preservation Storage Technologies

The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk, presented as part of the the Program on Information Science seminar series, discusses findings from this survey, common gaps, and trends in this area.

(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier’s reliability claims. For more on that see this earlier post.)