Archive for the ‘Scholarly Communication’ Category

Guest Post: Towards Strategies for Making Legacy Software Curation-Ready

September 27, 2017 Leave a comment

Alex Chassanoff is a CLIR/DLF Postdoctoral Fellow in the Program on Information Science and continues a series of posts on software curation.

In this blog post, I am going to reflect upon potential strategies that institutions can adopt for making legacy software curation-ready.  The notion of “curation-ready” was first articulated as part of the “Curation Ready Working Group”, which formed in 2016 as part of the newly emerging Software Preservation Network (SPN).  The goal of the group was to “articulate a set of characteristics of curation-ready software, as well as activities and responsibilities of various stakeholders in addressing those characteristics, across a variety of different scenarios”[1].  Drawing on inventory at our own institutions, the working group explored different strategies and criteria that would make software “curation-ready” for representative use cases.  In my use case, I looked specifically at the GRAPPLE software program and wrote about particular use and users for the materials.

This work complements the ongoing research I’ve been doing as a Software Curation Fellow at MIT Libraries [2] to envision curation strategies for software.  Over the past six months, I have conducted an informal assessment of representative types of software in an effort to identify baseline characteristics of materials, including functions and uses.

Below, I briefly characterize the state of legacy software at MIT.

  • Legacy software often exists among hybrid collections of materials, and can be spread across different domains.
  • Different components(e.g., software dependencies, hardware) may or may not be co-located.
  • Legacy software may or may not be accessible on original media. Materials are stored in various locations, ranging from climate-controlled storage to departmental closets.
  • Legacy software may exist in multiple states with multiple contributors over multiple years.
  • Different entities (e.g., MIT Museum, Computer Science and Artificial Intelligence Laboratory, Institute Archives & Special Collections) may have administrative purview over legacy software with no centralized inventory available.
  • Collected materials may contain multiple versions of source code housed in different formats (e.g., paper print outs, on multiple diskettes) and may or may not consist of user manuals, requirements definitions, data dictionaries, etc.
  • Legacy software has a wide range of possible scholarly use and users for materials. These may include the following: research on institutional histories (e.g., government-funded academic computing research programs), biographies (e.g., notable developers and/or contributors of software),  socio-technical inquiries (e.g., extinct programming languages, implementation of novel algorithms), and educational endeavors (e.g., reconstruction of software).

We define curation-ready legacy software as having the following characteristics: being discoverable, usable/reusable, interpretable, citable, and accessible.  Our approach views curation as an active, nonlinear, iterative process undertaken throughout the life (and lives) of a software artifact.

Steps to increase curation-readiness for legacy software

Below, I briefly describe some of the strategies we are exploring as potential steps in making legacy software curation-ready.  Each of these strategies should be treated as suggestive rather than prescriptive at this stage in our exploration.

Identify appraisal criteria. Establishing appraisal criteria is an important first step that can be used to guide decisions about selection of relevant materials for long-term access and retention. As David Bearman writes, “Framing a software collecting policy begins with the definition of a schema which adequately depicts the universe of software in which the collection is to be a subset.”[3]  It is important to note that for legacy software, determining appraisal criteria will necessarily involve making decisions about both the level of access and preservation desired.  Decision-making should be guided by an institutional understanding of what constitutes a fully-formed collection object. In other words, what components of software should be made accessible? What will be preserved? Does the software need to be executable? What levels of risk assessment should be conducted throughout the lifecycle?  Making these decisions institutionally will in turn help guide the identification of appropriate preservation strategies (e.g., emulation, migration, etc) based on desired outcomes.

Identify, assemble, and document relevant materials. A significant challenge with legacy software lies in the assembling of relevant materials to provide necessary context for meaningful access and use.  Locating and inventorying related materials (e.g., memos, technical requirements, user manuals) is an initial starting point. In some cases, meaningful materials may be spread across the web at different locations.  While it remains a controversial method in archival practice, documentation strategy may provide useful framing guidance on principles of documentation [4].

Identify stakeholders. Identifying the various stakeholders, either inside or outside of the institution, can help ensure proper transfer and long-term care of materials, along with managing potential rights issues where applicable.  Here we draw on Carlson’s work developing the Data Curation Profile Toolkit and define stakeholders as any group, organizations, individuals or others having an investment in the software, that you would feel the need to consult regarding access, care, use, and reuse of the software[5].

Describe and catalog materials. Curation-readiness can be increased by thoroughly describing and cataloging select materials, with an emphasis on preserving relationships among entities. In some cases, this may consist of describing aspects of the computing environment and relationships between hardware, software, dependencies, and/or versions. Although the software itself may not be accessible, describing related materials (i.e., printouts of source code, technical requirements documentation) adequately can provide important points of access. It may be useful to consider the different conceptual models of software that have been developed in the digital preservation literature and decide which perspective aligns best with your institutional needs [6].

Digitize and OCR paper materials. Paper printouts of source code and related documentation can be digitized according to established best practice workflows[7].  The use of optical character recognition (OCR) programs produces machine-readable output, enabling easy indexing of content to enhance discoverability and/or textual transcriptions.  The latter option can make historical source code more portable for use in simulations or reconstructions of software.

Migrate media. Legacy software often reside on unstable media such as floppy disks or magnetic tape. In cases where access to the software itself is desirable, migrating and/or extracting media contents (where possible) to a more stable medium is recommended [8].


As an active practice, software curation means anticipating future use and uses of resources from the past. Recalling an earlier blog post, our research aims to produce software curation strategies that embrace Reagan Moore’s theoretical view of digital preservation, whereby “information generated in the past is sent into the future”[9]. As the born-digital record increases in scope and volume, libraries will necessarily have to address significant changes in the ways in which we use and make use of new kinds of resources.  Technological quandaries of storage and access will likely prove less burdensome than the social, cultural, and organizational challenges of adapting to new forms of knowledge-making. Legacy software represents this problem space for libraries/archives today.  Devising curation strategies for software helps us to learn more about how knowledge-embedded practices are changing and gives us new opportunities for building healthy infrastructures [10].


[1] Rios, F., Almas, B., Contaxis, N., Jabloner, P., Kelly, H.. (2017). Exploring curation-ready software: use cases. doi:10.17605/OSF.IO/8RZ9E

[2] These are some of the open research questions being addressed by the initial cohort of CLIR/DLF Software Curation Fellows in different institutions across the country.

[3] Bearman, D. (1985). Collecting software: a new challenge for archives & museums. Archives & Museum Informatics, Pittsburgh, PA.

[4] Documentation strategy approaches archival practice as a collaborative work among record creators, archivists, and users.  It often traverses institutions and represents an alternative approach by prompting extensive documentation organized around an “ongoing issue or activity or geographic area.” See:  Samuels, H. (1991). “Improving our disposition: Documentation strategy,” Archivaria 33,

[5] The results of two applied research projects provide examples from the digital preservation literature.  In 2002, the Agency to Research Project at the National Archives of Australia developed a conceptual model based on software performance as a measure of the effectiveness of digital preservation strategies. See: Heslop,  H., Davis, S., Wilson, A. (2002). “An approach to the preservation of digital records,” National Archives of Australia, 2002; in their 2008 JISC report, the authors proposed a composite view of software with the following four entities: package, version, variant, and download. See:  Matthew, B., McIlwrath, B., Giaretta, D., Conway, E. (2008).“The significant properties of software: A study,”

[6] Carlson, J. (2010). “The Data Curation Profiles toolkit: Interviewer’s manual,”

[7]  Technical guidelines for digitizing archival materials for electronic access: Creation of production master files–raster images. (2005). Washington, D.C.: Digital Library Federation,

[8] For a good overview of storage recommendations for magnetic tape, see: To read more about the process of reformatting analog media, see: Pennington, S., and Rehberger D. (2012). The preservation of analog video through digitization. In D. Boyd, S. Cohen, B. Rakerd, & D. Rehberger (Eds.), Oral history in the digital age. Institute of Library and Museum Services. Retrieved from

[9] Moore, R. (2008). “Towards a theory of digital preservation”, International Journal of Digital Curation 3(1).

[10] Thinking about software as infrastructure provides a useful framing for envisioning strategies for curation.  Infrastructure perspectives advocate “adopting a long term rather than immediate timeframe and thinking about infrastructure not only in terms of human versus technological components but in terms of a set of interrelated social, organizational, and technical components or systems (whether the data will be shared, systems interoperable, standards proprietary, or maintenance and redesign factored in).”  See:  Bowker, G.C., Baker, K., Millerand, F. & Ribes, D. (2010). “Toward information infrastructure studies: Ways of knowing in a networked environment.” In J. Hunsinger, L. Klastrup, & M. All en (Eds.),International handbook of Internet research. Dordrecht; Springer, 97-117.


Guest Post: Software as a Collection Object

July 18, 2017 Leave a comment

Alex Chassanoff is a CLIR/DLF Postdoctoral Fellow in the Program on Information Science and continues a series of posts on software curation.

As I described in my first post, an initial challenge at MIT Libraries was to align our research questions with the long-term collecting goals of the institution. As it happens, MIT Libraries had spent the last year working on a task force report to begin to formulate answers to just these sorts of questions. In short, the task force envisions MIT Libraries as a global platform for scholarly knowledge discovery, acquisition, and use. Such goals may at first appear lofty. However, the acquisition of knowledge through public access to resources has been a central organizing principle of libraries since their inception. In his opening statement at the first national conference of librarians in 1853, Charles Coffin Jewett proclaimed, “We meet to provide for the diffusion of a knowledge of good books and for enlarging the means of public access to them. [1]

Archivists and professionals working in special collections have long been focused on providing access to, and preservation of, local resources at their institutions. What is perhaps most unique about the past decade is the broadened institutional focus on locally-created content. This shift in perspective towards looking inwards is a trend noted by Lorcan Dempsey, who describes it thusly:

In the inside-out model, by contrast, the university, and the library, supports resources which may be unique to an institution, and the audience is both local and external. The institution’s unique intellectual products include archives and special collections, or newly generated research and learning materials (e-prints, research data, courseware, digital scholarly resources, etc.), or such things as expertise or researcher profiles. Often, the goal is to share these materials with potential users outside the institution. [2]

Arguably, this shift in emphasis can be attributed to the affordances of the contemporary networked research environment, which has broadened access to both resources and tools. Archival collections previously considered “hidden” have been made more accessible for historical research through digitization. Scholars are also able to ask new kinds of historical questions using aggregate data, and answer historical questions in new kinds of ways.

This begs the question – what unique and/or interesting content does an institution with a rich history of technology and innovation already have in our possession?

Exploring Software in MIT Collections

MIT has of course played a foundational role in the development and history of computing. Since the 1940s, the Institute has excelled in the creation and production of software and software-based artifacts. Project Whirlwind, Sketchpad, and Project MAC are just a few of the monumental research computing projects conducted here. As such, the Institute Archives & Special Collections has over time acquired a significant number of materials related to software developed at MIT.

In our quest to understand how software may be used (and made useful) as an institutional asset, we engaged in a two-pronged approach. First, we aimed to identify the types of software that MIT may consider providing access to What are the different functions and purposes that software at MIT is created used, and reused for? Second, we aimed to understand more about the active practices of researchers creating, using, and/or reusing software. We anticipated that this combined approach might help us develop a robust understanding of existing practices and potential user needs. At the same time, we recognized that identifying and exposing potential pain points could potentially guide and inform future curation strategies. After an initial period of exploratory work, we identified representative software cases found in various pockets across the MIT campus.

Collection #1: The JCR Licklider Papers and the GRAPPLE software

Materials in the collection were first acquired by the Institute for Special Archives and Collections in 1996. Licklider was a psychologist and renowned computer scientist who came to MIT in 1950. He is widely hailed as an influential figure for his visionary ideas around personal computing and human-computer interaction.

In my exploration of archival materials, I looked specifically at boxes 13-18 in the collection, which contained documentation about GRAPPLE, a dynamic graphical programming system developed while Licklider was at the MIT Laboratory for Computer Science. According to the user manual, the focus of GRAPPLE on “the development of a graphical form of a language that already exists as a symbolic programming language.” [3] Programs could be written using computer-generated icons and then monitored by an interpreter.


Figure 1. Folder view, box 16, J.C.R. Licklider Papers, 1938-1995 (MC 499),

Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

Materials in the collection related to GRAPPLE include:

  • Printouts of GRAPPLE source code
  • GRAPPLE program description
  • GRAPPLE interim user manual
  • GRAPPLE user manual
  • GRAPPLE final technical report
  • Undated and unidentified computer tapes
  • Assorted correspondence between Licklider and the Department of Defense

Each of the documents has multiple versions included in the collection, typically distinguished by date and filename (where visible). The printouts of GRAPPLE source code totaled around forty pages. The computer tapes have not yet been formatted for access.

While the software may be cumbersome to access on existing media, the materials in the collection contain substantial amounts of useful information about the function and nature of software in the early 1980s. Considering the documentation related to GRAPPLE in different social contexts helped to illuminate the value of the collection in relationship to the history of early personal computing.

Historians of programming languages would likely be interested in studying the evolution of the coding syntax contained in the collection. The GRAPPLE team used the now-defunct programming language MDL (which stands for “More Datatypes than Lisp”); the extensive documentation provides examples of MDL “in action” through printouts of code packages.


Figure 2. Computer file printout, “eraser.mud.1”, 31 May 1983, box 14, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

The challenges facing the GRAPPLE team at the time of coding and development would be be interesting to revisit today. One obstacle to successful implementation that the team notes were the existing limitations of graphical display environments. In their final technical report on the project from 1984, the GRAPPLE team note the potential of desktop icons for identifying objects and their representational qualities.

Our conclusion is that icons have very significant potential advantages over symbols but that a large investment in learning is required of each person who would try to exploit the advantages fully. As a practical matter, symbols that people already know are going to win out in the short term over icons that people have to learn in applications that require more than a few hundred identifiers. Eventually, new generations of users will come along and learn iconic languages instead of or in addition to symbolic languages, and the intrinsic advantages of icons as identifiers (including even dynamic or kinematic icons) will be exploited. [4]

Despite technological advancement, some fundamental dynamics in human-computer interaction remain relatively unchanged; namely, the powerful relationship between representational symbols and the production of knowledge/knowledge structures. What might it look like to bring to life today software that was conceived in the early days of personal computing? Such aspirations are certainly possible. Consider the journey of the Apollo 11 source code, which was transcribed from digitized code printouts and then put onto Github. One can even simulate the Apollo missions using a virtual Apollo Guidance Control (AGC).

Other collection materials also offer interesting documentation of early conceptions of personal computing while also providing clear evidence that computer scientists such as Licklider regarded abstraction as an essential part of successful computer design. A pamphlet entitled “User Friendliness–And All That” notes the “problem” of mediating between “immediate end users” and “professional computer people” to successfully aid in a “reductionist understanding of computers.”

Figure 3. Pamphlet, “User friendliness-And All That”, undated, box 16, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

These descriptions are useful for illuminating how software was conceived and designed to be a functional abstraction. Such revelations may be particularly relevant in the current climate – where debates over algorithmic decision making are rampant. As the new media scholar Wendy Chun asks, “What is software if not the very effort of making something intangible visible, while at the same rendering the visible (such as the machine) invisible?” [5]


Building capacity for collecting software as an institutional asset is difficult work. Expanding collecting strategies presents conceptual, social, and technical challenges that crystallize once scenarios for access and use are envisioned. For example, when is software considered an artifact ready to be “archived and made preservable”? What about research software developed and continually modified over the years in the course of ongoing departmental work? What about printouts of source code – is that software? How do code repositories like github fit into the picture? Should software only be considered as such its active state of execution? Interesting ontological questions surface when we consider the boundaries of software as a collection object.

Archivists and research libraries are poised to meet the challenges of collecting software. By exploring what makes software useful and meaningful in different contexts, we can more fully envision potential future access and use scenarios. Effectively characterizing software in its dual role as both artifact and active producer of artifacts remains an essential piece of understanding its complex value.



[1] “Opening Address of the President.” Norton’s Literary Register And Book Buyers Almanac, Volume 2. New York: Charles B. Norton, 1854.

[2] Dempsey, Lorcan. “Library Collections in the Life of the User: Two Directions.” LIBER Quarterly 26, no. 4 (2016): 338–359. doi:

[3]  GRAPPLE Interim User Manual, 11 October 1981, box 14, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

[4] Licklider, J.C.R. Graphical Programming and Monitoring Final Technical Report, U.S. Government Printing Office, 1988, 17.

[5] Chun, Wendy Hui Kyong. “On Software, or the Persistence of Visual Knowledge.” Grey Room 18 (Winter 2004): 26-51.

Guest Post: Curation as Context: Software in the Stacks

April 21, 2017 Comments off

Alex Chassanoff is a CLIR/DLF Postdoctoral Fellow in the Program on Information Science and continues a series of posts on software curation.

As scholarly landscapes shift, differing definitions for similar activities may emerge from different communities of practice.   As I mentioned in my previous blog post, there are many distinct terms for (and perspectives on) curating digital content depending on the setting and whom you ask [1].  Documenting and discussing these semantic differences can play an important role in crystallizing shared, meaningful understandings.  

In the academic research library world,  the so-called data deluge has presented library and information professionals with an opportunity to assist scholars in the active management of their digital content [2].  Curating research output as institutional content is a relatively young, though growing phenomenon.  Research data management (RDM) groups and services are increasingly common in research libraries, partially fueled by changes in federal funding grant application requirements to encourage data management planning.  In fact, according to a recent content analysis of academic library websites, 185 libraries are now offering RDM services [3].  The charge for RDM groups can vary widely; tasks can range from advising faculty on issues related to privacy and confidentiality, to instructing students on potential avenues for publishing open-access research data.

As these types of services increase, many research libraries are looking to life cycle models as foundations for crafting curation strategies for digital content [4].  On the one hand, life cycle models recognize the importance of continuous care and necessary interventions that managing such content requires.  Life cycle models also provide a simplified view of essential stages and practices, focusing attention on how data flows through a continuum.  At the same time, the data flow perspective can obscure both the messiness of the research process and the complexities of managing dynamic digital content [5,6].  What strategies for curation can best address scenarios where digital content is touched at multiple times by multiple entities for multiple purposes?  

Christine Borgman notes the multifaceted role that data can play in the digital scholarship ecosystem, serving a variety of functions and purposes for different audiences.  Describing the most salient characteristics of that data may or may not serve the needs of future use and/or reuse. She writes:

These technical descriptions of “data” obscure the social context in which data exist, however. Observations that are research  findings  for  one  scientist  may  be background context to another. Data that are adequate evidence for one purpose (e.g., determining whether water quality is safe for surfing) are inadequate for others (e.g., government standards for testing drinking water). Similarly, data that are synthesized for one purpose may be “raw” for another. [7]

Particular data sets may be used and then reused for entirely different intentions.  In fact, enabling reuse is a hallmark objective for many current initiatives in libraries/archives.  While forecasting future use is beyond our scope, understanding more about how digital content is created and used in the wider scholarly ecosystem can prove useful for anticipating future needs.  As Henry Lowood argues, “How researchers will actually put their hands and eyes on historical software and data collections generally has been bracketed out of data curation models focused on preservation”[8].  

As an example, consider the research practices and output of faculty member Alice, who produces research tools and methodologies for data analysis. If we were to document the components used and/or created by Alice for this particular research project, it might include the following:

  • Software program(s) for computing published results
  • Dependencies for software program(s) for replicating published results
  • Primary data collected and used in analysis
  • Secondary data collected and used in analysis
  • Data result(s) produced by analysis
  • Published journal article

We can envision at least two uses of this particular instantiation of scholarly output. First, the statistical results of the data can be verified by replicating the conditions of the analysis.   Second, the statistical approach executed by the software program can be executed on a new inputted data set.  In this way, software can simultaneously serve as both an outcome to be preserved and as a methodological means to an (new) end.  

There are certain affordances in thinking about strategies for curation-as-context, outside the life cycle perspective.  Rather than emphasizing content as an outcome to be made accessible and preserved through a particular workflow, curation could instead aim to encompass the characterization of well-formed research objects, with an emphasis on understanding the conditions of their creation, production, use, and reuse.   Recalling our description of Alice above, we can see how each component of the process can be brought together to represent an instantiation of a contextually-rich research object.

Curation-as-context approaches can help us map the always-already in flux terrain of dynamic digital content.  In thinking about curating software as a complex object for access, use, and future use, we can imagine how mapping the existing functions, purposes, relationships, and content flows of software within the larger digital scholarship ecosystem may help us anticipate future use, while documenting contemporary use.  As Cal Lee writes:

Relationships to other digital objects can dramatically affect the ways in which digital objects have been perceived and experienced. In order for a future user to make sense of a digital object, it could be useful for that user to know precisely what set of surrogate representations – e.g. titles, tags, captions, annotations, image thumbnails, video keyframes – were associated with a digital object at a given point in time. It can also be important for a future user to know the constraints and requirements for creation of such surrogates within a given system (e.g. whether tagging was required, allowed, or unsupported; how thumbnails and keyframes were generated), in order to understand the expression, use and perception of an object at a given point in time [9].

Going back to our previous blog post, we can see how questions like “How are researchers creating and managing their digital content” are essential counterparts to questions like “What do individuals served by the MIT Libraries need to able to reuse software?” Our project aims to produce software curation strategies at MIT Libraries that embrace Reagan Moore’s theoretical view of digital preservation, whereby “information generated in the past is sent into the future” [10].  In other words, what can we learn about software today that makes an essential contribution to meaningful access and use tomorrow?  

Works Cited

[1] Palmer, C., Weber, N., Muñoz, T, and Renar, A. (2013), “Foundations of data curation: The pedagogy and practice of ‘purposeful work’ with research data”, Archives Journal, Vol 3.

[2] Hey, T.  and Trefethen, A. (2008), “E-science, cyberinfrastructure, and scholarly communication”, in Olson, G.M. Zimmerman, A., and Bos, N. (Eds), Scientific Collaboration on the Internet, MIT Press, Cambridge, MA.

[3] Yoon, A. and Schultz, T. (2017), “Research data management services in academic libraries in the US: A content analysis of libraries’ websites” (in press). College and Research Libraries.

[4] Ray, J. (2014), Research Data Management: Practical Strategies for Information Professionals, Purdue University Press, West Lafayette, IN.

[5] Carlson, J. (2014), “The use of lifecycle models in developing and supporting data services”, in Ray, J. (Ed),  Research Data Management: Practical Strategies for Information Professionals, Purdue University Press, West Lafayette, IN.

[6] Ball, A. (2010), “Review of the state of the art of the digital curation of research data”, University of Bath.

[7] Borgman, C., Wallis, J. and Enyedy, N. (2006), “Little science confronts the data deluge: Habitat ecology, embedded sensor networks, and digital libraries”, Center for Embedded Network Sensing, 7(1–2), 17 – 30. doi: 10.1007/s00799-007-0022-9. UCLA: Center for Embedded Network Sensing.  

[8] Lowood, H. (2013), “The lures of software preservation”, Preserving.exe: Towards a national strategy for software preservation, National Digital Information Infrastructure and Preservation Program of the Library of Congress.

[9] Lee, C. (2011), “A framework for contextual information in digital collections”, Journal of Documentation 67(1).

[10] Moore, R. (2008), “Towards a theory of digital preservation”, International Journal of Digital Curation 3(1).


Guest Post: DataRescue-Boston@MIT Wrap up

March 2, 2017 Leave a comment

Alex Chassanoff  who is a Postdoctoral Fellow in the Program for Information Science, contributes to this detailed wrapup of the recent Data Rescue Boston event that she helped organize.


Data Rescue Boston@MIT Wrap up

Written by event organizers:

Alexandra Chassanoff

Jeffrey Liu

Helen Bailey

Renee Ball

Chi Feng


On Saturday, February 18th, the MIT Libraries and the Association of Computational Science and Engineering co-hosted a day long Data Rescue Boston hackathon at Morss Hall in the Walker Memorial Building.  Jeffrey Liu, a Civil and Environmental Engineering graduate student at MIT, organized the event as part of an emerging North American movement to engage communities locally in the safeguarding of potentially vulnerable federal research information.  Since January, Data Rescue events have been springing up at libraries across the country, largely through the combined organizing efforts of Data Refuge and Environmental Data and Governance Initiative.


The event was sponsored by MIT Center for Computational Engineering, MIT Department of Civil and Environmental Engineering, MIT Environmental Solutions Initiative, MIT Libraries, MIT Graduate Student Council Initiatives Fund, and the Environmental Data and Governance Initiative.

Here are some snapshot metrics from our event:

# of Organizers: 8
# of Volunteers: ~15
# of Guides: 9
# of Participants: ~130
# URLs researched: 200
# URLs harvested: 53
# GiB harvested: 35
# URLs seeded: 3300 at event (~76000 from attendees finishing after event)
# Agency Primers started: 19
# Cups of Coffee: 300
# Burritos: 120
# Bagels: 450
# Pizzas: 105

Goal 1. Process data

MIT’s data rescuers managed to process a similar amount of data through the seeding and harvesting phases of data rescue as compared to other similarly-sized events.  For reference, Data Rescue San Francisco researched 101 URLs and harvested 25 GB of data at their event.  Data Rescue DC, a two-day event which also included a bagging/describing track which we did not have, harvested 20GB of data, seeded 4776 URLs, bagged 15 datasets and described 40 data sets.   

Goal 2. Expand scope

Another goal of our event was to explore creating new workflows for expanding efforts beyond an existing focus on federal agency environmental and climate data.  Toward that end, we decided to pilot a new track called Surveying which we used to identify and describe programs, datasets and documents at federal agencies still in need of agency primers.  We were lucky enough to have particular domain experts on hand who assisted us with our efforts.  In total, we were able to begin expansion efforts for agencies and departments at the Department of Justice, Department of Labor, Health and Human Services, and the Federal Communications Commission.

Goal 3: Engage and build community

Attendees at our event spanned age groups, occupations, and technical abilities.  Participants included research librarians, concerned scientists, and expert undergraduate hackers; according to national developers for the Data Rescue archiving application, MIT had the largest number of “tech-tel” than any other event thus far.   As part of the Storytelling aspect of Data Rescue events, we captured profiles for twenty-seven of our attendees.  Additionally, we created Data Use Stories that describe how some researchers use specific data sets from the National Water Information System (USGS), the Alternative Fuels Data Center (DOE),  and the Global Historical Climate Network (NOAA).  These stories let us communicate how these data sets are used to better understand our world, as well as make decisions that impact our everyday lives.

The hackathon at MIT was the second event hosted by Data Rescue Boston, which has begun hosting weekly working groups every Thursday at MIT  for continuing engagement on compiling tools and documentation to improve workflow, identify vulnerable data sets, and create resources to help further efforts.   

Future Work

Data rescue events continue to gather steam, with eight major national events planned over the next month.  The next DataRescue Boston event will be held at Northeastern on March 24th. A dozen volunteers and attendees from the MIT event have already signed up to help organize workshops and efforts at the Northeastern event.

Press Coverage of our Event:

Guest Post: Alex Chassanoff on Building A Model for Software Curation

January 21, 2017 Leave a comment

Alex Chassanoff  who is a Postdoctoral Fellow in the program on information science introduces a series of posts on software curation.

Building A Model for Software Curation:

An Introductory Post


In October 2016, I began working at the MIT Libraries as a CLIR/DLF Postdoctoral Fellow in Software Curation. CLIR began offering postdoctoral fellowships in data curation in 2012; however, myself and three others were part of the first cohort conducting research in the area of Software Curation.  At our fellowship seminar and training this summer,the four of us joked about not having any idea what we would be doing (and Google wasn’t much help). Indeed, despite years of involvement in digital curation, I was unsure of what it might mean to curate software. As has been well-documented in the library/archival science community, curation of data means many different things tomany different people.  Add in the term “software” and you increase the complexities.

At MIT Libraries, I was given the good fortune of working with two distinguished and esteemed experts in library research: Nancy McGovern, the Director of the Digital Preservation Program and Micah Altman, the Director of Research.   This blog post describes the first phase of our work together in defining a research agenda for software curation as an institutional asset.

Defining Scope

As we began to suss out possible research objectives and assorted activities, we found ourselves circling back to four central questions – which themselves split into associated sub-questions.

  • What is software? What is the purpose and function of software? What does it mean to curate software? How do these practices differ from preservation?
  • When do we curate software? Is it at the time of creation? Or when it becomes acquired by an institution?
  • Why do institutions and researchers curate software?
  • Who is institutionally responsible for curating software and for whom are we curating software?

Developing Focus and Purpose

We also began to outline the types of exploratory research questions we might ask depending on the specific purpose and entities we were creating a model for (see Table 1 below). Of course, these are only some of the entities that we could focus on; we could also broaden our scope to include research questions of interest to software publishers, software journals, or funders interested in software curation.


Entity Purpose: Libraries/Archives Purpose: MIT Specific
Research library What does a library need to safeguard + preserve software as an asset? How are other institutions handling this? How are funding agencies considering research on software curation? What are the MIT libraries’ existing and future needs related to software curation?
Software creator What are the best practices software creators should adopt when creating software? How are software creators depositing their software and how are journals recommending they do this? What are the individual needs and existing practices of software creators served by the MIT Libraries?
Software user What are the different kinds of reasons why people may use software? What are the conditions for use? What are the specific curation practices we should implement to make software usable for this community? What do individuals served by the MIT Libraries need to be able to reuse software?

Table 1: Potential purpose(s) of research by entity

Importantly, we wanted to adopt an agile research approach that considered software as an artifact, rather than (simply) as an outcome to be preserved and made accessible.  Curation in this sense might seek to answer ontological questions about software as an entity with significant characteristics at different levels of representation.   Certainly, digital object management approaches that emphasize documentation of significant properties or characteristics are long-standing in the literature.  At the same time, we wanted our approach to address essential curatorial activities (what Abby Smith termed “interventions”) that help ensure digital files remain accessible and usable. [1]  We returned to our shared research vision: to devise a model for software curation strategies to assist research outcomes that rely on the creation, use, reuse, and study of software.

Statement of Research Objectives and Working Definitions

Given the preponderance of definitions for curation and the wide-ranging implications of curating for different purposes and audiences, we thought it would be essential for us to identify and make clear our particular interests.  We developed the following statement to best describe our goals and objectives:

Libraries and archives are increasingly tasked with responsibilities related to the effective long-term preservation and curation of software.  The purpose of our work is to investigate and make recommendations for strategies that institutions can adopt for managing software as complex digital objects across generations of technology.

We also developed the following working definition of software curation for use in our research:

“Software curation encompasses the active practices related to the creation, acquisition, appraisal and selection, description, transformation, preservation, storage, and dissemination/access/reuse of software over short- and long- periods of time.”

What’s Next

The next phase of our research involves formalizing our research approach through the evaluation, selection, and application of relevant models (such as the OAIS Reference Model) and ontologies (such as the SWO). We are also developing different scenarios to establish the roles and responsibilities bound up in software creation, use, and reuse. In addition to reporting on the status of our project, you can expect to read blog posts about both the philosophical and practical implications of curating software in an academic research library setting.


[1] In the seminal collection Authenticity in a digital environment, Abby Smith noted that “We have to intervene continually to keep digital files alive. We cannot put a digital file on a shelf and decide later about preservation intervention. Storage means active intervention.” See: Abby Smith (2000) “Authenticity in Perspective  Authenticity in a digital environment. Washington, D.C: Council on Library and Information Resources.

Perspectives on the Future of Scholarly Communication

June 11, 2012 1 comment

Since knowledge is not a private good, a pure market approach leads to under-provisioning. Planning for access to the scholarly record should include planning for long-term access beyond the life of a single institution. Important problems in scholarly communications, information science & scholarship increasingly require diverse multidisciplinary approaches.
My colleagues Lynne Herndon, Amy Brand and I were honored to be able to discuss the future of scholarly communication at  Georgetown University’s annual Scholarly Communication Symposium.

The video is below:

My slides are also available: