Month: October 2014

Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by Stephen Griffin, University of Pittsburgh

My colleague,  Stephen Griffin,  who is Visiting Professor and Mellon Cyberscholar at the University of Pittsburgh, School of Information Sciences, presented this talk  as part of the Program on Information Science Brown Bag Series.  Steve is an expert in Digital Libraries and has a broad perspective on the evolution of library and information science — having had  a 32-year career at the National Science Foundation (NSF), as a Program Director in the Division of Information and Intelligent Systems. Steve lead the  Interagency Digital Libraries Initiatives and International Digital Libraries Collaborative Research Programs, which supported many notable digital library projects (including my first large research project).

In his talk, below, Steve discusses how how research libraries can play a key and expanded role in enabling digital scholarship and creating the supporting activities that sustain it.

In his abstract, Steve describes his talk as follows:

Contemporary research and scholarship is increasingly characterized by the use of large-scale datasets and computationally intensive tasks.  A broad range of scholarly activities is reliant upon many kinds of information objects, located in distant geographical locations expressed in different formats on a variety of mediums.  These data can be dynamic in nature, constantly enriched by other users and automated processes.

Accompanying data-centered approaches to inquiry have been cultural shifts in the scholarly community that challenge long-standing assumptions that underpin the structure and mores of academic institutions, as well as to call into question the efficacy and fairness of traditional models of scholarly communication.  Scholars are now demanding new models of scholarly communication that capture a comprehensive record of workflows and accommodate the complex and accretive nature of digital scholarship.  Computation and data-intensive digital scholarship present special challenges in this regard, as reproducing the results may not be possible based solely on the descriptive information presented in traditional journal publications.  Scholars are also calling for greater authority in the publication of their works and rights management.

Agreement is growing on how best to manage and share massive amounts of diverse and complex information objects.  Open standards and technologies allow interoperability across institutional repositories.  Content level interoperability based on semantic web and linked open data standards is becoming more common.   Information research objects are increasingly thought of as social as well as data objects – promoting knowledge creation and sharing and possessing qualities that promote new forms of scholarly arrangements and collaboration.  These developments are important to advance the conduct and communication of contemporary research.  At the same time, the scope of problem domains can expand, disciplinary boundaries fade and interdisciplinary research can thrive.

This talk will present alternative paths for expanding the scope and reach of digital scholarship and robust models of scholarly communication necessary for full reporting.  Academic research libraries will play a key and expanded role in enabling digital scholarship and creating the supporting activities that sustain it.  The overall goals are to increase research productivity and impact, and to give scholars a new type of intellectual freedom of expression.

From my point of view, a number of themes ran through Steve’s presentation:

  • Grand challenges in computing have shifted focus from computing capacity to managing and understanding information; … and repositories have shifted from simple discovery towards data integration.
  • As more information has become available the space of problems we can examine expands from natural sciences to other scientific areas — especially to a large array of problems in social science and humanities; but
    …  research funding is shifting further away from social sciences and humanities.
  • Reproducibiity has become a crisis in sciences; and
    … reproducibility requires a comprehensive record of the research process and scholarly workflow
  • Data sharing and support for replication still occurs primarily at the end of the scientific workflow
    … accelerating the research cycle requires integrating sharing of data and analysis in much earlier stages of workflow, towards a continually open research process.

Steve’s talk includes a number of recommendations for libraries. First and foremost to my view is that libraries will need to act as partners with scientists in their research, in order to support open science, accelerated science, and the integration of information management and sharing workflows into earlier stages of the research process I agree with this wholeheartedly and have made it a part of the mission of our Program.

The talk suggests a set of specific priorities for libraries. I don’t think one set of priorities will fit all set of research libraries — because pursuit of projects is necessarily, and appropriately opportunistic — and depends on the competitive advantage of the institutions involved and the needs of local stakeholders. However, I would recommend adding rapid fabrication, scholarly evaluation, crowdsourcing, library publishing, long-term access generally to the list of priorities in the talk.

Steve’s talk  makes the point Libraries will need to act as partners with scientists in their research, in order to support accelerated science, that integration information management and reproducibility into earlier stages of the research process.

Redistricting and Technology

This talk, presented as guest lecture in Ron Rivest’s and Charles Stewart’s class on Elections and Technology, reflects on the use of technology in redistricting, and lessons learned about open data, public participation, technology, and data management from conducting crowd-sourced election mapping efforts.

Some observations:

  • On technical implementation: There is still a substantial gap between the models and methods used in technology stack, and that used in mapping and elections.  The domain of electoral geography deals with census, and administrative units; legally defined relationships among units; randomized interventions — where GIS deals with polygons, layers, and geospatial relationships. These concept often maps — with some exceptions — and in political geography, one can run into a lot of problems if one doesn’t pay attention to the exceptions. For example, spatial contiguity is often the same as legal contiguity, but not always — and implementing the “not always” part implies a whole set of separate data structures, algorithms, and interfaces.
  • On policy & transparency: We often assume that transparency is satisfied by making the rules (the algorithm) clear, and the inputs to the rules  (the data) publicly available. In election technology, however, code matters too — its impossible to verify or correct implementation of an algorithm without the code; and the form of the data matters —  transparent data containing complete information, in accessible formats, available through a standard API, accompanied by documentation, and evidence of authenticity.
  • On policy & participation: Redistricting plans are a form of policy proposal. Technology is necessary to enable of richer participation in redistricting — it enables individuals to make complete, coherent alternative proposals to those offered by the legislature. Technology is not sufficient, although the courts sometimes pay attention to these publicly submitted maps, legislatures have strong incentives to act in self-interested ways. Institutional changes are needed before fully participative redistricting becomes a reality.
  • On policy implementation: engagement with existing grass-roots organizations and the media was critical for participation. Don’t assume that if you build it, anyone will come…
  • On methodology: Crowd-sourcing enables us to sample from plans that are discoverable by humans — this is really useful as  unbiased random-sampling of legal redistricting plans is not feasible. By crowd-sourcing large sets of plans we can examine the achievable trade-offs among redistricting criteria, and conduct a “revealed preference” analysis  to determine legislative intent.
  • Ad-hoc, miscellaneous, preliminary observations: Field experiments in this area are hard —  there are a lot of moving parts to manage  — creating the state of the practice, while meeting the timeline of politics, while working to keep the methodology (etc.) clean enough to analyze later. And always remember Kransberg’s 1rst law: technology is neither good nor bad — neither is it neutral.

We’ve also written quite a few articles, book chapters, etc. on the topic that expand on many of these topics.


Examples of Big Data and Privacy Problems

Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research. New laws, regulations and policies governing information privacy continue to emerge, increasing the complexity of management. Trends in information collection and management — cloud storage, “big” data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.

The slides below provide an overview changing landscape of information privacy with a focus on the possible consequences of these changes for researchers and research institutions.
Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research, and was originally presented as part of the Program on Information Science Brown Bag Series

Across the emerging examples of data and big prvacy, a number of different challenges recur that appear to be novel to big data, and which drew the attention of the attending experts. In our privacy research collaborations we have started to assign names for  these privacy problems for easy reference:

  1. The “data density” problem — many forms of “big” data used in computational social science measure more attributes, contain more granularity and provide richer and more complex structure than traditional data sources. This creates a number of challenges for traditional confidentiality protections including:
    1. Since big data often has quite different distributional properties from “traditional data”, traditional methods of generalization and suppression cannot be used without sacrificing large amounts of utility.
    2. Traditional methods concentrate on protecting tabular data. However, computational social science increasingly makes use of text, spatial traces, networks, images and data in a wide variety of heterogenous structures.
  2. The “data exhaust” problem – traditional studies of humans focused on data collected explicitly for that purpose. Computational social science increasingly uses data that is collected for other purposes. This creates a number of challenges, including:
    1. Access to “data exhaust” cannot easily be limited by the researcher – although a researcher may limit access to their own copy, the exhaust may be available from commercial sources; or similar measurements may be available from other exhaust streams. This increases the risk that any sensitive information linked with the exhaust streams can be reassociated with an individual.
    2. Data exhaust often produces fine-grained observations of individuals over time. Because of regularities in human behavior, patterns in data exhaust can be used to ‘fingerprint’ an individual – enabling potential reidentification even in the absence of explicit identifiers or quasi-identifiers.
  3. The “it’s only ice cream” problem – traditional approaches to protecting confidential data focus on protecting “sensitive” attributes, such as measures of disfavored behavior, or “identifying” attributes, such as gender or weight.  Attributes such as “favorite flavor of ice cream” or “favorite foreign movie” would not traditionally be protected – and could even be disclosed in an identified form. However the richness, variety, and coverage of big data used in computational social science substantially increases the risk that any ‘nonsensitive’ attribute could, in combination with other  publicly available, nonsensitive information, be used to identify an individual. This makes it increasingly difficult to predict and ameliorate the risks to confidentiality associated with release of the data.
  4. The “doesn’t stay in Vegas” problem – in traditional social science research, most of the information used was obtained and used within approximately the same context – accessing information outside of its original context was often quite costly.  Increasingly, computational social science uses information that was shared in a local context for a small audience, but is available in a global context, and to a world audience.  This creates a number of challenges, including:
    1. The scope of the consent, whether implied or express, of the individuals being studied using new data sources may be unclear. And commercial service according to terms of service and privacy policies may not clearly disclose third-party research uses.
    2. Data may be collected over a long period of time under evolving terms of service and expectations
    3. Data may be collected across a broad variety of locations – each of which may have different expectations and legal rules regarding confidentiality.
    4. Future uses of the data and concomitant risks are not apparent at the time of collection, when notice and consent may be given.
  5. The “algorithmic discrimination” problem – in traditional social science, models for analysis and decision-making were human-mediated. The use of big data with many measures, and/or complex models (e.g. machine-learning models) or models lacking formal inferential definitions (e.g. many clustering models), can lead to algorithmic discrimination that is neither intended by nor immediately discernable to the researcher.

Our forthcoming working papers from the Privacy Tools for Sharing Research Data explore these issues in more detail.