I was honored to present some of our research at a panel on election reform at this year’s Harvard Law and Policy Review symposium .
As a summary of this the Harvard Law and Policy Review’s Notice and Comment Blog has published a concise summary of recommendation for redistricting reform by my collaborator Michael McDonald and I, entitled:
Create Real Redistricting Reform through Internet-Scale Independent Commissions
To quote from the HLPR summary:
Twenty-first century redistricting should incorporate transparency at internet speed and scale – open source, open data, open process (see here for in-depth recommendations) — and twenty-first century redistricting should incorporate internet technology for twenty-first century participation: direct access to the redistricting process; access to legal-strength mapping tools; and the integration of crowd-sourcing to create maps, identify communities and neighborhoods, collect and correct data, and gather and analyze public commentary.
There are few policy arenas where the public can fashion legitimate proposals that rival what their elected officials enact. Redistricting is among them, so why not enable greater public participation in this critical democratic process?
(And n a related topic this previous post summarizes some of our research on crowd sourced mapping for open government .)
Big data has huge implications for privacy, as summarized in our commentary below:
Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.
On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a ‘comprehensive review’ on big data and privacy.
This lead to a series of workshop on big data and technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research project and from the BigData@CSAIL projects have contributed to these workshops and raised many thoughtful issues (and the workshop sessions are online and well worth watching).
EPIC, ARL and 22 other privacy organizations requested an opportunity to comment on the report, and OSTP later allowed for a 27-day commentary period during which brief comments would be accepted by e-mail. (Note that the original RFI provided by OSTP is, at the time of this writing, a broken link, so we have posted a copy. )They requested commenters provide specific answers to five extraordinarily broad questions:
- What are the public policy implications of the collection, storage, analysis, and use of big data? For example, do the current U.S. policy framework and privacy proposals for protecting consumer privacy and government use of data adequately address issues raised by big data analytics?
- What types of uses of big data [are most important]… could measurably improve outcomes or productivity with further government action, funding, or research? What types of uses of big data raise the most public policy concerns? Are there specific sectors or types of uses that should receive more government and/or public attention?
- What technological trends or key technologies will affect the collection, storage, analysis and use of big data? Are there particularly promising technologies or new practices for safeguarding privacy while enabling effective uses of big data?
- How should the policy frameworks or regulations for handling big data differ between the government and the private sector? Please be specific as to the type of entity and type of use (e.g., law enforcement, government services, commercial, academic research, etc.).
- What issues are raised by the use of big data across jurisdictions, such as the adequacy of current international laws, regulations, or norms?
My colleagues at the Berkman Center, David O’Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.
Much can be improved in how big data and is currently treated. To summarize (quoting from the conclusions of the comment):
Addressing privacy risks requires a sophisticated approach, and the privacy protections currently used for big data do not take advantage of advances in data privacy research or the nuances these provide in dealing with different kinds of data and closely matching sensitivity to risk.
I invite you to read the full comment .
State of the Art Informatics for Research Reproducibility, Reliability, and Reuse: Or How I Learned to Stop Worrying and Love Data Management
Scholarly publishers, research funders, universities, and the media, are increasingly scrutinizing research outputs. Of major concern is the integrity, reliability, and extensibility of the evidence on which published findings are based. A flood of new funder mandates, journal policies, university efforts, and professional society initiatives aim to make this data verifiable, reliable, and reusable: If “data is the new oil”, we need data management to prevent ‘fires’, ensure ‘high-octane’, and enable ‘recycling’.
In March, I had the pleasure of being the inaugural speaker in a new lecture series (http://library.wustl.edu/research-data-testing/dss_speaker/dss_altman.html) initiated by the Libraries at the Washington University in St. Louis Libraries — dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.
In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.
This blog post provides some wider background for the presentation, and a recap of its recommendations. The approaches can be roughly divided into three categories. The first approach focuses on tools for reproducible computation ranging from “statistical documents” (incorporating Knuth’s  concept of literate programming) to workflow systems and reproducible computing environments [for example, Buckheit & Donoho 1995; Schwab et al. 2000; Leisch & Rossini 2003; Deelman & Gils 2006; Gentleman & Temple-Lang 2007] With few exceptions [notably, Freire, et al. 2006] this focuses primarily on “simple replication” or “reproduction” –replicating exactly a precise set of result from an exact copy of original data made at the time of research.
Current leading examples of tools that support reproducible computation include:
- Ipython: ipython.org
- Knitr yihui.name/knitr/
- Research Compendia researchcompendia.org
- Run My Code runmycode.org
- Vistrails vistrails.org
The second approach focuses on data sharing methods and tools [see for example, Altman et al 2001; King 2007; Anderson et al., 2007; Crosas 2011].  This approaches more generally on helping researchers to share — both for replication and for broader reuse – including secondary uses and use in teaching. Increasingly work in this area [e.g. Gutmann 2009; Altman-King 2007] focuses on issues of enabling long-term and interdisciplinary access to data – this requires that the researchers’ tacit knowledge about data formats, measurement, structure and provenance be more explicitly documented.
Current leading examples of informatics tools that support data sharing include:
The third approach focuses on the norms, practices and licensing associated with data sharing archiving and replication and the related incentives embedded in scholarly communication [Pienta 2007; Hamermesh 2007; Altman & King 2007; King 2007; Hedstrom et al. 2008; McCullough 2009; Stodden 2009]. This approach seeks to create the necessary conditions to enable data sharing and reuse, and to examine and align citations around citation, data sharing, and peer review to encourage replicability and reusability.
Current leading examples of informatics tools that support richer citation, evaluation, open science, and review include:
- Data Cite datacite.org
- Data dryad datadryad.org
- Dataverse Network thedata.org
- DMPTOOL dmp.cdlib.org/
- Figshare figshare.com
- Journal of Visual Experiments jove.com
- ORCID: Orcid.org
- Research Replication Reports http://www.psychologicalscience.org/index.php/replication
- Thomson Reuters Data Citation Index wokinfo.com/products_tools/multidisciplinary/dci/
Many Tools, Few Solutions
In this area, there are many useful tools, but few solutions that offer a complete solution – even for a specialized community of practice. All three approaches are useful, and here are several general observations to be made about them. First, tools for replicable research such as VisTrails, MyExperiment, Wings, and StatDocs are characterized by their use of a specific and controlled defined software framework and their ability to facilitate near automatic replication. The complexity of these tools, and their small user and maintenance base means that we cannot rely on them to exist and function in five-ten years – they cannot ensure long term access. Because they focus only on results and not on capturing practices, descriptive metadata and documentation, they allow exact replication without providing the contextual information necessary for broader reuse. Finally these tools are heterogeneous across subdisciplines, and largely incompatible, they do not as yet offer a broadly scalable solution.
Second, tools and practices for data management have the potential to broadly increase data sharing and the impact of related publications However, although these tools are becoming easier to use, they still require an extra effort for the researcher. Moreover, since additional effort often comes near (or past) the conclusion of the main research project (and only after acceptance of an article and preparation for final publication) it is perceived as a burden, and often honored in the breach.
Third, incentives for replication have been weak in many disciplines – and journals are a key factor. The reluctance of journal editors to publish articles either confirming or non-confirming replications work authors’ incentives to create replicable work. Lack of formal provenance and attribution practices for data also weakens accountability, raises barriers to conducting replication and reuse, reduces incentive to disseminate data for reuse, and increases the ambiguity of replication studies, making them difficult to study.
Furthermore, new forms of evidence complicate replication and reuse. In most scientific disciplines, the amount of data potentially available for research is increasing non-linearly. In addition, changes in technology and society are greatly affecting the types and quantities of potential data available for scientific analysis, especially in the social sciences. This presents substantial challenges to the future replicability and reusability of research. Traditional data archives currently consist almost entirely of numeric tabular data from noncommercial sources. New forms of data differ from tabular data in size, format, structure, and complexity. Left in its original form, this sort of data is difficult or for scholars outside of the project that generated it to interpret and use. This is a barrier to integrative and interdisciplinary research, but also a significant obstacle to providing long-term access, which becomes practically impossible as the tacit knowledge necessary to interpret the data is forgotten. To enable broad use and to secure long term access requires more than simply storing the individual bits of information – it requires establishing and disseminating good data management practices. [Altman & King 2007]
How research libraries can jump-start the process.
Many research libraries should consider at least three steps:
1. Create a dataverse hosted by the Harvard Dataverse Network ( http://thedata.harvard.edu/dvn/faces/login/CreatorRequestInfoPage.xhtml ). This provides free, permanent storage, dissemination, with bit-level preservation insured by Harvard’s endowment. The dataverse can be branded, curated, and controlled by the library – so it enables libraries to maintain relationship with their patrons, and provide curation services, with minimal effort. (And since DVN is open-source, a library can always move from the hosted service to one they run themselves.
2. Link to DMPTool (https://dmp.cdlib.org/) from your libraries website. And consider joining DMPTool as an institution – especially if you use Shibboleth (Internet2) to authorize your users. You’ll be in good company — according to a recent ARL survey 75% of ARL libraries are now at least linking to DMPTool. Increasing researchers use of DMPtool provides early opportunities for conversation with libraries around data, enables libraries to offer service at a time when it is salient to the researcher , and provides a information which can be used to track and evaluate data management planning needs.
3. Publish a “libguide” focused on helping researchers get more credit for their work. This is a subject of intense interest, and the library can provide information about trends and tools in the area that researchers (especially junior researchers) of which researchers may not be aware. Some possible topics to include: Data citation (e.g. the http://www.force11.org/node/4769 ); researcher identifiers (e.g., http://orcid.org ); and impact metrics (http://libraries.mit.edu/scholarly/publishing/impact) .
Altman, M., L. Andreev, M. Diggory, M. Krot, G. King, D. Kiskis, A. Sone, S. Verba, A Digital Library for the Dissemination and Replication of Quantitative Social Science Research, Social Science Computer Review 19(4):458-71. 2001.
Altman, M. and G. King. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib Magazine 13(3/4). 2007.
Anderson, R. W. H. Greene, B. D. McCullough and H. D. Vinod. “The Role of Data/Code Archives in the Future of Economic Research,” Journal of Economic Methodology. 2007.
Buckheit, J. and D.L. Donoho,Wavelan and Reproducible Research, in A. Antoniadis (ed.) Wavelets and Statistics, Springer-Verlag. 1995.
Crosas, M., The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data, D-lib Magazine 17(1/2). 2011.
D.S. Hamermesh, “Viewpoint: Replication in Economics,” Canadian Journal of Economics. 2007.
Deelman, E. Y. Gil, (Eds.). Final Report on Workshop on the Challenges of Scientific Workflows. 2006. <http://vtcpc.isi.edu/wiki/images/b/bf/NSFWorkflow-Final.pdf>
Freire, J., C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, 10-18, 2006.
Gentleman R., R. Temple Lang. Statistical Analyses and Reproducible Research, Journal of Computational and Graphical Statistics 16(1): 1-23. 2007.
Gutmann M., M. Abrahamson, M. Adams, M. Altman, C. Arms, K. Bollen, M. Carlson, J. Crabtree, D. Donakowski, G. King, J. Lyle, M. Maynard, A. Pienta, R. Rockwell, L. Timms-Ferrara, C. Young, “From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data”, Library Trends 57(3):315-337. 2009.
Hedstrom, Margaret, Jinfang Niu, Kaye Marz,. “Incentives for Data Producers to Create “Archive/Ready” Data: Implications for Archives and Records Management”, Proceedings of the Society of American Archivists Research Forum. 2008.
King, G. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research, 32(2), 173–199. 2007.
Knuth, D.E., Literate Programming, CLSI Lecture Notes 27. Center for the Study of Language and Information. Stanford, Ca. 1992.
Leisch F., and A.J. Rossini, Reproducible Statistical Research, Chance 16(2): 46-50. 2003.
McCullough, B.D., Open Access Economics Journals and the Market for Reproducible Economic Research, Economic Analysis & Policy 39(1). 2009.
Pienta, A., LEADS Database Identifies At-Risk Legacy Studies, ICPSR Bulletin 27(1) 2006.
Schwab, M., M. Karrenbach, and J. Claerbout, Making Scientific Computations Reproducible, Computing in Science and Engineering 2: 61-67. 2000.
Stodden, V.The Legal Framework for Reproducible Scientific Research: Licensing and Copyright, Computing in Science and Engineering 11(1):35-40. 2009.
 Also see for example the CRAN reproducible research task view: http://cran.r-project.org/web/views/ReproducibleResearch.html; and the Reproducible Research tools page: http://reproducibleresearch.net/index.php/RR_links#Tools
FTC has been hosting a series of seminars on consumer privacy, on which it has requested comments. The most recent seminar explored privacy issues related to mobile device tracking. As the seminar summary points out …
In most cases, this tracking is invisible to consumers and occurs with no consumer interaction. As a result, the use of these technologies raises a number of potential privacy concerns and questions.
The presentations raised an interesting and important combination of questions about how to promote business and economic innovation while protecting individual privacy. I have submitted a comment on these changes with some proposed recommendations.
To summarize (quoting from the submitted the comment):
Knowledge of an individual’s location history and associations with others has the potential to be used in a wide variety of harmful ways. … [Furthermore], since all physical activity has a unique spatial and temporal context, location history provides a linchpin for integrating multiple sources of data that may describe an individual. [R]esearch shows that human mobility patterns are highly predictable and that these patterns have unique signatures, making them highly identifiable– even in the absence of associated identifiers or hashes. Moreover, locational traces are difficult or impossible to render non-identifiable using traditional masking methods.
I invite you to read the full comment here:
This comment drew heavily on previous comments on proposed OSHA regulation made with colleagues at the Berkman Center, David O’Brien, Alexandra Woods, was made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.
OSHA has proposed a set of set of changes to current tracking of workplace injuries and illnesses.
Currently information about workplace injuries and illnesses must be recorded, but only on paper. Further most of this information is never reported — OSHA only receives detailed information when it conducts an investigation, and receives summary records from only a small percentage of employers who are selected to participate in the annual survey. (Additionally BLS receives a sample of this information in order to produce specific statistics for its “Survey of Occupational Injuries and Illnesses”
OSHA proposes three changes. The first change would require establishment to regularly submit the information that they are already required to collect and maintain (quarterly submission of detailed information for larger establishment, and annual submission of summary information from any establishment with more than twenty employees that is already required to maintain these records) . The second change makes this process digital — submissions would be electronic, instead of on paper. And the third change would be to make the data collected public — searchable, and downloadable in machine-actionable (.csv) form.
These proposed changes raise an interesting and important combination of questions about how to promote government (and industry) transparency while protecting individual privacy. My colleagues at the Berkman Center, David O’Brien, Alexandra Woods, and I have submitted an extensive comment on these changes with some proposed recommendations. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.
To summarize (quoting from the conclusions of the comment):
We argue that workplace injury and illness records should be made more widely available because releasing these data has substantial potential individual, research, policy, and economic benefits. However, OSHA has a responsibility to apply best practices to manage data privacy and mitigate potential harms to individuals that might arise from data release.
The complexity, detail, richness, and emerging uses for data create significant uncertainties about the ability of traditional ‘anonymization’ and redaction methods and standards alone to protect the confidentiality of individuals. Generally, one size does not fit all, and tiered modes of access – including public access to privacy-protected data and vetted access to the full data collected – should be provided.
Such access requires thoughtful analysis with expert consultation to evaluate the sensitivity of the data collected and risks of re-identification and to design useful and safe release mechanisms.
I invite you to read the full comment here:
Sound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research.
A few days ago I was honored to officially announce the Data Citation Working Group’s Joint Declaration of Data Citation Principles at IDCC 2014, from which the above quote is taken.
This Joint Data Citation Principles identifies guiding principles for the scholarly citation of data. This recommendation is a s collaborative work with CODATA, FORCE 11, DataCite and many other individuals and organizations. And in the week since it has been released, it has already garnered over twenty institutional endorsements.
Some slides introducing the principles are here:
To summarize, from 1977 through 2009 there were three phases of development in the area of data citation.
- The first phase of development focused on the role of citation to facilitate description and information retrieval. This phase introduced the principles that data in archives should be described as works rather than media, using author, title, and version.
- The second phase of development extended citations to support data access and persistence. This phase introduced the principles that research data used by publication should be cited, that those citations should include persistent identifiers, and that the citations should be directly actionable on the web.
- The third phase of development focused on using citations for verification and reproducibility. Although verification and reproducibility had always been one of the motivations for data archiving – it had not been a focus of citation practice. This phase introduced the principles that citations should support verifiable linkage of data and published claims, and it started the trend towards wider integration with the publishing ecosystem
And over the last five years the importance and urgency of scientific data management and access has been recognized more broadly. The culmination of this trend toward increasing recognition, thus far, is an increasingly widespread consensus by researchers and funders of research that data is a fundamental product of research and therefore a citable product. The fourth and current phase of data development work focuses on integration with the scholarly research and publishing ecosystem. This includes integration of data citation in standardized ways within publication, catalogs, tool chains, and larger systems of attribution.
Read the full recommendation here, along with examples, references and endorsements:
Our guest speaker, Cavan Capps, who is Big Data Lead services presented this talk as part of the Program on Information Science Brown Bag Series.
Cavan Capps is the U.S. Census Bureau’s Lead on Big Data processing. In that role he is focusing on new Big Data sources for use in official statistics, best practice private sector processing techniques and software/hardware configurations that may be used to improve statistical processes and products. Previously, Mr. Capps initiated, designed and managed a multi-enterprise, fully distributed, statistical network called the DataWeb.
Capps provided the following summary of his talk.
Big Data provides both challenges and opportunities for the official statistical community. The difficult issues of privacy, statistical reliability, and methodological transparency will need to be addressed in order to make full use of Big Data in the official statistical community. Improvements in statistical coverage at small geographies, new statistical measures, more timely data at perhaps lower costs are the potential opportunities. This talk will provides an overview of some of the research being done by the Census Bureau as it explores the use of “Big Data” for statistical agency purposes.
And he has also described the U.S. census efforts to incorporate big data in in this article: http://magazine.amstat.org/blog/2013/08/01/official-statistics/
What struck me most about Capps’ talk and the overall project is how many disciplines have to be mastered to find an optimal solution.
- Deep social science knowledge (especially economics, sociology, psychology, political science) to design the right survey measures, and to come up with theoretically and substantively coherent alternative measures;
- carefully designed machine learning algorithms are needed to extract actionable information from non-traditional data sources;
- advances in statistical methodology are needed to guide adaptive survey design; make reliable inferences over dynamic social networks; and to measure and correct for bias in measures generated from non-traditional data sources and non-probability samples;
- large scale computing is needed to do this all in real time;
- information privacy science is required to ensure the results that are released (at scale, and in real time) continue to maintain the public trust; And…
- information science methodology is required to ensure the quality, versioning, authenticity, provenance and reliability that is expected of the US Census.
This is indeed a complex project. And also given the diversity of areas implicated, a stimulating one — it has resonated with many different projects and conversations at MIT.