Big Data For Statistical Agencies
Our guest speaker, Cavan Capps, who is Big Data Lead services presented this talk as part of the Program on Information Science Brown Bag Series.
Cavan Capps is the U.S. Census Bureau’s Lead on Big Data processing. In that role he is focusing on new Big Data sources for use in official statistics, best practice private sector processing techniques and software/hardware configurations that may be used to improve statistical processes and products. Previously, Mr. Capps initiated, designed and managed a multi-enterprise, fully distributed, statistical network called the DataWeb.
Capps provided the following summary of his talk.
Big Data provides both challenges and opportunities for the official statistical community. The difficult issues of privacy, statistical reliability, and methodological transparency will need to be addressed in order to make full use of Big Data in the official statistical community. Improvements in statistical coverage at small geographies, new statistical measures, more timely data at perhaps lower costs are the potential opportunities. This talk will provides an overview of some of the research being done by the Census Bureau as it explores the use of “Big Data” for statistical agency purposes.
And he has also described the U.S. census efforts to incorporate big data in in this article: http://magazine.amstat.org/blog/2013/08/01/official-statistics/
What struck me most about Capps’ talk and the overall project is how many disciplines have to be mastered to find an optimal solution.
- Deep social science knowledge (especially economics, sociology, psychology, political science) to design the right survey measures, and to come up with theoretically and substantively coherent alternative measures;
- carefully designed machine learning algorithms are needed to extract actionable information from non-traditional data sources;
- advances in statistical methodology are needed to guide adaptive survey design; make reliable inferences over dynamic social networks; and to measure and correct for bias in measures generated from non-traditional data sources and non-probability samples;
- large scale computing is needed to do this all in real time;
- information privacy science is required to ensure the results that are released (at scale, and in real time) continue to maintain the public trust; And…
- information science methodology is required to ensure the quality, versioning, authenticity, provenance and reliability that is expected of the US Census.
This is indeed a complex project. And also given the diversity of areas implicated, a stimulating one — it has resonated with many different projects and conversations at MIT.