Home > Uncategorized > Big Data For Statistical Agencies

Big Data For Statistical Agencies

Our guest speaker,  Cavan Capps,  who is Big Data Lead services presented this  talk as part of the Program on Information Science Brown Bag Series.

Cavan Capps is the U.S. Census Bureau’s Lead on Big Data processing. In that role he is focusing on new Big Data sources for use in official statistics, best practice private sector processing techniques and software/hardware configurations that may be used to improve statistical processes and products. Previously, Mr. Capps initiated, designed and managed a multi-enterprise, fully distributed, statistical network called the DataWeb.

Capps provided the following summary of his talk.

Big Data provides both challenges and opportunities for the official statistical community.  The difficult issues of privacy, statistical reliability, and methodological transparency will need to be addressed in order to make full use of Big Data in the official statistical community.  Improvements in statistical coverage at small geographies, new statistical measures, more timely data at perhaps lower costs are the potential opportunities. This talk will provides an overview of some of the research being done by the Census Bureau as it explores the use of “Big Data” for statistical agency purposes.

And he has also described the U.S. census efforts to incorporate big data in in this article:  http://magazine.amstat.org/blog/2013/08/01/official-statistics/

What struck me most about Capps’ talk and the overall project is how many disciplines have to be mastered to find an optimal solution.

  • Deep social science knowledge (especially economics, sociology, psychology, political science) to design the right survey measures, and to come up with theoretically and substantively coherent alternative measures;
  • carefully designed machine learning algorithms are needed to extract actionable information from non-traditional data sources; 
  • advances in statistical methodology are needed to guide adaptive survey design; make reliable inferences over dynamic social networks; and to measure and correct for bias in measures generated from non-traditional data sources and non-probability samples;
  • large scale computing is needed to do this all in real time;
  • information privacy science is required to ensure the results that are released (at scale, and in real time) continue to maintain the public trust; And…
  • information science methodology is required to ensure the quality, versioning, authenticity, provenance and reliability that is expected of the US Census.

This is indeed a complex project. And also given the diversity of  areas implicated, a stimulating one — it has resonated with many different projects and conversations at MIT.

Categories: Uncategorized
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: