Examples of Big Data and Privacy Problems
Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research. New laws, regulations and policies governing information privacy continue to emerge, increasing the complexity of management. Trends in information collection and management — cloud storage, “big” data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.
The slides below provide an overview changing landscape of information privacy with a focus on the possible consequences of these changes for researchers and research institutions.
Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research, and was originally presented as part of the Program on Information Science Brown Bag Series.
Across the emerging examples of data and big prvacy, a number of different challenges recur that appear to be novel to big data, and which drew the attention of the attending experts. In our privacy research collaborations we have started to assign names for these privacy problems for easy reference:
- The “data density” problem — many forms of “big” data used in computational social science measure more attributes, contain more granularity and provide richer and more complex structure than traditional data sources. This creates a number of challenges for traditional confidentiality protections including:
- Since big data often has quite different distributional properties from “traditional data”, traditional methods of generalization and suppression cannot be used without sacrificing large amounts of utility.
- Traditional methods concentrate on protecting tabular data. However, computational social science increasingly makes use of text, spatial traces, networks, images and data in a wide variety of heterogenous structures.
- The “data exhaust” problem – traditional studies of humans focused on data collected explicitly for that purpose. Computational social science increasingly uses data that is collected for other purposes. This creates a number of challenges, including:
- Access to “data exhaust” cannot easily be limited by the researcher – although a researcher may limit access to their own copy, the exhaust may be available from commercial sources; or similar measurements may be available from other exhaust streams. This increases the risk that any sensitive information linked with the exhaust streams can be reassociated with an individual.
- Data exhaust often produces fine-grained observations of individuals over time. Because of regularities in human behavior, patterns in data exhaust can be used to ‘fingerprint’ an individual – enabling potential reidentification even in the absence of explicit identifiers or quasi-identifiers.
- The “it’s only ice cream” problem – traditional approaches to protecting confidential data focus on protecting “sensitive” attributes, such as measures of disfavored behavior, or “identifying” attributes, such as gender or weight. Attributes such as “favorite flavor of ice cream” or “favorite foreign movie” would not traditionally be protected – and could even be disclosed in an identified form. However the richness, variety, and coverage of big data used in computational social science substantially increases the risk that any ‘nonsensitive’ attribute could, in combination with other publicly available, nonsensitive information, be used to identify an individual. This makes it increasingly difficult to predict and ameliorate the risks to confidentiality associated with release of the data.
- The “doesn’t stay in Vegas” problem – in traditional social science research, most of the information used was obtained and used within approximately the same context – accessing information outside of its original context was often quite costly. Increasingly, computational social science uses information that was shared in a local context for a small audience, but is available in a global context, and to a world audience. This creates a number of challenges, including:
- The scope of the consent, whether implied or express, of the individuals being studied using new data sources may be unclear. And commercial service according to terms of service and privacy policies may not clearly disclose third-party research uses.
- Data may be collected over a long period of time under evolving terms of service and expectations
- Data may be collected across a broad variety of locations – each of which may have different expectations and legal rules regarding confidentiality.
- Future uses of the data and concomitant risks are not apparent at the time of collection, when notice and consent may be given.
- The “algorithmic discrimination” problem – in traditional social science, models for analysis and decision-making were human-mediated. The use of big data with many measures, and/or complex models (e.g. machine-learning models) or models lacking formal inferential definitions (e.g. many clustering models), can lead to algorithmic discrimination that is neither intended by nor immediately discernable to the researcher.
Our forthcoming working papers from the Privacy Tools for Sharing Research Data explore these issues in more detail.