Big Data Ethics:

Detecting Bias in Data Collection, Algorithmic Discrimination and "Informed Refusal."


Project Summary

The team proposes to address grand challenges through a multidisciplinary study of the ethical issues involved in the use of big data and predictive algorithms to make decisions affecting individuals. They will assemble a concrete set of cases, and use these to define the more general problem or problems that arise. Some of these cases will come from existing studies in data-driven discrimination (Sweeney 2013; Datta 2015; Angwin 2016). Others will involve historic discrimination data. We will also look at public data such as the NIJ Crime Forecasting Challenge (NIJ 2016), and public social media, where differences between groups may include distinctions based on personal preference as well as distinctions based on group stereotyping.

In this interdisciplinary project, the team will utilize an approach to algorithm audit methodology which address two major barriers to performing web-based research on algorithms, namely restrictive interpretations of both the Computer Fraud and Abuse Act (CFAA) and the Common Rule as it applies to computer science research by university Institutional Review Boards by engaging with actual users of web-based platforms who are incentivized by participating in a project that seeks to uphold the “right to explanation” and “informed refusal” through algorithm transparency (Sandvig et. al 2014; Polonetsky 2005; Moss 2016; Goodman and Flaxman 2016; Benjamin 2016). Drawing from Sandvig et. al (2014) suggestions, they will frame their algorithm audit study within the familiar context of social science informed discrimination audit methodology with a focus on recruiting a large number of distributed single users to test selected web-based platforms, while minimizing the chance of perturbing the system and avoiding violating terms of service related to automated use of web-based platforms.

In addition to overcoming the two barriers mentioned above, this also allows for a more participatory “citizen science” approach to data collection and supports the goal of algorithmic transparency and critical data literacy which informs our project. In addition, they will collect data that allows the team to look at algorithmic discrimination through an intersectional approach (Krenshaw 1991; Tripathi 2014), which addresses not only the intersection between protected categories that are typically examined such as ethno-race, gender or sexual orientation but also relevant indirect factors which can be illuminated utilizing methods drawn from statistics and machine learning, association rules (Tramer 2016; Ruggieri 2010) and decision trees (Mancuhan 2012).

They will identify and categorize the issues involved in these cases, building on existing approaches to describe discrimination. PI Roark will concentrate on formal ethics frameworks and their relation to algorithmic research from an anthropologically informed science and technology studies (STS) perspective, PI Kelly will lead on relating to work in moral psychology, and PI Clifton will investigate how these cases fit with existing mathematical models used in machine learning.

PI Clifton will also attempt to categorize the algorithmic sources of bias in each case. Example categories include training data encoding human bias, imbalanced data collection, data collection feedback loops (e.g., predictive policing leading to higher law enforcement in certain neighborhoods, leading to greater discovery of crime, …), as well as algorithms that “ignore” small subpopulations with different observed properties.

The cases will then be studied to determine where there are disparities. For example, two cases that appear the same from an ethical and statistical viewpoint may appear different when looked at through a moral psychology lens. This points to a weakness in the models: an inability to capture important distinctions. The PIs will collaboratively identify where existing models are insufficient to capture the moral and ethical nuances in automated data-based decision making. They anticipate this serving as a starting point for future proposals to develop models that can serve the data science community in developing algorithmic approaches to decision-making that respect personal preference while preventing discrimination.

Principal Investigator Bio

Chris Clifton

Professor of Computer Science

Dr. Clifton works on data privacy, particularly with respect to analysis of private data. This includes privacy-preserving data mining, data de-identification and anonymization, and limits on identifying individuals from data mining models. He also works more broadly in data mining, including data mining of text and data mining techniques applied to interoperation of heterogeneous information sources. Fundamental data mining challenges posed by these applications include extracting knowledge from noisy data, identifying knowledge in highly skewed data (few examples of "interesting" behavior), and limits on learning. He also works on database support for widely distributed and autonomously controlled information, particularly issues related to data privacy.

Investigators

Chris Clifton
Professor of Computer Science
clifton@purdue.edu a>
site
Daniel Kelly
Associate Professor of Philosophy
drkelly@purdue.edu
site
Kendall Roark
Professor of Library Science
roark6@purdue.edu
site