“Big Data” is the bogeyman of the information age: powerful, and as ill-defined as it is abstractly threatening. Broadly, it encompasses “technology that maximize[s] computational power and algorithmic accuracy”;1 “types of analyses that draw on a range of tools to clean and compare data”;2 and the underlying belief in the correlation between the size of the data set, and its ability to produce increasingly accurate and nuanced insights.3 Put another way, “‘Big data’ [is] the amassing of huge amounts of statistical information on social and economic trends and human behavior.”4 The belief in the prescient value of big data has led to widespread collection of information on citizens and consumers in both the public and private sectors, though that distinction has become increasingly permeable.5 Data brokers, companies that create and sell detailed profiles of consumers for profit, sell their products to private and public entities alike, and often do not have data quality control clauses in the contracts governing those interactions.6 These profiles also often refer directly or indirectly to sensitive attributes, such as race, gender, age, and socioeconomic status.7
This brave new world of big data is no longer new. But the mechanics of the algorithms relying on that data, and the process by which decisions are made using that information, merits a sharpened focus. Algorithmic decision-making is increasingly replacing existing practices in both the public and private sector, making an understanding of the technical construction of those algorithms increasingly crucial. This is all the more true for processes in which the consumer or citizen does not have a voice, and the logic behind the decision is fundamentally opaque.8 It is difficult, if not impossible, for that consumer or citizen to challenge an adverse decision made about her when the basis for the decision is unavailable. In the private sector, automated predictions are used to calculate loan rates, credit scores, insurance risk, employment evaluations, and in hiring searches.9 In the public sector,10 they are being used for risk prediction in law enforcement,11 as well as for sentencing,12 and to calculate benefits.13 Further, there is a pervasive and misguided belief in the inherent neutrality of algorithmic decision-making by virtue of its empiricism. But data is not inherently neutral, and neither are the algorithms that process it. Each is the product of the beliefs, fallibilities, and biases of the person who created them. If those fallibilities are unaccounted for, algorithms will simply replicate the pre-existing inequalities encoded in their intake data and structure. This memorandum will provide an overview on the basics of algorithms and data mining, and explore how automated decision-making can unintentionally reveal sensitive information, or unintentionally base their predictions on protected traits, implicating individual privacy and civil liberties.
To understand how the particular features of an algorithm can violate individual privacy, or lead to discriminatory outcomes, it is necessary to understand the discrete steps of how algorithms work. An algorithm can be defined as “simply a series of steps undertaken in order to solve a particular problem or accomplish a defined outcome.”14 In the context of big data, that means a computational process that takes input data and creates an output based on a rule.15 A machine-learning algorithm involves two distinct processes: a classifier algorithm, and a learner algorithm.16 A classifier algorithm performs a mathematical function on a given set of input data, and creates a category based on the relationships between different properties (‘features’ of the data) as an output. An example would be a classifying algorithm that takes a list of emails with multiple features, such as sender, time sent, or presence of an attachment, and sorts them by sender (“from Bob” or “not from Bob”). The learner algorithm will establish the relationships between a set of features in training data, and prospectively apply that rule to new inputs.17 Commonly used machine-learning models include neural networks, decision trees, Naïve Bayes, and logistic regression.18 The choice of model depends on the particular use, such as an algorithm designed to predict creditworthiness, as opposed to an algorithm designed to predict the likelihood of crime in a given area, and different models can be used separately, or in conjunction with one another.19 A prioritization algorithm, as the name might suggest, ranks an input by virtue of possession or lack of certain attributes, and is primarily used in processes that assess risk. Examples include recidivism algorithms used by judges in sentencing, or algorithms that assess insurance risk.20
The very value of data analytics lies with is its ability to elicit subtle and insightful relationships between various data features, such as, oddly enough, an increase in Pop-Tart purchases before hurricanes.21 That seemingly oracular ability to illustrate connections between otherwise random attributes is both what make big data so useful, and what leads to its piercing ability to reveal private information. It can elicit inferences an individual did not want to know, or might not want anyone else to know, such as a medical condition.22 It can also draw relationships between legally protected and unprotected categories, and base decisions off of those correlations.23 Even when the information is not legally protected or inherently sensitive, there are concerns that increasingly precise determinations could be used to create inscrutably complex portraits of consumers, in a way that could further diminish consumer control.24 Privacy violations and discriminatory outcomes are a predictable consequence of data analytics’ ability to elucidate unexpected information. While distinct concepts, privacy and civil rights often overlap when the private information is deeply connected to a fundamental right, or a protected attribute, such as political affiliation, immigration status, or a disability.
Algorithms can be intrinsically (and unintentionally) discriminatory through the population of data selected, how the algorithm functions, and the data itself. For example, when the training data for a predictive policing algorithm assigning the probability of crime to an area uses crime statistics from police stops in 1956 Chattanooga, the algorithm will learn—and replicate—a correlation between arrest rates and race. Data does not simply occur; it is created, and will reflect the flaws of its creator, as will any rule predicated on the relationships between various attributes in that data.25 As a matter of technique, machine learning is also less accurate, and thus roughly less effective, for minority groups. There is proportionately less data available for majority groups by definition, and correlations that may be correct for the majority may be completely incorrect for the minority.26 In an excellent piece illustrating the fallacy of inherently neutral data mining, Moritz Hardt uses the example of a machine learning algorithm distinguishing between real and fake names.27 A short and common name might be real in one culture, and fake in another; if the classifier discerns a negative correlation between real names and complex or long ones, it will be inaccurate in applying that rule to minority groups.28 Certain attributes can also serve as proxies for sensitive attributes, such as race, or socioeconomic status. Uber, for example, was accused of redlining by directing drivers away from majority-black neighborhoods.29 Inference of membership in a protected class; statistical bias skewing the function of the algorithm; and faulty inferences based on mistaken or acontextual data can each serve the render the results of an algorithm discriminatory, or violate an individual’s privacy.30
The problems big data poses for privacy and civil rights are manifold and complex. Though the work ahead is considerable, technologists and legal scholars have begun exploring relevant techniques to better guard against discrimination and protect individual privacy. Computer scientists in public policy like Latanya Sweeney,31 Cynthia Dwork,32 Helen Nissenbaum33 and Moritz Hardt34 have shed light on the fallacy of inherently neutral data mining through research on techniques to combat discrimination, and protect privacy. These technical approaches include both discrimination-blind, as well as discrimination-aware data mining,35 privacy-aware data mining,36 and differential privacy.37 Legal scholars have also begun to delve deeply into how the mechanics of data mining, and the myth of its assumed neutrality, often undermines the assumptions predicating existing laws.38 The Federal Trade Commission’s Big Data report summarized relevant questions for engineers working with large data sets and trying to ascertain the risk of privacy violations or inherent discrimination, such as whether a relevant model accounts for biases, and closely the dataset mirrors the population being measured.39
Ultimately, preliminary research is exactly that—preliminary. It does not answer all the tough questions raised by the use of big data, and how automated decision-making challenges existing legal frameworks designed to protect privacy and civil liberties. While understanding the mechanics of algorithmic decision-making is fundamentally necessary to prevent violations of privacy and civil liberties from simply being ignored, it is only the first step towards preventing them. At the very least, it is a start.
This piece is adapted from a memorandum I wrote as a summer clerk at the Electronic Privacy Information Center. A description of EPIC’s work on algorithmic transparency, and a compilation of related resources, can be found here.