Finding the Blank Spots in Big Data

Mimi Onuoha is an artist who works mostly with algorithms, data sets, and digital systems, but her best known work may be a file cabinet. White, metal, and unassuming, it’s the kind that used to line the carpeted halls of office buildings before the advent of Google Drive and iCloud. Sliding open Onuoha’s cabinet reveals a column of familiar brownish-green folders, hooked at the sides and marked on top by plastic tabs. The handwritten labels include: “Publicly available gun trace data,” “Trans people killed or injured in instances of hate crime,” “Muslim mosques/communities surveilled by the FBI/CIA.” But when you open any one of the folders, there’s nothing inside.

This is Onuoha’s Library of Missing Datasets, a physical catalog of digital absence. She created the piece in 2016 (and a second version in 2018), after realizing that even with all of the esoteric, eccentric datasets you can find online—every word in the Broadway musical Hamilton, a yearly estimate of hotdogs eaten by Americans on the 4th of July—there’s a lot of urgent, necessary data that’s suspiciously missing. “In spaces that are oversaturated with data, there are blank spots where there’s nothing collected at all,” she says in a video for Data & Society. “When you look into them, you start to realize that they almost universally intersect with the interests of the most vulnerable.”

Mimi Onuoha, *Library of Missing Data Sets* (2016)

How often do we think of data as missing? Data is everywhere—it’s used to decide what products to stock in stores, to determine which diseases we’re most at risk for, to train AI models to think more like humans. It’s collected by our governments and used to make civic decisions. It’s mined by major tech companies to tailor our online experiences and sell to advertisers. As our data becomes an increasingly valuable commodity—usually profiting others, sometimes at our own expense—to not be “seen” or counted might seem like a good thing. But when data is used at such an enormous scale, gaps in the data take on an outsized importance, leading to erasure, reinforcing bias, and, ultimately, creating a distorted view of humanity. As Tea Uglow, director of Google’s Creative Lab, has said in reference to the exclusion of queer and transgender communities, “If the data does not exist, you do not exist.”

“In spaces that are oversaturated with data, there are blank spots where there’s nothing collected at all.”

This is something that artists and designers working in the digital realm understand better than most, and a growing number of them are working on projects that bring in the nuance, ethical outlook, and humanist approach necessary to take on the problem of data bias. This group includes artists like Onuoha that have the vision to seek out and highlight these absences (and offer a blueprint for others), as well as those like artist and software engineer Omayeli Arenyeka, who are working on projects that collect necessary data. It also includes artist and researcher Caroline Sinders and the collective Feminist Internet, who are working on building AI models, chatbots, and systems that take into account data bias and exclusion in every step of their processes. Others are academics like Catherine D’Ignazio and Lauren Klein, whose book Data Feminism considers how a feminist approach to data science would curb widespread bias. Still others are activists, like María Salguero, who saw there was a lack of comprehensive data on gender-based killings in Mexico and decided to collect it herself.

The artists, programmers, designers, and technologists that are addressing this problem at a grassroots level understand that those most at risk for being excluded from data are also the most marginalized. What their projects have in common is that they take an intersectional feminist approach to this problem, using data as a way to challenge who has access to power and who doesn’t. They also bring idealistic, communal, and at times poetic approaches to the biggest hurdle when dealing with “blank spots”—having to address something that isn’t there.

Silenced by Data: Who’s Missing?

In the spring of 2014, Catherine D’Ignazio had just given birth and was finishing up her graduate degree at the MIT Media Lab. Each day, she would go to the lab, located in a soaring, 163,000-square-foot glass and steel building designed by architect Fumihiko Maki. When she needed to use a breast pump, she would head to a grubby, grey bathroom stall, the only private space she could find in a building full of transparent glass and sight lines. One day, frustrated after fumbling the unwieldy pump in the cramped space and spilling breast milk all over the floor, it occurred to her: “Why was I using a crappy, loud machine to pump milk on the bathroom floor at one of the most elite, well-resourced engineering institutions in the entire world?”

The Media Lab Complex at MIT designed by Maki and Associates and completed in 2009.

The MIT Media Lab prides itself on “inventing the future,” which led D’Ignazio to wonder exactly whose futures the institution was mostly concerned with inventing. Did these futures include babies or breastfeeding mothers? “Really, we are centering elite, white, cis, heterosexual, abled men’s futures,” she clarified in a recent talk at Eyeo Festival 2019. “And that’s a really narrow group.”

“If the data does not exist, you do not exist.”

D’Ignazio’s point brings up an important consideration for anyone concerned with addressing data bias: As we look into who’s missing from big data, we should also consider who is always, without question, there. Data, much like society as a whole, tends to privilege those who are white and male, and who have for centuries been considered the “human default.” This is an idea that goes back as far as evolutionary theory and runs so deep as to be embedded in our languages in the form of the generic masculine. And it’s a line of thinking so pervasive, even subconsciously so, that the male experience and perspective have come to be seen as universal, while any other perspectives are considered niche.

The idea that what is male is universal is the cause of what Caroline Criado Perez calls the gender data gap: a lack of data collected on women that has served to further naturalize sex and gender discrimination. “Because women aren’t seen and aren’t remembered, because male data makes up the majority of what we know, what is male comes to be seen as universal,” she writes in her recent book Invisible Women: Data Bias in a World Designed for Men. “It leads to the positioning of women, half the global population, as a minority.” And if it’s bad for all women, you can count with certainty on the fact that it’s far worse for women of color, disabled women, transgender women, and working-class women. For those groups, writes Criado Perez, “the data is practically non-existent.”

Within what Criado Perez calls the “gap,” we might also find what Onuoha calls “missing datasets.” Onuoha’s list includes: “People excluded from public housing because of criminal records;” “mobility for older adults with physical disabilities or cognitive impairments;” and “measurements for global web users that take into account shared devices and VPNs.” D’Ignazio adds to it the percentage of women dying during childbirth, for which there is still no national tracking system (Black women, by the way, are most disadvantaged by this absence, since they are 243 percent more likely to die from pregnancy or childbirth-related causes than white women). We also still don’t know the size of the transgender population, as data journalist Mona Cholabi has pointed out. And in Mexico, there was no comprehensive data on the killing of women and girls because of their gender (“femicides”), so a human rights activist and geophysical engineer named María Salguero built one. She alone has logged over 8,000 cases of femicide since 2016.

Maria Salguero, Los feminicidios en México.

At this point it’s easy to see the outlines of a pernicious feedback loop: the interests of those who are least-valued by our society are the least likely to be seen worthy of data collection, but in order to be valued, you have to be counted. Who a system or data set is created for, and who they are created by, should be forefront of mind for anyone working with data. This is especially important as our world becomes more and more automated. AI helps doctors with diagnoses, it scans through resumes, it aids in policing. Machines help make decisions of civil and social importance, and they do it by sifting through large amounts of data. “In machine learning, data is what defines the algorithm: it determines what the algorithm does,” artist and researcher Caroline Sinders writes in an essay for Dilettante Army. But the datasets that AI is being trained on are riddled with gaps, and it can’t be fixed by merely running around and plugging the holes.

At the same time that we start to really reckon with systematic bias at a societal level, it’s crucial to also scrutinize the data that informs so many of our systems and decisions. Those working closest with our data and the design that’s informed by it may seem best equipped to lead that charge, but large technology companies aren’t exactly the nimblest to do so, and the makeup of the teams entrusted to design these systems do not usually reflect the groups most disadvantaged by them. Take Google, for example, which runs the most popular search engine in the world: in 2016, an MBA student named Rosalia tweeted that a Google search for “unprofessional hair” yielded mostly image results of Black women with natural hair while a search for “professional hair” brought forth images of white women with updos. This is no longer the case, but at the time it was found that this was because Google was determining what was in the photos by the text and captions around it, many of which, in this case, were Black women criticizing the perception that natural hair is unprofessional. Google’s algorithm was pulling information without consideration of the context, and it took a Black woman outside of the company to recognize that lack of nuance and call it out as a problem.

The datasets that AI is being trained on are riddled with gaps, and it can’t be fixed by merely running around and plugging the holes.

To take another Google search example, searching the word “hand” brings up images of white hands, no matter where you are in the world. When designer Johanna Burai realized this, she created the project World White Web, which asks people to share images she’s collected of non-white hands to increase their ranking on Google Search results. There’s also Sofia Umoja Noble, who in her book Algorithms of Oppression proposes the idea of a noncommercial, transparent search engine that displays results in a visual rainbow of color (orange for entertainment, green for business, red for pornographic) so people can find “nuanced shades of information.” Ultimately, artists and researchers are the ones who have taken it upon themselves to make work that critiques data bias at the same time as it seeks to remedy it—or at least provide a viable blueprint for going forward.

Filling the Gaps: Data as a Form of Protest

In 2016, Sinders had just finished an Eyebeam/Buzzfeed residency researching online harassment and hate speech, when she found herself in need of a palette cleanser. In response to the volatile displays of hate she had just spent so much time cataloging, she decided to embark on a project that positioned data collection as a form of resistance, a force for good. She used the Feminist Principles of the Internet as a framework, then set out to create a dataset about feminism—made up of the names of important feminist figures, texts, movies, books, concepts—that was also collected in a feminist way. The result is Feminist Data Set, a series of physical workshops and lectures in which Sinders and her collaborators collectively and “slowly” gather data in an effort to communally define feminism.

Caroline Sinders’ Feminist Data Sets workshop at SOHO20 (2018).

Over the past three years, the project has expanded, and Sinders is now analyzing the entire pipeline of machine learning—from data collection to creating and training a data model to designing an algorithm—through a feminist lens. “I’m analyzing how data is caught and kept, and how it’s viewed by society,” she says. “What do we use it for? Who has access to it? How might that data be misused or misrepresented?” She’s now working on creating a data model, a phase at which, Sinders says, it’s pretty standard for machine learning projects to use Amazon’s labor force, Mechanical Turk, to label data. But in step with the Feminist Data Set methodology, Sinders questioned whether Mechanical Turk is a feminist system. The conclusion was that it’s not—“A system that creates competition amongst laborers, that discourages a union, that pays pennies per repetitive task, and that creates nameless and hidden labor is not ethical, nor is it feminist,” she writes—so now she’s setting out to create an ethical Mechanical Turk system of her own.

“It is about power—who has it, and who doesn’t.”

In the effort to create better datasets, Sinders’ Feminist Data Set project is joined by Omayeli Arenyeka’s The Gendered Project. In spring 2019, Arenyeka wrote code that scraped an online dictionary for gendered and sexualized terms, which were then hand filtered by her and a collaborator, and collected in an online resource that now contains over 2,000 terms. She hopes it will call attention to the imbalance in male and female gendered terms and their connotations. “It’s helpful to be able to look at the data and think about the language we use more critically,” Arenyeka said when the project came out.

Omayeli Arenyeka’s The Gendered Project (2019).

Both Sinders and Arenyeka’s projects are small, but they rely on collective effort to grow the datasets and make sure that they are taking all kinds of people into account, not just the default. This is what Sinders means when she calls the principles that guide her work “inherently feminist,” even while acknowledging that they would also call into the broader category of “ethical design.” It’s also what D’Ignazio, the MIT researcher, means when she says that data science needs an “intersectional feminist approach” (a reference to the term “intersectionality,” coined by Kimberly Crinshaw, which describes how different forms of discrimination—racism, sexism, imperialism—overlap and compound each other in the formation of social inequity). In other words, these projects are not just addressing gender discrimination in data, they strive to create better datasets and frameworks to improve data inequity, period. As D’Ignazio puts it, “Intersectional feminism is not only about women and gender. It is about power—who has it, and who doesn’t.”

That these efforts are willing to challenge existing power structures owes to the fact that they are led by artists and designers who are data literate but not beholden to the interests of Big Tech. However, many of these projects seek to become a guidepoint for big companies or institutions that hold so much influence over our data and how it’s used (Facebook, for example, and its unethical selling of user data to Cambridge Analytica, or the data used by airport body scanners to discriminate against people who are gender non-conforming or transgender). The UK collective Feminist Internet has built a feminist chatbot called F’xa that both acts as an educational tool for teaching people about data bias and a proof of concept of the collective’s Personal Intelligent Assistant Standards. Chatting with F’xa brings forth simple explanations of how AI works and how bias occurs in AI systems (“when they reflect human biases held by the people involved in coding, collecting, selecting, or using data to train the algorithms that power the AI”). F’xa speaks clearly yet informally, doesn’t store any data from the conversation, and has no discernible gender. It also doesn’t use the word “I,” as a way of calling attention to the “complex issues that come up around the emotional attachments people form with bots.”

F’xa, Feminist Internet’s feminist chatbot.

Feminist Internet hopes that the standards it used to build F’xa, created by Feminist AI researcher Josie Young, will be used by researchers and tech companies to create chatbots with feminist values, which they define as values that help ensure you don’t “knowingly or unknowingly perpetuate gender inequality in the things [you] build.”

Sinders also points to the Secure UX checklist, a set of guidelines for “developers who want their designs to protect digital security and privacy for communities they are helping,” that she developed with Sage Cheng, Martin Shelton, Matt Mitchell, Natalie Cadranel, and Soraya Okuda. “These principles can scale,” she says. “Maybe Facebook can’t implement them all today, but they could at least be inspired by them.” Ultimately, projects like Sinders’, Arenyeka’s, Onuoha’s, and Feminist Internet’s are valuable in their willingness to push ideas further, to work collectively and inclusively, and to refuse to take profit-making and technological progress at the harm of the most vulnerable as an inevitability. These data sets and design systems matter because their creators have an actual stake in their creation. That’s something that companies and institutions that yield so much power over our data and digital products could learn from.

This story is part of an ongoing series about UX Design in partnership with Adobe XD, the collaboration platform that helps teams create designs for websites, mobile apps, and more.

Finding the Blank Spots in Big Data

Digital