Explore my computational, design, research, and consulting work at lucyhavens.com

What I Talk About When I Talk About Bias (part II)

Written in

by

In the first post of this series, I wrote about the distinction between overt and covert bias, giving examples of how a search engine could exhibit sexism whether or not a search query specified a gender.  In this post, I write about why technologies such as search engines exhibit social biases, such as sexism, racism, and ableism.

As I’ve seen the term “Artificial General Intelligence” popping up more and more, I’m amazed at how confidently people seem to be pursuing this as a goal.  The problem with the word “general” is the same as the problem with the word “good” (as in “technology for social good”) or “standard” (as is “Standard American English”): who defines “general?”  Are the people driving the Artificial General Intelligence (AGI) agenda qualified to define “general” at a global, or even national, scale?  Are we anywhere close to having data that encompasses enough of the world’s knowledge to train an AGI model?

Who Defines “General?”

In a research paper I published a few years ago as a Ph.D. student in Edinburgh, I reviewed the Higher Education Statistical Agency’s survey results of computer science and data science students in the UK.  Overwhelmingly, students were reported to be white, male, and without a disability.  The Wise Campaign’s statistics about the UK’s STEM workforce showed that women accounted for only about a quarter of the STEM workforce.  Now that I’m back in the US, I thought I’d look for similar statistics about STEM students and the STEM workforce here.  In a Pew Research Center report, I read that in the US, women overall are earning more STEM degrees than men and account for about 50% of the STEM workforce.  That being said, women only make up about a quarter of the computing workforce, which is the workforce responsible for creating AI.  Also, the STEM workforce in the US is mostly white.

The growing diversity of demographics among students earning STEM degrees was a pleasant surprise!  Still, if we think about the demographics of the workforce building AI and aiming to build AGI, there’s a lot of room for improvement.  

The proportion of men working as computer scientists, computer programmers, and developers ranges from 70% to 81%.  Since the survey data collected only categorizes people as women or men, people working in these roles who identify as trans, non-binary, or another gender diverse identity are either misrepresented or excluded.  The proportion of white people working as computer scientists, computer programmers, and developers ranges from 54% to 67%.  People in these roles who identify as Hispanic or Black ranges from just 4% to 10%.

Given these statistics, I find it highly unlikely that AI teams working towards the general in AGI have a diverse enough experience of the US, let alone the world, to determine what constitutes general intelligence.

Statistics aside, the harmful behaviors of AI give further evidence of how far we are from general intelligence.  Joy Buolamwini and Timnit Gebru’s study of commercial facial recognition systems found that the systems consistently recognized men and white people better than people of other genders and ethnicities, performing especially well on white men.  In the book Weapons of Math Destruction, author Cathy O’Neil gives examples of how applications of big data analytics such as AI are undermining democracy and amplifying social and economic inequalities.  Virginia Eubanks also reports on how data-driven technologies such as AI reinforce economic inequalities by trapping people in poverty in the book Automating Inequality.

Clearly, we are a long way from creating technology that serves the general public, let alone embodies general knowledge.

A Dataset of the World’s Knowledge

The assumption behind the goal of creating AGI is that there will be a dataset that comprises the world’s knowledge, because a model would need to be trained on such a dataset to create AGI technology.  Back in 2020, before Twitter was X, Jack Dorsey likened tweets on the platform to humanity’s consciousness in an interview with Lex Friedman.  This is a nice-sounding metaphor, but statistics don’t support the claim.  As of 2021, there were about 4.26 billion social media users worldwide.  Since many accounts exist for non-humans (organizations, events, pets, bots, etc.), I feel confident stating that less than half of the global population was on social media, let alone Twitter.  As of 2021, there were about 353.1 million Twitter users.  Even if each user did correspond to one person, that would only represent about 4.5% of the global population.

So we’re not there yet, you may say, but we will be soon.  Soon, EVERYTHING will be data-fied and scraped to train AI models that are generally intelligent.  Sorry, but no.

Even if you expand from social media users to Internet users, you’re still not close to a representative sample of the world’s population.  As of 2021, about 63% of the world’s population was estimated to regularly use the Internet, based on usage on any device over a three-month period of the year.  Broken down by region, the populations of Africa and Central America are underrepresented.  This means the knowledge of people living on those continents would be underrepresented in any dataset of global Internet users’ knowledge.

To illustrate the amount of the world’s intelligence excluded from AI models, I created the sankey diagram below.* The diagram visualizes estimates of the number of Internet, social media, and Twitter users relative to the world’s population in 2021. Internet users were counted as the number of people who accessed the Internet through any device over a three-month period (this is how The World Bank measures Internet usage). When measuring social media accounts, it’s important to remember that accounts are created for organizations, events, pets, and bots, so the number of people using social media is actually lower than the number of social media user accounts. The same goes for Twitter: the number of visualized user accounts is higher than the number of individual people on Twitter.

The vision of a future in which AI comprises “the repository of all human knowledge and culture” is simply not feasible.  There are forms of knowledge that aren’t recorded in digital formats, so they can’t be included in AI training datasets.  There are cultures whose people do not wish to share their knowledge openly online.  There are communities of people whose histories have been misrepresented or erased.  There are countless artworks, books, buildings, and other cultural heritage artifacts that have not been digitized and likely never will be.  We are producing information at a rapidly accelerating pace, so the influx of new artifacts to galleries, libraries, archives, and museums far outpaces the speed at which those artifacts can be digitized.  (More on this in a future post…)

Besides, data have to be recorded, so the moment they exist, they are already out of date.  The attempt to build AI that captures all the world’s knowledge is a losing battle.  The world is an evolving place, so we should be creating AI models that leave room for change and uncertainty.

If you’re interested in hearing more on why AGI is smoke and mirrors, I recommend this podcast episode, this research paper, and this interactive report!

*The data for the visualization in this post were gathered from the The World Bank, Kepios, Statista, and Insider Intelligence. You can see my code for analyzing data from these sources in my GitHub repo. I created the sankey diagram with RAWGraphs and Adobe Illustrator.

One response to “What I Talk About When I Talk About Bias (part II)”

  1. […] In the first post, I wrote about the distinction between overt and covert bias.  In the second post, I wrote about why AI models, including those powering search engines, exhibit social biases such […]

Leave a Reply

Your email address will not be published. Required fields are marked *