Algorithmic Monocultures in Hiring,” by Rishi Bommasani, Sarah Bana, Kathleen A. Creel, Dan Jurafsky, and Percy Liang, examines how automated systems built by the same few algorithm vendors can cause the same applicants to face rejection again and again, noting “clear racial disparities.”

    The study’s unique, position-by-position investigation is eye-opening for both job seekers and those hoping to find the best candidates for the job. We spoke with the authors about what they found.

     

    What is “algorithmic monoculture?”

     

    Sarah Bana: Algorithmic monoculture, to me, is any circumstance in which similar outcomes occur because of algorithms. There are plenty of simple algorithms, like needing a college degree or three years of experience, before getting a job. But the more complex algorithms that are now appearing in the labor market produce similar outcomes through more complicated processes. These machine learning tools, built to characterize opaque elements like “fit,” generate similar outcomes across firms with less interpretability. 

    Kathleen Creel: Algorithmic monoculture occurs when the same algorithm dominates a sector, or in its weaker but more typical form, algorithms made in similar ways using similar data such that they make similar decisions. 

     

    Can you explain your study’s methodology?

     

    Sarah: We looked at 4 million applications across 3 million applicants, all screened by the vendor pymetrics. We did two primary sets of analyses: one on bias, and the other on homogenization.

    For the bias analyses, we looked at the applicants that had provided race data at the position level. US employment law flags a position when one group is recommended at less than 80% of the rate of the most-recommended group — this is the “four-fifths rule.” So we calculated the recommendation rate of each group, and tested whether any group had a recommendation rate that was statistically significantly different from the highest passing group, and lower than 80% of the rate of the highest passing group.

    For the homogenization analyses, we started by looking at how many models were used across firms. This number was 42.

    We also looked at the probability of being systemically rejected – that is, being rejected by every position you had applied to. In our other work, we establish this concept of a baseline – which helps us understand what the rate could be if the models were independent. In the pymetrics data, 10 percent of applicants who apply to 4 positions are systemically rejected. And the observed rate and the benchmark rate diverge substantially.

    But when we used our methods to analyze the largest prior study of hiring decisions, which sent 83,000 applications to 108 Fortune 500 firms, we find that the systemic rejection rates observed in their data are very accurately predicted by employers making statistically independent decisions. So there’s something different going on in our algorithmically mediated data, even though the time periods were similar.

     

    What would you say is the main takeaway?

     

    Kathleen: We’ve speculated in past work that if many firms relied on the same AI vendor to screen job applicants, that could prevent some applicants from getting any interviews.  But this study was the first time we were able to show this effect in real hiring data. 

    Sarah: I think the most significant result of our study is how much bias we find in this algorithmic hiring system. The vendor has published aggregated audits that demonstrate that their tools do not demonstrate measurable bias. In that way, I was surprised because I thought that their algorithms would be an example of best practice. When you read that something you’re buying has been audited, you tend to take that finding at face value – and that’s likely part of what is going on. 

     

    Before we get to the results, can you explain the pymetrics platform at the center of the study?

     

    Sarah: Of course. Pymetrics is a response to a long tradition of industrial-organizational (I/O) psychology — the discipline that aims to improve work outcomes for individuals and organizations. Earlier generations of job screening tools used personality tests that you’d take on a computer; Autor and Scarborough (2008), for example, analyzed a test built around the Five Factor model (conscientiousness, agreeableness, extroversion, openness, and neuroticism).

    Pymetrics’s founders argued there was a better way to measure personality than asking people about themselves. The games they have candidates play — short, gamified tasks rooted in neuroscience — are designed to surface those traits through behavior rather than self-report. In many ways, pymetrics is trying to improve systems that have historically been quite biased: people often struggle to get jobs because of what’s on (or missing from) their resume, and a process that doesn’t rely on resumes would, in principle, remove that barrier.

    But behavior also encodes who we are. One of the pymetrics games involves popping balloons to measure risk tolerance — and a friend recently pointed out that risk aversion looks very different for someone at the poverty line than it does for someone who has never missed a meal.

    Ideally, the future of this kind of assessment would involve simulating actual work as part of the interview. I’m very sympathetic to firms trying to hire — I recently went through an interview process that took more than five months end to end, with far fewer applicants than the median pymetrics position handles. AI makes hiring faster and cheaper, but at a cost we need to be willing to measure and audit.

     

    What groups experience the most adverse impact?

     

    Sarah: We see many Black and Asian applicants adversely impacted. We don’t have any causal evidence here but my guess is that behaviors that are being picked up by the games are functioning as proxies for race – the kind of bias that is hard to remove without explicit adjustments to the trained models. 

    There’s also a structural piece that we don’t observe with the data we have access to. The models are trained against each firm’s current employees in a given role, and those workforces likely aren’t very diverse to begin with. 

    Share.

    Comments are closed.