
Researchers report that AI models trained mainly on Global North data treat regional words from Brazil’s Center-West and Northeast as statistical noise — and argue that fixing this requires more than just regional datasets; it requires treating data as a cultural meaning-making system.
https://doi.org/10.25189/2675-4916.2026.v7.n3.id925

3 Comments
I mean, go figure the Brazilian model speaks better Portuguese. Any reason to think that Portuguese texts are anything more than “statistical noise” in the training set?
>This article examines the processes of construction of meaning in generative AI through the lens of discursive semiotics, focusing on how Big Data and datafication operate as semiotic regimes. Drawing upon the concepts of semiotic practices and forms of life the analysis describes how the intangible and dynamic process of datafication configures practical scenes that, once stabilized within Big Data, privilege particular forms of life.
This reads like some kind of parody.
Taking a look at their methods it may still be parody. They basically played “what do I have in my pocket” with chat models.
Asking chatgpt and a Brazilian model about an obscure Brazilian slang term with zero context then building a narrative about how its “cultural erasure” that chatgpt brings up other historical uses of the same term rather than what the academic was thinking of.
Big surprise: the Brazilian model assumes you’re talking about something in the context of Brazil.
Wow, treating data as what it actually is? Heaven forbid we do that, we’re already losing billions.