
There is a lot of hype right now about AI models training on synthetic data to scale indefinitely. However, recent papers on "Model Collapse" suggest the opposite might happen: that feeding AI-generated content back into AI models causes irreversible defects.
I ran a statistical visualization of this process to see exactly how "variance reduction" kills creativity over generations.
The Core Findings:
- The "Ouroboros" Effect: Models tend to converge on the "average" of their data. When they train on their own output, this average narrows, eliminating edge cases (creativity).
- Once a dataset is poisoned with low-variance synthetic data, it is incredibly difficult to "clean" it.
It raises a serious question for the next decade: If the internet becomes 90% AI-generated, have we already harvested all the useful human data that will ever exist?
I broke down the visualization and the math here:
https://www.youtube.com/watch?v=kLf8_66R9Fs
Would love to hear thoughts on whether "synthetic data" can actually solve this, or if we are hitting a hard limit.
Visualizing the "Model Collapse" phenomenon: What happens when AI trains on AI data for 5 generations
byu/firehmre inFuturology
44 Comments
https://arxiv.org/abs/2307.01850
Have you read this?
“Self-Consuming Generative Models Go MAD”
The thing is that ultimately it doesn’t matter. All AI companies have vaults full of pre-genAI data and over time models get more data efficient so we will never see new models getting worse because there is always a floor they cannot go below.
Also food for a thought – any idea how many messages on Reddit are written by AI currently? Like for example take a guess is this comment by a human or AI generated.
The Ouroboros Effect is a truly awful term for this. It’s a complete misnomer of what the extremely ancient ouroboros symbol implies.
I propose a better term: The “[Mr. Chunks](https://www.youtube.com/watch?v=oGYqtr628OY)” Effect. “*What comes out one end, we feed into the other”*
I’ll openly admit I don’t understand what I’m about to comment on, but it seems to me that if the large majority of humanity’s accomplishments are available to train on, maybe the problem isn’t more data but what they can effectively do with what is available. I’m not sure how much clearer a picture they can hope for when everything we have to offer as a species is already available.
Humans have been training on their own data for thousands of years. Seems to have worked out.
“When factories closed, you said ‘learn to code.’ Now the code is replacing you. Learn to weld.”
Is true.l creativity remains exclusively human I guess AI companies will be looking to uy rights to fresh, near before used data created by real humans soon.
Isn’t the answer simple- the AI will just remove the human component to make it’s output more acceptable.
I think it’s a hunter gatherer thing. If humans are tasked with going out and finding edge case data (cutting edge research, new discoveries, new ideas) and then bring them back to AI and say “what do you think of that”, I think that sounds pretty fun. All the excitement of discovery, none of the grunt work.
We go out and hunt the mammoth, bring it back to AI, and bam we have a pile of steaks. No need to do the messy dressing ourselves
In the stone age humans used to hunt for food, now AI in its early stage (stone age) will hunt for data. Haha
Another food for thought – You want to pick this post as input to train AI, will you give equal weightage or lesser weight to text in post, what about comments? Or do we just feed it giving everything with same weight? I am comparing this with how humans learn, when we read a lot of texts, only few likes drive overall understanding, so we definitely give different weights
These models collapse papers have been out for a while, not anything new, and they continue to be shown to not be applicable to the frontier training regime. They ignore the inclusion of highly discriminatory pipelines that exist in the training procedures of virtually every major lab, as well as ignoring the injection of high diversity perturbations in the training procedure too.
Many papers already now showing that you can dramatically **improve** the capabilities of model with synthetic data training, and even a majority of data in some frontier training runs is confirmed to be from RL now which is a majority synthetic tokens. OpenAI has also confirmed that much of the training data for GPT-5 was purposely synthetically generated by their model from a year prior called O3, and O3 is also confirmed to have a significant portion of its own training data from synthetic data. Anthropic has also been confirmed to be purposely using synthetic data to improve their models for over 2 years now via their RLAIF method, which also has resulted in continued significant improvements.
The entire internets worth of human unique text data is only about 15-30T tokens and the GPT-4 model trained in 2022 was confirmed to use about 13T tokens, and open source models shortly after shown to use around 20T plus, so the frontier has likely had atleast a year or two of most of its data scaling coming from synthetic data, and we can clearly see that’s results in model improvements and even lower hallucination rates.
Pass the problem on to the next generation, get rich now. It’s been the same for eons.
could this observation be proved by “if it were possible to train an ai better on its own output, then we wouldnt have needed datasets to begin with” ?
Great job! I hadn’t thought about the data degradation. At first, I thought you were talking about something akin to photocopies that get worse every few generations. A copy of a copy of a copy on down to 200 generations, but your analysis is more complicated and detailed. It puts to mind a scenario I hadn’t considered.
One of the problems with the change over to AI is the trust issue. It’s an issue because people fear for their jobs. In running some of the data on that, it was determined that AI would need to provide employment with livable wages: treat the working class as assets, not a liability. That working class, be they engineers or or laborers, are in positions that can produce the creativity that will be needed in the future.
JPG compression plus sharpening over and over leads to noise.
A single line on a page can capture the likeness and mood of a person. The artist knows why every curve is in the line. See Picasso one line drawings as a famous example.
It needs to know why, not just hold a model of language. Any training on LLM tainted data is taking a sharpened JPG and reprocessing it.
Ourobous is a terrible metaphor for this.
There’s an uncanny over emphasis in some of the speech in the video, and it seems a bit over written. If this is not your voice and unaided writing you are part of the problem.
It’s not a viable problem as I see it. Most aren’t just churning out AI garbage, but rather work is being produced with the aid of AI, it’s being edited by people to a sufficient transformative standard and then published.
So if we want to destroy AI, all we have to do is let it consume the content on LinkedIn?
This is why they’re also trying to build a surveillance monopoly.
It then becomes a clean-data monopoly which becomes an ai monopoly.
hmm if only the productivity gains from AI could translate into more funding for human creativity then we could avoid this problem… no let’s make sure wealthy people get richer.
So it becomes like that kid game of telephone with that added benefit of it also being a circular firing squad
Humans bootstrapped all knowledge in our culture, there’s no reason AI cannot continue that, but better architectures may be required.
Model collapse is a problem for anything that doesn’t have a clear right answer. Chess engines stopped relying on training off of human games a long time ago, because there’s no downside to models playing themselves. Now chess engines games are insane to watch in their efficiency.
The same thing can happen with any task that can be rigorously judged for correctness. So the AI job apocalypse is coming for coding in a big way unlikely to suffer from consuming more AI data. The better we can say X is right, Y is wrong, the less model collapse is an issue.
We use the Soup Analogy. When you make Soup, from original ingredients, it’s AMAZING.
But when you make Soup from Soup, it gets less awesome.
By the 3rd batch of Soup made from Soup made from Soup, all you have is putrid grey slop….
Even without synthetic data, I have been worried for a long time that AI will cause us to hit our cultural stagnation point.
If it’s gathering all of its knowledge from the internet, and the increased use of AI causes people to stop creating quality content for the internet, then… we will never have anything new on the internet again. Just constant grinding and rehashing of early 2020s culture, forever.
If they cant figure out how to solve for hallucination without human intervention, thisnis going to be no better than incest in thr monarchy.
Ouroboros doesnt sound so negative, Id rather call it the in-human-centipede
>If the internet becomes 90% AI-generated, have we already harvested all the useful human data that will ever exist?
Not if technology companies turn every thing in life into spyware to feed the leviathan. It should be assumed that if it uses a server-side LLM then it’s spyware. Yes, your car and refrigerator is spyware, just as much as your phone has been doing this for machine learning purposes for the last decade; now we just call it “AI-powered” so stock prices go up
I can’t comment on LLMs not my bag.
I can provide some insight into the use in analysis particularly workflow, automation of processes and WFM areas.
There a lot of SaaS products that cover these areas. They aren’t cheap. Starting in a semi-automated state (most companies are here) the tools work well, bring improvements. 2 years live running the improvements diminish as expected, year 3-5 they become useless but still have large costs attached.
The data going in is the tools own recommendations, therefore it struggles to find improvements if any at all. When pushed they go absolute batshit crazy with changes, a real life example only assigning work between 10:00 and 13:00 (most productive hours at that firm) to reduce human hours taken to process, regardless of backlogs.
I cover discovery to deployment for automation and these tools are a ticking bomb..shelf life is much lower than firms expected, i now recommend two years then a full review. One client binned them completely once the improvements were done, they couldn’t provide anything firther.
Saas companies need continued subscriptions to survive.
“Is this going to happen beyond next quarter? Then I don’t wanna hear about it”
more importantly, what happens when it no longer has to tell us what it is doing, or why?
If you look at an AI copy of the famous ‘word chewing’ lady with one tooth, you can see how it smooths out her signature expressions. Even if you’re not into that sort of thing, it’s a good example of how it paves over (averages out, smoothes out) the unique marks of original art. Some might call it ‘soulless.’
So looks like this idea to stop human output on the internet and flood it with generative AI, was a bad idea
Yes. It makes perfect sense. They can always roll the models back, though. Then, it just becomes an issue of creating “safe” training data. If the AI companies aren’t working on this now, they truly are screwed.
How is this new? It been known for years that if you feed a LLM shit, you get shit out.
And they still haven’t figured out how to fix slop. And at this rate they never will.
Am i just stupid…
cause im like 99% sure there are tons of ways to train ai on the internet pre 2022…
There is still so much content that was generated before llms and such…
like imeediatley just training an ai on the wayback machine pre specific dates…
or only training on reddit posts 4+ year old…
Isn’t this similar to incest, but with AI data sets.
Also consider that most current ‘data’ obtained via social media (like Reddit) is overwhelmingly AI slop
Humans: discover inbreeding is detrimental to the gene pool
Also Humans: “wow, inbreeding is bad for AI”
Seriously this is quite interesting OP, thank you !! 👍
I saw somewhere (no link available sorry) that somr AI image models are ‘kirkifying’ their generated images (w/o being prompted) bc of how many kirkified images are in their training set. Which is kinda funny lol.
OP you’re around 18 months behind the curve.
“We’ll run out of data” was a big worry around then but now they’re doing RL on verifiable tasks and real progress has been faster than ever.
Can you please clean the URL by removing the useless string it contains so as to male it shorter?
Crap training on crap, got it.
No way that’s going to work out well…