Entrepreneurship

Poisoning the AI well: the real dangers of model collapse

Experts are sounding the alarm about large language models increasingly being trained on synthetic or intentionally false data. This could have disastrous real-world consequences.

  • da Steffan Heuer, guest author
  • Data
  • Tempo di lettura 5 minuto

When AI increasingly relies on its own outputs, errors, biases and manipulations can reinforce themselves - with potentially far-reaching consequences. © Shutterstock/Na_Studio

Summary

  • Growing reliance on AI creates systemic risks: As AI is increasingly used to generate content and inform decisions, its weaknesses become structural rather than incidental.
  • Quality erosion through synthetic training data: When models are trained on AI-generated rather than human-created data, errors and biases compound over time - a dynamic described as "model collapse".
  • Additional risks from deliberate data manipulation: Even small amounts of intentionally false or biased data can significantly distort model outputs ("data poisoning").
  • Greater impact in critical domains: Errors are inevitable, but in areas such as healthcare and finance their consequences are more severe.

The symptoms of what's wrong with our growing reliance on AI systems to create words and images are all around us. Artificial "photos" start taking on a yellowish tint by default. Text generation engines like ChatGPT, Gemini, and Claude will generate what appear to be fact-based answers that contain completely fictitious sources, authors, books, or case law citations, often amplifying bogus or fringe arguments. Papers in renowned scientific journals are accompanied by erroneous charts or even patently absurd illustrations, such as grossly oversized rat genitals.

The issue is not immediately visible: only over multiple iterations does it become clear how deviations begin to solidify. © "ON THE STABILITY OF ITERATIVE RETRAINING OF GENERATIVE MODELS ON THEIR OWN DATA" ICLR 2024

It's one thing for these falsehoods to slip through human quality control and end up in print or in court. These errors, however, point to a larger problem that has a growing number of researchers very worried. AI models are not only "hallucinating", which is euphemistic industry jargon for making things up. They are also rapidly marching towards a chaotic state of affairs where AI systems run out of real, human-created data to train on, and instead feed on synthetic data that's already been generated by AI and is ripe with biases, falsehoods, and even deliberately poisoned data. As a result, users can no longer count on anything their preferred AI tool serves up.

When AI learns from itself

Jathan Sadowski, a senior lecturer in the Emerging Technologies Research Lab at Monash University in Melbourne, has come up with arguably the best name for this potentially fatal weakness in AI tools that have taken the world by storm. He calls it "Habsburg AI", a cynical nod to the once-powerful dynasty whose frequent intermarriages led to severe genetic disorders.

Jathan Sadowski, Senior Lecturer and ARC Future Fellow, Emerging Technologies Research Lab, Monash University, Melbourne, Australia
Jathan Sadowski, Senior Lecturer and ARC Future Fellow, Emerging Technologies Research Lab, Monash University, Melbourne, Australia © Victor Tikhanov

"Habsburg AI refers to the creation of AI models using incestuous methods where one model is built and trained using the data outputs of other AI models, rather than data created by humans," Sadowski explains in an interview. "These inbred systems have a higher chance of possessing exaggerated features, unexpected mutations, and other inherent weaknesses. Habsburg AI created by synthetic data," he warns, "lacks the information diversity of organic data produced by humans."

When Sadowski, who also hosts an irreverent podcast called "This Machine Kills", first raised the alarm in early 2023, his fears were largely theoretical. Since then, researchers around the world have corroborated his critique with technical analysis, calling the impending failure of voracious LLMs "model collapse" or worse.

Why more data is not better data

One of the very first academics to do so was Ilia Shumailov at Oxford University. He co-authored a paper in the journal Nature warning in mid-2024 that what he called model collapse "must be taken seriously". If not addressed, he saw "AI spiralling into the abyss, feeding on its own mistakes and becoming increasingly clueless and repetitive." Another team of researchers, focusing on imagery, compared self-consuming AI models that consequently went "insane" to mad cow disease.

Ilia Shumailov, researcher, University of Oxford, Oxford, United Kingdom
Ilia Shumailov, researcher, University of Oxford, Oxford, United Kingdom © Ian Wallman

The root cause of false and misleading output largely stems from a dual fallacy that's still running rampant among AI vendors who chase ever bigger models. At some point - and the indications are we've already passed it - systems run out of human input to learn from. There are only so many articles, books, and other sources to be scraped - and only so many content deals AI companies can strike to secure a steady supply of fresh material.

If AI is supposed to get smarter and smarter, and scale up without breaking the bank, the argument goes, why not use synthetic data produced by other AIs to train on? But this is a Faustian bargain, and Shumailov's study sets out why. "It was widely believed that if we could create more and more synthetic data, we could train models to be infinitely better," Shumailov explains. "That's not the case. Simply throwing more data at a problem doesn't guarantee better outcomes." This is because minor errors are generated by all machine learning models. These errors are reproduced by subsequently trained models, which then add slight mistakes of their own. The inaccuracies add up - and that's how quality erodes instead of improves. 

The shrinking pool of high-quality data

What makes matters worse is the fact that platforms, from news organizations to academic archives, are putting up barriers to keep out AI crawlers looking for fresh, ideally free content. While the legal battles over copyrights and access are raging, the visible pool of high-quality data is shrinking - just as demand for it keeps going up. And if search results are no longer presented as lists of links but as polished-looking answers, the public may not even notice what they're missing.

These inbred systems […] lacks the information diversity of organic data produced by humans.

Using synthetic data instead is a stopgap measure with dangerous consequences. "We don't know the tipping point yet where some synthetic data is okay but any more will cause collapse," says Sadowski. And it's not even safe to distinguish between human and non-human generated content alone. A recent study by researchers at the University of Texas and two other US universities demonstrates that just like people, LLMs can suffer from so-called "brain rot" due to continual exposure to junk web text. Give our supposed "AI co-pilots" a diet of popular yet low-quality content, and they suffer from a decline in reasoning abilities and memory. They will also become less ethically aligned and more psychopathic. As if they were talking about a human patient, the study's authors recommend performing "routine cognitive health checks" on AI models.

The problem is made worse because of a second phenomenon called data poisoning, flagged up by Seyedali Mirjalili, founding Director of the Centre for Artificial Intelligence Research and Optimization at Torrens University in Australia. Data poisoning happens when someone deliberately slips bad or biased examples into the training data so the model learns hidden rules or false facts that later appear in its answers.

An imbalanced data foundation leaves its mark: AI models lose precision, memory and judgement. © istock/imaginima

"Together they point to a larger problem," he says. "Errors compound across generations, trust erodes, and models can look fine on benchmarks while drifting off course in the real world." Mirjalili compares the dangers to a city's water system. "Imagine the internet as a giant water reservoir for AI. Collapse is a city endlessly recycling its own water until the flavour goes flat. Poisoning is a vandal adding dye and toxins that keep recirculating. Without fresh and verified water sources, the tap runs, but what comes out is not safe to drink."

And it doesn't take much to poison the well. Researchers in the UK and USA recently showed that dropping just 250 malicious or "bad" documents into a sea of millions of "good" files is enough to compromise even the largest models. Biases, falsehoods, and so-called "topic steering", or looking at the world with blinkers on, can quickly grow out of control.

Seyedali Mirjalili, Founding Director, Centre for Artificial Intelligence Research and Optimization, Torrens University, Australia
Seyedali Mirjalili, Founding Director, Centre for Artificial Intelligence Research and Optimization, Torrens University, Australia

The real-world consequences of such deliberate or even unintentional poisoning are dire. Take a hospital heavily invested in AI systems, says Mirjalili. "In healthcare, a few poisoned pages claim a fake treatment works. The hospital's LLM trains on those pages, plus its own past notes and low-quality forum posts, so the model starts giving confident but wrong advice. Patients get delayed or unsafe care."

In finance, model collapse plus data poisoning can also wreak havoc. "Imagine an attacker seeding tainted pages that misstate a company's debt. The bank's LLM ingests them, then is retrained on its own summaries and other weak sources, so its judgment narrows and skews. The result is bad risk flags and mispriced loans," says the Australia-based professor.

Model collapse is unlikely to solve itself, it's a persistent challenge.

Over time, society stands to be affected in even deeper ways. Once humans have become accustomed to trusting AI's responses, they might incorporate bad advice into their habits and worldview, and stop exercising their mental muscles altogether. Distrust is on the rise anyway, due to all the human-generated rubbish circulating on the internet. AI has the unprecedented potential to further amplify and weaponise biased and patently false information pushed by politicians and corporations.

Between innovation and responsibility

By many accounts, model collapse and model poisoning are not just the ephemeral growing pains of a ground-breaking technology, but structural problems. "Model collapse is unlikely to solve itself, it's a persistent challenge," says Shumailov, who in the summer of 2025 left DeepMind, Google's AI division, to launch his own start-up sequrity.ai. His new business venture helps companies to manage AI vulnerabilities, so his predictions now are less gloomy than when he submitted his study two years ago. "Even if we 'freeze' models in the state they are today, they are immensely useful, so I am not worried at all," he says.

AI critic Sadowski has a less optimistic take on the problem: "Companies might be sitting on some secret knowledge about advancements in synthetic data. But it's also very clear that they need to downplay concerns about issues like Habsburg AI and model collapse in order to mitigate any concerns from their investors about the AI boom slowing down."

Competition among AI providers is intense - increasing the incentive to push the limits of data and training. © Kenneth Cheung/Getty Images

The companies creating AI models are in a desperate race for market dominance; the temptation to cut corners is immense. But they know, as everyone does, the truth behind the principle: junk in, junk out. Since the quality of a system's output depends almost entirely on the quality of the input it receives, the AI companies bear an enormous responsibility to society to ensure their models get this right.

Investment strategy

Meet your new AI agent - an autonomous self-starter

Agentic AI - the latest evolution of this important technology - allows software to operate more independently.
A man sitting on a table and smiles
Entrepreneurship

"AI will disrupt every industry"

Entrepreneur and computer scientist Richard Socher expects that rapid advances in artificial intelligence will upend the way we search and, together with automation, leave almost no business untouched.
A black and white scene from the movie Metropolis shows a woman's head wearing a helmet with wires attached.
Entrepreneurship

Artificial intelligence and the art of imitation

The evolution of AI - humankind's attempt to artificially imitate its own intelligence.
A computer-generated plant grows in a forest environment.
Sustainability

How AI can advance - but sometimes impede - sustainability

Artificial intelligence already has a key role in fighting climate change and improving ESG investing. But as AI continues to gain prominence, experts are raising concerns about the associated risks.
Investment strategy

The AI barbell: short-term enablers vs long-term adopters

With data centres popping up everywhere and AI evolving rapidly, the future looks bright for both those companies enabling AI today and those integrating the technology in the future.
Investment strategy

Powering the AI revolution

More energy infrastructure is needed to support the rising demand for data centre capacity. But rapid power expansion will have to be compatible with decarbonisation commitments.
A face can be seen in an electronic device, while in the background a poster refers to AI.
Investment strategy

How to protect yourself against AI-fuelled misinformation

Artificial intelligence makes it increasingly hard to filter information when making investment decisions. This could lead to disastrous consequences. Fortunately, there are ways to counter the AI challenge and ensure safe outcomes.