- Home
-
Private banking
-
LGT career
Experts are sounding the alarm about large language models increasingly being trained on synthetic or intentionally false data. This could have disastrous real-world consequences.
The symptoms of what's wrong with our growing reliance on AI systems to create words and images are all around us. Artificial "photos" start taking on a yellowish tint by default. Text generation engines like ChatGPT, Gemini, and Claude will generate what appear to be fact-based answers that contain completely fictitious sources, authors, books, or case law citations, often amplifying bogus or fringe arguments. Papers in renowned scientific journals are accompanied by erroneous charts or even patently absurd illustrations, such as grossly oversized rat genitals.
It's one thing for these falsehoods to slip through human quality control and end up in print or in court. These errors, however, point to a larger problem that has a growing number of researchers very worried. AI models are not only "hallucinating", which is euphemistic industry jargon for making things up. They are also rapidly marching towards a chaotic state of affairs where AI systems run out of real, human-created data to train on, and instead feed on synthetic data that's already been generated by AI and is ripe with biases, falsehoods, and even deliberately poisoned data. As a result, users can no longer count on anything their preferred AI tool serves up.
Jathan Sadowski, a senior lecturer in the Emerging Technologies Research Lab at Monash University in Melbourne, has come up with arguably the best name for this potentially fatal weakness in AI tools that have taken the world by storm. He calls it "Habsburg AI", a cynical nod to the once-powerful dynasty whose frequent intermarriages led to severe genetic disorders.
"Habsburg AI refers to the creation of AI models using incestuous methods where one model is built and trained using the data outputs of other AI models, rather than data created by humans," Sadowski explains in an interview. "These inbred systems have a higher chance of possessing exaggerated features, unexpected mutations, and other inherent weaknesses. Habsburg AI created by synthetic data," he warns, "lacks the information diversity of organic data produced by humans."
When Sadowski, who also hosts an irreverent podcast called "This Machine Kills", first raised the alarm in early 2023, his fears were largely theoretical. Since then, researchers around the world have corroborated his critique with technical analysis, calling the impending failure of voracious LLMs "model collapse" or worse.
One of the very first academics to do so was Ilia Shumailov at Oxford University. He co-authored a paper in the journal Nature warning in mid-2024 that what he called model collapse "must be taken seriously". If not addressed, he saw "AI spiralling into the abyss, feeding on its own mistakes and becoming increasingly clueless and repetitive." Another team of researchers, focusing on imagery, compared self-consuming AI models that consequently went "insane" to mad cow disease.
The root cause of false and misleading output largely stems from a dual fallacy that's still running rampant among AI vendors who chase ever bigger models. At some point - and the indications are we've already passed it - systems run out of human input to learn from. There are only so many articles, books, and other sources to be scraped - and only so many content deals AI companies can strike to secure a steady supply of fresh material.
If AI is supposed to get smarter and smarter, and scale up without breaking the bank, the argument goes, why not use synthetic data produced by other AIs to train on? But this is a Faustian bargain, and Shumailov's study sets out why. "It was widely believed that if we could create more and more synthetic data, we could train models to be infinitely better," Shumailov explains. "That's not the case. Simply throwing more data at a problem doesn't guarantee better outcomes." This is because minor errors are generated by all machine learning models. These errors are reproduced by subsequently trained models, which then add slight mistakes of their own. The inaccuracies add up - and that's how quality erodes instead of improves.
What makes matters worse is the fact that platforms, from news organizations to academic archives, are putting up barriers to keep out AI crawlers looking for fresh, ideally free content. While the legal battles over copyrights and access are raging, the visible pool of high-quality data is shrinking - just as demand for it keeps going up. And if search results are no longer presented as lists of links but as polished-looking answers, the public may not even notice what they're missing.
These inbred systems […] lacks the information diversity of organic data produced by humans.
Using synthetic data instead is a stopgap measure with dangerous consequences. "We don't know the tipping point yet where some synthetic data is okay but any more will cause collapse," says Sadowski. And it's not even safe to distinguish between human and non-human generated content alone. A recent study by researchers at the University of Texas and two other US universities demonstrates that just like people, LLMs can suffer from so-called "brain rot" due to continual exposure to junk web text. Give our supposed "AI co-pilots" a diet of popular yet low-quality content, and they suffer from a decline in reasoning abilities and memory. They will also become less ethically aligned and more psychopathic. As if they were talking about a human patient, the study's authors recommend performing "routine cognitive health checks" on AI models.
The problem is made worse because of a second phenomenon called data poisoning, flagged up by Seyedali Mirjalili, founding Director of the Centre for Artificial Intelligence Research and Optimization at Torrens University in Australia. Data poisoning happens when someone deliberately slips bad or biased examples into the training data so the model learns hidden rules or false facts that later appear in its answers.
"Together they point to a larger problem," he says. "Errors compound across generations, trust erodes, and models can look fine on benchmarks while drifting off course in the real world." Mirjalili compares the dangers to a city's water system. "Imagine the internet as a giant water reservoir for AI. Collapse is a city endlessly recycling its own water until the flavour goes flat. Poisoning is a vandal adding dye and toxins that keep recirculating. Without fresh and verified water sources, the tap runs, but what comes out is not safe to drink."
And it doesn't take much to poison the well. Researchers in the UK and USA recently showed that dropping just 250 malicious or "bad" documents into a sea of millions of "good" files is enough to compromise even the largest models. Biases, falsehoods, and so-called "topic steering", or looking at the world with blinkers on, can quickly grow out of control.
The real-world consequences of such deliberate or even unintentional poisoning are dire. Take a hospital heavily invested in AI systems, says Mirjalili. "In healthcare, a few poisoned pages claim a fake treatment works. The hospital's LLM trains on those pages, plus its own past notes and low-quality forum posts, so the model starts giving confident but wrong advice. Patients get delayed or unsafe care."
In finance, model collapse plus data poisoning can also wreak havoc. "Imagine an attacker seeding tainted pages that misstate a company's debt. The bank's LLM ingests them, then is retrained on its own summaries and other weak sources, so its judgment narrows and skews. The result is bad risk flags and mispriced loans," says the Australia-based professor.
Model collapse is unlikely to solve itself, it's a persistent challenge.
Over time, society stands to be affected in even deeper ways. Once humans have become accustomed to trusting AI's responses, they might incorporate bad advice into their habits and worldview, and stop exercising their mental muscles altogether. Distrust is on the rise anyway, due to all the human-generated rubbish circulating on the internet. AI has the unprecedented potential to further amplify and weaponise biased and patently false information pushed by politicians and corporations.
By many accounts, model collapse and model poisoning are not just the ephemeral growing pains of a ground-breaking technology, but structural problems. "Model collapse is unlikely to solve itself, it's a persistent challenge," says Shumailov, who in the summer of 2025 left DeepMind, Google's AI division, to launch his own start-up sequrity.ai. His new business venture helps companies to manage AI vulnerabilities, so his predictions now are less gloomy than when he submitted his study two years ago. "Even if we 'freeze' models in the state they are today, they are immensely useful, so I am not worried at all," he says.
AI critic Sadowski has a less optimistic take on the problem: "Companies might be sitting on some secret knowledge about advancements in synthetic data. But it's also very clear that they need to downplay concerns about issues like Habsburg AI and model collapse in order to mitigate any concerns from their investors about the AI boom slowing down."
The companies creating AI models are in a desperate race for market dominance; the temptation to cut corners is immense. But they know, as everyone does, the truth behind the principle: junk in, junk out. Since the quality of a system's output depends almost entirely on the quality of the input it receives, the AI companies bear an enormous responsibility to society to ensure their models get this right.