Over the summer, a couple of scientific articles have been published showing that recursive training of large language models (LLMs) using synthetic data led to these models' collapse in just a few generations.
In “The Curse of Recursion”, from May 2023, the authors discuss what will happen once LLMs contribute much of the language found online. Up until now, these models have been relying on increasing amounts of data for their training and, over time, have been producing more and more content. Over time the mix between human-generated content and AI-generated content (or synthetic data) will change. They have found that the use of model-generated content in training causes irreversible defects in the resulting models and named this degenerative process in learning model collapse. Amazingly, they also showed that to avoid model collapse, access to genuine human-generated content is essential. In summary, paradoxically, human-generated content is at a premium if we want to continue to make progress in AI.
In “Self-Consuming Generative Models Go MAD”, from July 2023, the researchers show image generation examples of a condition they called Model Autophagy Disorder (MAD) which happens when using synthetic data to train next-generation models. Repeating this process creates a “self-consuming” loop whose properties are still not completely understood. The study concluded that without enough fresh real data in each generation of an autophagic loop, future generative models are doomed to have their quality progressively decrease.
This led me to conclude that we need to build and protect a corpus of high-quality human-generated information, free from LLM contamination, to continue enjoying the benefits of generative AI (at least using current technology).
Isn’t it funny that AI may embed a still poorly understood mechanism to protect human creativity? I do not mean a sentient AI but, rather, a very large and opaque set of linear algebra equations and stochastic processes that make up the neural networks.
In the meantime, several traditional professions that create, or curate human knowledge may be at a premium: librarians, encyclopedists, historians, archivists, teachers and educators, museum curators, anthropologists and archaeologists, philosophers, scholars and researchers, editors, and publishers.
That is a novel thought!
What a difference a year makes! We talked about model collapse and synthetic data more than a year ago. Progress is being made in "taming" this issue with techniques like the ones introduced this week (Dec 12, 2024) by a team of researchers at Microsoft with phi-4. Check it out here: https://arxiv.org/pdf/2412.08905v1
Happy Holidays!