As AI models like ChatGPT and Llama grow in size and capability, their outputs increasingly contribute to the very datasets used to train them, such as AI-generated images shared online. This self-reinforcing loop can lead to a detrimental phenomenon known as Model Collapse, where the model’s performance degrades over time. Our recent research reveals that this collapse is rooted in a fundamental change in scaling laws: the previously linear relationship between model performance and the size of training data and model parameters, as described in the Kaplan and Chinchilla papers, eventually flattens out, causing additional data to lose its effectiveness. In this presentation, I will outline the key results of our theory and the mathematical ideas for such an analysis, by way of classical random matrix theory.
Joint work with Yunzhen Feng (NYU), Julia Kempe (Meta), Pu Yang (Peking University), and Francois Charton (Meta)