What Is Synthetic Data? Why AI Trained on AI Is the Next Big Thing (and Problem)

Short Supply As AI companies start running out of training data, many are looking into so-called “synthetic data” — but it remains unclear whether such a thing will ever work. As the New York Times explains, synthetic data is — on its face, at least — a simple solution for the growing scarcity and other issues with AI training data. If AI can grow large on data generated by AI, it would not only solve the training data shortage, but could also eliminate the looming problem of AI copyright infringement, too. But while companies like Anthropic, Google, and OpenAI are all working to try to create quality synthetic data, none have managed to do so quite yet. Thus far, AI models built on synthetic data have tended to run into trouble. Australian AI researcher and podcaster Jathan Sadowski referred to the isssues as “Habsburg AI,” a reference to the deeply-inbred Habsburg dynasty and their ultra-prominent chins that signaled their family’s penchant for intermarriage. As Sadowski tweeted last February, this term describes “a system that is so heavily trained on the outputs of other generative AI’s that it becomes an inbred mutant, likely with exaggerated, grotesque features” — much like, well, the Hapsburg jaw. Last summer, Futurism interviewed another data researcher, Rice University’s Richard G. Baraniuk, about his term for this phenomenon: “Model Autophagy Disorder,” or “MAD” for short. It took only five generations of AI inbreeding for the model in the Rice research to “blow up,” as the professor put it. Synthetic Solutions…What Is Synthetic Data? Why AI Trained on AI Is the Next Big Thing (and Problem)

Leave a Reply

Your email address will not be published. Required fields are marked *