Self-Fulfilling While most AI models are built on data made by humans, some companies are starting to use — or are trying to figure out how to use — data that was itself generated by AI. If they can pull it off, it could be a huge boon, albeit one that makes the entire AI ecosystem feel even more like a sort of algorithmic ouroboros. As the Financial Times reports, companies including OpenAI, Microsoft, and the two-billion-dollar startup Cohere are increasingly investigating what’s known as “synthetic data” to train their large language models (LLMs) for a number of reasons, not least of which being that it’s apparently more cost-effective. “Human-created data,” Cohere CEO Aiden Gomez told the FT, “is extremely expensive.” Beyond the relative cheapness of synthetic data, however, is the scale issue. Training cutting-edge LLMs starts to use essentially all the human-created data that’s actually available, meaning that to build even stronger ones, they’re almost certainly going to need more. “If you could get all the data that you needed off the web, that would be fantastic,” Gomez said. “In reality, the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need.” It’s All Happening As the CEO noted, Cohere and other companies are already quietly using synthetic data to train their LLMs “even if it’s not broadcast widely,” and others like OpenAI seem to expect to use it in the future. During an event in…AI Developers Are Already Quietly Training AI Models Using AI-Generated Data