AI Companies Running Out of Training Data After Burning Through Entire Internet

Mass Shortage As AI companies keep building bigger and better models, they’re running down a shared problem: sometime soon, the internet won’t be big enough to provide all the data they need. As the Wall Street Journal reports, some companies are looking for alternative sources of data training now that the internet is growing too small, with things like publicly-available video transcripts and even AI-generated “synthetic data” as options. While there are some companies, such as Dataology, which was formed by ex-Meta and Google DeepMind researcher Ari Morcos, looking into ways to train larger and smarter models with less data and resources, most big companies are looking into novel — and controversial — means of data training. OpenAI, for instance, has per the WSJ’s sources discussed training GPT-5 on transcriptions from public YouTube videos — even as its own chief technology officer, Mira Murati, struggles to answer questions about whether its Sora video generator was trained using YouTube data. Don’t Panic Synthetic data, meanwhile, has been the subject of ample debate in recent months after researchers found last year that training an AI model on AI-generated data would be a digital form of “inbreeding” that would ultimately lead to “model collapse” or “Habsburg AI.” Some companies, like OpenAI and Anthropic, which was formed by OpenAI in 2021 in efforts to build a safer and more ethical AI than those of their former employer, are seeking to head that off by creating supposedly higher-quality synthetic data — though of course, neither is letting press in…AI Companies Running Out of Training Data After Burning Through Entire Internet

Leave a Reply

Your email address will not be published. Required fields are marked *