Data Crash AI companies typically build their AI models on lots of publicly available content, from YouTube videos to newspaper articles. But many of these content hosts have now started to put up restrictions on their content. Those new restrictions could bring about a “crisis” that would make these AI models less effective, according to a new study by the Massachusetts Institute of Technology’s Data Provenance Initiative. The researchers performed an audit of 14,000 websites that are scraped by prominent AI training data sets. The intriguing result: that about 28 percent “of the most actively maintained, critical sources” on the internet are now “fully restricted from use.” The administrators of these websites have made these restrictions by adding increasingly stringent limitations to how web crawler bots are allowed to scrape their content. “If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems,” the researchers write. No Free Lunch It’s understandable that content hosts would put restrictions on their cache of now-valuable data. AI companies have taken this publicly available material, much of it copyrighted, and are using it to make money without permission. This has understandably upset many, from The New York Times to celebrities like Sarah Silverman. What’s particularly galling is that people like OpenAI CTO Mira Murati are saying that some creative jobs should disappear — even though it’s the content made by these creative people that power models like OpenAI’s ChatGPT. The arrogance on display, and the resulting blowback, have…Crisis Looms as AI Companies Rapidly Losing Access to Training Data