Dirty Work A giant dataset of YouTube subtitles has, per a new investigation, been used to train countless AI models without the permission of the tens of thousands of creators whose work was scraped. As Wired reports with the help of the data-driven Proof News project, a dataset known as “YouTube Subtitles” has been used by everyone from Apple and Anthropic to Nvidia and Salesforce to train AI models since it was released in 2020. Compiled by the open-source nonprofit EleutherAI, the YouTube Subtitles dataset doesn’t include any actual video, but instead subtitle data from 173,536 videos gleaned from more than 48,000 channels. Among those channels were everything from MIT and Harvard to MrBeast and the BBC, among many others. Of all the channel owners that Proof managed to speak with for the story, none had been made aware ahead of time that ElutherAI had used subtitles from their videos. Forgiveness, Not Permission One of the impacted creators, the progressive vlogger David Pakman, was mighty peeved when he learned from Proof about his videos being included in the dataset. “No one came to me and said, ‘We would like to use this,'” the commentator, who had nearly 16o videos used in the dataset, told Wired. “This is my livelihood, and I put time, resources, money, and staff time into creating this content.” According to AI policy researcher Jai Vipra of Brazil’s Fundação Getulio Vargas Law School, the YouTube Subtitles dataset is a “gold mine” because it can teach models how to replicate human speech. To science…YouTubers Furious After Apple and Anthropic Steal Their Data to Train AI