Data Scraper OpenAI has launched a new web crawler called “GPTBot” that will trawl the internet for content to train its large language models like GPT-4, which power ChatGPT. “Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” reads a post on OpenAI’s website. The AI juggernaut also claims that GPTBot is “filtered” to remove paywalled sources, personally identifiable information, and text that violates its policies. Fortunately, OpenAI does provide a way to easily block GPTBot by adding an entry to a website’s robot.txt, a file that tells web crawlers from search engines like Google what they’re allowed to access. Moreover, administrators can customize which parts of their sites GPTBot can crawl. Its multiple IPs are available, too, for easy blocking. Keep Out! Until now, the large language models behind ChatGPT were trained on hordes of online data gathered up to September 2021. There’s no way to have data that was scraped before that cutoff date removed retroactively, but blocking its new web crawler will at least future-proof websites that want to keep it out going forward. And you can bet that many site owners, who probably aren’t keen on having their content hoovered up and imitated by an AI, are already taking advantage of this. One example is popular sci-fi magazine Clarkesworld, which announced on X, formerly known as Twitter, that it was blocking GPTBot. Tech outlet The Verge has quietly done the same, and countless articles are…OpenAI Deploys Crawler to Vacuum Up Your Posts and Train AI With Them