GPT-4 and professional benchmarks: the wrong answer to the wrong question

OpenAI didn’t release much information about GPT-4 — not even the size of the model — but heavily emphasized its performance on professional licensing exams and other standardized tests. For instance, GPT-4 reportedly scored in the 90th percentile on the bar exam. So there’s been much speculation about what this means for professionals such as lawyers.We don’t know the answer, but we hope to inject some reality into the conversation. OpenAI may have violated the cardinal rule of machine learning: don’t test on your training data. Setting that aside, there’s a bigger problem. The manner in which language models solve problems is different from how people do it, so these results tell us very little about how a bot will do when confronted with the real-life problems that professionals face. It’s not like a lawyer’s job is to answer bar exam questions all day.You’re reading AI Snake Oil, a blog about our upcoming book. Subscribe to get new posts.Problem 1: training data contaminationTo benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.As further evidence for this hypothesis, we tested it on Codeforces…GPT-4 and professional benchmarks: the wrong answer to the wrong question

Leave a Reply

Your email address will not be published. Required fields are marked *