A discrepancy between the first and third -third reference results for OPENAI’s O3 AI AI is Raising questions about the transparency of the company and model test practices.
When OpenAI presented O3 in DecemberThe company said the model could answer just over a quarter of questions about FrontierMath, a challenging set of mathematical problems. This score exploded competition: the best model best managed to respond only about 2% of FrontyerMath’s problems correctly.
“Today, all offers have less than 2% [on FrontierMath]”, Mark Chen, Openai Research Director, He said during a live broadcast. “We are seeing [internally]With O3 in the aggressive configuration of the test time test, we can exceed 25%. “
It turns out that this figure was probably an upper limit, achieved by an O3 version with more computing behind it than the OpenAI model released publicly last week.
Epoch AI, the Research Institute behind FrontierMath, published the results of its independent reference tests of O3 on Friday. Epoch discovered that O3 obtained around 10%, well below the highest claim of OpenAi.
Operai has launched O3, its highly anticipated reasoning model, along with O4-mini, a cheapest and most cheaper model that is successful O3-mini.
We evaluate the new models in our math and science reference points set. Wire results! pic.twitter.com/5gbtzkey1b
– Epoch AI (@Epochaireeesearch) April 18, 2025
That does not mean that it lied, per se. The reference results published by the company in December shows a lower score that coincides with the time of observed score. Epoch also pointed out that his test configuration probably differs from OpenAI, and that he used an updated version of FrontierMath for his evaluations.
“The difference between our results and Openai’s could be due to Openai’s evaluation with a more powerful internal scaffolding, using more trial time [computing]or because these results were executed in a different subset of FrontierMath (the 180 problems in FrontierMath-2024-11-26 compared to the 290 problems in Frontiermath-2025-02-28-Private) “. wrote Time.
According to an X publication From the Arc Prize Foundation, an organization that tested a prior version of the release of O3, the public model of O3 “is a different model […] Tune in for the use of chat/product “, the report of the corroborating era.
“All levels of O3 computing are launched are smaller than the version than [benchmarked]”, Wrote the Arc Award. In general terms, you can expect the largest computation levels to achieve better reference scores.
The new O3 test in ARC-AGI-1 will take one or two days. Because today’s launch is a materially different system, we are relaxing our results from the past informed as “preview”:
O3-Preview (bass): 75.7%, $ 200/task
O3-Preview (High): 87.5%, $ 34.4k/taskUpon the price o1 pro …
– Mike Knoop (@Mikeknoop) April 16, 2025
Wenda Zhou, member of the technical staff. He said during a live broadcast last week that the O3 in production is “more optimized for real world use cases” and the speed versus the O3 version demonstrated in December. As a result, it can exhibit “disparities,” he added.
“[W]We have done [optimizations] To do the [model] More profitable [and] More useful in general, “Zhou said.” We still hope that we still think that this is a much better model […] You will not have to wait so long when you request an answer, which is something real with these [types of] models “.
According to O3’s public launch, it does not reach OpenAi’s promises is a point of discussion, since the company O3-mini-high-mini of the company exceed O3 in FrontierMath, and Openai plans to debut a more powerful O3 variant, O3-PRO, in the coming weeks.
However, it is another reminder that the reference points of AI are better not taken to the letter, particularly when the source is a company with services to sell.
The comparative evaluation of “controversies” is becoming a common fact in the AI industry as suppliers run to capture the headlines and the mentality with new models.
In January, Epoch was criticized To wait to reveal OpenAi funds until after the company announced O3. Many academics who contributed to FrontierMath were not informed of OpenAi’s participation until it became public.
More recently, Xai de Elon Musk was accused to publish deceptive reference graphics for your latest AI model, Grok 3. Only this month, goal admitted to promoting reference scores for a version of a model that differed from which the company made available to developers.
Updated 4:21 PM Pacific: added comments of Wenda Zhou, member of the technical staff of Operai, of a live broadcast last week.