OpenAI Releases O3 Model With High Performance and High Cost

It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive. It is not just brute force. These capabilities are new territory and they demand serious scientific attention.

Benchmark Performance

ARC-AGI Benchmark

o3 has achieved a breakthrough score on the ARC-AGI benchmark, which is considered an indicator of progress toward artificial general intelligence:

o3 scored 75.7% using standard computing power

With increased resources (high-compute mode), o3 reached an unprecedented 87.5%

This performance surpasses the human-level threshold of 85% and represents a significant leap from its predecessor, o1, which only scored 32%

Mathematics and Problem-Solving

o3 has great mathematical reasoning and problem-solving:

Nearly perfect score (96.7%) on the 2024 American Mathematical Olympiad (AIME)

25.2% on EpochAI's Frontier Math Benchmark, far exceeding previous models that couldn't break 2%

Coding and Software Engineering

In coding-related tasks, o3 shows substantial improvements:

SWE-Bench Verified: 71.7, which is 22.8 points higher than o1

Codeforces: Achieved an Elo rating of 2,727

Comparison with Gemini 2 and Other Models

While o3 demonstrates exceptional performance, Gemini 2 and other models also show strong capabilities:

Gemini 2.0 Flash

Outperforms its predecessor Gemini 1.5 Pro on key benchmarks6

Excels in competition-level math problems, achieving state-of-the-art results on MATH and HiddenMath6

Performs well in language and multimedia understanding, outperforming GPT-4o on MMLU-Pro6

Life Buzz News

OpenAI Releases O3 Model With High Performance and High Cost

POPULAR CATEGORY

corporate

tech

entertainment

research

misc

wellness

athletics