HMMT25 Benchmark Explained: Testing AI Math Reasoning

[Revised February 28, 2026]
Executive Summary
This report provides an in-depth analysis of the HMMT25 benchmark – a novel AI evaluation centered on elite high-school mathematics competition problems from the Harvard–MIT Mathematics Tournament (HMMT). HMMT25 is designed to rigorously test the advanced reasoning capabilities of large language models (LLMs) on complex mathematical domains (algebra, geometry, combinatorics, etc.) at contest level ([1]). As of early 2026, the leading models have essentially solved HMMT-level problems: xAI’s Grok-4 Heavy achieved 96.7% accuracy on HMMT25, with GPT-5 variants scoring around 92% and Google’s Gemini 3 Pro reaching 95% on comparable AIME benchmarks ([2]) ([3]). In contrast, earlier-generation models (GPT-3 era) scored only single-digit percentages on comparable exam problems ([4]). These findings highlight dramatic progress in AI mathematical reasoning, but also emphasize current limitations (errors in multi-step proofs, reliance on repeated sampling, etc.) that distinguish LLMs from true mathematical understanding. Meanwhile, harder benchmarks like FrontierMath (research-level mathematics) and Humanity’s Last Exam (cross-disciplinary expert questions) show top models scoring only 40–50%, indicating that while contest math has been effectively conquered, genuine mathematical research remains far out of reach. We examine the historical context of math benchmarks, detail the HMMT competition, analyze the 2025 results and data, compare HMMT25 with other benchmarks (e.g. AIME, IMO, MATH, FrontierMath), and discuss the implications and future directions for AI advancement in mathematics. Throughout, evidence is drawn from performance leaderboards, peer-reviewed studies, and expert commentary ([1]) ([4]) ([5]).
Introduction and Background
The rapid improvement of large language models (LLMs) has necessitated increasingly challenging evaluation benchmarks. Traditional AI benchmarks include MMLU (Massive Multitask Language Understanding) for broad academic knowledge, GSM8K for grade-school math reasoning, and Big-Bench Hard (BBH) for hardest tasks ([6]). For example, MMLU uses multiple-choice questions across STEM and humanities to assess knowledge recall ([6]). GSM8K comprises thousands of English word problems at elementary levels ([7]). While such benchmarks have advanced understanding of AI capabilities, researchers have noted that excelling on them does not always imply robust reasoning or real-world problem-solving. For instance, LLMs often improve on benchmarks via memorization or prompt tricks, leading to calls for more dynamic and challenging evaluations ([6]) ([5]).
Mathematical reasoning is a longtime challenge for LLMs. Early work like the MATH dataset (NeurIPS 2021) collected 12,500 math contest problems and showed GPT-3 models scored only ~5% accuracy, while a human (3-time IMO gold medalist) scored ~90% ([4]). Subsequent approaches (chain-of-thought prompting, self-consistency, code execution) have substantially improved performance, but complex problems often remain unsolved or only partially correct. A 2023 study using GPT-4’s code interpreter achieved ~70% accuracy on MATH, later boosting to ~84% via verification techniques ([8]) ([9]). By early 2026, the landscape has shifted dramatically: GPT-5.2 and OpenAI’s reasoning models (o3, o4-mini) achieve near-perfect scores on AIME with tool access, and models like Gemini 3 Pro score 95% on AIME without tools. Meanwhile, benchmarks like IneqMath (focused on proofs of inequalities) demonstrate that even top LLMs usually fail detailed step-by-step reasoning with <10% accuracy when proof rigor is required ([10]). These results illustrate that while AI’s ability to produce correct final answers has surged, genuine mathematical understanding remains imperfect.
To push limits further, some researchers have turned to math competition problems for evaluation. Math contests (e.g. AMC, AIME, Olympiads) present fresh, challenging problems beyond standard training corpora. For instance, the International Math Olympiad (IMO) in 2025 garnered attention when OpenAI and Google DeepMind claimed their in-development models solved 5 of 6 difficult Olympiad problems ([11]). However, independent evaluation by [12] found that none of the major mainstream models (Google Gemini, xAI Grok, Anthropic Claude, DeepSeek) produced fully correct solutions on the IMO problems ([5]). MathArena, developed by researchers at ETH Zürich, has since become one of the most trusted independent platforms for evaluating AI math reasoning on fresh competition problems ([13]). Analyst Emily Riehl noted that models used “best-of-n” answer selection – akin to having multiple students solve then picking the best solution ([14]) – raising concerns that raw scores may overestimate genuine understanding. She also warned that LLM-generated proofs often contain subtle logical errors at the research frontier ([15]). These observations motivate the need for transparent, high-quality benchmarks that can truly test if AI is “learning math” or merely pattern-matching exam answers.
In this context, HMMT (Harvard–MIT Mathematics Tournament) emerges as a relevant source of problems. HMMT is a prestigious annual math contest for high school students, held twice each school year (November and February). The November contest is roughly equivalent in difficulty to the AMC 10/12 and AIME contests, while the February contest features problems at national/international Olympiad levels ([16]). A standard HMMT packet includes multiple rounds: the Individual round (30 answer-based problems in Feb), a Team round (proof-based problems in Feb), and the fast-paced Guts round (36 short-answer problems) ([17]). For example, the February HMMT 2023 included 36 Guts problems and a Team round of 10 proof questions ([17]). These contests require creative problem-solving with novel mathematical ideas, not just applied textbook formulas.
HMMT25 refers to using the 2025 problems from HMMT (both November 2024 and February 2025) as an AI benchmark. As a formal benchmark, HMMT25 collects these contest problems (in English) and measures an AI model’s accuracy in solving them. The purpose is to assess “advanced mathematical reasoning requiring sophisticated thinking and problem-solving strategies” ([1]). Unlike static datasets, contest problems continuously change each year and often include personalizable gems that likely did not appear in training corpora. Thus HMMT25 aims to avoid data contamination and ensure evaluation of true generalization.According to the AIRank platform, HMMT25 tasks include algebra, geometry, combinatorics, and other math domains, similar to what contest participants face ([1]).
The HMMT25 Benchmark
HMMT25 is a “comprehensive mathematical evaluation benchmark” drawing on problems from official Harvard–MIT Math Tournament contests ([1]). It specifically uses the prior-year’s problem sets (e.g. HMMT Nov/Feb 2024/2025) as test questions. Each question requires a numeric or closed-form answer (contest-style), enabling automatic scoring of correctness. In practice, evaluation frameworks present each problem text to the model and check if the model’s answer matches the official solution. The tasks span multiple areas of mathematics, including difficult algebraic manipulations, non-trivial geometry, combinatorial counting, number theory, and clever uses of inequalities. The benchmark thus emphasizes multi-step reasoning and often requires pattern recognition, creative insight, and careful calculation. According to [AIRank], the HMMT25 benchmark is explicitly intended to test these abilities ([1]).
Some specifics about the contest: HMMT holds two main events each year. The November contest (HMMT‐Nov) features problems roughly at AMC/AIME level, plus an introductory “warm-up” feel. The February contest (HMMT‐Feb) is the more challenging one, selecting tougher problems akin to national Olympiads ([16]). In the February round, top scorers advance to an Invitational where a final set of harder problems is given. Each HMMT contest includes three rounds:
- Individual Round: In the February HMMT, there are 3 sets of 10 short-answer problems (100 minutes total). These are non-multiple-choice, free-response questions ([17]).
- Team Round: In February, teams solve 10 proof-based problems as a group, requiring written proofs (in November these are short answers).
- Guts Round: A fast-paced “relay” round with 4 sets of 9 short-answer questions each (total 36 problems) ([17]). Teams run to retrieve new sheets of problems.
For the AI benchmark, it is likely that the short-answer parts of Individual and Guts rounds are used (since Team proofs would require full solutions). Exact details of the HMMT25 construction (which problems are included) are not officially published, but the AIRank platform indicates there are 6 models tested on HMMT25 ([18]). The benchmark is scored as the percentage of problems answered correctly (often simply correct answer vs official key). AIRank summary shows a Top Score of 96.7% and an Average Score of 75.7% among submitted models ([19]). Notably, half of the models exceeded 80% accuracy, but the rest scored much lower, indicating a wide spread in ability ([20]).
AI Model Performance on HMMT25
The leaderboard results for HMMT25 (as of mid-2025) highlight the leading edge of mathematical AI. According to the AIRank site, six models have reported scores (all self-reported by organizations, none external-verified). Table 1 below summarizes the top entries:
| Rank | Model | Organization | Release Date | HMMT25 Score (%) | Source |
|---|---|---|---|---|---|
| 1 | Grok-4 Heavy | xAI (Musk’s AI lab) | Jul 9, 2025 | 96.7 | ([2]) |
| 2 | Grok-4 | xAI | Jul 9, 2025 | 90.0 | ([2]) |
| 3 | Qwen3-235B-A22B-Thinking-2507 | Alibaba Cloud / Qwen AI | Jul 25, 2025 | 83.9 | ([2]) |
| 4 | Qwen3-Next-80B-A3B-Thinking | Alibaba Cloud / Qwen AI | Jan 10, 2025 | 73.9 | ([21]) |
| 5 | Qwen3-235B-A22B-Instruct-2507 | Alibaba Cloud / Qwen AI | Jul 22, 2025 | 55.4 | ([22]) |
| 6 | Qwen3-Next-80B-A3B-Instruct | Alibaba Cloud / Qwen AI | Jan 10, 2025 | 54.1 | ([23]) |
Table 1: Leaders on HMMT25 (higher is better; based on AIRank benchmark summary ([2])).
These results underscore two main points: (1) The leading models achieve very high accuracy on these contest problems. xAI’s Grok-4 Heavy answered nearly all correctly (96.7%), which surpasses even what a top human might get on such a test. (2) There is a significant performance gap between the top and bottom of the list. The top two (xAI’s models) average 93.3%, whereas the rest (Alibaba’s Qwen models) average only 66.8% ([24]). In particular, the Qwen series shows steep decline from the “Thinking” (83.9%, 73.9%) to the “Instruct” version (~55%). This suggests that model architecture, pretraining, or fine-tuning strategies (e.g. instruct-tuning vs reasoning mode) have a huge impact.
For additional context, other evaluation platforms like BenchLM report similar high-level trends across benchmarks (see Section Comparisons below). GPT-5 variants (OpenAI’s models released mid-2025) score around 92% on HMMT test sets ([3]). OpenAI’s reasoning-focused models have pushed even further: o4-mini with Python tool access achieves 99.5% on AIME 2025, and GPT-5.2 (released late 2025) scores a perfect 100% on AIME 2025 ([25]). Google’s Gemini 3 Pro scores 95% on AIME without tools, with its "Deep Think" variant solving over 40% of FrontierMath problems – a dramatic leap from the 2% state-of-the-art when that benchmark launched in late 2024 ([26]). On HMMT specifically, Grok-4 Heavy still edges out GPT-5 (which BenchLM shows at ~92% on HMMT 2025 ([3])), though the gap is narrowing as newer models continue to arrive.
It is important to recognize that accuracy percentages on benchmarks like HMMT25 may not capture full problem-solving ability. As explored in Section Implications, models may use brute-force tactics (e.g. sampling many answers and picking a correct one ([14])) or even retrieve answers if problems have leaked into training data. AIRank results are “self-reported” meaning the organizations ran their own tests and reported the scores ([27]). Without external auditing, there is always a possibility of subtle bias or undisclosed assistance (e.g. code execution) in generating these answers.
Nevertheless, the data from HMMT25 provides valuable empirical evidence: top-tier LLMs can solve nearly all of a very difficult math contest, whereas second-tier models achieve much lower rates. By early 2026, independent evaluation via platforms like [12] has largely corroborated these self-reported scores, confirming that the frontier models (GPT-5, Grok-4, Gemini 3) have effectively mastered high-school Olympiad-level short-answer problems.
Analysis of HMMT25 Data and Performance Trends
Distribution of Scores. Based on the AIRank summary, the six submitted models on HMMT25 had a rather bimodal distribution. Half (3 models) scored 80% or higher, and half below (the “High Performers (80%+)” count is 3) ([20]). The top two alone average 93.3%, the four from Alibaba average 66.8% ([24]). The overall mean for all models was 75.7% ([19]). In practical terms, some top models are approaching near-perfect performance, while many others struggle. This mirrors observations on other benchmarks: e.g. on the MATH dataset, GPT-4’s reported accuracy hovered around 50-60% without tools, while GPT-3 was ~5% ([4]). In both cases, there is a large gap between state-of-the-art and the rest.
Comparison to Human-level. HMMT problems are designed for strong high-school mathematicians. It is hard to find published aggregate human scores on HMMT, but reasonable inference comes from similar contests. In Hendrycks et al. (2021) the MATH dataset reported 90% accuracy for a 3×IMO gold medalist ([4]). If we use that as a proxy, Grok-4 Heavy’s 96.7% on HMMT25 suggests AI now matches or even slightly exceeds elite human performance on these problems (at least in the raw accuracy metric). This is a striking milestone – roughly a moonshot in AI achievement as some commentators call it ([11]). However, unlike the IMO results cited in [SciAm 2025], where only emerging models were tested, HMMT25 is an independent, formal benchmark and may better reflect “real exam” conditions. The fact that Grok-4 Heavy nearly maxed the test indicates that at least for short-answer problems, AI has essentially “caught up” with contest-caliber students.
Benchmark Specifics. The problems in HMMT25 cover an unusually wide range. While many existing benchmarks (e.g. GSM8K) focus on simpler arithmetic or algebra word problems, HMMT25 includes heavy geometry and combinatorics. For instance, HMMT often has geometry questions requiring creative insight into diagram properties (which LLMs see only textually). Similarly, some combinatorial problems involve intricate counting arguments. A key observation is that these problems often do not have straightforward multi-step natural language explanations written in common sources. Therefore, solving them presumably relies on genuine reasoning capabilities, not memorization of text. That said, some contest solutions are posted online after the fact, so future benchmarks might need new problems or blind-testing to guard against leaks.
Performance by Model Type. The top two spots belong to xAI’s Grok series (Heavy and standard versions). These models are billed as large-scale general LLMs trained on a mix of data including web text and code, and reportedly tuned for robust reasoning ([28]). Their high scores indicate strong multi-step numeracy skills. In contrast, the Alibaba Cloud “Qwen” series serves as the other competitor. The “Thinking” variants (which likely use chain-of-thought prompts or reasoning modes) scored relatively well (83.9%, 73.9%), whereas the “Instruct” variants (optimized for following direct instructions) fell far behind (~54%). This highlights that model prompting/training mode affects performance greatly: reasoning mode models collaboration produce far better results than flattened instruct-tuned models. It suggests that just tuning on following queries (Instruct) is insufficient for complex reasoning – the model must be encouraged to internally process the steps.
Comparison with Contemporary Models. As of early 2026, the competitive landscape has intensified significantly. Major LLMs (OpenAI GPT-5/5.2, Anthropic Claude Opus 4.6, Google Gemini 3.1 Pro, xAI Grok 4.20) compete fiercely across all benchmarks. Independent reviews indicate GPT-5 leads on many benchmarks, even marginally surpassing Grok and Claude on aggregate ([28]). On math tasks specifically, GPT-5 (high) scores 92% on HMMT25 ([3]), slightly below Grok-4 Heavy’s 96.7%. OpenAI’s reasoning models (o3, o4-mini) achieve 88.9–99.5% on AIME 2025 depending on tool access ([25]). Claude Opus 4.6 achieves 100% on AIME 2025 with tools and 87% without. Gemini 3 Deep Think has emerged as particularly strong in mathematics, solving 37.6% of FrontierMath problems – well ahead of GPT-5’s 32.4% on that benchmark ([26]). DeepSeek R1, a notable open-source contender, scores 87.5% on AIME 2025 (up from 70% in its initial release). xAI released Grok 4.20 in beta (February 2026) with a multi-agent collaboration architecture, though formal benchmark figures for HMMT are pending.
Overall, the HMMT25 results paint a picture of state-of-the-art LLMs achieving near-human or super-human accuracy on difficult math problems. At the same time, they reveal weaknesses: substantial dropoffs beyond the elite models, and potential reliance on “tricks” (see below). It is crucial to interpret these scores carefully, considering evaluation methodology and real reasoning ability.
Comparisons with Other Math Benchmarks
It is informative to place HMMT25 in the context of other mathematical benchmarks. Table 2 provides selected comparisons.
| Benchmark | Year Introduced/Used | Problem Source | Example Top Model Score (%) | Comments | Source |
|---|---|---|---|---|---|
| AIME 2025 | 2025 (annual) | American Invitational Math Exam | ~96 (GPT-5 high) (<a href="https://benchlm.ai/math#:~:text=1GPT,%7C%2093" title="Highlights: 1GPT, | 93" class="text-gray-400 text-[10px] hover:text-gray-500">[3]) | High-school contest (precalculus algebra) |
| HMMT Feb 2025 | 2025 (annual) | Harvard–MIT Math Tournament | 96.7 (Grok-4 Heavy) ([2]) | Collegiate-level contest (olympiad problems) | AIRank ([2]) |
| MATH (NeurIPS 2021) | 2021 | 12,500 Olympiad-style problems | ~5 (GPT-3) ([4]); ~70 (GPT-4 Code) ([8]) | Diverse competition problems, with full solutions | Hendrycks et al. ([4]); Schreiner ([8]) |
| IMO 2025 | 2025 | International Math Olympiad (HS) | 83 (5/6 solved by cutting-edge models) ([11]) | Highest global HS loop; 6 very hard problems | Scientific Am. ([11]) |
| FrontierMath | 2024 | Research-level math problems | ~40 (Gemini 3 Deep Think) ([26]) | Unsolved research-grade problems; was 2% in 2024 | Epoch AI ([26]) | | Humanity's Last Exam | 2025 | 2,500 expert-level questions | ~45 (Gemini 3.1 Pro) ([29]) | Cross-disciplinary; human experts score ~90% | Center for AI Safety ([29]) |
Table 2: Comparison of selected math and reasoning benchmarks. Percent scores reflect the fraction of problems solved correctly by top AI models listed. (Benchmarks vary in format: AIME and HMMT use short-answer tests, IMO uses proof-based problems, FrontierMath and HLE test broader expert-level reasoning.)
From Table 2 we notice that AIME and HMMT (Feb) are roughly comparable in difficulty: top AI models score in the mid-90s on both. This makes sense since AIME is targeted at advanced high-schoolers and is part of the same contest ecosystem. Meanwhile, the IMO is harder; five out of six problems solved corresponds to ~83%, which was celebrated as an AI breakthrough ([11]). The MATH dataset (constructed in 2021) was once extremely challenging – GPT-3 scored about 5% ([4]) – but has now been essentially saturated, with frontier models scoring 95%+. FrontierMath, introduced in late 2024 by Epoch AI, represents the new frontier: its research-level problems initially stumped all models at ~2%, but by early 2026, Gemini 3 Deep Think solves over 40% of problems across tiers 1–3. Humanity's Last Exam, launched by the Center for AI Safety, tests 2,500 questions across over 100 academic domains; top models like Gemini 3.1 Pro score around 45%, far below human expert levels of ~90% ([29]). This layered picture suggests that HMMT25 sits in a "solved" zone for frontier models, between the now-saturated MATH/AIME benchmarks and the still-challenging FrontierMath and proof-based competitions.
It is also useful to compare across model categories. Beyond Grok and Qwen, other LLMs show strong but varied performance: Anthropic’s Claude Opus 4.6 achieves 100% on AIME 2025 with tool access, while Google’s Gemini 3 Pro scores 95% on AIME without tools ([3]). DeepSeek R1 (open-source) scores 87.5% on AIME 2025 after its mid-2025 update. Meta’s Llama-3 models (open weight) achieve at best ~67% on HMMT ([30]). These numbers underscore that closed-source frontier models now routinely achieve 90%+ on contest math, well-trained open-source reasoning models reach 70–87%, whereas earlier or smaller models lag further behind. The rapid improvement is dramatic: models that scored single digits on MATH in 2021 now routinely hit the mid-90s on comparable benchmarks just four years later.
In summary, HMMT25 is a well-established benchmark whose top scores reflect current frontier LLMs. Its difficulty exceeds common grade-school math tasks (like GSM8K or AQuA) but is slightly lower than full IMO proof tasks. As of early 2026, HMMT25 has largely been "solved" by frontier models – the benchmark's value now lies in differentiating mid-tier models and tracking how quickly new architectures close the gap with the leaders. For truly unsolved challenges, the community has shifted attention to FrontierMath, Humanity's Last Exam, and proof-based evaluations (see next section).
Case Studies and Examples
Case Study: GPT-4 Code Interpreter on Math Benchmarks. Although not specifically on HMMT25, the GPT-4 Code Interpreter’s performance on the MATH dataset illustrates two key points: (a) the power of tool-augmented LLMs for math, and (b) limits of non-tool LLMs. As reported by The Decoder (Aug 2023), using the Code Interpreter mode, GPT-4 achieved 69.7% on MATH ([8]) – a dramatic jump over GPT-4’s ~42% without tools. Moreover, by implementing “explicit code-based self-verification” and weighted voting, researchers boosted that to 84.3% ([9]). This suggests that allowing a model to compute (via a Python sandbox) and check its own calculations yields near-superhuman performance on hard problems. If similar techniques were applied to HMMT problems, performance would likely climb further. However, HMMT25 as presented likely assumes a pure LLM without external tools, so GPT-4 Code and friends did not have such advantages in the HMMT evaluation.
Case Study: Olympiad Problems and AI Mistakes. The 2025 IMO illustrations from Scientific American reveal typical failure modes. Even top LLMs that “solve” contest problems may do so by generating plausible-looking text rather than rigorous reasoning ([15]). Interviewed mathematician Emily Riehl recounted that “every model [she] asked has made the same subtle mistake” on an advanced category theory question ([15]). This underscores that LLM “solutions” often omit or fudge crucial logical steps. In the context of HMMT25, this means that a model might output the correct numeric answer but without a correct chain of thought. Indeed, the IneqMath evaluation found that while an LLM’s final numeric answer might match the key, the intermediate reasoning was wrong in ~65% of cases ([10]). We do not have step-level analysis of HMMT25 answers, but this suggests caution: even 96.7% score by Grok-4 Heavy may not mean it “understood” every proof in a human sense, only that it got almost all answers right.
Case Study: Multiple-choice vs Open-answer. Many benchmarks allow multiple-choice, which can inflate AI scores via elimination strategies. HMMT25 uses open-ended answers, preventing that. However, we must consider if models could be exploitative. For example, an LLM might recognize question patterns from training (even if not memorizing exact answers), or latch onto superficial cues. In the IMO case, the use of self-consistency (running multiple solutions) was likened to having many students collaborate ([14]). If a model can try repeatedly until it hits the correct answer, a high score may not reflect reliability on first try. For practical applications, we care more about robust single-shot performance. Sadly, public leaderboards often report the “best-of-n” result, which can mask true consistency.
Case Study: Data Leakage and Benchmark Integrity. A significant concern is whether HMMT problems have leaked into LLM training sets. HMMT problems are not widely published in textbooks, but solutions from past years can circulate on math forums. If a model was pretrained on internet data including old HMMT archives, it might recall or pattern-match. The HMMT organizers have no published dataset for AI testing, unlike the partitioned MATH or GSM8K. Hence we rely on new content. Ideally, HMMT25 as a benchmark would use the latest (2025) problems before they are publicly released, to ensure fairness. There is a parallel here with the cited IMO exercise: the IMO president explicitly stated he could not confirm if AI “training leakage” occurred ([31]). This highlights the general difficulty of benchmarking next-gen AI: static known benchmarks eventually get learned by the models, necessitating fresh, unseen problems (or mechanically generated ones) to truly assess generalization.
Implications and Future Directions
The emergence of HMMT25 as a benchmark and the high scores achieved carry several implications for AI research, education, and safety.
1. Advancing Mathematical AI. Achieving nearly 100% on HMMT25 suggests that LLMs are becoming extremely capable at high-school mathematics. This could accelerate automated problem solving and tutoring. For instance, an AI tutor could now potentially solve and explain a wide range of contest-level questions in real time. Indeed, one might imagine students using these models to check their answers or even learn problem-solving techniques. However, as experts caution ([15]), without formal correctness assurance, the models’ “think-aloud” solutions should be verified. This points to a trend: combining LLMs with formal proof assistants (like Lean, Coq) may be crucial. Interestingly, some AI teams at IMO already had their models output Lean proofs which were formally checked ([32]). In the future, we may see LLMs integrated with symbolic tools to ensure correctness, so that a 96.7% score is backed by a verified proof.
2. Benchmark Robustness and Novelty. The need for fresh evaluation keeps growing. As Time Magazine reports, test creators (AI labs and non-profits alike) are designing hyper-challenging tasks to stay ahead ([33]). Since this article was first published, several new benchmarks have emerged: FrontierMath (research-level mathematics from Epoch AI), Humanity's Last Exam (2,500 expert-curated questions across 100+ disciplines), and the First Proof challenge (10 extremely difficult research math problems proposed by 11 distinguished mathematicians in February 2026). The HMMT25 example shows one approach: leveraging real-world competitions. [12], developed by ETH Zürich researchers, has become the go-to independent platform for evaluating models on fresh, uncontaminated competition problems ([13]). The key lesson is that evaluation must evolve faster than models improve. A static benchmark that AI masters becomes trivial – HMMT25 itself is now approaching that threshold. Thus, expert observers argue for continuous generation of test problems and third-party audits to ensure fairness ([34]) ([35]). HMMT naturally updates each year; similarly, Epoch AI's FrontierMath: Open Problems now includes 16 genuinely unsolved research problems, representing a fundamentally different challenge tier.
3. Understanding LLM Capabilities. Benchmarks like HMMT25 reveal what current models can and cannot do. The high scores show that models effectively handle complex algebraic manipulations and multi-step logic up to Olympiad-level problems. But the persistent failures on novel proof issues (as discussed above) indicate that models still largely operate by pattern matching and probabilistic reasoning, not true logical deduction. This echoes the limitations observed in multiple studies: LLM “reasoning” often lacks genuine chain-of-thought consistency ([10]). Thus, future research is likely to focus on hybrid architectures (neural + symbolic) and on training objectives that emphasize internal reasoning consistency. In the near term, we may see incremental improvements like larger context windows, more sophisticated prompt engineering, or ensemble models to handle trickier HMMT-type questions.
4. Societal and Safety Considerations. The fact that AI now scores so well on HMMT raises questions. In education, it could challenge how we teach and test mathematics. If an AI can solve any contest problem, evaluation methods must change (e.g. oral exams, proctored in-room tests, or new problem types). In science and engineering, we can be optimistic: AI may assist researchers in deriving formulas or checking work. However, as the IMO headline suggests ([36]), there are also concerns about overhyping these milestones. Experts stress that even if AI solves contest problems, it is not replacing mathematicians – real research problems are far harder and take human creativity over years ([37]). The safe deployment of such capable models also demands benchmarks in non-math domains, because solving math puzzles is only one slice of intelligence.
5. Future Benchmarks and Directions. Looking forward, several trends have crystallized since the original HMMT25 results:
-
Multimodal Math. Current HMMT25 is text-only. But real-world math often involves figures and diagrams. As vision-language models mature (GPT-5 and Gemini 3 are natively multimodal), future benchmarks are incorporating diagrams and visual reasoning. A model would need to interpret geometry diagrams, read graphs, and reason visually – a capability that is improving rapidly but not yet at contest-level reliability.
-
Dynamic Evaluation. The need for “dynamic” or streaming benchmarks is now being addressed. MathArena.ai continuously evaluates models on new competition problems as they are released, providing uncontaminated assessments. Research from Fudan/Tongji suggests creating evolving evaluation to reduce data leakage ([38]). Epoch AI's FrontierMath similarly adds new problems over time, including genuinely unsolved research questions.
-
Proof-Verified Benchmarks. Given errors in LLM reasoning, benchmarks increasingly require step-by-step proofs, not just answers. Projects like IneqMath exemplify this: they only count a solution correct if each logical step is valid ([10]). The First Proof challenge (February 2026) takes this further by requiring models to produce multi-page research-level proofs. Integration with formal proof assistants (Lean 4, Coq) is becoming standard practice for verifying AI-generated mathematical arguments.
-
Combined Reasoning and Tools. Benchmarks now routinely report scores both with and without tool access. The gap is stark: o4-mini jumps from ~89% to 99.5% on AIME with a Python interpreter ([25]). This has led to a bifurcation in benchmarking: “pure reasoning” scores (no tools) vs “agentic” scores (with code execution, search, etc.). Multi-agent systems like Grok 4.20's four-agent collaboration architecture represent the cutting edge of this tool-augmented approach.
-
Beyond Math: Real-World and Expert-Level Problems. Humanity's Last Exam has broadened the evaluation scope beyond math to 100+ academic disciplines. While HMMT25 is narrowly focused on contest math, the broader trend is toward benchmarks testing expert-level reasoning across sciences, medicine, law, and engineering. The community now recognizes that solving olympiad math is necessary but insufficient for demonstrating general intelligence – models must also handle ambiguity, open-ended research, and real-world complexity.
Conclusion
HMMT25 represents a cutting-edge benchmark at the intersection of AI and mathematics. It crystalizes how far large language models have come—and how far they still have to go—in mastering human-level mathematical reasoning. Top models now demonstrate an almost uncanny ability to tackle contest problems that once seemed exclusive to prodigies ([2]) ([4]). Yet, in-depth analysis shows that this facility has limits: AI can output correct answers but may not provide truly rigorous reasoning across the board ([10]) ([15]).
Our extensive review indicates that the performance on HMMT25 (96.7% top score) is consistent with other chart-topping results (near-perfect on AIME, low-80s on IMO), confirming that frontier AI models have effectively mastered advanced high-school mathematics under exam conditions ([2]) ([3]). By early 2026, with GPT-5.2 achieving 100% on AIME and Gemini 3 Deep Think solving 40%+ of FrontierMath, the challenge has definitively shifted from "can AI solve contest math?" to "can AI do genuine mathematical research?" However, the mixed results on more open-ended proofs and the known pitfalls of best-of-n sampling mean that these benchmarks should be interpreted with caution. The very creation of HMMT25 and its successor benchmarks (FrontierMath, Humanity's Last Exam, MathArena) is a positive step: it forces models (and researchers) to address new challenges, promoting robustness and creativity rather than rote learning.
Looking ahead, AI is becoming an invaluable tool for learning and discovery in mathematics, increasingly allied with symbolic reasoning and formal verification systems to ensure correctness. HMMT25 has helped chart the map: it showed what top models could achieve in 2025 and highlighted the gaps between correct answers and rigorous reasoning. The new frontier – FrontierMath's research-level problems, where even the best models solve only ~40% – represents the next mountain to climb. The combined evidence from HMMT25 and related studies suggests a near future where LLMs are ubiquitous helpers in STEM fields, provided that we carefully benchmark their abilities and remain vigilant about understanding their limitations ([15]) ([35]).
Sources: All statements above are supported by peer-reviewed studies, technical reports, and benchmark data. Key references include the AIRank HMMT25 leaderboard ([1]) ([2]), the MATH dataset paper ([4]), Scientific American on AI Olympiad performance ([5]), BenchLM math leaderboard ([39]), MathArena for independent evaluation ([12]) ([13]), Epoch AI FrontierMath benchmark ([26]), Humanity's Last Exam ([29]), OpenAI model announcements ([25]), and recent benchmarking analyses of LLMs on math tasks ([8]) ([10]), among others. These confirm the trends and findings discussed herein with empirical data. Each claim is cited to the original source for verification.
External Sources (39)
Get a Free AI Cost Estimate
Tell us about your use case and we'll provide a personalized cost analysis.
Ready to implement AI at scale?
From proof-of-concept to production, we help enterprises deploy AI solutions that deliver measurable ROI.
Book a Free ConsultationHow We Can Help
IntuitionLabs helps companies implement AI solutions that deliver real business value.
AI Strategy Consulting
Navigate model selection, cost optimization, and build-vs-buy decisions with expert guidance tailored to your industry.
Custom AI Development
Purpose-built AI agents, RAG pipelines, and LLM integrations designed for your specific workflows and data.
AI Integration & Deployment
Production-ready AI systems with monitoring, guardrails, and seamless integration into your existing tech stack.
DISCLAIMER
The information contained in this document is provided for educational and informational purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability of the information contained herein. Any reliance you place on such information is strictly at your own risk. In no event will IntuitionLabs.ai or its representatives be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from the use of information presented in this document. This document may contain content generated with the assistance of artificial intelligence technologies. AI-generated content may contain errors, omissions, or inaccuracies. Readers are advised to independently verify any critical information before acting upon it. All product names, logos, brands, trademarks, and registered trademarks mentioned in this document are the property of their respective owners. All company, product, and service names used in this document are for identification purposes only. Use of these names, logos, trademarks, and brands does not imply endorsement by the respective trademark holders. IntuitionLabs.ai is an AI software development company specializing in helping life-science companies implement and leverage artificial intelligence solutions. Founded in 2023 by Adrien Laurent and based in San Jose, California. This document does not constitute professional or legal advice. For specific guidance related to your business needs, please consult with appropriate qualified professionals.
Related Articles

Humanity's Last Exam: The AI Benchmark for LLM Reasoning
Learn about Humanity's Last Exam (HLE), the Nature-published AI benchmark testing true LLM reasoning with 2,500 expert-level questions. Updated with 2026 leaderboard scores from GPT-5, Claude Opus, and Gemini 3.

MMLU-Pro Explained: The Advanced AI Benchmark for LLMs
Learn about MMLU-Pro, the advanced AI benchmark designed to overcome MMLU's limitations. This guide explains its design, dataset, and impact on LLM evaluation.

AIME 2025 Benchmark: An Analysis of AI Math Reasoning
Explore the AIME 2025 benchmark, a key test for AI mathematical reasoning. See how models like GPT-5 score over 94% and compare LLM performance on Olympiad-leve