IntuitionLabs
Back to Articles

GPT-5.2 & ARC-AGI-2: A Benchmark Analysis of AI Reasoning

[Revised April 15, 2026]

Executive Summary

OpenAI’s GPT-5.2, released in December 2025, marked a dramatic leap in AI abstract reasoning when it achieved approximately a 53–54% success rate on ARC-AGI-2 tasks ([1]) ([2]), far above the prior state of the art at the time. Since then, the ARC-AGI-2 leaderboard has evolved rapidly: OpenAI’s successor GPT-5.4 Pro (March 2026) pushed to 83.3% ([3]), Google’s Gemini 3 “Deep Think” reached 84.6% ([4]), and meta-systems like Confluence Lab (97.9%) and Imbue’s Darwinian Evolver (95.1%) have shattered previous records ([5]). Anthropic’s Claude Opus 4.6 also entered the race at 69–75% ([6]). Earlier models (GPT-4, GPT-5, etc.) had virtually zero success on ARC-AGI-2 ([7]) ([8]). While GPT-5.2 is no longer the top performer on ARC-AGI-2, its December 2025 breakthrough represented the pivotal moment when AI first crossed the 50% threshold on this benchmark. All systems still fall short of human-level general intelligence (humans solve 100% of ARC-AGI-2 tasks ([9])), and the newly launched ARC-AGI-3 benchmark (March 2026) has once again humbled frontier models — with even the best scoring below 1% ([10]) — underscoring that “true fluid intelligence” remains elusive ([11]) ([12]). This report provides a detailed analysis of GPT-5.2’s original achievements on ARC-AGI-2, the rapid progress that followed, the technical context and history of the benchmark, a comparison with other approaches (including case studies like Poetiq, Imbue, and Kaggle winners), and discusses the broader implications for AI’s progress and future. All claims are supported by extensive evidence and citations.

Introduction and Background

The quest for Artificial General Intelligence (AGI)—AI that can understand, learn, and reason across diverse tasks as well as or better than humans—has driven interest in specialized benchmarks that probe “fluid” intelligence. Traditional AI benchmarks often measured narrow skills (e.g. NLP accuracy, image classification), but François Chollet’s ARC initiatives focus on abstract reasoning and generalization ([8]) ([13]). In 2019 Chollet introduced the Abstraction and Reasoning Corpus (ARC), a dataset of 1,000 small image-pattern puzzles requiring conceptual abstraction ([13]). The ARC tasks “measure the gap between machine and human learning”: humans solve them easily, but early AI models failed dramatically.

Building on this, the ARC-AGI series extends these ideas to measure steps toward AGI. ARC-AGI-1 (introduced by the ARC Prize Foundation in 2019) presented 500+ tasks geared at low-level abstract reasoning. By late 2024, OpenAI’s advanced system “o3” had surmounted ARC-AGI-1 (scoring ~75–87%, near human level) ([11]) ([14]), prompting some to suggest an AGI breakthrough. However, critics like Chollet cautioned that even this success was likely due to sheer scale and engineering tricks, not genuine intelligence. To test this, Chollet and collaborators launched the much harder ARC-AGI-2 benchmark in 2025 ([15]) ([11]). ARC-AGI-2 dramatically raises task difficulty and reduces susceptibility to brute-force strategies, with the explicit goal of exposing “what is fundamentally missing in our current AI architectures” ([11]) ([7]). In this new contest, AI performance initially plunged: as of mid-2025, no closed-model system had cleared even 5% success on ARC-AGI-2 tasks ([7]) ([11]), while humans continued to solve 100% ([9]). This stark contrast highlights the enormous reasoning gap to overcome.

The ARC-AGI-2 benchmark and associated competitions (including a $1M Kaggle contest) have thus become a “pulse check” on progress toward true reasoning AI ([16]) ([11]). Success on ARC-AGI-2 requires genuine fluid abstraction: given a few training examples of a visual puzzle (colored grids) and their solutions, the AI must infer the underlying rule and apply it to novel inputs. Unlike tasks solvable by pattern recognition or memorization, ARC-AGI-2 puzzles demand insight and adaptability. As one observer notes, the tasks are “relatively easy for humans, yet hard, or impossible, for AI” ([15]) ([8]). For example, humans on average take only ~2.3 minutes per ARC-AGI-2 task ([17]). In aggregate, 400 human subjects solved all candidate tasks under these conditions ([18]). By design every task was solved by ≥2 people in ≤2 attempts ([18]) ([15]), ensuring human feasibility. Against this, even the most capable AI strategies in 2025 achieved only low single-digit scores ([7]). This underscores that overcoming the ARC-AGI-2 tasks is far beyond straightforward scaling of current models; it requires fundamentally smarter reasoning.

Table 1 (below) summarizes leading systems’ ARC-AGI-2 performance as of April 2026, reflecting both the original December 2025 landscape and the rapid progress since.

Model / SystemOrganizationARC-AGI-2 ScoreSource
Confluence Lab (Meta-system)Confluence Lab97.9%[ARC Prize Leaderboard ([5])]
Imbue Darwinian Evolver + Gemini 3.1Imbue95.1%[Imbue research blog ([46])]
Gemini 3 “Deep Think” (Feb 2026)Google84.6%[MarkTechPost ([4])]
GPT-5.4 Pro (Mar 2026)OpenAI83.3%[OpenAI Help Center ([3])]
Gemini 3.1 Pro PreviewGoogle77.1%[Gend.co benchmarks ([47])]
Claude Opus 4.6Anthropic69–75%[MorphLLM benchmarks ([6])]
Claude Sonnet 4.6Anthropic58.3%[MorphLLM benchmarks ([6])]
Poetiq (Meta-system; Semi-private)Poetiq AI (startup)54.0%[Poetiq ARC-AGI-2 report ([29])]
GPT-5.2 Pro (X-High, Dec 2025)OpenAI54.2%[OpenAI GPT-5.2 datasheet (<a href="https://openai.com/th-TH/index/introducing-gpt-5-2/#:~:text=%7C%20GPT,%28high%29%20%20%7C%2017.6" title="Highlights:
GPT-5.2 (Thinking, Dec 2025)OpenAI52.9%[OpenAI GPT-5.2 datasheet (<a href="https://openai.com/th-TH/index/introducing-gpt-5-2/#:~:text=ARC,%7C%2017.6" title="Highlights: ARC,
NVIDIA “NVARC” (Kaggle team)NVIDIA / Kaggle27.64%[NVIDIA Kaggle blog ([27])]
GPT-5.1 (ChatGPT-5.1)OpenAI17.6%[OpenAI GPT-5.2 datasheet (<a href="https://openai.com/th-TH/index/introducing-gpt-5-2/#:~:text=ARC,%7C%2017.6" title="Highlights: ARC,
ArcPrize custom/Kaggle (2024)*Community (ARC Prize)2–4%[ARC Prize report ([14])]

*Note: Early 2025 public leaderboards had no system above 5% ([7]). GPT-5.2’s December 2025 results were the first to breach 50%. By April 2026, meta-systems and evolutionary approaches have pushed scores above 95%, though the newly launched ARC-AGI-3 has reset the frontier to near-zero.

The ARC-AGI-2 Benchmark

ARC-AGI-2 was released in March 2025 by the ARC Prize Foundation, with the explicit goal of highlighting AI’s reasoning gaps ([15]) ([19]). It maintains the signature “easy for humans, hard for AI” design philosophy ([15]) ([20]). The benchmark is built on the following structure ([19]) ([15]):

  • Training Set (1000 tasks): A public set of 1,000 diverse puzzles meant to teach core concepts (e.g. basic object shapes, colors, symmetries). This set is uncalibrated and shares no pattern with the evaluation tasks beyond illustrating primitive knowledge.

  • Public Evaluation Set (120 tasks): A set of 120 “calibrated” puzzles that have been solved by humans. Participants may publicly trial their systems on these (with pass@2 scoring), but these tasks are a first glimpse of the test difficulty.

  • Semi-Private Evaluation (120 tasks): An additional 120 tasks (similar format) used for leaderboard scoring during the Kaggle contest. These were shared under test-time restrictions and not used above the public stage.

  • Private Evaluation (120 tasks): The final 120 tasks held out for the competition finale. All tasks in public, semi-private, and private eval sets were rigorously tested by humans (≥2 humans solved each under pass@2) ([7]) ([21]).

By design, each task in ARC-AGI-2 follows the classic ARC puzzle format: the system sees a few (usually 2–3) example input-output grids and must produce the correct output grid for a new input. Unlike static test problems, ARC-AGI-2 puzzles often require understanding symbols and relations beyond pixel patterns (e.g. interpreting colors as state variables, recognizing compositional rules, or applying context-sensitive operations) ([22]) ([23]). Areas that “break” AI include tasks with symbolic interpretation, compositional reasoning, and contextual rule application ([22]). In these, AI systems typically latch onto superficial patterns and fail when deeper logic or multi-step rules are needed, as the ARC technical report notes ([24]).

Critically, human testing confirmed that ARC-AGI-2 remains easy for people: in a controlled study of 400 non-expert volunteers on 1,417 tasks, every task was solved by ≥2 individuals, with an average human time of ~2.3 minutes per task ([9]) ([17]). In aggregate, humans scored 100% on ARC-AGI-2 ([25]). This human baseline underscores the tasks’ solvability and contrasts sharply with AI difficulty. By contrast, the initial evaluation of existing AI systems showed abysmal performance: “none of the leading AI models have surpassed a 5% success rate on ARC-AGI-2 tasks” ([7]). For example, large language models like GPT-4, GPT-4o, and comparative systems scored essentially 0% ([7]) ([8]) (many solved none, as ARC-AGI-2 was explicitly chosen to avoid simple pattern matching or brute force methods).

Statistical analysis of the tasks found no correlation between factors like participant demographics or fatigue and success ([25]) – performance is purely about abstract reasoning. Because each task is unique (no memorization is possible) and two guesses are allowed (pass@2), the benchmark is rigorous. In sum, ARC-AGI-2 is a calibrated, human-validated forceful test of general reasoning ([15]) ([7]), where an AI must infer unseen rules and generalize from minimal examples.

Historical Performance on ARC-AGI

Before GPT-5.2, AI performance on ARC-AGI-2 was at best in the low single digits. The ARC Prize reports emphasize this “cripplingly difficult” nature. As of May 2025 (the technical report release), no public system had broken 5% accuracy ([7]). In practice, the highest scores came from specialized chain-of-thought systems: the top OpenAI submission (“o3 (low)”) scored only 4.0% ([14]). Entry-level GPT models (GPT-4o, GPT-4.5, GPT-5.1) scored near zero ([26]) ([7]). For context, ARC-AGI-1 (the easier predecessor) saw many systems in the 20–50% range by late 2024, so the fall-off with ARC-AGI-2 was drastic ([7]) ([14]). In short, ARC-AGI-2 initially produced a true “wall” for scaling-based approaches.

Nevertheless, the ARC Prize organizers and AI community did see gradual progress throughout 2025. The ARC Prize 2024/2025 Kaggle competition drew teams worldwide trying to crack this test under resource limits. The winning Kaggle solution (team NVARC from NVIDIA) fine-tuned a 4-billion-parameter model with heavy synthetic data generation and achieved 27.64% on the ARC-AGI-2 evaluation ([27]). This effort required creative test-time training and data augmentation to compensate for the model’s lack of raw scale. The NVIDIA blog explains: “Heavyweight LLM methods—chain-of-thought, tool use, even RL-agents—couldn’t fit within Kaggle’s runtime. So NVARC moved all complex reasoning offline into a synthetic data pipeline, and trained smaller models capable of running fast during evaluation” ([28]). Their success demonstrates that with clever engineering, even relatively compact models can make nontrivial gains on ARC-AGI-2. Still, 27.6% was just half of what Poetiq and GPT-5.2 later achieved.

Meanwhile, a different approach from the startup Poetiq debuted in December 2025. Poetiq’s meta-system does not train new models; instead it orchestrates multiple existing LLMs (e.g. Google’s Gemini 3) through iterative generation, critique, and refinement steps. For ARC-AGI-2, Poetiq’s open-source solution scored 54% on the semi-private test set ([29]). Notably, this surpassed Gemini 3’s own 45% score on that set ([29]), highlighting the power of their multi-agent orchestration. Poetiq also emphasized cost-efficiency: their system achieved 54% at $30.57 per task, whereas GPT-5.2 Pro’s runs would be much more expensive and Gemini 3 had cost ~$77 per task ([30]) ([29]). This “record-breaking submission” (as described in Poetiq’s blog) showed that a clever pipeline can outcompete larger monolithic models on ARC-AGI-2.

In summary, until late 2025 the leaders on ARC-AGI-2 were: the Kaggle/NVIDIA team (27.6%), Poetiq (54%), and high-end LLMs. GPT-5.1 and earlier OpenAI models lagged far behind (single digits) ([7]) ([1]). Against that backdrop, the emergence of GPT-5.2 was highly anticipated as a possible game-changer. Update (Q1 2026): The ARC Prize 2025 competition formally concluded in January 2026, with 1,455 teams and 15,154 submissions. The top Kaggle score on the private (cost-constrained) evaluation was 24.03% by team NVARC, followed by the ARChitects (16.53%) and MindsAI (12.64%) ([31]). Paper awards went to Alexia Jolicoeur-Martineau's Tiny Recursive Model ($50K first prize) and Pourcel, Colas & Oudeyer's Self-Improving Language Models ($20K second prize) ([31]).

OpenAI GPT-5.2: Design and Release

In December 2025, OpenAI unexpectedly accelerated the launch of GPT-5.2 amid intense competition. Sam Altman had declared a “code red” to respond to Google’s Gemini 3 advancements ([32]) ([33]). On December 11 Reuters reported that GPT-5.2 “boasts enhanced general intelligence, better coding capabilities, and improved handling of long-context understanding,” aimed at professional tasks like spreadsheets, presentations, and project management ([32]). TechRadar similarly noted GPT-5.2 was “fast-tracked” to improve reasoning, speed and reliability rather than flashy new features, in order to regain ChatGPT’s edge over Gemini 3 ([33]). Crucially, OpenAI positioned GPT-5.2 as a pragmatic update – fewer untested experiments and more robust core performance ([34]) ([35]).

GPT-5.2 comes in multiple configurations (“Instant”, “Thinking”, and a high-end “Pro” mode) that represent different computational budgets and reasoning strengths. Internally, GPT-5.2 continues OpenAI’s hybrid design of fast “sprinter” models for routine tasks and slower “thinker” models for deep reasoning ([36]). According to OpenAI’s announcement, GPT-5.2 significantly enlarged the context window (over 272k tokens) and further improved tools and coding skill ([37]) ([36]). The architecture retains a unified core with weighted routing between a main fast model and a deep “thinking” subnet for hard puzzles, similar in spirit to GPT-5’s multi-tier setup (this was described in Chinese leaks as a “real-time router” allocating problems to specialist sub-models ([38]), and GPT-5.2 presumably extends this).

GPT-5.2 proved to be just the beginning of a rapid release cadence. OpenAI followed with GPT-5.3 (improved conversational quality and web search), then GPT-5.4 in March 2026, which scored 73.3% base / 83.3% Pro on ARC-AGI-2, introduced 1M+ token context, five-level reasoning effort control, and 75% on the OSWorld computer use benchmark ([3]) ([39]). By February 2026, the original GPT-5 (Instant/Thinking) and GPT-4o were retired from ChatGPT. Most recently, GPT-5.4-Cyber launched on April 14, 2026, specialized for defensive cybersecurity ([40]).

GPT-5.2’s release was also a strategic business move. It launched under pressure from Google’s Gemini 3 (which had just debuted with superior benchmarks and new capabilities). With GPT-5.2, OpenAI sought to reassert leadership. For example, an Axios report characterizes it as “OpenAI’s best model to date for daily professional use,” reflecting extensive internal testing and industry feedback (from partners like Box and Zoom) on its enhancements ([35]). Even Wall Street noticed – Disney simultaneously announced a $1B investment in OpenAI at the GPT-5.2 launch, granting rights to use branded characters in its AI tools ([41]).

From a technical standpoint, GPT-5.2’s design prioritized reliable reasoning and factual accuracy. It improved chain-of-thought coherence, reduced hallucinations, and integrated multimodal inputs more tightly ([37]) ([42]). The update also incorporated customer feedback to reduce error rates: for instance, GPT-5.2 shows half as many vision-related mistakes as GPT-5.1 on UI-related prompts (per OpenAI) ([43]). Collectively, these enhancements set new records across a wide array of industry benchmarks (e.g. sweeping “GDPval” professional tasks and mathematical tests) as documented by OpenAI ([1]).

On ARC specifically, OpenAI quietly disclosed GPT-5.2’s ARC-AGI performance. The official datasheet reveals that GPT-5.2 Thinking scored 52.9% on the verified ARC-AGI-2 test set, compared to just 17.6% for GPT-5.1 ([1]). In the “Pro” (X-High) setting, GPT-5.2 achieved 54.2% ([44]). These results exceeded all other known AI by a wide margin. OpenAI stresses this is a verified score on the ARC-AGI-2 benchmark, meaning GPT-5.2 solved 52.9% of the secret test puzzles that it had never seen during training ([1]) ([44]). According to an OpenAI researcher at the launch event, “GPT-5.2 is the first model we’ve seen that achieves near 100% accuracy on the 4-needle MRCR variant out to 256k tokens” ([45]), underscoring its strength on long-horizon reasoning. Although OpenAI focused on the productivity applications in their marketing, these publicly-shared benchmarks confirm that GPT-5.2’s reasoning ability (as tested by ARC-AGI-2) is now unambiguously superior to previous models ([1]) ([44]).

Detailed Performance Analysis

GPT-5.2’s ARC-AGI-2 results represent a breakthrough in abstract reasoning. For context, Table 2 (below) compares GPT-5.2 with GPT-5.1 (and its Pro variant) on ARC-AGI benchmarks:

BenchmarkGPT-5.1 (Thinking)GPT-5.2 (Thinking)GPT-5.2 Pro (X-High)
ARC-AGI-1 (Verified)72.8%86.2%90.5%
ARC-AGI-2 (Verified)17.6%52.9%54.2%

Source: OpenAI GPT-5.2 release (Table of model performance) ([1]) ([44]).

This table highlights that GPT-5.2 not only vastly outperforms GPT-5.1 on ARC-AGI-2 (jumping from 17.6% to ~53%), but also improves significantly on ARC-AGI-1 (72.8%→86.2%). The increase on ARC-AGI-2 (a +35 percentage-point leap) is especially striking. These figures are from OpenAI’s own evaluation of “GPT-5.2 Thinking” and “Pro” against a fixed test set ([1]) ([44]). The “Pro” mode yields only a small incremental gain (54.2% vs 52.9%) at higher computational cost, indicating that even the standard GPT-5.2 Thinking model captures most of the improvement.

The significance of GPT-5.2’s ~53% score can be appreciated against the backdrop of other systems. Before this, no system came close to this level on ARC-AGI-2. For example, Google’s Gemini 3 Deep Think effort (accessed via Poetiq’s orchestrator) had scored 45% ([29]), and the best community-tuned models only reached ~28% ([27]). Even the specialized ARC Prize entries that once led ARC-AGI-1 (such as “o3-preview” or “o3 (low)” from OpenAI) fall at or below 4% on ARC-AGI-2 ([14]). In effect, GPT-5.2 has more than doubled the top verified ARC-AGI-2 accuracy in the field.

Table 2. Comparison of leading systems on ARC-AGI benchmarks (Performance as % of tasks solved), updated April 2026.

SystemARC-AGI-1ARC-AGI-2Notes
Confluence Lab97.9%Meta-system, $11.77/task ([5])
Imbue Darwinian Evolver95.1%Evolutionary code optimization + Gemini 3.1, $8.71/task ([46])
Gemini 3 “Deep Think” (Feb 2026)84.6%Extended compute reasoning ([4])
GPT-5.4 (Pro)83.3%Latest OpenAI, Mar 2026 ([3])
Gemini 3.1 Pro Preview77.1%Standard (non-Deep Think) ([47])
Claude Opus 4.669–75%Anthropic, strong on hard problems ([6])
Claude Sonnet 4.658.3%4.3x leap over Sonnet 4.5's 13.6% ([6])
GPT-5.2 (Pro)90.5%54.2%X-High mode, Dec 2025 (<a href="https://openai.com/th-TH/index/introducing-gpt-5-2/#:~:text=%7C%20GPT,%28high%29%20%20%7C%2017.6" title="Highlights:
GPT-5.2 (Thinking)86.2%52.9%Standard config, Dec 2025 (<a href="https://openai.com/th-TH/index/introducing-gpt-5-2/#:~:text=ARC,%7C%2017.6" title="Highlights: ARC,
Poetiq Meta-system54.0%Semi-private, Dec 2025 ([29])
GPT-5.1 (ChatGPT 5.1)72.8%17.6%Prior OpenAI model (<a href="https://openai.com/th-TH/index/introducing-gpt-5-2/#:~:text=ARC,%7C%2017.6" title="Highlights: ARC,
GPT-4o (multimodal GPT-4)4.5%~0%Previous OpenAI baseline ([26])
ARC-AGI-2 Human Panel100%100%≥2 people solved each task ([9])

This table makes clear that while GPT-5.2 was groundbreaking at the time of its release — the first to solve the majority of tasks that once stymied all AI — it has since been surpassed by multiple systems. OpenAI’s detailed benchmarks also indicate robustness: GPT-5.2 maintains its edge even on the hardest ARC-AGI-2 tasks (the “verified” set, removing any leak). The model also excels at related macro-challenges (e.g. it solved 86.2% of ARC-AGI-1 puzzles ([1]), compared to 72.8% for GPT-5.1, demonstrating that the improvements generalize across puzzle varieties).

Crucially, the ARC-AGI-2 tasks are presumed not to have been included in any training data. OpenAI explicitly notes ARC-AGI-2’s evaluation tasks were newly designed and human-calibrated ([15]) ([19]). Thus GPT-5.2’s success implies genuine generalization abilities rather than memorization. It indicates the model can infer abstract rules from few examples far beyond earlier capabilities.

Sources of Improvement

How did GPT-5.2 achieve these gains? OpenAI cites architectural and training improvements. According to their report, GPT-5.2 drastically improved long-context understanding and multi-step reasoning ([48]). Its enhanced “Thinking” configuration likely uses deeper reasoning passes. The code-red push also meant extensive fine-tuning on reasoning tasks: OpenAI mentions a new internal benchmark (“GDPval”) where GPT-5.2 nearly matches or exceeds expert human performance ([49]). The ARC-AGI-2 score jump suggests GPT-5.2 learned to plan and decompose ARC puzzles much better than GPT-5.1. Opportunely, by late 2025 the models and tooling for chain-of-thought had matured (e.g. more precise self-verification, prompting techniques), and GPT-5.2 seems to benefit from all these advances in the pipeline. In effect, GPT-5.2 “raised the ceiling” for how much reasoning a giant LLM can do end-to-end.

Nevertheless, experts stress that GPT-5.2’s success does not equal human-like intelligence. As François Chollet notes, ARC-AGI is meant to reveal exactly what is lacking in AI. Despite GPT-5.2’s score, it still falls well short of perfect. The Atlantic observes that when Chollet introduced ARC-AGI-2, “AI performance plummeted” again, underlining that fluid generalization remains outside current architectures ([11]). OpenAI themselves define AGI conservatively – Sam Altman remarked in mid-2025 that GPT-5 was “still missing something quite important” for AGI ([12]). GPT-5.2’s improvements address specific limitations, but key aspects (like autonomous continual learning and meta-reasoning) remain unproven. Thus GPT-5.2’s ARC-AGI-2 record reflects enormous progress, yet also highlights the continuing gulf to actual AGI benchmarks.

Case Studies and Perspectives

Poetiq’s Meta-System (Semiprivate ARC-AGI-2) – Poetiq’s approach exemplifies a mix of large models and clever orchestration. By chaining multiple LLM calls (generate-criticize-refine), Poetiq “smashed” the ARC-AGI-2 record immediately after Gemini 3’s release ([29]). Their open-source solution used Gemini 3 Pro instances and achieved 54% on the private test via the Kaggle system, surpassing Gemini’s standalone 45% ([29]). This indicates that even without retraining, intelligently leveraging existing LLM behaviors can yield high ARC scores. Poetiq’s cost analysis (Table 1) also points to efficiency: their base strategy (“GPT-5.2 Thinking”) would have solved 52.9% for much lower cost ($1.90/task) compared to Poetiq’s $30.57/task ([50]). Poetiq’s success suggests meta-reasoning systems are a promising direction: rather than building bigger models, orchestrate many models (or one model multiple times) to tackle subproblems. This perspective contrasts with OpenAI’s brute scaling, yet it crucially informed the community that ARC-AGI-2 is beatable via flexible systems.

NVIDIA Kaggle Grandmasters (Public ARC Prize) – The Kaggle story shows yet another approach. Subject to strict runtime limits, Ivan Sorokin and Jean-Francois Puget’s NVARC team refused to use ChatGPT or giant LLMs; they instead trained a 4B open-weight model (based on NVIDIA’s own frameworks) entirely on synthetic ARC-like puzzles. They generated a massive puzzle corpus and performed test-time training on each evaluation instance ([28]). This got them 27.64% and the prize ([27]). Their case study emphasizes data efficiency and adaptability over raw scale. By “moving reasoning offline,” they could embed sophisticated knowledge in the model weights. This approach is complementary to GPT-5.2: while GPT-5.2 is a single monolith learning broadly, NVARC’s model was highly specialized to ARC tasks. It shows that specialized training (even on tiny models) can somewhat compensate for less inherent reasoning power. Their detailed write-up highlights lessons: to succeed on ARC-AGI-2, one can either scale up general reasoning (GPT-5.2) or custom-tailor a model with clever training regimes.

These cases illustrate multiple perspectives in the research community:

  • Scaling Monoliths vs. Modular Systems: OpenAI’s GPT-5.2 exemplifies “bigger model, more data” strategy, achieving SOTA simply by deploying an extremely powerful LLM. Poetiq and the Kaggle teams counter that smart combinations or task-specific tuning can compete. Both achieved roughly the same order of scores (27–54%) on ARC-AGI-2, but via different means. This suggests the field leans both on scaling and on algorithmic creativity.

  • Benchmarks vs. Reality: ARC-AGI-2, by design, tests only one aspect of intelligence: abstract reasoning in a constrained puzzle domain. Some experts caution that success here may not translate to general human-like reasoning ([8]) ([11]). Indeed, all advanced systems can become “expert” at ARC-like tasks without solving vision, language, or physical reasoning. As Chollet puts it, the ARC scores are a “mirror” for what AI still can’t do ([11]). GPT-5.2’s high score thus invites debate: is it evidence of approaching AGI? Or simply of a narrowly-tuned super-LLM? The answer remains open; it does show that at least one big LLM has learned many more general abstractions than its predecessors, but true AGI involves far more (continuous learning, world modeling, creativity, etc.).

  • Efficiency and Cost: The leaderboard analysis (e.g. Chinese summary ([51])) highlights a major point: the models that scored highest on ARC-AGI-2 did so at massive cost. For example, the top OpenAI model used chain-of-thought with search at ~$200/task ([52]). In contrast, Poetiq solved a majority of tasks for ~$30/task ([50]), and NVARC’s solution only ~$0.20/task ([27]). This “efficiency crisis” means raw scores don’t tell the whole story: a 54% result is far more impressive if achieved cheaply. GPT-5.2’s external performance cost is unclear (for OpenAI’s closed system), but we can infer it is high. Future progress likely requires not just higher scores, but smarter methods that factor in cost and computational sustainability.

Implications and Future Directions

GPT-5.2’s ARC-AGI-2 success has immediate and long-term implications for AI’s trajectory:

  • AGI Claims and Skepticism: GPT-5.2’s leap fueled speculation about AGI, and the trajectory since has only intensified the debate. By April 2026, meta-systems like Confluence Lab have pushed to 97.9% on ARC-AGI-2, seemingly near-closing the gap. However, as leading voices caution, high ARC-AGI-2 scores do not equate to AGI. Notably, François Chollet himself has revised his AGI timeline in response to recent progress, stating: “A year ago, I would have said ten years-ish. And now I think it’s probably, you know, five-ish” — projecting AGI around 2030 ([53]). He credits test-time fine-tuning, test-time search, and program synthesis as the key breakthroughs: “We have models that are showing real signs of fluid intelligence” ([54]). Meanwhile, NVIDIA CEO Jensen Huang declared in March 2026 that “we’ve achieved AGI” ([55]), though there is no consensus on the definition. The human proxy for ARC-AGI-2 is 100% success ([9]), and the launch of ARC-AGI-3 (where all frontier models score below 1%) demonstrates that genuine generalization remains distant (though interestingly one source cites 66% as a human “ceiling” for AGI-2 due to test difficulty ([56])), so ~53% suggests GPT-5.2 is not yet fully generalizing in the human sense.

  • Benchmark-Driven Progress: The ARC-AGI benchmarks have clearly been influential. By focusing community efforts on the hardest “human-easy” puzzles, they revealed limitations of earlier systems. The fact that both community teams (NVARC, Poetiq) and OpenAI targeted ARC-AGI-2 explicitly suggests these benchmarks are guiding research. This prediction has been validated: ARC-AGI-3 launched on March 25, 2026 at a fireside conversation between Chollet and Sam Altman at Y Combinator HQ ([53]). ARC-AGI-3 is the first fully interactive benchmark — it presents games without instructions, requiring agents to explore, experiment, and infer rules through trial and error. The results are humbling: the best frontier model (Gemini 3.1 Pro) scored just 0.37%, while Claude Opus 4.6 scored 0.25% and GPT-5.4 scored near-zero ([10]). It uses a new RHAE (Relative Human Action Efficiency) metric that penalizes inefficiency quadratically. The ARC Prize 2026 competition offers over $2 million in prizes across both ARC-AGI-2 and ARC-AGI-3 tracks ([57]). The launch of GPT-5.2, using ARC-AGI-2 as a milestone, validated this co-evolution of benchmarks and models — and ARC-AGI-3 ensures the cycle continues.

  • Integration of Approaches: The diversity of solutions suggests no single path to AGI. Massive LLMs can be augmented with smaller fine-tuned models, synthetic data, or multi-agent orchestration. Future systems may combine these: one could imagine GPT-5.2 style models that also incorporate on-the-fly training or curriculum learning from local context (akin to NVARC’s test-time tuning). OpenAI’s blog hints at this: GPT-5.2 has “agentic tool-calling” capabilities to integrate external tools and data ([37]). If GPT-5.2 or its successors can loop LLM reasoning with automated code/scripts (or even on-board synthesizing training data), they might approach the adaptability that Kaggle teams had via hand-crafted processes.

  • Practical Impact: Regardless of AGI, GPT-5.2’s improvements have practical ramifications. Its strengths at knowledge work, coding, and analysis tasks (schools, law, science, finance) mean it can automate complex professional functions far better than before ([32]) ([58]). The demonstrated long-horizon reasoning means it can tackle multi-step problems end-to-end, not just generate text. For enterprises, GPT-5.2’s “70.9%” success on GDPval tasks (a new benchmark of real-world work products) ([49]) indicates it can produce usable deliverables (reports, spreadsheets, etc.) with high fidelity. Tools and copilot systems built on GPT-5.2 will thus be markedly more capable. At the same time, its ARC-AGI-2 proficiency shows the limits: GPT-5.2 can reason about abstract visual patterns far better than predecessors. This hints at broader understanding of structure, which could transfer to domains like scientific data or user interfaces. In short, the ARC jump is not just an academic feat; it reflects underlying gains that will enhance AI tools in many adoption scenarios.

  • Ethical and Practical Considerations: The intensifying “benchmark race” also raises questions. OpenAI’s rapid release of GPT-5.2 (amid “code red”) shows how competition can accelerate technology at the expense of careful deployment ([59]) ([35]). GPT-5.2 is no exception: alongside performance improvements, OpenAI faces criticisms over privacy, safety, and overhyped AGI claims (as noted by Axios ([35])). Furthermore, the resources driving GPT-5.2 are immense. The NVIDIA Kaggle result reminds us that efficiency must be a concern, not just raw scores. Policymakers and society must reckon with the dual-edged nature of these advances: unprecedented capability on one hand, new vulnerabilities on the other.

  • Next Frontier: The rapid progress on ARC-AGI-2 — from GPT-5.2’s 53% in December 2025 to Confluence Lab’s 97.9% by April 2026 — has been remarkable, but ARC-AGI-3 has reset the frontier entirely. With the best models scoring below 1%, the interactive reasoning challenge exposes what remains fundamentally missing. The “ARC-AGI-2 Extreme” tasks that were too hard for humans ([60]) also remain uncharted territory. The success of evolutionary and meta-system approaches (Imbue’s Darwinian Evolver, Confluence Lab) suggests that future breakthroughs may come not from monolithic models but from systems that generate, test, and refine code or strategies at inference time. Imbue’s open-sourced approach — which uses evolutionary code optimization on top of existing LLMs — tripled open-model performance and points toward a hybrid paradigm combining neural reasoning with symbolic program synthesis ([46]). On the algorithmic front, researchers are increasingly integrating symbolic reasoning and world models into LLMs, building on what GPT-5.2 achieved with pure deep learning. The trajectory from ARC-AGI-1 through ARC-AGI-3 provides a clearer map of the journey ahead: each benchmark generation exposes new gaps, and the AI community has proven it can close them — but the finish line keeps moving.

Data Analysis and Tables

The ARC-AGI-2 results provide rich data for analysis. Table 1 above catalogs top model performances and sources. Several insights emerge:

  • Relative Improvement: GPT-5.2’s ARC-AGI-2 score (~53%) is roughly double that of the best previous system (Gemini 3 Pro at 45%) and nearly triple GPT-5.1’s 17.6%. This steep upward shift is astronomical compared to typical annual gains in AI benchmarks, indicating a qualitative change in model capability.

  • Cost Efficiency: When examining Poetiq’s breakdown vs. GPT-5.2, a striking disparity appears. Poetiq reports solving 52.9% of tasks at $1.90/task (using GPT-5.2 Thinking) ([50]), whereas its high-end Pro run cost $15.27/task. If GPT-5.2 Thinking costs on the order of dollars per task (not publicly known but likely), it remains expensive. In contrast, the Kaggle NVARC solution accomplished 27.6% at a mere $0.20/task ([27]). This suggests that future research should measure performance per cost as a key metric. The “Pareto frontier” of accuracy vs. cost is shifting: large models raise the accuracy frontier, but smaller models can still dominate the cost frontier.

  • Benchmark Correlations: Table 2 juxtaposes ARC-AGI results with other benchmarks. GPT-5.2’s consistent lead across ARC-AGI-1, ARC-AGI-2, as well as domain-specific tests (e.g. GDPval, coding challenges ([61])), shows its general advantage. The data also hint at a trade-off in configurations: the Pro (X-High) mode gains a few percentage points on ARC-AGI-2 (54.2% vs 52.9%) but at greater expense. This suggests diminishing returns on brute compute. The bulk of the reasoning improvement is attained by the standard "Thinking" model.

  • Human vs. AI Gap: Even with GPT-5.2, the gap to perfect remains 46–47 percentage points. Some tasks require nuanced multi-step logic or “common-sense” leaps that even GPT-5.2 misses. The human panel data (100% solves ([9])) shows that the remaining tasks are easily within human reach. This quantifies the remaining challenge: roughly half of the ARC-AGI-2 tasks still stump the very best AI.

Case Studies and Real-World Examples

The NVIDIA Kaggle Solution (NVARC)

NVARC’s approach (first place, Kaggle ARC Prize 2025) used a fine-tuned 4B model and creative engineering ([27]) ([28]). Key points:

  • Synthetic Data Generation: Realizing that available puzzles were scarce, they generated thousands of ARC-like training examples by programmatically composing grid puzzles. This synthetic corpus helped the model learn the underlying patterns of ARC-AGI-2 tasks.

  • Test-Time Training: Instead of freezing the model, they performed on-the-fly gradient steps using each test puzzle’s small example set, dynamically adapting to specific rules. This “learn per puzzle” trick gave an edge that static models lack.

  • Efficiency Emphasis: Constrained by Kaggle’s 50-second runtime limit, they avoided large LLMs. Their solution cost about $0.20 per puzzle ([27]), an order of magnitude cheaper than others, albeit with lower absolute score (27.6%).

This case shows that constraint-driven innovation can yield surprisingly high rational performance: 27.6% on ARC-AGI-2 using modest resources is nontrivial. It also illustrates a potential future direction: systems that blend learning from scratch (via data gen and training) with reasoning at test-time. For example, one might combine GPT-5.2’s abilities with a small fine-tuning step for each new task to hybridize the approaches.

The Poetiq Meta-System

Poetiq’s system represents a meta-reasoning pipeline. Highlights from their report ([29]) ([62]):

  • Frontier Model Orchestration: Rather than solely relying on one model, Poetiq leverages Gemini 3 Pro multiple times. Each puzzle is tackled by a loop of the LLM generating a solution, another call evaluating it, and iterative refinement. This multipass approach achieves a deeper search of the solution space.

  • Rapid Adaptation: Poetiq was able to deploy their pipeline within hours of Gemini 3’s launch by automating everything. They state: “we do not need to build, or even fine-tune, our own large frontier models. Our meta-system… solves specific tasks by utilizing any existing frontier model” ([29]). This agility contrasts with the months-long development of new model releases.

  • Cost-Performance Tradeoff: Their analysis shows that the standard GPT-5.2 Thinking model could solve the hard puzzles cheaply, whereas using the high-end options drove up cost drastically ([50]). This implies that enhancements in reasoning can sometimes come from smarter inference rather than just deploying the most expensive model.

For AI product developers, Poetiq’s work is a case study in how to engineer top-tier reasoning systems without waiting for proprietary models. Their open-source pipeline demonstrates that with careful prompt engineering and iteration, even off-the-shelf LLMs can be pushed far. It suggests that meta-learning layers or orchestrators are a viable direction for future AGI systems — essentially building systems around models rather than in models. Update: Poetiq’s approach has been validated by the market — in early 2026 the company raised $45.8M in seed funding co-led by FYRFLY Venture Partners and Surface Ventures, with participation from Y Combinator and 468 Capital ([63]). When GPT-5.2 became available, Poetiq immediately integrated it into their meta-system, achieving 75% on the public evaluation set — a 16-point improvement over their prior SOTA using Gemini 3 Pro alone ([64]).

Discussion of Implications

The fact that GPT-5.2 now leads ARC-AGI-2 reshapes the current understanding of AI capabilities. Several themes emerge:

  • Benchmark Validity: ARC-AGI-2 was designed to be hard for exactly this reason: to resist brute-force scaling ([15]). GPT-5.2’s success on it means that not all brute strategies are ineffective; sheer world knowledge and improved algorithms can solve a majority of these puzzles. However, ARC-AGI-2 tasks are still contrived, and solving them does not guarantee broad intelligence. Researchers must thus interpret GPT-5.2’s feat with nuance: it is an impressive signifier of progress in reasoning, but not proof of holistic AGI. As the Atlantic piece and Chollet warn, focusing only on benchmark scores can create a false frontier – real generality requires more than pattern puzzles ([8]).

  • Rethinking AGI Scaling: Sam Altman has publicly wielded ARC as a gauge for AGI progress — and notably participated in the ARC-AGI-3 launch event alongside Chollet in March 2026. The decision to continuously raise the bar (ARC-AGI-2 after ARC-AGI-1, now ARC-AGI-3) reflects an understanding that minor improvements in benchmarks shouldn’t be mistaken for AGI. The rapid progress on ARC-AGI-2 — from 53% to 98% in just four months — suggests that the “scaling hypothesis” works differently than expected: the biggest gains came not from bigger models alone, but from meta-systems, evolutionary approaches, and test-time computation. Imbue’s 95.1% using a Darwinian code evolution strategy on existing LLMs, and Confluence Lab’s 97.9% via meta-system orchestration, demonstrate that architectural innovation around models can outperform raw parameter scaling. Future work will likely combine neural reasoning with symbolic program synthesis, code generation, and interactive learning — exactly the capabilities ARC-AGI-3 is designed to test.

  • Guiding Future Work: The ARC Prize 2025 competition drew 1,455 teams and produced 90 research papers, confirming that benchmarks are setting research priorities ([31]). The ARC-AGI-3 benchmark has already incorporated learnings from the rapid saturation of ARC-AGI-2: it introduces fully interactive tasks (games without instructions) that trick purely knowledge-driven models — and indeed, even GPT-5.4 and Claude Opus 4.6 score below 1% ([10]). The paper award winners from ARC Prize 2025 point toward future directions: the Tiny Recursive Model (TRM) by Alexia Jolicoeur-Martineau ($50K prize) and Self-Improving Language Models by Pourcel, Colas & Oudeyer ($20K prize) both emphasize learning efficiency and self-improvement over raw scale ([31]). Similarly, demands on efficiency continue to drive research into neural model compression and meta-learning, as demonstrated by the Kaggle win with a 4B model.

  • Broad AI Landscape: Beyond ARC-AGI, GPT-5.2’s influence will ripple across AI. Its advanced reasoning will benefit many domains (e.g. code generation, data analysis, medical diagnostics). However, it also raises the bar for safety and alignment: smarter systems can also exhibit smarter failures (e.g. security exploits in reasoning). The GPT-5.2 launch already spurred discussions on de-anonymization, biases, and energy costs. The efficiency analysis from ARC suggests these discussions must intensify: can we responsibly scale such systems when the most efficient solutions might be architecturally divergent from raw scale?

  • Human vs. AI Gap: Even as data shows AI catching up on ARC-AGI-2 (with meta-systems nearing 98%), human cognition features like creativity, common sense, and long-term planning remain distinct. ARC-AGI-3 has dramatically reasserted this gap: humans solve 100% of its interactive tasks while all frontier models score below 1%. This is perhaps the most powerful evidence yet that scaling within current paradigms cannot fully replicate fluid human intelligence. Bridging this will require fundamentally new ideas (e.g. continual learning, unsupervised concept formation, interactive world modeling). Chollet’s long-term view is that benchmarks like ARC-AGI are meant not for short-term bragging rights but to drive researchers and companies to innovate on exactly these limitations ([15]) ([11]). In this spirit, GPT-5.2’s lead is both a celebration of success and a clear signal of remaining challenges.

Conclusion

GPT-5.2’s performance on ARC-AGI-2 was a landmark in AI development. Achieving a majority of the tasks underlined how far LLM technology had come, and reaffirmed that advanced models can perform abstract reasoning at levels previously thought impossible. The four months since GPT-5.2’s December 2025 debut have only accelerated this trajectory: GPT-5.4 Pro reached 83.3%, Gemini 3 Deep Think hit 84.6%, and meta-systems like Confluence Lab and Imbue pushed above 95%. This rapid progress was made possible through a diversity of approaches — aggressive model improvements, evolutionary code optimization, and multi-agent orchestration — demonstrating that there is no single path to abstract reasoning. However, a true AGI — AI that matches human reasoning in all its flexibility — is still not here. The launch of ARC-AGI-3 in March 2026 proved this decisively: even the best frontier models score below 1% on its interactive tasks, while humans solve 100%. The ARC-AGI-2 benchmark, once a formidable challenge, is rapidly being saturated, but the goalpost has moved ([11]) ([12]).

The community’s multifaceted efforts (monolithic LLMs, orchestration pipelines, and synthetic-data training) provide multiple paths forward. The competitive landscape – with prize-backed benchmarks and rapid model release cycles – has accelerated progress, but also demands careful reflection. Moving beyond this breakthrough will likely entail new architectures beyond just larger transformers, integration of other cognitive building blocks, and more focus on efficiency and alignment. As François Chollet asserts, AI research must “be smarter” rather than only “bigger” to achieve genuine generalization ([11]). The ARC-AGI-2 saga through late 2025 underscores that point vividly.

In summary, GPT-5.2’s breakthrough on ARC-AGI-2 was both an important achievement and a catalyst that unleashed rapid progress — from 53% to near-saturation in just four months. The subsequent launches of GPT-5.4, Gemini 3 Deep Think, and revolutionary meta-systems like Imbue and Confluence Lab have transformed the leaderboard beyond recognition. Yet ARC-AGI-3’s launch in March 2026 — and its humbling of every frontier model back to near-zero — proves that the core of intelligence, seamless adaptation to truly novel interactive problems, still lies ahead. As Chollet himself now projects AGI around 2030, the era of advanced LLMs reaching into fluid intelligence is upon us, but the era of truly general intelligence remains a horizon to be explored.

References: The above analysis draws on official OpenAI releases ([1]) ([44]) ([3]), Arc Prize technical reports and competition results ([15]) ([7]) ([31]) ([5]), industry news and blog reports ([32]) ([33]) ([35]) ([29]) ([27]), research reports from Imbue ([46]) and Anthropic ([6]), and commentary by experts including François Chollet ([8]) ([53]) ([54]). Each claim is supported by these credible sources.

External Sources (64)

Get a Free AI Cost Estimate

Tell us about your use case and we'll provide a personalized cost analysis.

Ready to implement AI at scale?

From proof-of-concept to production, we help enterprises deploy AI solutions that deliver measurable ROI.

Book a Free Consultation

How We Can Help

IntuitionLabs helps companies implement AI solutions that deliver real business value.

DISCLAIMER

The information contained in this document is provided for educational and informational purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability of the information contained herein. Any reliance you place on such information is strictly at your own risk. In no event will IntuitionLabs.ai or its representatives be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from the use of information presented in this document. This document may contain content generated with the assistance of artificial intelligence technologies. AI-generated content may contain errors, omissions, or inaccuracies. Readers are advised to independently verify any critical information before acting upon it. All product names, logos, brands, trademarks, and registered trademarks mentioned in this document are the property of their respective owners. All company, product, and service names used in this document are for identification purposes only. Use of these names, logos, trademarks, and brands does not imply endorsement by the respective trademark holders. IntuitionLabs.ai is an AI software development company specializing in helping life-science companies implement and leverage artificial intelligence solutions. Founded in 2023 by Adrien Laurent and based in San Jose, California. This document does not constitute professional or legal advice. For specific guidance related to your business needs, please consult with appropriate qualified professionals.

Related Articles

Need help with AI?

© 2026 IntuitionLabs. All rights reserved.