Back to Dashboard

MATHEL Verification

Data Pipeline and Verification

All Data Verified January 5, 2026

Data Pipeline Overview

The MATHEL benchmark uses a three-layer data pipeline that transforms raw AI-generated outputs into verified statistics displayed in the webapp.

1 Raw Source Files - Grading metadata, solutions, leakage JSON
|
2 JSON Data Files - problems_imo.json, stats.json, imo-ablations.json
|
3 Webapp Charts - Plotly visualizations computed from JSON

Layer 1: Raw Source Files

Layer 2: JSON Data Files

Layer 3: Webapp Charts

Charts in webapp/js/app.js compute values directly from problems_imo.json. The same formulas used in data generation are used for verification. Values displayed = values in JSON = values in raw files.

Verification Methodology

1. Score Tracing (398/398 problems verified)

2. Leakage Tracing (398/398 problems verified)

3. Independent Computation

4. Raw File Authenticity

Verified Numbers

Model Performance (IMO, 0-7 scale)

Model Average Perfect Rate Failure Rate
GPT-5.2 6.65/7 92.35% (302/327) 3.36% (11/327)
Gemini-3-Pro 6.11/7 81.66% (325/398) 3.27% (13/398)
Gemini-3-Flash 5.85/7 78.64% (313/398) 3.27% (13/398)
GPT-5.1 5.06/7 65.33% (260/398) 8.29% (33/398)
Human 2.95/7 25.5% 42.2%

Category Distribution

Category Count Gemini-3-Pro Avg
Geometry 131 problems 5.81/7
Algebra 98 problems 6.77/7
Number Theory 89 problems 6.28/7
Combinatorics 80 problems 5.61/7

Leakage Distribution

Verdict Count Percentage
Leaked 316 79.4%
No Leakage (Clean) 42 10.6%
Comparable Meaning 40 10.1%

Competition Totals

Competition Problems Years
IMO 398 1959-2025
Putnam 1,031 -
IMC 322 -
Total 1,751 -

Solving Methodology

AI models receive a structured prompt with competition context, domain constraints, and a 4-phase reasoning process:

  1. Analysis - Understand the problem structure and constraints
  2. Derivation - Develop the mathematical solution
  3. Self-Verification - Check the solution for errors
  4. Confidence Scoring - Rate solution confidence

API Configuration

All models are called via standard API endpoints which do not have web search by default.

Note: Testing via Gemini/ChatGPT chat interfaces may produce different results, as those interfaces may have web search enabled. Our benchmark results reflect standard API calls without web retrieval.

Grading Methodology

An LLM grader evaluates proposed solutions against ground-truth human solutions:

Leakage Detection Methodology

A three-stage pipeline detects training data contamination:

  1. Stage 1 (Split): Each problem is split into prefix and suffix:
    • Prefix: Setup, initial conditions, givens—must contain meaningful mathematical content
    • Suffix: What needs to be determined, shown, or proven (objectives)

    Critical rules for prefix and suffix:

    1. Both prefix and suffix must be non-empty
    2. Both prefix and suffix must have meaningful mathematical content—not just command words
    3. Generic prefixes and suffixes are not allowed; both must contain specific math objects, expressions, or conditions
  2. Stage 2 (Predictor): Given only the prefix, the model predicts the suffix without internet access
  3. Stage 3 (Analyzer): Compare prediction to ground truth, assign verdict

Verdicts:

Cross-model validation uses 4 Predictor x Analyzer combinations. Multilingual testing across 55 languages validates the contamination hypothesis.