Verification - MATHEL Benchmark

Data Pipeline Overview

The MATHEL benchmark uses a three-layer data pipeline that transforms raw AI-generated outputs into verified statistics displayed in the webapp.

1 Raw Source Files - Grading metadata, solutions, leakage JSON

|

2 JSON Data Files - problems_imo.json, stats.json, imo-ablations.json

|

3 Webapp Charts - Plotly visualizations computed from JSON

Layer 1: Raw Source Files

398 grading metadata files (data/imo/gradings-*/*) - Generated Nov 19-20, 2025. Token usage: 5,367-29,905 tokens per grading. Format: "Score: X out of 7" consistently parsed.
398 solution files (data/imo/solutions-*/*) - Contains LaTeX mathematical notation and proof language. Real mathematical content verified.
398 leakage JSON files (data/imo/leakage_results-*/*) - Valid JSON structure with verdict, confidence, and reasoning fields.

Layer 2: JSON Data Files

problems_imo.json - 398/398 gemini_pro_score values match raw files. 398/398 leakage_verdict values match raw files. Generated by webapp/generate_data.py from Layer 1.
stats.json - Pre-aggregated statistics. All values match independent computation from problems_imo.json.
analysis/imo-ablations.json - Cross-model grading statistics generated from gradings directories.

Layer 3: Webapp Charts

Charts in webapp/js/app.js compute values directly from problems_imo.json. The same formulas used in data generation are used for verification. Values displayed = values in JSON = values in raw files.

Verification Methodology

1. Score Tracing (398/398 problems verified)

Read raw file: data/imo/gradings-gemini-3-pro-preview/{id}_grading_metadata.txt
Extract using regex: "Score: (\d+) out of (\d+)"
Compare to: problems_imo.json[id].gemini_pro_score
Result: 100% match

2. Leakage Tracing (398/398 problems verified)

Read raw file: data/imo/leakage_results-gemini-3-pro-preview/{id}_leakage.json
Extract: leakage_analysis.verdict
Compare to: problems_imo.json[id].leakage_verdict
Result: 100% match

3. Independent Computation

Computed all statistics from problems_imo.json
Compared to stats.json pre-computed values
Result: 100% match on all metrics

4. Raw File Authenticity

Timestamps: Nov 19-20, 2025 (consistent with generation dates)
Token counts: Variable (5K-30K), indicating real API calls
Content: Real mathematical proofs with LaTeX notation
Human data: Official IMO format, 1959-2025 coverage

Verified Numbers

Model Performance (IMO, 0-7 scale)

Model	Average	Perfect Rate	Failure Rate
GPT-5.2	6.65/7	92.35% (302/327)	3.36% (11/327)
Gemini-3-Pro	6.11/7	81.66% (325/398)	3.27% (13/398)
Gemini-3-Flash	5.85/7	78.64% (313/398)	3.27% (13/398)
GPT-5.1	5.06/7	65.33% (260/398)	8.29% (33/398)
Human	2.95/7	25.5%	42.2%

Category Distribution

Category	Count	Gemini-3-Pro Avg
Geometry	131 problems	5.81/7
Algebra	98 problems	6.77/7
Number Theory	89 problems	6.28/7
Combinatorics	80 problems	5.61/7

Leakage Distribution

Verdict	Count	Percentage
Leaked	316	79.4%
No Leakage (Clean)	42	10.6%
Comparable Meaning	40	10.1%

Competition Totals

Competition	Problems	Years
IMO	398	1959-2025
Putnam	1,031	-
IMC	322	-
Total	1,751	-

Solving Methodology

AI models receive a structured prompt with competition context, domain constraints, and a 4-phase reasoning process:

Analysis - Understand the problem structure and constraints
Derivation - Develop the mathematical solution
Self-Verification - Check the solution for errors
Confidence Scoring - Rate solution confidence

IMO: Pre-calculus level, 0-7 scale, elementary methods encouraged
Putnam/IMC: University level, 0-10 scale, advanced theorems allowed

API Configuration

All models are called via standard API endpoints which do not have web search by default.

Note: Testing via Gemini/ChatGPT chat interfaces may produce different results, as those interfaces may have web search enabled. Our benchmark results reflect standard API calls without web retrieval.

Grading Methodology

An LLM grader evaluates proposed solutions against ground-truth human solutions:

IMO rubric: 7=complete, 6=minor gaps, 1=partial progress, 0=no progress
Putnam rubric: 10=complete, 9=trivial omission, 0-2=common for partial work
Grader explicitly ignores any self-assessment in the solution
Cross-model grading validates consistency and detects bias
Validation: The grading prompt has been shown to correlate with human expert grades from 15 IMO medalists with a correlation coefficient above 0.92

Leakage Detection Methodology

A three-stage pipeline detects training data contamination:

Stage 1 (Split): Each problem is split into prefix and suffix:
- Prefix: Setup, initial conditions, givens—must contain meaningful mathematical content
- Suffix: What needs to be determined, shown, or proven (objectives)
Critical rules for prefix and suffix:
1. Both prefix and suffix must be non-empty
2. Both prefix and suffix must have meaningful mathematical content—not just command words
3. Generic prefixes and suffixes are not allowed; both must contain specific math objects, expressions, or conditions
Stage 2 (Predictor): Given only the prefix, the model predicts the suffix without internet access
Stage 3 (Analyzer): Compare prediction to ground truth, assign verdict

Verdicts:

"leaked" - Near-identical prediction indicating memorization
"comparable_meaning" - Similar meaning but different wording
"no_leakage" - No evidence of memorization

Cross-model validation uses 4 Predictor x Analyzer combinations. Multilingual testing across 55 languages validates the contamination hypothesis.