Back to Dashboard
Data Pipeline Overview
The MATHEL benchmark uses a three-layer data pipeline that transforms raw AI-generated outputs into verified statistics displayed in the webapp.
1
Raw Source Files - Grading metadata, solutions, leakage JSON
|
2
JSON Data Files - problems_imo.json, stats.json, imo-ablations.json
|
3
Webapp Charts - Plotly visualizations computed from JSON
Layer 1: Raw Source Files
- 398 grading metadata files (
data/imo/gradings-*/*) - Generated Nov 19-20, 2025. Token usage: 5,367-29,905 tokens per grading. Format: "Score: X out of 7" consistently parsed.
- 398 solution files (
data/imo/solutions-*/*) - Contains LaTeX mathematical notation and proof language. Real mathematical content verified.
- 398 leakage JSON files (
data/imo/leakage_results-*/*) - Valid JSON structure with verdict, confidence, and reasoning fields.
Layer 2: JSON Data Files
- problems_imo.json - 398/398 gemini_pro_score values match raw files. 398/398 leakage_verdict values match raw files. Generated by
webapp/generate_data.py from Layer 1.
- stats.json - Pre-aggregated statistics. All values match independent computation from problems_imo.json.
- analysis/imo-ablations.json - Cross-model grading statistics generated from gradings directories.
Layer 3: Webapp Charts
Charts in webapp/js/app.js compute values directly from problems_imo.json. The same formulas used in data generation are used for verification. Values displayed = values in JSON = values in raw files.
Verification Methodology
1. Score Tracing (398/398 problems verified)
- Read raw file:
data/imo/gradings-gemini-3-pro-preview/{id}_grading_metadata.txt
- Extract using regex:
"Score: (\d+) out of (\d+)"
- Compare to:
problems_imo.json[id].gemini_pro_score
- Result: 100% match
2. Leakage Tracing (398/398 problems verified)
- Read raw file:
data/imo/leakage_results-gemini-3-pro-preview/{id}_leakage.json
- Extract:
leakage_analysis.verdict
- Compare to:
problems_imo.json[id].leakage_verdict
- Result: 100% match
3. Independent Computation
- Computed all statistics from problems_imo.json
- Compared to stats.json pre-computed values
- Result: 100% match on all metrics
4. Raw File Authenticity
- Timestamps: Nov 19-20, 2025 (consistent with generation dates)
- Token counts: Variable (5K-30K), indicating real API calls
- Content: Real mathematical proofs with LaTeX notation
- Human data: Official IMO format, 1959-2025 coverage
Verified Numbers
Model Performance (IMO, 0-7 scale)
| Model |
Average |
Perfect Rate |
Failure Rate |
| GPT-5.2 |
6.65/7 |
92.35% (302/327) |
3.36% (11/327) |
| Gemini-3-Pro |
6.11/7 |
81.66% (325/398) |
3.27% (13/398) |
| Gemini-3-Flash |
5.85/7 |
78.64% (313/398) |
3.27% (13/398) |
| GPT-5.1 |
5.06/7 |
65.33% (260/398) |
8.29% (33/398) |
| Human |
2.95/7 |
25.5% |
42.2% |
Category Distribution
| Category |
Count |
Gemini-3-Pro Avg |
| Geometry |
131 problems |
5.81/7 |
| Algebra |
98 problems |
6.77/7 |
| Number Theory |
89 problems |
6.28/7 |
| Combinatorics |
80 problems |
5.61/7 |
Leakage Distribution
| Verdict |
Count |
Percentage |
| Leaked |
316 |
79.4% |
| No Leakage (Clean) |
42 |
10.6% |
| Comparable Meaning |
40 |
10.1% |
Competition Totals
| Competition |
Problems |
Years |
| IMO |
398 |
1959-2025 |
| Putnam |
1,031 |
- |
| IMC |
322 |
- |
| Total |
1,751 |
- |
Solving Methodology
AI models receive a structured prompt with competition context, domain constraints, and a 4-phase reasoning process:
- Analysis - Understand the problem structure and constraints
- Derivation - Develop the mathematical solution
- Self-Verification - Check the solution for errors
- Confidence Scoring - Rate solution confidence
- IMO: Pre-calculus level, 0-7 scale, elementary methods encouraged
- Putnam/IMC: University level, 0-10 scale, advanced theorems allowed
API Configuration
All models are called via standard API endpoints which do not have web search by default.
Note: Testing via Gemini/ChatGPT chat interfaces may produce different results, as those interfaces may have web search enabled. Our benchmark results reflect standard API calls without web retrieval.
Grading Methodology
An LLM grader evaluates proposed solutions against ground-truth human solutions:
- IMO rubric: 7=complete, 6=minor gaps, 1=partial progress, 0=no progress
- Putnam rubric: 10=complete, 9=trivial omission, 0-2=common for partial work
- Grader explicitly ignores any self-assessment in the solution
- Cross-model grading validates consistency and detects bias
- Validation: The grading prompt has been shown to correlate with human expert grades from 15 IMO medalists with a correlation coefficient above 0.92
Leakage Detection Methodology
A three-stage pipeline detects training data contamination:
- Stage 1 (Split): Each problem is split into prefix and suffix:
- Prefix: Setup, initial conditions, givens—must contain meaningful mathematical content
- Suffix: What needs to be determined, shown, or proven (objectives)
Critical rules for prefix and suffix:
- Both prefix and suffix must be non-empty
- Both prefix and suffix must have meaningful mathematical content—not just command words
- Generic prefixes and suffixes are not allowed; both must contain specific math objects, expressions, or conditions
- Stage 2 (Predictor): Given only the prefix, the model predicts the suffix without internet access
- Stage 3 (Analyzer): Compare prediction to ground truth, assign verdict
Verdicts:
- "leaked" - Near-identical prediction indicating memorization
- "comparable_meaning" - Similar meaning but different wording
- "no_leakage" - No evidence of memorization
Cross-model validation uses 4 Predictor x Analyzer combinations. Multilingual testing across 55 languages validates the contamination hypothesis.