| Benchmark | Category | Measured | Claimed | Setup |
|---|---|---|---|---|
| GPQA Diamond | reasoning | 29.80 | — | 0-shot |
| 0-shot, loglikelihood | ||||
| GSM8K | math | 37.45 | 41.10 | 5-shot |
| 5-shot, strict-match. Comes in under the vendor number; methodology/prompting gap. | ||||
| MMLU-Pro | knowledge | 25.48 | — | 5-shot |
| 5-shot generative CoT, custom-extract. Above the ~10% floor but well below SmolLM3-3B-Base on the identical task. | ||||
| HumanEval+ | code | 22.56 | 29.90 | 0-shot |
| 0-shot pass@1. Vendor number is on the original (easier) HumanEval, not HumanEval+. | ||||
| AIME 2024 | math | 0.00 | — | 0-shot |
| 0-shot greedy. Non-termination finding: the base model generates unbounded reasoning without converging on a boxed answer. Failing fraction did not improve from 6k to 16k tokens, ruling out 'needed more tokens'. A real result, not a truncation artifact. | ||||
How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.