| Benchmark | Category | Measured | Claimed | Setup |
|---|---|---|---|---|
| GPQA Diamond | reasoning | 35.35 | — | 0-shot |
| 0-shot, loglikelihood | ||||
| GSM8K | math | 67.48 | 67.63 | 5-shot |
| 5-shot, strict-match; reproduces vendor base number within stderr | ||||
| MMLU-Pro | knowledge | 35.10 | — | 5-shot |
| 5-shot generative CoT, custom-extract | ||||
| HumanEval+ | code | 29.27 | 30.48 | 0-shot |
| 0-shot pass@1; reproduces vendor base number within stderr | ||||
| AIME 2024 | math | 0.00 | — | 0-shot |
| 0-shot greedy. Honest base-model result: model attempts and formats answers correctly, gets the math wrong. Near-zero is expected for a non-reasoning 3B base. | ||||
How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.