sanity·bench

SmolLM3-3B-Base

Hugging Face
Apache-2.0 dense base
Type
dense
Total params
3.0B
Active params
3.0B
Sparsity
Context
65,536
Train tokens
11.0T

Benchmarks

Benchmark Category Measured Claimed Setup
GPQA Diamond reasoning 35.35 0-shot
0-shot, loglikelihood
GSM8K math 67.48 67.63 5-shot
5-shot, strict-match; reproduces vendor base number within stderr
MMLU-Pro knowledge 35.10 5-shot
5-shot generative CoT, custom-extract
HumanEval+ code 29.27 30.48 0-shot
0-shot pass@1; reproduces vendor base number within stderr
AIME 2024 math 0.00 0-shot
0-shot greedy. Honest base-model result: model attempts and formats answers correctly, gets the math wrong. Near-zero is expected for a non-reasoning 3B base.

Claimed vs measured

How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.

measured − claimedGSM8K-0.15HumanEval+-1.21
Independent benchmark of the base (pre-instruct) checkpoint. Two benchmarks have direct vendor base-model references and both reproduced within stderr (GSM8K, HumanEval+). Near-zero AIME is the honest base-model result: the model attempts problems and emits answers in the expected format, it simply gets the math wrong, which is expected for a non-reasoning 3B base.