SmolLM3-3B-Base

Hugging Face

Apache-2.0 dense base

Type

dense

Total params

3.0B

Active params

3.0B

Sparsity

Context

65,536

Train tokens

11.0T

Benchmarks

Benchmark	Category	Measured	Claimed	Setup
GPQA Diamond	reasoning	35.35	—	0-shot
0-shot, loglikelihood
GSM8K	math	67.48	67.63	5-shot
5-shot, strict-match; reproduces vendor base number within stderr
MMLU-Pro	knowledge	35.10	—	5-shot
5-shot generative CoT, custom-extract
HumanEval+	code	29.27	30.48	0-shot
0-shot pass@1; reproduces vendor base number within stderr
AIME 2024	math	0.00	—	0-shot
0-shot greedy. Honest base-model result: model attempts and formats answers correctly, gets the math wrong. Near-zero is expected for a non-reasoning 3B base.

Claimed vs measured

How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.

Independent benchmark of the base (pre-instruct) checkpoint. Two benchmarks have direct vendor base-model references and both reproduced within stderr (GSM8K, HumanEval+). Near-zero AIME is the honest base-model result: the model attempts problems and emits answers in the expected format, it simply gets the math wrong, which is expected for a non-reasoning 3B base.