DeepSeek-V2-Lite-Base

DeepSeek

DeepSeek License MoE MLA base

Type

MoE

Total params

15.7B

Active params

2.4B

Sparsity

Context

32,768

Train tokens

5.7T

Benchmarks

Benchmark	Category	Measured	Claimed	Setup
GPQA Diamond	reasoning	29.80	—	0-shot
0-shot, loglikelihood
GSM8K	math	37.45	41.10	5-shot
5-shot, strict-match. Comes in under the vendor number; methodology/prompting gap.
MMLU-Pro	knowledge	25.48	—	5-shot
5-shot generative CoT, custom-extract. Above the ~10% floor but well below SmolLM3-3B-Base on the identical task.
HumanEval+	code	22.56	29.90	0-shot
0-shot pass@1. Vendor number is on the original (easier) HumanEval, not HumanEval+.
AIME 2024	math	0.00	—	0-shot
0-shot greedy. Non-termination finding: the base model generates unbounded reasoning without converging on a boxed answer. Failing fraction did not improve from 6k to 16k tokens, ruling out 'needed more tokens'. A real result, not a truncation artifact.

Claimed vs measured

How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.

Independent benchmark of the base checkpoint. Knowledge-heavy MMLU-Pro categories hold up best; reasoning-heavy ones are weakest, consistent with a 5.7T-token base model. The AIME 0/60 is a documented non-termination finding, not an artifact (see AIME note).