sanity·bench

Moonlight-16B-A3B-Base

Moonshot AI
Apache-2.0 MoE MLA Muon-optimizer base
Type
MoE
Total params
16.0B
Active params
3.0B
Sparsity
Context
8,192
Train tokens
5.7T

Benchmarks

Benchmark Category Measured Claimed Setup
GPQA Diamond reasoning 30.30 0-shot
0-shot, loglikelihood. Reproduced across 3 configs. Loglikelihood scoring never generates tokens, so it is structurally immune to the degeneration that blocked the generative benchmarks. Only ~1.6 stderr above the 25% floor.
GSM8K math 73.92 77.40 5-shot
5-shot, strict-match. Reproduced across 3 configs (fp16 verified-good config reported). Its prompt distribution does not trigger the degeneration.
MMLU-Pro knowledge blocked 5-shot
Blocked by input-dependent numerical degeneration under vLLM: many prompts produced garbage output, with whole MMLU-Pro categories collapsing to zero. Not a model-capability score, a deployment-stability failure.
HumanEval+ code blocked 0-shot
Blocked by the same vLLM numerical degeneration affecting generative benchmarks.
AIME 2024 math blocked 0-shot
Blocked by the same vLLM numerical degeneration affecting generative benchmarks.

Claimed vs measured

How the vendor's published numbers compare to what I measured. Bars to the left in red mean the model card over-claimed; right in blue means it beat its own claim.

measured − claimedGSM8K-3.48
Partial evaluation with an important reproducibility finding. Two benchmarks completed cleanly and were reproduced across three independent configurations. The generative benchmarks could not be measured reliably: under the standard vLLM attention path on this hardware the base model exhibits input-dependent numerical degeneration, producing garbage output (repeated '!') on a large subset of prompts. That instability is itself a documented result here, and the vendor's 'deploys easily on vLLM' claim does not mention it. Notable for training with the Muon optimizer rather than Adam.