sanity·bench
independent · reproducible · skeptical

Somebody should check the numbers.

Every open model comes with benchmark scores. Most of those scores came from the people who built it. SanityBench reruns the tests, publishes the methodology, and uploads the logs. The weird stuff gets extra attention: MoE, Mamba, hybrids, diffusion models, whatever somebody decided to build at 3 A.M. If a result can't be reproduced, it doesn't belong here.

Leaderboard

Independent reproductions of open-weight language models · 2 models evaluated · last updated May 2026

Architecture

# Model Vendor Arch Total Active GPQA GSM8K MMLU-Pro HumanEval AIME
Ling-mini-2.0 Ant Group / InclusionAI MoE 16.0B 1.4B 37.88 80.89 53.34 72.56 16.70
Ring-mini-2.0 Ant Group / InclusionAI MoE 16.0B 1.4B 37.88 79.76 54.52 65.24 10.00