Every open model comes with benchmark scores. Most of those scores came from the people who built it. SanityBench reruns the tests, publishes the methodology, and uploads the logs. The weird stuff gets extra attention: MoE, Mamba, hybrids, diffusion models, whatever somebody decided to build at 3 A.M. If a result can't be reproduced, it doesn't belong here.
Independent reproductions of open-weight language models · 2 models evaluated · last updated May 2026
| # | Model | Vendor | Arch | Total | Active | GPQA | GSM8K | MMLU-Pro | HumanEval | AIME |
|---|---|---|---|---|---|---|---|---|---|---|
| Ling-mini-2.0 | Ant Group / InclusionAI | MoE | 16.0B | 1.4B | 37.88 | 80.89 | 53.34 | 72.56 | 16.70 | |
| Ring-mini-2.0 | Ant Group / InclusionAI | MoE | 16.0B | 1.4B | 37.88 | 79.76 | 54.52 | 65.24 | 10.00 |