sanity·bench

Methodology

Every number here is one I ran myself. Hardware is disclosed. Harnesses are named and version-pinned. Raw logs are in the repo. If you can't reproduce it from what's published, it doesn't belong on the site.

The one rule

Model cards exist to move and sell models. That's not a criticism, it's their job. My job is different. I download the weights, run them myself, publish the logs, and write down exactly what happened.

Runs per benchmark

One run per benchmark. I don't average across multiple runs. It's expensive and, for the mostly deterministic setups used here, rarely changes the headline result.

The exception is when the harness gets in the way of measuring the model. A common example is a reasoning model hitting an 8K token cap and getting cut off before it reaches an answer. In those cases I rerun with a sensible limit and note it on the result page.

Precision

Models run at full precision (FP16/BF16) or quantized (Q4 and friends). Quantization changes results. A Q4 score and an FP16 score aren't making the same claim, so they're reported separately.

Harnesses

I use the standard named harnesses, not my own scoring code, so the numbers line up with what everyone else reports:

BenchmarkHarnessSetup
GPQA Diamondlm-evaluation-harness0-shot, loglikelihood
GSM8Klm-evaluation-harness5-shot, strict
MMLU-Prolm-evaluation-harness5-shot
HumanEval+bigcode-evaluation-harnesspass@1
AIME 2024custom generation script0-shot, greedy

Harness versions and any per-model patches live on each model's page and in the repo. Some of these architectures won't even load without a specific transformers or vllm pin (Ling and Ring need a bailing_moe_v2 patch, for instance), so that gets recorded too.

Hardware

Runs happen on disclosed cloud GPUs (usually a RunPod A6000 48GB) or local hardware. Hardware affects speed and occasionally affects results. The exact machine is listed for every run.

Raw logs

The raw harness output for every run goes up in the GitHub repo next to the score. Don't trust the summary number? Read the actual run. Or clone the repo and reproduce it.

Claimed vs measured

When a vendor has published a number, it shows up in amber beside my measured result in ice blue. Amber is always the vendor's claim. It's there for context, and it stays unconfirmed unless there's a measured number sitting next to it.