Every number here is one I ran myself. Hardware is disclosed. Harnesses are named and version-pinned. Raw logs are in the repo. If you can't reproduce it from what's published, it doesn't belong on the site.
Model cards exist to move and sell models. That's not a criticism, it's their job. My job is different. I download the weights, run them myself, publish the logs, and write down exactly what happened.
One run per benchmark. I don't average across multiple runs. It's expensive and, for the mostly deterministic setups used here, rarely changes the headline result.
The exception is when the harness gets in the way of measuring the model. A common example is a reasoning model hitting an 8K token cap and getting cut off before it reaches an answer. In those cases I rerun with a sensible limit and note it on the result page.
Models run at full precision (FP16/BF16) or quantized (Q4 and friends). Quantization changes results. A Q4 score and an FP16 score aren't making the same claim, so they're reported separately.
I use the standard named harnesses, not my own scoring code, so the numbers line up with what everyone else reports:
| Benchmark | Harness | Setup |
|---|---|---|
| GPQA Diamond | lm-evaluation-harness | 0-shot, loglikelihood |
| GSM8K | lm-evaluation-harness | 5-shot, strict |
| MMLU-Pro | lm-evaluation-harness | 5-shot |
| HumanEval+ | bigcode-evaluation-harness | pass@1 |
| AIME 2024 | custom generation script | 0-shot, greedy |
Harness versions and any per-model patches live on each model's page and in the repo.
Some of these architectures won't even load without a specific transformers
or vllm pin (Ling and Ring need a bailing_moe_v2 patch, for
instance), so that gets recorded too.
Runs happen on disclosed cloud GPUs (usually a RunPod A6000 48GB) or local hardware. Hardware affects speed and occasionally affects results. The exact machine is listed for every run.
The raw harness output for every run goes up in the GitHub repo next to the score. Don't trust the summary number? Read the actual run. Or clone the repo and reproduce it.
When a vendor has published a number, it shows up in amber beside my measured result in ice blue. Amber is always the vendor's claim. It's there for context, and it stays unconfirmed unless there's a measured number sitting next to it.