independent · reproducible · skeptical

Somebody should check the numbers.

Independent, reproducible benchmarks of open-weight language models. Every number re-run on disclosed hardware with raw logs published. The exotic architectures other leaderboards skip get tested here.

Benchmarks coming soon

The leaderboard is being rebuilt under a stricter, instruct-only methodology. Real numbers will be back here shortly.