sanity·bench

About

SanityBench is an independent benchmark project for open-weight language models. Most of the models here won't make the front page of a leaderboard. That's kind of the point.

Why this exists

Pick a random open model and look up its benchmark scores. Chances are the numbers came from the people who built it.

Sometimes those numbers hold up perfectly. Sometimes they don't. Either way, somebody should check. That's what this site does. I run the model, publish the methodology, upload the logs, and let the result speak for itself.

Who runs it

I'm Boosted (iAmBoosted on Hugging Face, iamboosted on GitHub). I spend way too much time experimenting with model architecture, MoE conversions, model merges, reasoning fine-tunes, and other questionable ideas. Many of the same weird architectures I work on end up getting benchmarked here. SanityBench started because I wanted reliable numbers for my own projects and kept discovering nobody had published any.

Support

The evals burn GPU hours I pay for. If the site is useful and you'd like to help fund more runs, I appreciate it. If not, that's fine too. The benchmarks stay public either way.

Contact

Found a mistake, or want a model run? boosted@sanitybench.com. If a number here is wrong I'd rather know and fix it than leave it up.