Methodology
How FrozeBench produces scores
FrozeBench publishes evaluation runs we produced ourselves. We do not copy numbers from model cards, vendor pages, or other leaderboards; each score is tied to a recorded run and, where available, the underlying samples.
What “frozen” means
The name comes from our deterministic-run principle: default generation at temperature=0, a captured seed, recorded configuration, and inspectable artifacts. Temperature zero does not make every system perfectly deterministic, but it removes intentional sampling randomness from our default runs.
Run pipeline
- 1Queue a model × benchmark job with explicit task, model, and generation settings.
- 2Prepare and deploy the model behind a local OpenAI-compatible endpoint.
- 3Run lm-evaluation-harness in Docker with samples logged.
- 4Ingest job parameters, result files, and sample artifacts into the read-only data API.
What each run captures
Every run stores a file record alongside the lm-eval results and per-sample artifacts. The important audit fields are:
- model id and runtime args
- task args, seed, and prompt/template settings
- generation kwargs, including temperature
How headline scores are chosen
Each benchmark declares a primary metric and lm-eval filter/extractor. For composite benchmarks, FrozeBench applies the benchmark’s declared aggregation policy, such as mean or max. Task-level rows remain inspectable so the aggregate does not hide where a model is strong or weak.
Sample-level transparency
When sample artifacts are available, score pages link to the underlying prompts, model outputs, targets, and per-sample metrics. Multiple-choice and generative samples are kept as separate shapes; we do not coerce one into the other just to simplify display.
Limits of interpretation
- A benchmark score is not a universal capability ranking.
- Public benchmarks can be contaminated, saturated, noisy, or prompt-sensitive.
- Extraction filters and aggregation policies can change what a headline score means.
- Some completed runs may lack populated samples because of framework or ingestion limits.