Methodology

How FrozeBench produces scores

FrozeBench publishes evaluation runs we produced ourselves. We do not copy numbers from model cards, vendor pages, or other leaderboards; each score is tied to a recorded run and, where available, the underlying samples.

What “frozen” means

The name comes from our deterministic-run principle: default generation at temperature=0, a captured seed, recorded configuration, and inspectable artifacts. Temperature zero does not make every system perfectly deterministic, but it removes intentional sampling randomness from our default runs.

Run pipeline

1Queue a model × benchmark job with explicit task, model, and generation settings.
2Prepare and deploy the model behind a local OpenAI-compatible endpoint.
3Run lm-evaluation-harness in Docker with samples logged.
4Ingest job parameters, result files, and sample artifacts into the read-only data API.

What each run captures

Every run stores a file record alongside the lm-eval results and per-sample artifacts. The important audit fields are:

model id and runtime args
task args, seed, and prompt/template settings
generation kwargs, including temperature

How headline scores are chosen

Each benchmark declares a primary metric and lm-eval filter/extractor. For composite benchmarks, FrozeBench applies the benchmark’s declared aggregation policy, such as mean or max. Task-level rows remain inspectable so the aggregate does not hide where a model is strong or weak.

Sample-level transparency

When sample artifacts are available, score pages link to the underlying prompts, model outputs, targets, and per-sample metrics. Multiple-choice and generative samples are kept as separate shapes; we do not coerce one into the other just to simplify display.

Limits of interpretation

A benchmark score is not a universal capability ranking.
Public benchmarks can be contaminated, saturated, noisy, or prompt-sensitive.
Extraction filters and aggregation policies can change what a headline score means.
Some completed runs may lack populated samples because of framework or ingestion limits.