Skip to main content
FrozeBench

Methodology

How FrozeBench produces scores

FrozeBench publishes evaluation runs we produced ourselves. We do not copy numbers from model cards, vendor pages, or other leaderboards; each score is tied to a recorded run and, where available, the underlying samples.

What “frozen” means

The name comes from our deterministic-run principle: default generation at temperature=0, a captured seed, recorded configuration, and inspectable artifacts. Temperature zero does not make every system perfectly deterministic, but it removes intentional sampling randomness from our default runs.

Run pipeline

  1. 1Queue a model × benchmark job with explicit task, model, and generation settings.
  2. 2Prepare and deploy the model behind a local OpenAI-compatible endpoint.
  3. 3Run lm-evaluation-harness in Docker with samples logged.
  4. 4Ingest job parameters, result files, and sample artifacts into the read-only data API.

What each run captures

Every run stores a file record alongside the lm-eval results and per-sample artifacts. The important audit fields are:

  • model id and runtime args
  • task args, seed, and prompt/template settings
  • generation kwargs, including temperature

How headline scores are chosen

Each benchmark declares a primary metric and lm-eval filter/extractor. For composite benchmarks, FrozeBench applies the benchmark’s declared aggregation policy, such as mean or max. Task-level rows remain inspectable so the aggregate does not hide where a model is strong or weak.

Sample-level transparency

When sample artifacts are available, score pages link to the underlying prompts, model outputs, targets, and per-sample metrics. Multiple-choice and generative samples are kept as separate shapes; we do not coerce one into the other just to simplify display.

Limits of interpretation

  • A benchmark score is not a universal capability ranking.
  • Public benchmarks can be contaminated, saturated, noisy, or prompt-sensitive.
  • Extraction filters and aggregation policies can change what a headline score means.
  • Some completed runs may lack populated samples because of framework or ingestion limits.