What we measure
Every score on the leaderboard is a German benchmark run with lm-evaluation-harness . We pick benchmarks already in the harness so the number is comparable to other groups'. Either the German split directly (MMLU-ProX-DE) or the German generate-until variant (MMMLU-DE). No private benchmarks, no proprietary scorers.
Run configuration
- Chat template applied per model.
- Temperature 0.
- Max generated tokens 8192 — headroom for reasoning traces.
-
Thinking models forced to
reasoning_effort=minimalwhere supported. Pro / always-reasoning models run as-is. - 🔒 next to a Reasoning value means the model is architecturally locked to emit reasoning tokens (e.g. gpt-oss's Harmony format requires a non-empty analysis channel — the API floor still emits ~50 tokens / request). Compare these scores to other rows with caution.
- Official 3-tier answer extractor — model writes the letter, harness extracts it.
Token accounting
Token counts split into the three additive components of the billable total:
- prompt — what we sent.
- response — visible reply.
- reasoning — hidden thinking tokens (only on reasoning models).
Reasoning-off classification has tolerance: ≤10 absolute or <0.1 % of total completion tokens still counts as off (providers sometimes tag stray tokens as reasoning).
Speed (TPS & TTFT)
Throughput is measured separately from accuracy, on a controlled streaming workload — a ~2,000-token German prompt with a 400-token output cap, sent one request at a time (n=1). We can't read decode speed off the accuracy runs: a reasoning-off multiple-choice answer is one token, so there's nothing to time. This mirrors how artificialanalysis.ai measures it.
- TPS — decode tokens per second:
output_tokens / (total_time − TTFT). The wait for the first token is excluded, so this is steady-state generation speed. Output tokens only. Median over the run. - TTFT — time to first token: request sent → first streamed chunk. The part users feel most.
- Provider-pinned (the same model on a different provider streams at a different rate), reasoning off, token counts from the provider's own usage field. A snapshot at run time, not a constant.
- 🔒 means reasoning can't be disabled (gpt-oss) — its TPS includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off rows.
Known caveats
- Translation quality affects translated splits. Items where multiple top-tier models got EN right but DE wrong are flagged for review.
- OpenRouter pinned-provider rate limits can collapse mid-run. Failed runs are quarantined, not silently shipped.
- Cost is actual USD billed, not estimate.
Reproducing a run
Every result has a run_id
with launch timestamp + model slug. The full command, env, and per-item log live under
results/runs/<run_id>/
in the repo. The site rebuilds from Turso at build time — no client-side fetch, no
JavaScript shipped.