Methodology — German Artificial Analytics

What we measure

Every score on the leaderboard is a German benchmark run with lm-evaluation-harness . We pick benchmarks already in the harness so the number is comparable to other groups'. Either the German split directly (MMLU-ProX-DE) or the German generate-until variant (MMMLU-DE). No private benchmarks, no proprietary scorers.

Run configuration

Chat template applied per model.
Temperature 0.
Max generated tokens 8192 — headroom for reasoning traces.
Thinking models forced to reasoning_effort=minimal where supported. Pro / always-reasoning models run as-is.
🔒 next to a Reasoning value means the model is architecturally locked to emit reasoning tokens (e.g. gpt-oss's Harmony format requires a non-empty analysis channel — the API floor still emits ~50 tokens / request). Compare these scores to other rows with caution.
Official 3-tier answer extractor — model writes the letter, harness extracts it.

Token accounting

Token counts split into the three additive components of the billable total:

prompt — what we sent.
response — visible reply.
reasoning — hidden thinking tokens (only on reasoning models).

Reasoning-off classification has tolerance: ≤10 absolute or <0.1 % of total completion tokens still counts as off (providers sometimes tag stray tokens as reasoning).

Speed (TPS & TTFT)

Throughput is measured separately from accuracy, on a controlled streaming workload — a ~2,000-token German prompt with a 400-token output cap, sent one request at a time (n=1). We can't read decode speed off the accuracy runs: a reasoning-off multiple-choice answer is one token, so there's nothing to time. This mirrors how artificialanalysis.ai measures it.

TPS — decode tokens per second: output_tokens / (total_time − TTFT). The wait for the first token is excluded, so this is steady-state generation speed. Output tokens only. Median over the run.
TTFT — time to first token: request sent → first streamed chunk. The part users feel most.
Provider-pinned (the same model on a different provider streams at a different rate), reasoning off, token counts from the provider's own usage field. A snapshot at run time, not a constant.
🔒 means reasoning can't be disabled (gpt-oss) — its TPS includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off rows.

Known caveats

Translation quality affects translated splits. Items where multiple top-tier models got EN right but DE wrong are flagged for review.
OpenRouter pinned-provider rate limits can collapse mid-run. Failed runs are quarantined, not silently shipped.
Cost is actual USD billed, not estimate.

Reproducing a run

Every result has a run_id with launch timestamp + model slug. The full command, env, and per-item log live under results/runs/<run_id>/ in the repo. The site rebuilds from Turso at build time — no client-side fetch, no JavaScript shipped.