Benchmark scores
Accuracy on each German benchmark. Longest bar wins — the winner is tagged.
Pick two to four models. See who speaks German best — score, speed and cost, side by side.
Accuracy on each German benchmark. Longest bar wins — the winner is tagged.
Average score against cost per 1,000 questions (log scale). Up and to the left wins — more accuracy per euro.
Throughput / TTFT come from a separate controlled speed probe (≈2k-token prompt, one request at a time, provider-pinned) — a snapshot, not every model has one. 🔒 marks reasoning-locked models whose throughput includes forced reasoning tokens, so it isn't directly comparable. Quantization is the weight precision a model was served at — provider-internal where a closed API doesn't disclose it, unverified where the run wasn't quant-pinned.