Compare models

Pick two to four models. See who speaks German best — score, speed and cost, side by side.

Benchmark scores

Accuracy on each German benchmark. Longest bar wins — the winner is tagged.

Quality vs. cost

Average score against cost per 1,000 questions (log scale). Up and to the left wins — more accuracy per euro.

Details

Throughput / TTFT come from a separate controlled speed probe (≈2k-token prompt, one request at a time, provider-pinned) — a snapshot, not every model has one. 🔒 marks reasoning-locked models whose throughput includes forced reasoning tokens, so it isn't directly comparable. Quantization is the weight precision a model was served at — provider-internal where a closed API doesn't disclose it, unverified where the run wasn't quant-pinned.