Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.
| # | Model | INCLUDE (DE) | MMLU-ProX (DE) | MMMLU (DE) | Avg |
|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 72.7% | 86.5% | 89.3% | 82.8% |
| 2 | Gemini 3.1 Flash-Lite | 72.7% | 82.1% | 86.7% | 80.5% |
| 3 | Gemini 2.5 Flash | 70.5% | 79.6% | 84.7% | 78.3% |
| 4 | Qwen3.6 35B-A3B | 68.3% | 80.0% | 85.8% | 78.1% |
| 5 | DeepSeek V4 Flash | 70.5% | 74.3% | 85.1% | 76.6% |
| 6 | Claude Haiku 4.5 | 68.3% | — | 83.1% | 75.7% |
| 7 | Tencent HY3-Preview | 69.1% | 73.2% | 83.7% | 75.3% |
| 8 | DeepSeek V4 Pro | 69.8% | — | — | 69.8% |
| 9 | Gemma 4 31B | 67.6% | — | — | 67.6% |
| 10 | gpt-oss-120b | 66.2% | — | — | 66.2% |
| 11 | Mercury 2 | 54.0% | 63.6% | 68.0% | 61.9% |
| 12 | GLM-5.1 | 47.5% | — | — | 47.5% |
Average score across all benchmarks against cost per 1,000 questions (log scale). Up and to the left is better — more accuracy per dollar. The gold line is the value frontier: the best score available at each price.
* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.
Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 72.7% | 1.88s | 105.8 | 214.9 | $0.29 | 2026-06-02 |
| 2 | Gemini 3.1 Flash-Lite | 72.7% | 1.27s | 106.1 | 153.2 | $0.04 | 2026-06-02 |
| 3 | DeepSeek V4 Flash | 70.5% | 2.61s | 118.6 | 108.4 | $0.01 | 2026-06-02 |
| 4 | Gemini 2.5 Flash | 70.5% | 1.61s | 105.8 | 193.6 | $0.07 | 2026-06-02 |
| 5 | DeepSeek V4 Pro | 69.8% | 3.59s | 118.6 | 113.6 | $0.07 | 2026-06-02 |
| 6 | Tencent HY3-Preview | 69.1% | 5.48s | 143.9 | 222.4 | $0.01 | 2026-06-02 |
| 7 | Qwen3.6 35B-A3B | 68.3% | 2.72s | 118.4 | 545.1 | $0.08 | 2026-06-02 |
| 8 | Claude Haiku 4.5 | 68.3% | 3.45s | 154.6 | 347.4 | $0.26 | 2026-06-02 |
| 9 | Gemma 4 31B | 67.6% | 11.86s | 118.8 | 257.9 | $0.02 | 2026-06-03 |
| 10 | gpt-oss-120b 🔒 reasoning | 66.2% | 1.71s | 154.1 | 21.4 | $0.01 | 2026-05-29 |
| 11 | Mercury 2 | 54.0% | 0.48s | 111.3 | 58.6 | $0.01 | 2026-06-02 |
| 12 | GLM-5.1 | 47.5% | 7.54s | 113.9 | 257.2 | $0.16 | 2026-06-03 |
Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 86.5% | 2.23s | 1,664.9 | 353.3 | $66.76 | 2026-05-26 |
| 2 | Gemini 3.1 Flash-Lite | 82.1% | 1.80s | 1,664.9 | 303.8 | $10.25 | 2026-05-26 |
| 3 | Qwen3.6 35B-A3B | 80.0% | 4.67s | 1,689.6 | 984 | $13.36 | 2026-05-31 |
| 4 | Gemini 2.5 Flash | 79.6% | 2.69s | 1,664.9 | 786 | $28.89 | 2026-05-26 |
| 5 | DeepSeek V4 Flash | 74.3% | 2.16s | 1,718.6 | 161.9 | $1.61 | 2026-05-28 |
| 6 | Tencent HY3-Preview | 73.2% | 6.14s | 1,910.2 | 586.3 | $2.76 | 2026-05-27 |
| 7 | Mercury 2 | 63.6% | 1.13s | 1,563.3 | 580.3 | $6.22 | 2026-05-29 |
| Subject | Model | Score |
|---|---|---|
| biology | Gemini 3.5 Flash | 92.9% |
| biology | Qwen3.6 35B-A3B | 89.7% |
| biology | Gemini 3.1 Flash-Lite | 89.3% |
| biology | Gemini 2.5 Flash | 88.8% |
| biology | DeepSeek V4 Flash | 88.6% |
| biology | Tencent HY3-Preview | 85.8% |
| biology | Mercury 2 | 76.8% |
| business | Gemini 3.5 Flash | 89.6% |
| business | Gemini 3.1 Flash-Lite | 86.1% |
| business | Gemini 2.5 Flash | 83.5% |
| business | Qwen3.6 35B-A3B | 83.5% |
| business | Tencent HY3-Preview | 78.3% |
| business | DeepSeek V4 Flash | 73.3% |
| business | Mercury 2 | 70.1% |
| chemistry | Gemini 3.5 Flash | 89.1% |
| chemistry | Qwen3.6 35B-A3B | 88.2% |
| chemistry | Gemini 2.5 Flash | 86.9% |
| chemistry | Gemini 3.1 Flash-Lite | 86.4% |
| chemistry | Tencent HY3-Preview | 75.6% |
| chemistry | DeepSeek V4 Flash | 74.2% |
| chemistry | Mercury 2 | 72.0% |
| computer science | Gemini 3.5 Flash | 87.6% |
| computer science | Gemini 3.1 Flash-Lite | 86.3% |
| computer science | Gemini 2.5 Flash | 85.1% |
| computer science | Qwen3.6 35B-A3B | 85.1% |
| computer science | DeepSeek V4 Flash | 82.7% |
| computer science | Tencent HY3-Preview | 76.8% |
| computer science | Mercury 2 | 71.7% |
| economics | Gemini 3.5 Flash | 89.1% |
| economics | Gemini 3.1 Flash-Lite | 87.1% |
| economics | Gemini 2.5 Flash | 86.4% |
| economics | Qwen3.6 35B-A3B | 85.3% |
| economics | Tencent HY3-Preview | 82.3% |
| economics | DeepSeek V4 Flash | 77.6% |
| economics | Mercury 2 | 69.5% |
| engineering | Gemini 3.5 Flash | 82.5% |
| engineering | Gemini 3.1 Flash-Lite | 77.7% |
| engineering | Qwen3.6 35B-A3B | 77.5% |
| engineering | Gemini 2.5 Flash | 71.6% |
| engineering | Tencent HY3-Preview | 64.4% |
| engineering | DeepSeek V4 Flash | 57.4% |
| engineering | Mercury 2 | 47.7% |
| health | Gemini 3.5 Flash | 78.6% |
| health | Gemini 3.1 Flash-Lite | 75.5% |
| health | Tencent HY3-Preview | 73.7% |
| health | Qwen3.6 35B-A3B | 72.8% |
| health | Gemini 2.5 Flash | 72.5% |
| health | DeepSeek V4 Flash | 72.3% |
| health | Mercury 2 | 60.8% |
| history | Gemini 3.5 Flash | 80.6% |
| history | Gemini 3.1 Flash-Lite | 75.9% |
| history | Tencent HY3-Preview | 71.4% |
| history | Gemini 2.5 Flash | 70.9% |
| history | DeepSeek V4 Flash | 69.8% |
| history | Qwen3.6 35B-A3B | 69.0% |
| history | Mercury 2 | 48.6% |
| law | Gemini 3.5 Flash | 72.8% |
| law | Gemini 3.1 Flash-Lite | 62.7% |
| law | Gemini 2.5 Flash | 54.0% |
| law | Qwen3.6 35B-A3B | 52.8% |
| law | DeepSeek V4 Flash | 46.3% |
| law | Tencent HY3-Preview | 45.8% |
| law | Mercury 2 | 30.1% |
| math | Gemini 3.5 Flash | 94.6% |
| math | Qwen3.6 35B-A3B | 92.7% |
| math | Gemini 3.1 Flash-Lite | 91.0% |
| math | Gemini 2.5 Flash | 90.5% |
| math | DeepSeek V4 Flash | 88.5% |
| math | Tencent HY3-Preview | 86.5% |
| math | Mercury 2 | 81.4% |
| other | Gemini 3.5 Flash | 82.4% |
| other | Gemini 3.1 Flash-Lite | 76.2% |
| other | Gemini 2.5 Flash | 73.5% |
| other | Tencent HY3-Preview | 72.1% |
| other | DeepSeek V4 Flash | 70.8% |
| other | Qwen3.6 35B-A3B | 70.8% |
| other | Mercury 2 | 56.9% |
| philosophy | Gemini 3.5 Flash | 83.2% |
| philosophy | Gemini 3.1 Flash-Lite | 76.0% |
| philosophy | Gemini 2.5 Flash | 70.9% |
| philosophy | Qwen3.6 35B-A3B | 69.5% |
| philosophy | Tencent HY3-Preview | 69.1% |
| philosophy | DeepSeek V4 Flash | 68.3% |
| philosophy | Mercury 2 | 49.1% |
| physics | Gemini 3.5 Flash | 90.4% |
| physics | Qwen3.6 35B-A3B | 88.0% |
| physics | Gemini 3.1 Flash-Lite | 86.8% |
| physics | Gemini 2.5 Flash | 85.5% |
| physics | DeepSeek V4 Flash | 84.1% |
| physics | Mercury 2 | 71.9% |
| physics | Tencent HY3-Preview | 64.7% |
| psychology | Gemini 3.5 Flash | 87.8% |
| psychology | Gemini 3.1 Flash-Lite | 83.7% |
| psychology | Gemini 2.5 Flash | 82.6% |
| psychology | DeepSeek V4 Flash | 81.1% |
| psychology | Tencent HY3-Preview | 80.5% |
| psychology | Qwen3.6 35B-A3B | 78.7% |
| psychology | Mercury 2 | 65.7% |
OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 89.3% | 2.05s | 172 | 255.8 | $35.95 | 2026-05-26 |
| 2 | Gemini 3.1 Flash-Lite | 86.7% | 1.47s | 172 | 212 | $5.07 | 2026-05-26 |
| 3 | Qwen3.6 35B-A3B | 85.8% | 4.39s | 182.9 | 561.4 | $8.27 | 2026-06-01 |
| 4 | DeepSeek V4 Flash | 85.1% | 2.54s | 190.8 | 140.2 | $0.92 | 2026-05-28 |
| 5 | Gemini 2.5 Flash | 84.7% | 1.48s | 172 | 275.5 | $10.40 | 2026-05-26 |
| 6 | Tencent HY3-Preview | 83.7% | 5.55s | 223.4 | 280.9 | $1.14 | 2026-05-28 |
| 7 | Claude Haiku 4.5 | 83.1% | 4.20s | 2,267.2 | 397.6 | $59.74 | 2026-06-02 |
| 8 | Mercury 2 | 68.0% | 0.43s | 172.2 | 90.2 | $1.53 | 2026-05-30 |
Streaming throughput on a controlled workload (~2,000-token German prompt, 400-token output cap, one request at a time). TPS is decode speed — output tokens per second after the first token. TTFT is time to first token. Measured separately from the accuracy benchmarks (their reasoning-off answers are too short to time). Provider-pinned; a snapshot, not a constant.
| # | Model | TPS | TTFT |
|---|---|---|---|
| 1 | gpt-oss-120b 🔒 | 1,641 | 0.31s |
| 2 | Mercury 2 | 541 | 0.39s |
| 3 | Gemini 3.1 Flash-Lite | 223 | 0.61s |
| 4 | Gemini 3.5 Flash | 181 | 0.75s |
| 5 | Qwen3.6 35B-A3B | 159 | 0.62s |
| 6 | Gemini 2.5 Flash | 159 | 0.41s |
| 7 | Claude Haiku 4.5 | 131 | 0.79s |
| 8 | DeepSeek V4 Flash | 115 | 0.92s |
| 9 | Tencent HY3-Preview | 107 | 2.69s |
| 10 | GLM-5.1 | 32 | 0.65s |
| 11 | Gemma 4 31B | 24 | 0.59s |
Average score across all benchmarks against decode speed (output tokens per second) from the throughput probe. Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.
* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.
🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.