Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.
Which model is best on German tasks — one score, ranked. Filter by open vs. closed weights, price, or speed.
Showing 17 models that ran ≥3 of 7 benchmarks (9 excluded for thin coverage). Price = median effective $/1M tokens; Speed = throughput + latency.
Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro 🔒 reasoning | 87.3% | 4.86s | 1,059.7 | 436.2 | $10.04 | 2026-06-09 |
| 2 | Gemini 3.5 Flash | 86.0% | 0.81s | 1,077 | 24.6 | $1.88 | 2026-06-09 |
| 3 | Gemini 3.1 Flash-Lite | 83.2% | 0.52s | 1,077 | 25.7 | $0.63 | 2026-06-08 |
| 4 | Gemma 4 31B | 82.6% | 10.35s | 1,153 | 27.8 | $0.15 | 2026-06-11 |
| 5 | DeepSeek V4 Pro | 82.2% | 1.74s | 1,181.2 | 26.3 | $2.02 | 2026-06-09 |
| 6 | Qwen3.7 Max | 82.1% | 1.64s | 1,154.3 | 25.2 | $1.57 | 2026-06-09 |
| 7 | MiMo V2.5 Pro | 81.9% | 1.09s | 1,533.5 | 26.4 | $1.49 | 2026-06-09 |
| 8 | Gemma 4 26B A4B | 81.8% | 1.14s | 1,153 | 30.3 | $0.19 | 2026-06-08 |
| 9 | Gemini 2.5 Flash | 81.8% | 0.52s | 1,077 | 24.5 | $0.39 | 2026-06-09 |
| 10 | DeepSeek V4 Flash | 80.2% | 0.77s | 1,181.2 | 26.8 | $0.18 | 2026-06-09 |
| 11 | Qwen3.6 35B-A3B | 79.9% | 0.67s | 1,154.3 | 27.3 | $0.21 | 2026-06-09 |
| 12 | Gemma 4 12B | 79.6% | 9.49s | 1,153 | 28.1 | — | 2026-06-11 |
| 13 | Tencent HY3-Preview | 77.3% | 2.50s | 1,313.7 | 35.4 | $0.09 | 2026-06-09 |
| 14 | Qwen3 14B | 73.0% | 9.61s | 1,292.5 | 28.5 | $0.17 | 2026-06-11 |
| 15 | Qwen3.5 9B | 72.6% | 5.70s | 1,154.3 | 26.9 | $0.12 | 2026-06-11 |
| 16 | GLM-5.1 | 52.2% | 2.20s | 1,127.6 | 26.5 | $1.71 | 2026-06-09 |
Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro 🔒 reasoning | 77.7% | 4.24s | 107.1 | 416.6 | $0.72 | 2026-06-10 |
| 2 | GPT-5.5 🔒 reasoning | 74.8% | 3.99s | 128.2 | 153.6 | $0.70 | 2026-06-12 |
| 3 | Gemini 3.5 Flash | 74.1% | 1.72s | 107.1 | 205.4 | $0.28 | 2026-06-10 |
| 4 | MiniMax M2.7 🔒 reasoning | 72.7% | 12.26s | 151.1 | 1,443.8 | $0.29 | |
| 5 | Gemini 3.1 Flash-Lite | 71.9% | 0.89s | 107.1 | 151 | $0.04 | 2026-06-10 |
| 6 | Opus 4.8 | 71.9% | 5.77s | 256.2 | 343.7 | $1.32 | 2026-06-12 |
| 7 | Qwen3.7 Max | 71.2% | 4.17s | 118.3 | 239.6 | $0.28 | 2026-06-08 |
| 8 | Gemini 2.5 Flash | 70.5% | 1.61s | 105.8 | 193.6 | $0.07 | 2026-06-02 |
| 9 | DeepSeek V4 Pro | 70.5% | 3.45s | 118.626 | 121.108 | $0.08 | 2026-06-08 |
| 10 | DeepSeek V4 Flash | 70.5% | 2.53s | 118.626 | 111.288 | $0.0066 | 2026-06-08 |
| 11 | Kimi K2.6 | 69.8% | 13.65s | 138.7 | 527.1 | $0.58 | 2026-06-08 |
| 12 | Tencent HY3-Preview | 69.1% | 5.48s | 143.9 | 222.4 | $0.0086 | 2026-06-02 |
| 13 | Gemma 4 12B | 69.1% | 9.49s | 123.1 | 221.9 | — | 2026-06-11 |
| 14 | Qwen3.6 35B-A3B | 68.3% | 2.72s | 118.4 | 545.1 | $0.08 | 2026-06-02 |
| 15 | Claude Haiku 4.5 | 68.3% | 3.45s | 154.6 | 347.4 | $0.26 | 2026-06-02 |
| 16 | Gemma 4 31B | 68.3% | 10.35s | 119.1 | 205.4 | $0.01 | 2026-06-11 |
| 17 | MiMo V2.5 Pro | 67.6% | 4.63s | 372.2 | 224.3 | $0.10 | 2026-06-08 |
| 18 | GLM-5.1 | 67.6% | 3.36s | 114.6 | 259.7 | $0.35 | 2026-06-08 |
| 19 | gpt-oss-120b 🔒 reasoning | 66.2% | 1.71s | 154.1 | 111.7 | $0.0078 | 2026-05-29 |
| 20 | Gemma 4 26B A4B | 64.7% | 3.71s | 118.7 | 275.8 | $0.03 | 2026-06-03 |
| 21 | Qwen3.5 9B | 64.7% | 5.70s | 122.4 | 378 | $0.0096 | 2026-06-11 |
| 22 | grok-4.3 | 63.3% | 0.60s | 233.8 | 20.5 | $0.04 | 2026-06-03 |
| 23 | Qwen3 14B | 63.3% | 9.61s | 135.4 | 46 | $0.0038 | 2026-06-11 |
| 24 | Ministral 14B | 58.3% | 0.41s | 109.1 | 79.8 | $0.01 | 2026-06-03 |
| 25 | gemma-3-12b-it | 53.2% | 5.00s | 114.8 | 160.5 | $0.0073 | 2026-06-03 |
Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 86.5% | 2.23s | 1,664.9 | 353.3 | $66.76 | 2026-05-26 |
| 2 | Gemini 3.1 Flash-Lite | 82.2% | 1.24s | 1,665.9 | 304 | $10.26 | 2026-06-03 |
| 3 | Gemma 4 31B | 82.1% | 10.35s | 1,702.9 | 502.5 | $4.47 | 2026-06-11 |
| 4 | Qwen3.6 35B-A3B | 80.0% | 4.67s | 1,689.6 | 984 | $13.36 | 2026-05-31 |
| 5 | Gemini 2.5 Flash | 79.6% | 2.69s | 1,664.9 | 786 | $28.89 | 2026-05-26 |
| 6 | Gemma 4 26B A4B | 78.2% | 7.43s | 1,702.3 | 629.5 | $3.49 | 2026-06-05 |
| 7 | Claude Haiku 4.5 | 75.3% | 3.76s | 2,262 | 433.7 | $52.10 | 2026-06-03 |
| 8 | Qwen3.5 9B | 73.4% | 5.70s | 1,693.6 | 931.1 | $3.63 | 2026-06-11 |
| 9 | Tencent HY3-Preview | 73.2% | 6.14s | 1,910.2 | 586.3 | $2.76 | 2026-05-27 |
| 10 | Gemma 4 12B | 73.1% | 9.49s | 1,705.9 | 493.3 | — | 2026-06-11 |
| 11 | Gemini 2.5 Flash-Lite | 71.2% | 2.48s | 1,665.1 | 1,493.5 | $8.95 | 2026-06-03 |
| 12 | Qwen3 14B | 64.1% | 9.61s | 1,887.3 | 413.7 | $3.83 | 2026-06-11 |
| 13 | DeepSeek V4 Flash | 36.8% | 2.01s | 1,716.4 | 172.4 | $1.91 | 2026-06-03 |
| Subject | Model | Score |
|---|---|---|
| biology | Gemini 3.5 Flash | 92.9% |
| biology | Gemma 4 31B | 91.2% |
| biology | Qwen3.6 35B-A3B | 89.7% |
| biology | Gemini 3.1 Flash-Lite | 89.4% |
| biology | Gemini 2.5 Flash | 88.8% |
| biology | DeepSeek V4 Flash | 88.3% |
| biology | Gemma 4 26B A4B | 87.3% |
| biology | Gemini 2.5 Flash-Lite | 86.8% |
| biology | Tencent HY3-Preview | 85.8% |
| business | Gemini 3.5 Flash | 89.6% |
| business | Gemma 4 31B | 87.5% |
| business | Gemini 3.1 Flash-Lite | 85.3% |
| business | Gemma 4 26B A4B | 83.9% |
| business | Gemini 2.5 Flash | 83.5% |
| business | Qwen3.6 35B-A3B | 83.5% |
| business | DeepSeek V4 Flash | 81.0% |
| business | Tencent HY3-Preview | 78.3% |
| business | Gemini 2.5 Flash-Lite | 77.7% |
| chemistry | Gemini 3.5 Flash | 89.1% |
| chemistry | Qwen3.6 35B-A3B | 88.2% |
| chemistry | Gemini 3.1 Flash-Lite | 87.1% |
| chemistry | Gemini 2.5 Flash | 86.9% |
| chemistry | Gemma 4 31B | 86.7% |
| chemistry | Gemma 4 26B A4B | 85.4% |
| chemistry | Gemini 2.5 Flash-Lite | 78.4% |
| chemistry | DeepSeek V4 Flash | 75.7% |
| chemistry | Tencent HY3-Preview | 75.6% |
| computer science | Gemini 3.1 Flash-Lite | 88.8% |
| computer science | Gemini 3.5 Flash | 87.6% |
| computer science | Gemma 4 31B | 85.9% |
| computer science | Gemini 2.5 Flash | 85.1% |
| computer science | Qwen3.6 35B-A3B | 85.1% |
| computer science | Gemma 4 26B A4B | 83.9% |
| computer science | DeepSeek V4 Flash | 81.7% |
| computer science | Gemini 2.5 Flash-Lite | 77.8% |
| computer science | Tencent HY3-Preview | 76.8% |
| economics | Gemini 3.5 Flash | 89.1% |
| economics | Gemini 3.1 Flash-Lite | 87.3% |
| economics | Gemma 4 31B | 86.8% |
| economics | Gemini 2.5 Flash | 86.4% |
| economics | Qwen3.6 35B-A3B | 85.3% |
| economics | Gemma 4 26B A4B | 82.6% |
| economics | Tencent HY3-Preview | 82.3% |
| economics | Gemini 2.5 Flash-Lite | 79.1% |
| economics | DeepSeek V4 Flash | 69.5% |
| engineering | Gemini 3.5 Flash | 82.5% |
| engineering | Gemini 3.1 Flash-Lite | 77.8% |
| engineering | Qwen3.6 35B-A3B | 77.5% |
| engineering | Gemma 4 31B | 77.1% |
| engineering | Gemma 4 26B A4B | 73.5% |
| engineering | Gemini 2.5 Flash | 71.6% |
| engineering | Tencent HY3-Preview | 64.4% |
| engineering | DeepSeek V4 Flash | 60.0% |
| engineering | Gemini 2.5 Flash-Lite | 58.4% |
| health | Gemini 3.5 Flash | 78.6% |
| health | Gemini 3.1 Flash-Lite | 74.8% |
| health | Gemma 4 31B | 74.1% |
| health | Tencent HY3-Preview | 73.7% |
| health | Qwen3.6 35B-A3B | 72.8% |
| health | Gemini 2.5 Flash | 72.5% |
| health | Gemma 4 26B A4B | 71.2% |
| health | Gemini 2.5 Flash-Lite | 66.5% |
| health | DeepSeek V4 Flash | 13.8% |
| history | Gemini 3.5 Flash | 80.6% |
| history | Gemini 3.1 Flash-Lite | 74.5% |
| history | Gemma 4 31B | 74.5% |
| history | Tencent HY3-Preview | 71.4% |
| history | Gemini 2.5 Flash | 70.9% |
| history | Qwen3.6 35B-A3B | 69.0% |
| history | Gemma 4 26B A4B | 66.7% |
| history | Gemini 2.5 Flash-Lite | 59.3% |
| history | DeepSeek V4 Flash | 10.8% |
| law | Gemini 3.5 Flash | 72.8% |
| law | Gemini 3.1 Flash-Lite | 62.8% |
| law | Gemma 4 31B | 59.0% |
| law | Gemini 2.5 Flash | 54.0% |
| law | Qwen3.6 35B-A3B | 52.8% |
| law | Gemma 4 26B A4B | 50.7% |
| law | Tencent HY3-Preview | 45.8% |
| law | Gemini 2.5 Flash-Lite | 41.6% |
| law | DeepSeek V4 Flash | 10.4% |
| math | Gemini 3.5 Flash | 94.6% |
| math | Gemma 4 31B | 93.5% |
| math | Qwen3.6 35B-A3B | 92.7% |
| math | Gemma 4 26B A4B | 92.0% |
| math | Gemini 2.5 Flash | 90.5% |
| math | Gemini 3.1 Flash-Lite | 90.2% |
| math | Tencent HY3-Preview | 86.5% |
| math | Gemini 2.5 Flash-Lite | 82.8% |
| math | DeepSeek V4 Flash | 9.5% |
| nothink biology | Gemma 4 12B | 87.0% |
| nothink biology | Qwen3.5 9B | 85.6% |
| nothink biology | Qwen3 14B | 81.3% |
| nothink business | Gemma 4 12B | 79.8% |
| nothink business | Qwen3.5 9B | 78.1% |
| nothink business | Qwen3 14B | 70.3% |
| nothink chemistry | Qwen3.5 9B | 85.0% |
| nothink chemistry | Gemma 4 12B | 79.9% |
| nothink chemistry | Qwen3 14B | 72.2% |
| nothink computer science | Gemma 4 12B | 79.3% |
| nothink computer science | Qwen3.5 9B | 78.0% |
| nothink computer science | Qwen3 14B | 71.5% |
| nothink economics | Qwen3.5 9B | 78.9% |
| nothink economics | Gemma 4 12B | 78.7% |
| nothink economics | Qwen3 14B | 71.1% |
| nothink engineering | Qwen3.5 9B | 66.6% |
| nothink engineering | Gemma 4 12B | 65.1% |
| nothink engineering | Qwen3 14B | 55.9% |
| nothink health | Qwen3.5 9B | 65.9% |
| nothink health | Gemma 4 12B | 64.5% |
| nothink health | Qwen3 14B | 56.8% |
| nothink history | Qwen3.5 9B | 60.4% |
| nothink history | Gemma 4 12B | 57.2% |
| nothink history | Qwen3 14B | 48.8% |
| nothink law | Gemma 4 12B | 42.8% |
| nothink law | Qwen3.5 9B | 36.7% |
| nothink law | Qwen3 14B | 27.4% |
| nothink math | Qwen3.5 9B | 90.2% |
| nothink math | Gemma 4 12B | 90.1% |
| nothink math | Qwen3 14B | 82.2% |
| nothink other | Qwen3.5 9B | 61.6% |
| nothink other | Gemma 4 12B | 60.7% |
| nothink other | Qwen3 14B | 52.1% |
| nothink philosophy | Gemma 4 12B | 60.3% |
| nothink philosophy | Qwen3.5 9B | 58.1% |
| nothink philosophy | Qwen3 14B | 49.3% |
| nothink physics | Qwen3.5 9B | 84.7% |
| nothink physics | Gemma 4 12B | 81.9% |
| nothink physics | Qwen3 14B | 73.1% |
| nothink psychology | Gemma 4 12B | 76.3% |
| nothink psychology | Qwen3.5 9B | 74.2% |
| nothink psychology | Qwen3 14B | 66.0% |
| other | Gemini 3.5 Flash | 82.4% |
| other | Gemini 3.1 Flash-Lite | 76.7% |
| other | Gemma 4 31B | 73.7% |
| other | Gemini 2.5 Flash | 73.5% |
| other | Tencent HY3-Preview | 72.1% |
| other | Qwen3.6 35B-A3B | 70.8% |
| other | Gemma 4 26B A4B | 67.5% |
| other | Gemini 2.5 Flash-Lite | 64.9% |
| other | DeepSeek V4 Flash | 9.2% |
| philosophy | Gemini 3.5 Flash | 83.2% |
| philosophy | Gemini 3.1 Flash-Lite | 75.2% |
| philosophy | Gemma 4 31B | 74.3% |
| philosophy | Gemini 2.5 Flash | 70.9% |
| philosophy | Qwen3.6 35B-A3B | 69.5% |
| philosophy | Tencent HY3-Preview | 69.1% |
| philosophy | Gemma 4 26B A4B | 69.1% |
| philosophy | Gemini 2.5 Flash-Lite | 60.9% |
| philosophy | DeepSeek V4 Flash | 6.8% |
| physics | Gemini 3.5 Flash | 90.4% |
| physics | Gemma 4 31B | 89.0% |
| physics | Qwen3.6 35B-A3B | 88.0% |
| physics | Gemini 3.1 Flash-Lite | 87.6% |
| physics | Gemma 4 26B A4B | 86.1% |
| physics | Gemini 2.5 Flash | 85.5% |
| physics | Gemini 2.5 Flash-Lite | 76.6% |
| physics | Tencent HY3-Preview | 64.7% |
| physics | DeepSeek V4 Flash | 10.2% |
| psychology | Gemini 3.5 Flash | 87.8% |
| psychology | Gemma 4 31B | 83.6% |
| psychology | Gemini 3.1 Flash-Lite | 83.3% |
| psychology | Gemini 2.5 Flash | 82.6% |
| psychology | Tencent HY3-Preview | 80.5% |
| psychology | Qwen3.6 35B-A3B | 78.7% |
| psychology | Gemma 4 26B A4B | 78.6% |
| psychology | Gemini 2.5 Flash-Lite | 75.1% |
| psychology | DeepSeek V4 Flash | 10.4% |
OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 89.3% | 2.05s | 172 | 255.8 | $35.95 | 2026-05-26 |
| 2 | Gemini 3.1 Flash-Lite | 86.8% | 1.06s | 173 | 210.8 | $5.05 | 2026-06-03 |
| 3 | Gemma 4 31B | 86.6% | 10.35s | 185 | 251.5 | $1.55 | 2026-06-11 |
| 4 | Qwen3.6 35B-A3B | 85.8% | 4.39s | 182.9 | 561.4 | $8.27 | 2026-06-01 |
| 5 | DeepSeek V4 Flash | 84.9% | 2.32s | 185.9 | 144.7 | $1.03 | 2026-06-03 |
| 6 | Gemini 2.5 Flash | 84.7% | 1.48s | 172 | 275.5 | $10.40 | 2026-05-26 |
| 7 | Gemma 4 26B A4B | 83.7% | 3.97s | 185 | 340.7 | $2.24 | 2026-06-05 |
| 8 | Tencent HY3-Preview | 83.7% | 5.55s | 223.4 | 280.9 | $1.14 | 2026-05-28 |
| 9 | Claude Haiku 4.5 | 83.1% | 4.20s | 277.7 | 397.6 | $31.90 | 2026-06-02 |
| 10 | Gemini 2.5 Flash-Lite | 79.6% | 1.68s | 162.7 | 504.6 | $4.99 | 2026-06-03 |
| 11 | Gemma 4 12B | 79.0% | 9.49s | 189 | 261.2 | — | 2026-06-11 |
| 12 | Qwen3.5 9B | 78.8% | 5.70s | 186.9 | 408.3 | $1.12 | 2026-06-11 |
| 13 | Qwen3 14B | 73.4% | 9.61s | 212 | 87.9 | $0.65 | 2026-06-11 |
German MuSR multistep soft reasoning, frozen v1, cot+; accuracy.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro 🔒 reasoning | 88.1% | 12.14s | 1,464.9 | 1,306.5 | $10.49 | 2026-06-09 |
| 2 | Gemini 3.1 Flash-Lite | 84.4% | 2.95s | 1,467.3 | 568.5 | $0.73 | 2026-06-09 |
| 3 | Gemini 3.5 Flash | 84.2% | 5.79s | 1,464.9 | 850.2 | $5.55 | 2026-06-09 |
| 4 | Gemma 4 26B A4B | 83.7% | 9.78s | 1,478.6 | 814.1 | $0.29 | 2026-06-10 |
| 5 | DeepSeek V4 Flash | 83.5% | 10.18s | 1,659.8 | 798.9 | $0.26 | 2026-06-10 |
| 6 | Gemma 4 31B | 83.5% | 10.35s | 1,477.9 | 650.1 | $0.23 | 2026-06-11 |
| 7 | Gemini 2.5 Flash | 83.3% | 6.28s | 1,464.9 | 1,077.4 | $1.77 | 2026-06-09 |
| 8 | Gemma 4 12B | 81.6% | 9.49s | 1,477.9 | 707.4 | — | 2026-06-11 |
| 9 | GLM-5.1 | 81.4% | 26.91s | 1,583.3 | 882.5 | $3.44 | 2026-06-10 |
| 10 | MiMo V2.5 Pro | 80.9% | 11.43s | 1,964.3 | 754.9 | $0.81 | 2026-06-10 |
| 11 | Qwen3.5 9B | 80.9% | 5.70s | 1,469.2 | 1,629.7 | $0.22 | 2026-06-11 |
| 12 | Qwen3 14B | 69.7% | 9.61s | 1,728.9 | 1,270.4 | $0.28 | 2026-06-11 |
| Subject | Model | Score |
|---|---|---|
| murder mystery | Gemini 3.1 Pro | 89.2% |
| murder mystery | Gemini 2.5 Flash | 87.6% |
| murder mystery | Gemma 4 31B | 86.4% |
| murder mystery | DeepSeek V4 Flash | 85.6% |
| murder mystery | Gemma 4 12B | 85.6% |
| murder mystery | Gemma 4 26B A4B | 85.2% |
| murder mystery | Gemini 3.1 Flash-Lite | 84.8% |
| murder mystery | Gemini 3.5 Flash | 84.0% |
| murder mystery | GLM-5.1 | 80.0% |
| murder mystery | MiMo V2.5 Pro | 78.8% |
| murder mystery | Qwen3.5 9B | 77.6% |
| murder mystery | Qwen3 14B | 76.8% |
| object placements | Gemini 3.5 Flash | 81.3% |
| object placements | GLM-5.1 | 81.3% |
| object placements | Gemini 3.1 Pro | 79.7% |
| object placements | Gemma 4 12B | 79.7% |
| object placements | MiMo V2.5 Pro | 76.6% |
| object placements | Gemma 4 31B | 76.6% |
| object placements | Gemini 3.1 Flash-Lite | 75.0% |
| object placements | Gemini 2.5 Flash | 75.0% |
| object placements | DeepSeek V4 Flash | 75.0% |
| object placements | Gemma 4 26B A4B | 73.4% |
| object placements | Qwen3.5 9B | 73.4% |
| object placements | Qwen3 14B | 71.9% |
| team allocation | Gemini 3.1 Pro | 89.2% |
| team allocation | Gemini 3.1 Flash-Lite | 86.4% |
| team allocation | Qwen3.5 9B | 86.0% |
| team allocation | Gemini 3.5 Flash | 85.2% |
| team allocation | DeepSeek V4 Flash | 84.8% |
| team allocation | Gemma 4 26B A4B | 84.8% |
| team allocation | MiMo V2.5 Pro | 84.0% |
| team allocation | GLM-5.1 | 82.8% |
| team allocation | Gemma 4 31B | 82.4% |
| team allocation | Gemini 2.5 Flash | 81.2% |
| team allocation | Gemma 4 12B | 78.0% |
| team allocation | Qwen3 14B | 62.0% |
Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro 🔒 reasoning | 70.1% | 3.30s | 625.7 | 205.5 | $3.81 | 2026-06-09 |
| 2 | Gemini 3.5 Flash | 64.3% | 0.67s | 650.6 | 1.7 | $1.02 | 2026-06-09 |
| 3 | Qwen3.7 Max | 63.2% | 0.99s | 771.9 | 1.7 | $0.99 | 2026-06-09 |
| 4 | DeepSeek V4 Flash | 62.4% | 0.54s | 778.6 | 2.7 | $0.11 | 2026-06-09 |
| 5 | Gemma 4 12B | 62.2% | 9.49s | 758.7 | 2.7 | — | 2026-06-11 |
| 6 | MiMo V2.5 Pro | 61.7% | 0.45s | 1,140.3 | 2.8 | $0.99 | 2026-06-09 |
| 7 | Gemini 2.5 Flash | 61.6% | 0.40s | 650.7 | 1.8 | $0.20 | 2026-06-09 |
| 8 | Gemma 4 31B | 61.0% | 10.35s | 758.7 | 2.8 | $0.09 | 2026-06-11 |
| 9 | Gemini 3.1 Flash-Lite | 60.4% | 0.46s | 650.7 | 1.8 | $0.34 | 2026-06-08 |
| 10 | DeepSeek V4 Pro | 60.0% | 1.42s | 778.6 | 2.1 | $1.28 | 2026-06-09 |
| 11 | Qwen3.6 35B-A3B | 58.4% | 0.45s | 771.9 | 2.7 | $0.12 | 2026-06-09 |
| 12 | Qwen3 14B | 56.1% | 9.61s | 899.3 | 2.7 | $0.11 | 2026-06-11 |
| 13 | Qwen3.5 9B | 55.7% | 5.70s | 771.9 | 2.6 | $0.08 | 2026-06-11 |
| 14 | Tencent HY3-Preview | 38.9% | 2.35s | 886.7 | 2.7 | $0.06 | 2026-06-09 |
| 15 | GLM-5.1 | 23.1% | 1.88s | 765.6 | 2.6 | $1.09 | 2026-06-09 |
| 16 | Gemma 4 26B A4B | 15.2% | 0.70s | 758.8 | 2.8 | $0.12 | 2026-06-08 |
Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.
| # | Model | Score | Latency | Tok in | Tok out | Cost | Date |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro 🔒 reasoning | 80.4% | 3.71s | 777.9 | 262 | $10.37 | 2026-06-09 |
| 2 | Gemini 3.5 Flash | 79.4% | 0.79s | 802.7 | 1.5 | $2.49 | 2026-06-09 |
| 3 | Gemini 2.5 Flash | 77.1% | 0.40s | 802.7 | 1.5 | $0.50 | 2026-06-09 |
| 4 | DeepSeek V4 Flash | 74.9% | 0.53s | 949.8 | 2.5 | $0.27 | 2026-06-09 |
| 5 | MiMo V2.5 Pro | 74.8% | 0.57s | 1,301 | 2.5 | $2.33 | 2026-06-09 |
| 6 | Gemini 3.1 Flash-Lite | 74.2% | 0.46s | 802.7 | 1.5 | $0.83 | 2026-06-08 |
| 7 | DeepSeek V4 Pro | 73.5% | 1.41s | 949.8 | 1.6 | $3.12 | 2026-06-09 |
| 8 | Gemma 4 31B | 71.0% | 10.35s | 910.7 | 2.5 | $0.23 | 2026-06-11 |
| 9 | GLM-5.1 | 69.9% | 1.69s | 901.6 | 2.5 | $2.57 | 2026-06-09 |
| 10 | Qwen3.7 Max | 69.8% | 1.02s | 928.9 | 1.6 | $2.39 | 2026-06-09 |
| 11 | Gemma 4 26B A4B | 66.3% | 0.46s | 910.7 | 2.6 | $0.28 | 2026-06-08 |
| 12 | Qwen3.6 35B-A3B | 64.2% | 0.51s | 928.9 | 2.6 | $0.29 | 2026-06-09 |
| 13 | Gemma 4 12B | 64.1% | 9.49s | 910.7 | 2.6 | — | 2026-06-11 |
| 14 | Qwen3 14B | 64.0% | 9.61s | 1,060 | 2.6 | $0.26 | 2026-06-11 |
| 15 | Tencent HY3-Preview | 62.5% | 2.48s | 1,027.6 | 2.6 | $0.13 | 2026-06-09 |
| 16 | Qwen3.5 9B | 62.1% | 5.70s | 928.9 | 2.5 | $0.19 | 2026-06-11 |
TPS is decode speed — output tokens per second after the first token; higher is faster. TTFT is time to first token; lower is snappier. A snapshot, not a constant.
| # | Model | TPS | TTFT |
|---|---|---|---|
| 1 | gpt-oss-120b 🔒 | 362 | 0.51s |
| 2 | Gemini 3.1 Flash-Lite | 302 | 4.99s |
| 3 | Gemini 2.5 Flash-Lite | 293 | 0.36s |
| 4 | Gemini 2.5 Flash | 216 | 0.61s |
| 5 | Gemini 3.5 Flash | 203 | 0.88s |
| 6 | grok-4.3 | 171 | 0.60s |
| 7 | Qwen3.7 Max | 170 | 1.61s |
| 8 | Qwen3.6 35B-A3B | 142 | 1.18s |
| 9 | Claude Haiku 4.5 | 141 | 0.64s |
| 10 | Gemini 3.1 Pro 🔒 | 127 | — |
| 11 | DeepSeek V4 Flash | 103 | 1.00s |
| 12 | Tencent HY3-Preview | 95 | 2.47s |
| 13 | Ministral 14B | 87 | 0.41s |
| 14 | Gemma 4 26B A4B | 84 | 0.95s |
| 15 | GLM-5.1 | 71 | 0.89s |
| 16 | Qwen3 14B | 65 | 1.02s |
| 17 | DeepSeek V4 Pro | 58 | 1.02s |
| 18 | Qwen3.5 9B | 58 | 1.21s |
| 19 | MiMo V2.5 Pro | 49 | 2.03s |
| 20 | Gemma 4 12B | 46 | 0.83s |
| 21 | Gemma 4 31B | 38 | 1.01s |
Average score across all benchmarks against decode speed (output tokens per second). Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.
* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.
🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.