German Artificial Analytics
a PeerBench project

Which fast LLM speaks German best?

Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.

German LLM Overview

Which model is best on German tasks — one score, ranked. Filter by open vs. closed weights, price, or speed.

Weight
Price
Speed
#
Model
Score
Price
Speed
1
Gemini 3.1 Pro
Google · Closed
80.3
78–89
$6.75
$$$ · /Mtok
127
tok/s
2
Gemini 3.5 Flash
Google · Closed
68.4
67–75
$2.81
$$ · /Mtok
202.5
tok/s
3
Qwen3.7 Max
Alibaba · Closed
57.7
51–67
$1.28
$$ · /Mtok
170.3
tok/s
4
Gemini 3.1 Flash-Lite
Google · Closed
56.4
52–65
$0.44
$ · /Mtok
302.3
tok/s
5
DeepSeek V4 Pro
DeepSeek · Open weights
54.1
50–59
$1.62
$$ · /Mtok
57.8
tok/s
6
Gemma 4 31B
Google · Open weights
52.4
49–57
$0.17
$ · /Mtok
38.4
tok/s
7
DeepSeek V4 Flash
DeepSeek · Open weights
52.1
48–58
$0.14
$ · /Mtok
102.5
tok/s
8
MiMo V2.5 Pro
Xiaomi · Open weights
51.7
46–61
$0.85
$$ · /Mtok
49.2
tok/s
9
Gemini 2.5 Flash
Google · Closed
51.0
47–57
$1.00
$$ · /Mtok
216.2
tok/s
10
Qwen3.6 35B-A3B
Alibaba · Open weights
48.7
46–53
$0.30
$ · /Mtok
141.6
tok/s
11
Claude Haiku 4.5
Anthropic · Closed
46.4
44–49
$3.36
$$$ · /Mtok
140.6
tok/s
12
Gemma 4 26B A4B
Google · Open weights
45.0
40–51
$0.23
$ · /Mtok
84.4
tok/s
13
Gemma 4 12B
Google · Open weights
44.8
40–49
46.3
tok/s
14
GLM-5.1
Z.ai · Open weights
44.7
36–52
$1.45
$$ · /Mtok
71.4
tok/s
15
Tencent HY3-Preview
Tencent · Open weights
44.7
38–48
$0.08
$ · /Mtok
94.8
tok/s
16
Qwen3.5 9B
Alibaba · Open weights
39.1
34–42
$0.12
$ · /Mtok
57.7
tok/s
17
Qwen3 14B
Alibaba · Open weights
35.4
29–39
$0.14
$ · /Mtok
64.7
tok/s

Showing 17 models that ran ≥3 of 7 benchmarks (9 excluded for thin coverage). Price = median effective $/1M tokens; Speed = throughput + latency.

GermEval — German NER

Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.

1,024 questions Named-entity recognition Native German GermEval (via EuroEval) ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
87.3%
4.86s 1,059.7 436.2 $10.04 2026-06-09
2 Gemini 3.5 Flash
86.0%
0.81s 1,077 24.6 $1.88 2026-06-09
3 GPT-5.5 🔒 reasoning
85.9%
1.86s 1,140.7 40.4 $7.08 2026-06-13
4 Opus 4.8
85.8%
2.12s 2,110.5 42 $11.88 2026-06-13
5 Gemini 3.1 Flash-Lite
83.2%
0.52s 1,077 25.7 $0.63 2026-06-08
6 Gemma 4 31B
82.6%
10.35s 1,153 27.8 $0.15 2026-06-11
7 DeepSeek V4 Pro
82.2%
1.74s 1,181.2 26.3 $2.02 2026-06-09
8 Qwen3.7 Max
82.1%
1.64s 1,154.3 25.2 $1.57 2026-06-09
9 MiMo V2.5 Pro
81.9%
1.09s 1,533.5 26.4 $1.49 2026-06-09
10 Gemma 4 26B A4B
81.8%
1.14s 1,153 30.3 $0.19 2026-06-08
11 Gemini 2.5 Flash
81.8%
0.52s 1,077 24.5 $0.39 2026-06-09
12 DeepSeek V4 Flash
80.2%
0.77s 1,181.2 26.8 $0.18 2026-06-09
13 Qwen3.6 35B-A3B
79.9%
0.67s 1,154.3 27.3 $0.21 2026-06-09
14 Gemma 4 12B
79.6%
9.49s 1,153 28.1 2026-06-11
15 Tencent HY3-Preview
77.3%
2.50s 1,313.7 35.4 $0.09 2026-06-09
16 Qwen3 14B
73.0%
9.61s 1,292.5 28.5 $0.17 2026-06-11
17 Qwen3.5 9B
72.6%
5.70s 1,154.3 26.9 $0.12 2026-06-11
18 GLM-5.1
52.2%
2.20s 1,127.6 26.5 $1.71 2026-06-09

INCLUDE — German

Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

139 questions 4-option multiple choice Native German CohereLabs/include-base-44 ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
77.7%
4.24s 107.1 416.6 $0.72 2026-06-10
2 GPT-5.5 🔒 reasoning
74.8%
3.99s 128.2 153.6 $0.70 2026-06-12
3 Gemini 3.5 Flash
74.1%
1.72s 107.1 205.4 $0.28 2026-06-10
4 MiniMax M2.7 🔒 reasoning
72.7%
12.26s 151.1 1,443.8 $0.29
5 Gemini 3.1 Flash-Lite
71.9%
0.89s 107.1 151 $0.04 2026-06-10
6 Opus 4.8
71.9%
5.77s 256.2 343.7 $1.32 2026-06-12
7 Qwen3.7 Max
71.2%
4.17s 118.3 239.6 $0.28 2026-06-08
8 Gemini 2.5 Flash
70.5%
1.61s 105.8 193.6 $0.07 2026-06-02
9 DeepSeek V4 Pro
70.5%
3.45s 118.626 121.108 $0.08 2026-06-08
10 DeepSeek V4 Flash
70.5%
2.53s 118.626 111.288 $0.0066 2026-06-08
11 Kimi K2.6
69.8%
13.65s 138.7 527.1 $0.58 2026-06-08
12 Tencent HY3-Preview
69.1%
5.48s 143.9 222.4 $0.0086 2026-06-02
13 Gemma 4 12B
69.1%
9.49s 123.1 221.9 2026-06-11
14 Qwen3.6 35B-A3B
68.3%
2.72s 118.4 545.1 $0.08 2026-06-02
15 Claude Haiku 4.5
68.3%
3.45s 154.6 347.4 $0.26 2026-06-02
16 Gemma 4 31B
68.3%
10.35s 119.1 205.4 $0.01 2026-06-11
17 MiMo V2.5 Pro
67.6%
4.63s 372.2 224.3 $0.10 2026-06-08
18 GLM-5.1
67.6%
3.36s 114.6 259.7 $0.35 2026-06-08
19 gpt-oss-120b 🔒 reasoning
66.2%
1.71s 154.1 111.7 $0.0078 2026-05-29
20 Gemma 4 26B A4B
64.7%
3.71s 118.7 275.8 $0.03 2026-06-03
21 Qwen3.5 9B
64.7%
5.70s 122.4 378 $0.0096 2026-06-11
22 grok-4.3
63.3%
0.60s 233.8 20.5 $0.04 2026-06-03
23 Qwen3 14B
63.3%
9.61s 135.4 46 $0.0038 2026-06-11
24 Ministral 14B
58.3%
0.41s 109.1 79.8 $0.01 2026-06-03
25 gemma-3-12b-it
53.2%
5.00s 114.8 160.5 $0.0073 2026-06-03

MMLU-Pro — German

Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.

11,759 questions 10-option multiple choice Professional translation li-lab/MMLU-ProX ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.5 Flash
86.5%
2.23s 1,664.9 353.3 $66.76 2026-05-26
2 Gemini 3.1 Flash-Lite
82.2%
1.24s 1,665.9 304 $10.26 2026-06-03
3 Gemma 4 31B
82.1%
10.35s 1,702.9 502.5 $4.47 2026-06-11
4 Qwen3.6 35B-A3B
80.0%
4.67s 1,689.6 984 $13.36 2026-05-31
5 Gemini 2.5 Flash
79.6%
2.69s 1,664.9 786 $28.89 2026-05-26
6 Gemma 4 26B A4B
78.2%
7.43s 1,702.3 629.5 $3.49 2026-06-05
7 Claude Haiku 4.5
75.3%
3.76s 2,262 433.7 $52.10 2026-06-03
8 Qwen3.5 9B
73.4%
5.70s 1,693.6 931.1 $3.63 2026-06-11
9 Tencent HY3-Preview
73.2%
6.14s 1,910.2 586.3 $2.76 2026-05-27
10 Gemma 4 12B
73.1%
9.49s 1,705.9 493.3 2026-06-11
11 Gemini 2.5 Flash-Lite
71.2%
2.48s 1,665.1 1,493.5 $8.95 2026-06-03
12 Qwen3 14B
64.1%
9.61s 1,887.3 413.7 $3.83 2026-06-11
13 DeepSeek V4 Flash
36.8%
2.01s 1,716.4 172.4 $1.91 2026-06-03
Show per-subject breakdown (168)
Subject Model Score
biology Gemini 3.5 Flash 92.9%
biology Gemma 4 31B 91.2%
biology Qwen3.6 35B-A3B 89.7%
biology Gemini 3.1 Flash-Lite 89.4%
biology Gemini 2.5 Flash 88.8%
biology DeepSeek V4 Flash 88.3%
biology Gemma 4 26B A4B 87.3%
biology Gemini 2.5 Flash-Lite 86.8%
biology Tencent HY3-Preview 85.8%
business Gemini 3.5 Flash 89.6%
business Gemma 4 31B 87.5%
business Gemini 3.1 Flash-Lite 85.3%
business Gemma 4 26B A4B 83.9%
business Gemini 2.5 Flash 83.5%
business Qwen3.6 35B-A3B 83.5%
business DeepSeek V4 Flash 81.0%
business Tencent HY3-Preview 78.3%
business Gemini 2.5 Flash-Lite 77.7%
chemistry Gemini 3.5 Flash 89.1%
chemistry Qwen3.6 35B-A3B 88.2%
chemistry Gemini 3.1 Flash-Lite 87.1%
chemistry Gemini 2.5 Flash 86.9%
chemistry Gemma 4 31B 86.7%
chemistry Gemma 4 26B A4B 85.4%
chemistry Gemini 2.5 Flash-Lite 78.4%
chemistry DeepSeek V4 Flash 75.7%
chemistry Tencent HY3-Preview 75.6%
computer science Gemini 3.1 Flash-Lite 88.8%
computer science Gemini 3.5 Flash 87.6%
computer science Gemma 4 31B 85.9%
computer science Gemini 2.5 Flash 85.1%
computer science Qwen3.6 35B-A3B 85.1%
computer science Gemma 4 26B A4B 83.9%
computer science DeepSeek V4 Flash 81.7%
computer science Gemini 2.5 Flash-Lite 77.8%
computer science Tencent HY3-Preview 76.8%
economics Gemini 3.5 Flash 89.1%
economics Gemini 3.1 Flash-Lite 87.3%
economics Gemma 4 31B 86.8%
economics Gemini 2.5 Flash 86.4%
economics Qwen3.6 35B-A3B 85.3%
economics Gemma 4 26B A4B 82.6%
economics Tencent HY3-Preview 82.3%
economics Gemini 2.5 Flash-Lite 79.1%
economics DeepSeek V4 Flash 69.5%
engineering Gemini 3.5 Flash 82.5%
engineering Gemini 3.1 Flash-Lite 77.8%
engineering Qwen3.6 35B-A3B 77.5%
engineering Gemma 4 31B 77.1%
engineering Gemma 4 26B A4B 73.5%
engineering Gemini 2.5 Flash 71.6%
engineering Tencent HY3-Preview 64.4%
engineering DeepSeek V4 Flash 60.0%
engineering Gemini 2.5 Flash-Lite 58.4%
health Gemini 3.5 Flash 78.6%
health Gemini 3.1 Flash-Lite 74.8%
health Gemma 4 31B 74.1%
health Tencent HY3-Preview 73.7%
health Qwen3.6 35B-A3B 72.8%
health Gemini 2.5 Flash 72.5%
health Gemma 4 26B A4B 71.2%
health Gemini 2.5 Flash-Lite 66.5%
health DeepSeek V4 Flash 13.8%
history Gemini 3.5 Flash 80.6%
history Gemini 3.1 Flash-Lite 74.5%
history Gemma 4 31B 74.5%
history Tencent HY3-Preview 71.4%
history Gemini 2.5 Flash 70.9%
history Qwen3.6 35B-A3B 69.0%
history Gemma 4 26B A4B 66.7%
history Gemini 2.5 Flash-Lite 59.3%
history DeepSeek V4 Flash 10.8%
law Gemini 3.5 Flash 72.8%
law Gemini 3.1 Flash-Lite 62.8%
law Gemma 4 31B 59.0%
law Gemini 2.5 Flash 54.0%
law Qwen3.6 35B-A3B 52.8%
law Gemma 4 26B A4B 50.7%
law Tencent HY3-Preview 45.8%
law Gemini 2.5 Flash-Lite 41.6%
law DeepSeek V4 Flash 10.4%
math Gemini 3.5 Flash 94.6%
math Gemma 4 31B 93.5%
math Qwen3.6 35B-A3B 92.7%
math Gemma 4 26B A4B 92.0%
math Gemini 2.5 Flash 90.5%
math Gemini 3.1 Flash-Lite 90.2%
math Tencent HY3-Preview 86.5%
math Gemini 2.5 Flash-Lite 82.8%
math DeepSeek V4 Flash 9.5%
nothink biology Gemma 4 12B 87.0%
nothink biology Qwen3.5 9B 85.6%
nothink biology Qwen3 14B 81.3%
nothink business Gemma 4 12B 79.8%
nothink business Qwen3.5 9B 78.1%
nothink business Qwen3 14B 70.3%
nothink chemistry Qwen3.5 9B 85.0%
nothink chemistry Gemma 4 12B 79.9%
nothink chemistry Qwen3 14B 72.2%
nothink computer science Gemma 4 12B 79.3%
nothink computer science Qwen3.5 9B 78.0%
nothink computer science Qwen3 14B 71.5%
nothink economics Qwen3.5 9B 78.9%
nothink economics Gemma 4 12B 78.7%
nothink economics Qwen3 14B 71.1%
nothink engineering Qwen3.5 9B 66.6%
nothink engineering Gemma 4 12B 65.1%
nothink engineering Qwen3 14B 55.9%
nothink health Qwen3.5 9B 65.9%
nothink health Gemma 4 12B 64.5%
nothink health Qwen3 14B 56.8%
nothink history Qwen3.5 9B 60.4%
nothink history Gemma 4 12B 57.2%
nothink history Qwen3 14B 48.8%
nothink law Gemma 4 12B 42.8%
nothink law Qwen3.5 9B 36.7%
nothink law Qwen3 14B 27.4%
nothink math Qwen3.5 9B 90.2%
nothink math Gemma 4 12B 90.1%
nothink math Qwen3 14B 82.2%
nothink other Qwen3.5 9B 61.6%
nothink other Gemma 4 12B 60.7%
nothink other Qwen3 14B 52.1%
nothink philosophy Gemma 4 12B 60.3%
nothink philosophy Qwen3.5 9B 58.1%
nothink philosophy Qwen3 14B 49.3%
nothink physics Qwen3.5 9B 84.7%
nothink physics Gemma 4 12B 81.9%
nothink physics Qwen3 14B 73.1%
nothink psychology Gemma 4 12B 76.3%
nothink psychology Qwen3.5 9B 74.2%
nothink psychology Qwen3 14B 66.0%
other Gemini 3.5 Flash 82.4%
other Gemini 3.1 Flash-Lite 76.7%
other Gemma 4 31B 73.7%
other Gemini 2.5 Flash 73.5%
other Tencent HY3-Preview 72.1%
other Qwen3.6 35B-A3B 70.8%
other Gemma 4 26B A4B 67.5%
other Gemini 2.5 Flash-Lite 64.9%
other DeepSeek V4 Flash 9.2%
philosophy Gemini 3.5 Flash 83.2%
philosophy Gemini 3.1 Flash-Lite 75.2%
philosophy Gemma 4 31B 74.3%
philosophy Gemini 2.5 Flash 70.9%
philosophy Qwen3.6 35B-A3B 69.5%
philosophy Tencent HY3-Preview 69.1%
philosophy Gemma 4 26B A4B 69.1%
philosophy Gemini 2.5 Flash-Lite 60.9%
philosophy DeepSeek V4 Flash 6.8%
physics Gemini 3.5 Flash 90.4%
physics Gemma 4 31B 89.0%
physics Qwen3.6 35B-A3B 88.0%
physics Gemini 3.1 Flash-Lite 87.6%
physics Gemma 4 26B A4B 86.1%
physics Gemini 2.5 Flash 85.5%
physics Gemini 2.5 Flash-Lite 76.6%
physics Tencent HY3-Preview 64.7%
physics DeepSeek V4 Flash 10.2%
psychology Gemini 3.5 Flash 87.8%
psychology Gemma 4 31B 83.6%
psychology Gemini 3.1 Flash-Lite 83.3%
psychology Gemini 2.5 Flash 82.6%
psychology Tencent HY3-Preview 80.5%
psychology Qwen3.6 35B-A3B 78.7%
psychology Gemma 4 26B A4B 78.6%
psychology Gemini 2.5 Flash-Lite 75.1%
psychology DeepSeek V4 Flash 10.4%

MMMLU — German

OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

14,042 questions 4-option multiple choice Professional translation openai/MMMLU ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.5 Flash
89.3%
2.05s 172 255.8 $35.95 2026-05-26
2 Gemini 3.1 Flash-Lite
86.8%
1.06s 173 210.8 $5.05 2026-06-03
3 Gemma 4 31B
86.6%
10.35s 185 251.5 $1.55 2026-06-11
4 Qwen3.6 35B-A3B
85.8%
4.39s 182.9 561.4 $8.27 2026-06-01
5 DeepSeek V4 Flash
84.9%
2.32s 185.9 144.7 $1.03 2026-06-03
6 Gemini 2.5 Flash
84.7%
1.48s 172 275.5 $10.40 2026-05-26
7 Gemma 4 26B A4B
83.7%
3.97s 185 340.7 $2.24 2026-06-05
8 Tencent HY3-Preview
83.7%
5.55s 223.4 280.9 $1.14 2026-05-28
9 Claude Haiku 4.5
83.1%
4.20s 277.7 397.6 $31.90 2026-06-02
10 Gemini 2.5 Flash-Lite
79.6%
1.68s 162.7 504.6 $4.99 2026-06-03
11 Gemma 4 12B
79.0%
9.49s 189 261.2 2026-06-11
12 Qwen3.5 9B
78.8%
5.70s 186.9 408.3 $1.12 2026-06-11
13 Qwen3 14B
73.4%
9.61s 212 87.9 $0.65 2026-06-11

MuSR (DE)

German MuSR multistep soft reasoning, frozen v1, cot+; accuracy.

564 questions
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
88.1%
12.14s 1,464.9 1,306.5 $10.49 2026-06-09
2 Gemini 3.1 Flash-Lite
84.4%
2.95s 1,467.3 568.5 $0.73 2026-06-09
3 Gemini 3.5 Flash
84.2%
5.79s 1,464.9 850.2 $5.55 2026-06-09
4 Gemma 4 26B A4B
83.7%
9.78s 1,478.6 814.1 $0.29 2026-06-10
5 DeepSeek V4 Flash
83.5%
10.18s 1,659.8 798.9 $0.26 2026-06-10
6 Gemma 4 31B
83.5%
10.35s 1,477.9 650.1 $0.23 2026-06-11
7 Gemini 2.5 Flash
83.3%
6.28s 1,464.9 1,077.4 $1.77 2026-06-09
8 Gemma 4 12B
81.6%
9.49s 1,477.9 707.4 2026-06-11
9 GLM-5.1
81.4%
26.91s 1,583.3 882.5 $3.44 2026-06-10
10 MiMo V2.5 Pro
80.9%
11.43s 1,964.3 754.9 $0.81 2026-06-10
11 Qwen3.5 9B
80.9%
5.70s 1,469.2 1,629.7 $0.22 2026-06-11
12 Qwen3 14B
69.7%
9.61s 1,728.9 1,270.4 $0.28 2026-06-11
Show per-subject breakdown (36)
Subject Model Score
murder mystery Gemini 3.1 Pro 89.2%
murder mystery Gemini 2.5 Flash 87.6%
murder mystery Gemma 4 31B 86.4%
murder mystery DeepSeek V4 Flash 85.6%
murder mystery Gemma 4 12B 85.6%
murder mystery Gemma 4 26B A4B 85.2%
murder mystery Gemini 3.1 Flash-Lite 84.8%
murder mystery Gemini 3.5 Flash 84.0%
murder mystery GLM-5.1 80.0%
murder mystery MiMo V2.5 Pro 78.8%
murder mystery Qwen3.5 9B 77.6%
murder mystery Qwen3 14B 76.8%
object placements Gemini 3.5 Flash 81.3%
object placements GLM-5.1 81.3%
object placements Gemini 3.1 Pro 79.7%
object placements Gemma 4 12B 79.7%
object placements MiMo V2.5 Pro 76.6%
object placements Gemma 4 31B 76.6%
object placements Gemini 3.1 Flash-Lite 75.0%
object placements Gemini 2.5 Flash 75.0%
object placements DeepSeek V4 Flash 75.0%
object placements Gemma 4 26B A4B 73.4%
object placements Qwen3.5 9B 73.4%
object placements Qwen3 14B 71.9%
team allocation Gemini 3.1 Pro 89.2%
team allocation Gemini 3.1 Flash-Lite 86.4%
team allocation Qwen3.5 9B 86.0%
team allocation Gemini 3.5 Flash 85.2%
team allocation DeepSeek V4 Flash 84.8%
team allocation Gemma 4 26B A4B 84.8%
team allocation MiMo V2.5 Pro 84.0%
team allocation GLM-5.1 82.8%
team allocation Gemma 4 31B 82.4%
team allocation Gemini 2.5 Flash 81.2%
team allocation Gemma 4 12B 78.0%
team allocation Qwen3 14B 62.0%

SB10K — German sentiment

Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.

1,024 questions 3-class sentiment Native German SB10K (via EuroEval) ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
70.1%
3.30s 625.7 205.5 $3.81 2026-06-09
2 Opus 4.8
64.5%
1.88s 1,372.7 5 $7.16 2026-06-13
3 Gemini 3.5 Flash
64.3%
0.67s 650.6 1.7 $1.02 2026-06-09
4 Qwen3.7 Max
63.2%
0.99s 771.9 1.7 $0.99 2026-06-09
5 GPT-5.5
62.5%
1.36s 756.9 8.1 $4.12 2026-06-13
6 DeepSeek V4 Flash
62.4%
0.54s 778.6 2.7 $0.11 2026-06-09
7 Gemma 4 12B
62.2%
9.49s 758.7 2.7 2026-06-11
8 MiMo V2.5 Pro
61.7%
0.45s 1,140.3 2.8 $0.99 2026-06-09
9 Gemini 2.5 Flash
61.6%
0.40s 650.7 1.8 $0.20 2026-06-09
10 Gemma 4 31B
61.0%
10.35s 758.7 2.8 $0.09 2026-06-11
11 Gemini 3.1 Flash-Lite
60.4%
0.46s 650.7 1.8 $0.34 2026-06-08
12 DeepSeek V4 Pro
60.0%
1.42s 778.6 2.1 $1.28 2026-06-09
13 Qwen3.6 35B-A3B
58.4%
0.45s 771.9 2.7 $0.12 2026-06-09
14 Qwen3 14B
56.1%
9.61s 899.3 2.7 $0.11 2026-06-11
15 Qwen3.5 9B
55.7%
5.70s 771.9 2.6 $0.08 2026-06-11
16 Tencent HY3-Preview
38.9%
2.35s 886.7 2.7 $0.06 2026-06-09
17 GLM-5.1
23.1%
1.88s 765.6 2.6 $1.09 2026-06-09
18 Gemma 4 26B A4B
15.2%
0.70s 758.8 2.8 $0.12 2026-06-08

ScaLA — German acceptability

Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

2,048 questions Binary acceptability Native German ScaLA-de (via EuroEval) ↗
# Model Score Latency Tok in Tok out Cost Date
1 Opus 4.8
82.6%
2.00s 1,684.8 4 $17.46 2026-06-13
2 Gemini 3.1 Pro 🔒 reasoning
80.4%
3.71s 777.9 262 $10.37 2026-06-09
3 GPT-5.5 🔒 reasoning
79.7%
1.52s 913.2 27.6 $11.05 2026-06-13
4 Gemini 3.5 Flash
79.4%
0.79s 802.7 1.5 $2.49 2026-06-09
5 Gemini 2.5 Flash
77.1%
0.40s 802.7 1.5 $0.50 2026-06-09
6 DeepSeek V4 Flash
74.9%
0.53s 949.8 2.5 $0.27 2026-06-09
7 MiMo V2.5 Pro
74.8%
0.57s 1,301 2.5 $2.33 2026-06-09
8 Gemini 3.1 Flash-Lite
74.2%
0.46s 802.7 1.5 $0.83 2026-06-08
9 DeepSeek V4 Pro
73.5%
1.41s 949.8 1.6 $3.12 2026-06-09
10 Gemma 4 31B
71.0%
10.35s 910.7 2.5 $0.23 2026-06-11
11 GLM-5.1
69.9%
1.69s 901.6 2.5 $2.57 2026-06-09
12 Qwen3.7 Max
69.8%
1.02s 928.9 1.6 $2.39 2026-06-09
13 Gemma 4 26B A4B
66.3%
0.46s 910.7 2.6 $0.28 2026-06-08
14 Qwen3.6 35B-A3B
64.2%
0.51s 928.9 2.6 $0.29 2026-06-09
15 Gemma 4 12B
64.1%
9.49s 910.7 2.6 2026-06-11
16 Qwen3 14B
64.0%
9.61s 1,060 2.6 $0.26 2026-06-11
17 Tencent HY3-Preview
62.5%
2.48s 1,027.6 2.6 $0.13 2026-06-09
18 Qwen3.5 9B
62.1%
5.70s 928.9 2.5 $0.19 2026-06-11

TPS is decode speed — output tokens per second after the first token; higher is faster. TTFT is time to first token; lower is snappier. A snapshot, not a constant.

# Model TPS TTFT
1 gpt-oss-120b 🔒 362 0.51s
2 Gemini 3.1 Flash-Lite 302 4.99s
3 Gemini 2.5 Flash-Lite 293 0.36s
4 Gemini 2.5 Flash 216 0.61s
5 Gemini 3.5 Flash 203 0.88s
6 grok-4.3 171 0.60s
7 Qwen3.7 Max 170 1.61s
8 Qwen3.6 35B-A3B 142 1.18s
9 Claude Haiku 4.5 141 0.64s
10 Gemini 3.1 Pro 🔒 127
11 DeepSeek V4 Flash 103 1.00s
12 Tencent HY3-Preview 95 2.47s
13 Ministral 14B 87 0.41s
14 Gemma 4 26B A4B 84 0.95s
15 GLM-5.1 71 0.89s
16 Qwen3 14B 65 1.02s
17 DeepSeek V4 Pro 58 1.02s
18 Qwen3.5 9B 58 1.21s
19 MiMo V2.5 Pro 49 2.03s
20 Gemma 4 12B 46 0.83s
21 Gemma 4 31B 38 1.01s

Quality vs. speed

Average score across all benchmarks against decode speed (output tokens per second). Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.

55%60%65%70%75%80%85%50100150200300500Output speed (tokens / sec)Avg scorefaster & smarter — better ↗Gemini 3.1 Pro · 80.7% · 127 tok/s · reasoning locked · INCLUDE onlyGemini 3.1 Pro 🔒 *Gemini 3.5 Flash · 80.6% · 203 tok/sGemini 3.5 FlashGemini 3.1 Flash-Lite · 77.6% · 302 tok/sGemini 3.1 Flash-LiteGemini 2.5 Flash · 76.9% · 216 tok/sGemini 2.5 FlashGemma 4 31B · 76.5% · 38 tok/sGemma 4 31BClaude Haiku 4.5 · 75.6% · 141 tok/s · INCLUDE onlyClaude Haiku 4.5 *Gemini 2.5 Flash-Lite · 75.4% · 293 tok/s · INCLUDE onlyGemini 2.5 Flash-Lite *MiMo V2.5 Pro · 73.4% · 49 tok/s · INCLUDE onlyMiMo V2.5 Pro *Qwen3.6 35B-A3B · 72.8% · 142 tok/s · INCLUDE onlyQwen3.6 35B-A3B *Gemma 4 12B · 72.7% · 46 tok/sGemma 4 12BQwen3.7 Max · 71.6% · 170 tok/s · INCLUDE onlyQwen3.7 Max *DeepSeek V4 Pro · 71.5% · 58 tok/s · INCLUDE onlyDeepSeek V4 Pro *DeepSeek V4 Flash · 70.5% · 103 tok/sDeepSeek V4 FlashQwen3.5 9B · 69.7% · 58 tok/sQwen3.5 9Bgpt-oss-120b · 66.2% · 362 tok/s · reasoning locked · INCLUDE onlygpt-oss-120b 🔒 *Gemma 4 26B A4B · 67.7% · 84 tok/sGemma 4 26B A4BTencent HY3-Preview · 67.4% · 95 tok/s · INCLUDE onlyTencent HY3-Preview *Qwen3 14B · 66.2% · 65 tok/sQwen3 14Bgrok-4.3 · 63.3% · 171 tok/s · INCLUDE onlygrok-4.3 *GLM-5.1 · 58.8% · 71 tok/s · INCLUDE onlyGLM-5.1 *Ministral 14B · 58.3% · 87 tok/s · INCLUDE onlyMinistral 14B *

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.