German Artificial Analytics
a PeerBench project

Which fast LLM speaks German best?

Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.

German LLM Overview

Which model is best on German tasks — one score, ranked. Filter by open vs. closed weights, price, or speed.

Weight
Price
Speed
#
Model
Score
Price
Speed
1
Gemini 3.1 Pro
Google · Closed
80.3
78–89
$6.75
$$$ · /Mtok
127
tok/s
2
Gemini 3.5 Flash
Google · Closed
68.4
67–75
$2.81
$$ · /Mtok
202.5
tok/s
3
Qwen3.7 Max
Alibaba · Closed
57.7
51–67
$1.28
$$ · /Mtok
170.3
tok/s
4
Gemini 3.1 Flash-Lite
Google · Closed
56.4
52–65
$0.44
$ · /Mtok
302.3
tok/s
5
DeepSeek V4 Pro
DeepSeek · Open weights
54.1
50–59
$1.62
$$ · /Mtok
57.8
tok/s
6
Gemma 4 31B
Google · Open weights
52.4
49–57
$0.17
$ · /Mtok
38.4
tok/s
7
DeepSeek V4 Flash
DeepSeek · Open weights
52.1
48–58
$0.14
$ · /Mtok
102.5
tok/s
8
MiMo V2.5 Pro
Xiaomi · Open weights
51.7
46–61
$0.85
$$ · /Mtok
49.2
tok/s
9
Gemini 2.5 Flash
Google · Closed
51.0
47–57
$1.00
$$ · /Mtok
216.2
tok/s
10
Qwen3.6 35B-A3B
Alibaba · Open weights
48.7
46–53
$0.30
$ · /Mtok
141.6
tok/s
11
Claude Haiku 4.5
Anthropic · Closed
46.4
44–49
$3.36
$$$ · /Mtok
140.6
tok/s
12
Gemma 4 26B A4B
Google · Open weights
45.0
40–51
$0.23
$ · /Mtok
84.4
tok/s
13
Gemma 4 12B
Google · Open weights
44.8
40–49
46.3
tok/s
14
GLM-5.1
Z.ai · Open weights
44.7
36–52
$1.45
$$ · /Mtok
71.4
tok/s
15
Tencent HY3-Preview
Tencent · Open weights
44.7
38–48
$0.08
$ · /Mtok
94.8
tok/s
16
Qwen3.5 9B
Alibaba · Open weights
39.1
34–42
$0.12
$ · /Mtok
57.7
tok/s
17
Qwen3 14B
Alibaba · Open weights
35.4
29–39
$0.14
$ · /Mtok
64.7
tok/s

Showing 17 models that ran ≥3 of 7 benchmarks (9 excluded for thin coverage). Price = median effective $/1M tokens; Speed = throughput + latency.

GermEval — German NER

Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.

1,024 questions Named-entity recognition Native German GermEval (via EuroEval) ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
87.3%
4.86s 1,059.7 436.2 $10.04 2026-06-09
2 Gemini 3.5 Flash
86.0%
0.81s 1,077 24.6 $1.88 2026-06-09
3 GPT-5.5 🔒 reasoning
85.9%
1.86s 1,140.7 40.4 $7.08 2026-06-13
4 Opus 4.8
85.8%
2.12s 2,110.5 42 $11.88 2026-06-13
5 Gemini 3.1 Flash-Lite
83.2%
0.52s 1,077 25.7 $0.63 2026-06-08
6 Gemma 4 31B
82.6%
11.92s 1,153 27.8 $0.15 2026-06-11
7 DeepSeek V4 Pro
82.2%
1.74s 1,181.2 26.3 $2.02 2026-06-09
8 Qwen3.7 Max
82.1%
1.64s 1,154.3 25.2 $1.57 2026-06-09
9 MiMo V2.5 Pro
81.9%
1.09s 1,533.5 26.4 $1.49 2026-06-09
10 Gemma 4 26B A4B
81.8%
1.14s 1,153 30.3 $0.19 2026-06-08
11 Gemini 2.5 Flash
81.8%
0.52s 1,077 24.5 $0.39 2026-06-09
12 DeepSeek V4 Flash
80.2%
0.77s 1,181.2 26.8 $0.18 2026-06-09
13 Qwen3.6 35B-A3B
79.9%
0.67s 1,154.3 27.3 $0.21 2026-06-09
14 Gemma 4 12B
79.6%
43.80s 1,153 28.1 2026-06-11
15 Tencent HY3-Preview
77.3%
2.50s 1,313.7 35.4 $0.09 2026-06-09
16 Qwen3 14B
73.0%
21.12s 1,292.5 28.5 2026-06-11
17 Qwen3.5 9B
72.6%
11.77s 1,154.3 26.9 $0.12 2026-06-11
18 GLM-5.1
52.2%
2.20s 1,127.6 26.5 $1.71 2026-06-09

INCLUDE — German

Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

139 questions 4-option multiple choice Native German CohereLabs/include-base-44 ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
77.7%
4.24s 107.1 416.6 $0.72 2026-06-10
2 GPT-5.5 🔒 reasoning
74.8%
3.99s 128.2 153.6 $0.70 2026-06-12
3 Gemini 3.5 Flash
74.1%
1.72s 107.1 205.4 $0.28 2026-06-10
4 MiniMax M2.7 🔒 reasoning
72.7%
12.26s 151.1 1,443.8 $0.29
5 Gemini 3.1 Flash-Lite
71.9%
0.89s 107.1 151 $0.04 2026-06-10
6 Opus 4.8
71.9%
5.77s 256.2 343.7 $1.32 2026-06-12
7 Qwen3.7 Max
71.2%
2026-06-08
8 Gemini 2.5 Flash
70.5%
1.61s 105.8 193.6 $0.07 2026-06-02
9 DeepSeek V4 Pro
70.5%
3.44s 118.7 121.3 $0.08 2026-06-08
10 DeepSeek V4 Flash
70.5%
2.53s 118.626 111.288 $0.0066 2026-06-08
11 Kimi K2.6
69.8%
13.65s 138.7 527.1 $0.58 2026-06-08
12 Tencent HY3-Preview
69.1%
5.48s 143.9 222.4 $0.0086 2026-06-02
13 Gemma 4 12B
69.1%
27.64s 123.1 221.9 2026-06-11
14 Qwen3.6 35B-A3B
68.3%
2.72s 118.4 545.1 $0.08 2026-06-02
15 Claude Haiku 4.5
68.3%
3.45s 154.6 347.4 $0.26 2026-06-02
16 Gemma 4 31B
68.3%
4.56s 119.1 205.4 $0.01 2026-06-11
17 MiMo V2.5 Pro
67.6%
2026-06-08
18 GLM-5.1
67.6%
2026-06-08
19 gpt-oss-120b 🔒 reasoning
66.2%
1.71s 154.1 111.7 $0.0078 2026-05-29
20 Gemma 4 26B A4B
64.7%
3.71s 118.7 275.8 $0.03 2026-06-03
21 Qwen3.5 9B
64.7%
10.95s 122.4 378 $0.0096 2026-06-11
22 grok-4.3
63.3%
0.60s 233.8 20.5 $0.04 2026-06-03
23 Qwen3 14B
63.3%
3.32s 135.4 46 2026-06-11
24 Ministral 14B
58.3%
0.41s 109.1 79.8 $0.01 2026-06-03
25 gemma-3-12b-it
53.2%
5.00s 114.8 160.5 $0.0073 2026-06-03

MMLU-Pro — German

Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.

11,759 questions 10-option multiple choice Professional translation li-lab/MMLU-ProX ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.5 Flash
86.5%
2.23s 1,664.9 353.3 $66.76 2026-05-26
2 Gemini 3.1 Flash-Lite
82.2%
1.24s 1,666.1 304.5 $10.21 2026-06-03
3 Gemma 4 31B
82.1%
64.35s 1,702.9 502.5 $4.53 2026-06-11
4 DeepSeek V4 Pro
80.8%
288.1 218.9 $3.71 2026-06-16
5 Qwen3.6 35B-A3B
80.0%
4.67s 1,689.6 984 $13.36 2026-05-31
6 Gemini 2.5 Flash
79.6%
2.69s 1,664.9 786 $28.89 2026-05-26
7 Gemma 4 26B A4B
78.2%
7.77s 1,700.6 662.1 $4.64 2026-06-05
8 Claude Haiku 4.5
75.3%
3.76s 2,262 433.7 $52.10 2026-06-03
9 DeepSeek V4 Flash
74.9%
1.98s 1,718.7 162.2 $0.0020 2026-06-16
10 Qwen3.5 9B
73.4%
40.41s 1,693.6 931.1 $3.63 2026-06-11
11 Tencent HY3-Preview
73.2%
6.14s 1,910.2 586.3 $2.76 2026-05-27
12 Gemma 4 12B
73.1%
92.81s 1,705.9 493.3 2026-06-11
13 Gemini 2.5 Flash-Lite
71.2%
2.48s 1,665.1 1,493.5 $8.95 2026-06-03
14 Qwen3 14B
64.1%
29.26s 1,887.3 413.7 2026-06-11
Show per-subject breakdown (196)
Subject Model Score
biology Gemini 3.5 Flash 92.9%
biology Gemma 4 31B 91.2%
biology DeepSeek V4 Pro 90.2%
biology Qwen3.6 35B-A3B 89.7%
biology Gemini 3.1 Flash-Lite 89.4%
biology Gemini 2.5 Flash 88.8%
biology DeepSeek V4 Flash 88.4%
biology Gemma 4 26B A4B 87.3%
biology Gemini 2.5 Flash-Lite 86.8%
biology Claude Haiku 4.5 85.9%
biology Tencent HY3-Preview 85.8%
business Gemini 3.5 Flash 89.6%
business Gemma 4 31B 87.5%
business Gemini 3.1 Flash-Lite 85.3%
business DeepSeek V4 Pro 85.3%
business Gemma 4 26B A4B 83.9%
business Gemini 2.5 Flash 83.5%
business Qwen3.6 35B-A3B 83.5%
business Claude Haiku 4.5 79.0%
business Tencent HY3-Preview 78.3%
business Gemini 2.5 Flash-Lite 77.7%
business DeepSeek V4 Flash 76.2%
chemistry Gemini 3.5 Flash 89.1%
chemistry Qwen3.6 35B-A3B 88.2%
chemistry Gemini 3.1 Flash-Lite 87.1%
chemistry Gemini 2.5 Flash 86.9%
chemistry Gemma 4 31B 86.7%
chemistry DeepSeek V4 Pro 86.7%
chemistry Gemma 4 26B A4B 85.4%
chemistry Claude Haiku 4.5 80.4%
chemistry Gemini 2.5 Flash-Lite 78.4%
chemistry Tencent HY3-Preview 75.6%
chemistry DeepSeek V4 Flash 73.3%
computer science Gemini 3.1 Flash-Lite 88.8%
computer science Gemini 3.5 Flash 87.6%
computer science DeepSeek V4 Flash 85.9%
computer science Gemma 4 31B 85.9%
computer science DeepSeek V4 Pro 85.4%
computer science Gemini 2.5 Flash 85.1%
computer science Qwen3.6 35B-A3B 85.1%
computer science Gemma 4 26B A4B 83.9%
computer science Claude Haiku 4.5 83.4%
computer science Gemini 2.5 Flash-Lite 77.8%
computer science Tencent HY3-Preview 76.8%
economics Gemini 3.5 Flash 89.1%
economics Gemini 3.1 Flash-Lite 87.3%
economics Gemma 4 31B 86.8%
economics Gemini 2.5 Flash 86.4%
economics Qwen3.6 35B-A3B 85.3%
economics DeepSeek V4 Pro 84.1%
economics Gemma 4 26B A4B 82.6%
economics Tencent HY3-Preview 82.3%
economics Claude Haiku 4.5 81.5%
economics Gemini 2.5 Flash-Lite 79.1%
economics DeepSeek V4 Flash 75.0%
engineering Gemini 3.5 Flash 82.5%
engineering Gemini 3.1 Flash-Lite 77.8%
engineering Qwen3.6 35B-A3B 77.5%
engineering Gemma 4 31B 77.1%
engineering DeepSeek V4 Pro 74.2%
engineering Gemma 4 26B A4B 73.5%
engineering Gemini 2.5 Flash 71.6%
engineering Claude Haiku 4.5 64.7%
engineering Tencent HY3-Preview 64.4%
engineering DeepSeek V4 Flash 64.1%
engineering Gemini 2.5 Flash-Lite 58.4%
health Gemini 3.5 Flash 78.6%
health Gemini 3.1 Flash-Lite 74.8%
health DeepSeek V4 Pro 74.2%
health Gemma 4 31B 74.1%
health Tencent HY3-Preview 73.7%
health Claude Haiku 4.5 72.9%
health Qwen3.6 35B-A3B 72.8%
health Gemini 2.5 Flash 72.5%
health DeepSeek V4 Flash 71.3%
health Gemma 4 26B A4B 71.2%
health Gemini 2.5 Flash-Lite 66.5%
history Gemini 3.5 Flash 80.6%
history Gemini 3.1 Flash-Lite 74.5%
history Gemma 4 31B 74.5%
history DeepSeek V4 Pro 71.9%
history Tencent HY3-Preview 71.4%
history Gemini 2.5 Flash 70.9%
history Qwen3.6 35B-A3B 69.0%
history DeepSeek V4 Flash 68.2%
history Gemma 4 26B A4B 66.7%
history Claude Haiku 4.5 65.4%
history Gemini 2.5 Flash-Lite 59.3%
law Gemini 3.5 Flash 72.8%
law Gemini 3.1 Flash-Lite 62.8%
law Gemma 4 31B 59.0%
law DeepSeek V4 Pro 55.3%
law Gemini 2.5 Flash 54.0%
law Qwen3.6 35B-A3B 52.8%
law Gemma 4 26B A4B 50.7%
law Claude Haiku 4.5 47.4%
law DeepSeek V4 Flash 46.1%
law Tencent HY3-Preview 45.8%
law Gemini 2.5 Flash-Lite 41.6%
math Gemini 3.5 Flash 94.6%
math Gemma 4 31B 93.5%
math Qwen3.6 35B-A3B 92.7%
math Gemma 4 26B A4B 92.0%
math DeepSeek V4 Pro 90.9%
math Gemini 2.5 Flash 90.5%
math Gemini 3.1 Flash-Lite 90.2%
math DeepSeek V4 Flash 88.7%
math Claude Haiku 4.5 86.8%
math Tencent HY3-Preview 86.5%
math Gemini 2.5 Flash-Lite 82.8%
nothink biology Gemma 4 12B 87.0%
nothink biology Qwen3.5 9B 85.6%
nothink biology Qwen3 14B 81.3%
nothink business Gemma 4 12B 79.8%
nothink business Qwen3.5 9B 78.1%
nothink business Qwen3 14B 70.3%
nothink chemistry Qwen3.5 9B 85.0%
nothink chemistry Gemma 4 12B 79.9%
nothink chemistry Qwen3 14B 72.2%
nothink computer science Gemma 4 12B 79.3%
nothink computer science Qwen3.5 9B 78.0%
nothink computer science Qwen3 14B 71.5%
nothink economics Qwen3.5 9B 78.9%
nothink economics Gemma 4 12B 78.7%
nothink economics Qwen3 14B 71.1%
nothink engineering Qwen3.5 9B 66.6%
nothink engineering Gemma 4 12B 65.1%
nothink engineering Qwen3 14B 55.9%
nothink health Qwen3.5 9B 65.9%
nothink health Gemma 4 12B 64.5%
nothink health Qwen3 14B 56.8%
nothink history Qwen3.5 9B 60.4%
nothink history Gemma 4 12B 57.2%
nothink history Qwen3 14B 48.8%
nothink law Gemma 4 12B 42.8%
nothink law Qwen3.5 9B 36.7%
nothink law Qwen3 14B 27.4%
nothink math Qwen3.5 9B 90.2%
nothink math Gemma 4 12B 90.1%
nothink math Qwen3 14B 82.2%
nothink other Qwen3.5 9B 61.6%
nothink other Gemma 4 12B 60.7%
nothink other Qwen3 14B 52.1%
nothink philosophy Gemma 4 12B 60.3%
nothink philosophy Qwen3.5 9B 58.1%
nothink philosophy Qwen3 14B 49.3%
nothink physics Qwen3.5 9B 84.7%
nothink physics Gemma 4 12B 81.9%
nothink physics Qwen3 14B 73.1%
nothink psychology Gemma 4 12B 76.3%
nothink psychology Qwen3.5 9B 74.2%
nothink psychology Qwen3 14B 66.0%
other Gemini 3.5 Flash 82.4%
other DeepSeek V4 Pro 77.3%
other Gemini 3.1 Flash-Lite 76.7%
other Gemma 4 31B 73.7%
other Gemini 2.5 Flash 73.5%
other Tencent HY3-Preview 72.1%
other Qwen3.6 35B-A3B 70.8%
other Claude Haiku 4.5 69.5%
other DeepSeek V4 Flash 69.4%
other Gemma 4 26B A4B 67.5%
other Gemini 2.5 Flash-Lite 64.9%
philosophy Gemini 3.5 Flash 83.2%
philosophy Gemini 3.1 Flash-Lite 75.2%
philosophy Gemma 4 31B 74.3%
philosophy DeepSeek V4 Pro 73.7%
philosophy DeepSeek V4 Flash 71.9%
philosophy Gemini 2.5 Flash 70.9%
philosophy Qwen3.6 35B-A3B 69.5%
philosophy Gemma 4 26B A4B 69.1%
philosophy Tencent HY3-Preview 69.1%
philosophy Claude Haiku 4.5 67.9%
philosophy Gemini 2.5 Flash-Lite 60.9%
physics Gemini 3.5 Flash 90.4%
physics Gemma 4 31B 89.0%
physics Qwen3.6 35B-A3B 88.0%
physics Gemini 3.1 Flash-Lite 87.6%
physics DeepSeek V4 Pro 86.8%
physics Gemma 4 26B A4B 86.1%
physics Gemini 2.5 Flash 85.5%
physics DeepSeek V4 Flash 84.7%
physics Claude Haiku 4.5 81.3%
physics Gemini 2.5 Flash-Lite 76.6%
physics Tencent HY3-Preview 64.7%
psychology Gemini 3.5 Flash 87.8%
psychology Gemma 4 31B 83.6%
psychology Gemini 3.1 Flash-Lite 83.3%
psychology DeepSeek V4 Pro 83.3%
psychology Gemini 2.5 Flash 82.6%
psychology DeepSeek V4 Flash 80.8%
psychology Tencent HY3-Preview 80.5%
psychology Claude Haiku 4.5 78.9%
psychology Qwen3.6 35B-A3B 78.7%
psychology Gemma 4 26B A4B 78.6%
psychology Gemini 2.5 Flash-Lite 75.1%

MMMLU — German

OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

14,042 questions 4-option multiple choice Professional translation openai/MMMLU ↗
# Model Score Latency Tok in Tok out Cost Date
1 GPT-5.5 🔒 reasoning
92.1%
4.53s 198.9 166.8 $83.60 2026-06-16
2 Gemini 3.5 Flash
89.3%
2.05s 172 255.8 $35.95 2026-05-26
3 Gemini 3.1 Flash-Lite
86.8%
1.06s 173 210.7 $5.04 2026-06-03
4 Gemma 4 31B
86.6%
12.07s 185 251.5 $1.58 2026-06-11
5 Qwen3.6 35B-A3B
85.8%
4.39s 182.9 561.4 $8.27 2026-06-01
6 DeepSeek V4 Flash
84.9%
2.31s 190.8 143.6 $0.94 2026-06-03
7 Gemini 2.5 Flash
84.7%
1.48s 172 275.5 $10.40 2026-05-26
8 Gemma 4 26B A4B
83.7%
7.77s 112.3 221.7 $0.05 2026-06-05
9 Tencent HY3-Preview
83.7%
5.55s 223.4 280.9 $1.14 2026-05-28
10 Claude Haiku 4.5
83.1%
4.20s 277.7 397.6 $31.90 2026-06-02
11 Gemini 2.5 Flash-Lite
79.6%
1.72s 173 506 $3.06 2026-06-03
12 Gemma 4 12B
79.0%
47.38s 189 261.2 2026-06-11
13 Qwen3.5 9B
78.8%
13.81s 186.9 408.3 $1.12 2026-06-11
14 Qwen3 14B
73.4%
6.37s 212 87.9 2026-06-11

MuSR — German

Multi-step soft reasoning over long narrative contexts — murder mysteries, object placement and team allocation. Requires chaining clues across several paragraphs to reach the correct answer. Translated to German from the original English MuSR benchmark.

564 questions 2–5 option multiple choice Professional translation zayne-sprague/MuSR ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
88.1%
12.14s 1,464.9 1,306.5 $10.49 2026-06-09
2 GPT-5.5 🔒 reasoning
86.3%
13.58s 1,449.5 617.2 $14.53 2026-06-13
3 Opus 4.8
86.3%
13.87s 3,039.8 1,075.6 $23.74 2026-06-13
4 Gemini 3.1 Flash-Lite
84.4%
2.95s 1,467.3 568.5 $0.73 2026-06-09
5 Gemini 3.5 Flash
84.2%
5.79s 1,464.9 850.2 $5.55 2026-06-09
6 Gemma 4 26B A4B
83.7%
2026-06-10
7 DeepSeek V4 Flash
83.5%
10.18s 1,659.8 798.9 $0.26 2026-06-10
8 Gemma 4 31B
83.5%
23.30s 1,477.9 650.1 $0.23 2026-06-11
9 Gemini 2.5 Flash
83.3%
6.28s 1,464.9 1,077.4 $1.77 2026-06-09
10 Gemma 4 12B
81.6%
195.50s 1,477.9 707.4 2026-06-11
11 GLM-5.1
81.4%
2026-06-10
12 MiMo V2.5 Pro
80.9%
2026-06-10
13 Qwen3.5 9B
80.9%
61.14s 1,469.2 1,629.7 $0.22 2026-06-11
14 Qwen3 14B
69.7%
263.43s 1,728.9 1,270.4 2026-06-11
Show per-subject breakdown (42)
Subject Model Score
murder mystery GPT-5.5 90.0%
murder mystery Gemini 3.1 Pro 89.2%
murder mystery Gemini 2.5 Flash 87.6%
murder mystery Opus 4.8 87.6%
murder mystery Gemma 4 31B 86.4%
murder mystery DeepSeek V4 Flash 85.6%
murder mystery Gemma 4 12B 85.6%
murder mystery Gemma 4 26B A4B 85.2%
murder mystery Gemini 3.1 Flash-Lite 84.8%
murder mystery Gemini 3.5 Flash 84.0%
murder mystery GLM-5.1 80.0%
murder mystery MiMo V2.5 Pro 78.8%
murder mystery Qwen3.5 9B 77.6%
murder mystery Qwen3 14B 76.8%
object placements Gemini 3.5 Flash 81.3%
object placements GLM-5.1 81.3%
object placements Gemini 3.1 Pro 79.7%
object placements Opus 4.8 79.7%
object placements Gemma 4 12B 79.7%
object placements MiMo V2.5 Pro 76.6%
object placements Gemma 4 31B 76.6%
object placements Gemini 3.1 Flash-Lite 75.0%
object placements Gemini 2.5 Flash 75.0%
object placements DeepSeek V4 Flash 75.0%
object placements Gemma 4 26B A4B 73.4%
object placements Qwen3.5 9B 73.4%
object placements Qwen3 14B 71.9%
object placements GPT-5.5 68.8%
team allocation Gemini 3.1 Pro 89.2%
team allocation GPT-5.5 87.2%
team allocation Opus 4.8 86.8%
team allocation Gemini 3.1 Flash-Lite 86.4%
team allocation Qwen3.5 9B 86.0%
team allocation Gemini 3.5 Flash 85.2%
team allocation DeepSeek V4 Flash 84.8%
team allocation Gemma 4 26B A4B 84.8%
team allocation MiMo V2.5 Pro 84.0%
team allocation GLM-5.1 82.8%
team allocation Gemma 4 31B 82.4%
team allocation Gemini 2.5 Flash 81.2%
team allocation Gemma 4 12B 78.0%
team allocation Qwen3 14B 62.0%

SB10K — German sentiment

Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.

1,024 questions 3-class sentiment Native German SB10K (via EuroEval) ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.1 Pro 🔒 reasoning
70.1%
3.30s 625.7 205.5 $3.81 2026-06-09
2 Opus 4.8
64.5%
1.88s 1,372.7 5 $7.16 2026-06-13
3 Gemini 3.5 Flash
64.3%
0.67s 650.6 1.7 $1.02 2026-06-09
4 Qwen3.7 Max
63.2%
0.99s 771.9 1.7 $0.99 2026-06-09
5 GPT-5.5
62.5%
1.36s 756.9 8.1 $4.12 2026-06-13
6 DeepSeek V4 Flash
62.4%
0.54s 778.6 2.7 $0.11 2026-06-09
7 Gemma 4 12B
62.2%
21.45s 758.7 2.7 2026-06-11
8 MiMo V2.5 Pro
61.7%
0.45s 1,140.3 2.8 $0.99 2026-06-09
9 Gemini 2.5 Flash
61.6%
0.40s 650.7 1.8 $0.20 2026-06-09
10 Gemma 4 31B
61.0%
5.46s 758.7 2.8 $0.09 2026-06-11
11 Gemini 3.1 Flash-Lite
60.4%
0.46s 650.7 1.8 $0.34 2026-06-08
12 DeepSeek V4 Pro
60.0%
1.42s 778.6 2.1 $1.28 2026-06-09
13 Qwen3.6 35B-A3B
58.4%
0.45s 771.9 2.7 $0.12 2026-06-09
14 Qwen3 14B
56.1%
13.26s 899.3 2.7 2026-06-11
15 Qwen3.5 9B
55.7%
6.96s 771.9 2.6 $0.08 2026-06-11
16 Tencent HY3-Preview
38.9%
2.35s 886.7 2.7 $0.06 2026-06-09
17 GLM-5.1
23.1%
1.88s 765.6 2.6 $1.09 2026-06-09
18 Gemma 4 26B A4B
15.2%
0.70s 758.8 2.8 $0.12 2026-06-08

ScaLA — German acceptability

Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

2,048 questions Binary acceptability Native German ScaLA-de (via EuroEval) ↗
# Model Score Latency Tok in Tok out Cost Date
1 Opus 4.8
82.6%
2.00s 1,684.8 4 $17.46 2026-06-13
2 Gemini 3.1 Pro 🔒 reasoning
80.4%
3.71s 777.9 262 $10.37 2026-06-09
3 GPT-5.5 🔒 reasoning
79.7%
1.52s 913.2 27.6 $11.05 2026-06-13
4 Gemini 3.5 Flash
79.4%
0.79s 802.7 1.5 $2.49 2026-06-09
5 Gemini 2.5 Flash
77.1%
0.40s 802.7 1.5 $0.50 2026-06-09
6 DeepSeek V4 Flash
74.9%
0.53s 949.8 2.5 $0.27 2026-06-09
7 MiMo V2.5 Pro
74.8%
0.57s 1,301 2.5 $2.33 2026-06-09
8 Gemini 3.1 Flash-Lite
74.2%
0.46s 802.7 1.5 $0.83 2026-06-08
9 DeepSeek V4 Pro
73.5%
1.41s 949.8 1.6 $3.12 2026-06-09
10 Gemma 4 31B
71.0%
1.28s 910.7 2.5 $0.23 2026-06-11
11 GLM-5.1
69.9%
1.69s 901.6 2.5 $2.57 2026-06-09
12 Qwen3.7 Max
69.8%
1.02s 928.9 1.6 $2.39 2026-06-09
13 Gemma 4 26B A4B
66.3%
0.46s 910.7 2.6 $0.28 2026-06-08
14 Qwen3.6 35B-A3B
64.2%
0.51s 928.9 2.6 $0.29 2026-06-09
15 Gemma 4 12B
64.1%
27.85s 910.7 2.6 2026-06-11
16 Qwen3 14B
64.0%
15.71s 1,060 2.6 2026-06-11
17 Tencent HY3-Preview
62.5%
2.48s 1,027.6 2.6 $0.13 2026-06-09
18 Qwen3.5 9B
62.1%
8.40s 928.9 2.5 $0.19 2026-06-11

TPS is decode speed — output tokens per second after the first token; higher is faster. TTFT is time to first token; lower is snappier. A snapshot, not a constant.

# Model TPS TTFT
1 gpt-oss-120b 🔒 362 0.51s
2 Gemini 3.1 Flash-Lite 302 4.99s
3 Gemini 2.5 Flash-Lite 293 0.36s
4 Gemini 2.5 Flash 216 0.61s
5 Gemini 3.5 Flash 203 0.88s
6 grok-4.3 171 0.60s
7 Qwen3.7 Max 170 1.61s
8 Qwen3.6 35B-A3B 142 1.18s
9 Claude Haiku 4.5 141 0.64s
10 Gemini 3.1 Pro 🔒 127
11 DeepSeek V4 Flash 103 1.00s
12 Tencent HY3-Preview 95 2.47s
13 Ministral 14B 87 0.41s
14 Gemma 4 26B A4B 84 0.95s
15 GLM-5.1 71 0.89s
16 Qwen3 14B 65 1.02s
17 DeepSeek V4 Pro 58 1.02s
18 Qwen3.5 9B 58 1.21s
19 MiMo V2.5 Pro 49 2.03s
20 Gemma 4 12B 46 0.83s
21 Gemma 4 31B 38 1.01s

Quality vs. speed

Average score across all benchmarks against decode speed (output tokens per second). Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.

55%60%65%70%75%80%85%50100150200300500Output speed (tokens / sec)Avg scorefaster & smarter — better ↗Gemini 3.1 Pro · 80.7% · 127 tok/s · reasoning locked · INCLUDE onlyGemini 3.1 Pro 🔒 *Gemini 3.5 Flash · 80.6% · 203 tok/sGemini 3.5 FlashGemini 3.1 Flash-Lite · 77.6% · 302 tok/sGemini 3.1 Flash-LiteGemini 2.5 Flash · 76.9% · 216 tok/sGemini 2.5 FlashGemma 4 31B · 76.5% · 38 tok/sGemma 4 31BClaude Haiku 4.5 · 75.6% · 141 tok/s · INCLUDE onlyClaude Haiku 4.5 *Gemini 2.5 Flash-Lite · 75.4% · 293 tok/s · INCLUDE onlyGemini 2.5 Flash-Lite *MiMo V2.5 Pro · 73.4% · 49 tok/s · INCLUDE onlyMiMo V2.5 Pro *Qwen3.6 35B-A3B · 72.8% · 142 tok/s · INCLUDE onlyQwen3.6 35B-A3B *Gemma 4 12B · 72.7% · 46 tok/sGemma 4 12BQwen3.7 Max · 71.6% · 170 tok/s · INCLUDE onlyQwen3.7 Max *DeepSeek V4 Pro · 71.5% · 58 tok/s · INCLUDE onlyDeepSeek V4 Pro *DeepSeek V4 Flash · 70.5% · 103 tok/sDeepSeek V4 FlashQwen3.5 9B · 69.7% · 58 tok/sQwen3.5 9Bgpt-oss-120b · 66.2% · 362 tok/s · reasoning locked · INCLUDE onlygpt-oss-120b 🔒 *Gemma 4 26B A4B · 67.7% · 84 tok/sGemma 4 26B A4BTencent HY3-Preview · 67.4% · 95 tok/s · INCLUDE onlyTencent HY3-Preview *Qwen3 14B · 66.2% · 65 tok/sQwen3 14Bgrok-4.3 · 63.3% · 171 tok/s · INCLUDE onlygrok-4.3 *GLM-5.1 · 58.8% · 71 tok/s · INCLUDE onlyGLM-5.1 *Ministral 14B · 58.3% · 87 tok/s · INCLUDE onlyMinistral 14B *

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.