German Artificial Analytics
a PeerBench project

Which fast LLM speaks German best?

Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.

models
12
benchmarks
3
updated
2026-06-03
# Model INCLUDE (DE) MMLU-ProX (DE) MMMLU (DE) Avg
1 Gemini 3.5 Flash 72.7% 86.5% 89.3% 82.8%
2 Gemini 3.1 Flash-Lite 72.7% 82.1% 86.7% 80.5%
3 Gemini 2.5 Flash 70.5% 79.6% 84.7% 78.3%
4 Qwen3.6 35B-A3B 68.3% 80.0% 85.8% 78.1%
5 DeepSeek V4 Flash 70.5% 74.3% 85.1% 76.6%
6 Claude Haiku 4.5 68.3% 83.1% 75.7%
7 Tencent HY3-Preview 69.1% 73.2% 83.7% 75.3%
8 DeepSeek V4 Pro 69.8% 69.8%
9 Gemma 4 31B 67.6% 67.6%
10 gpt-oss-120b 66.2% 66.2%
11 Mercury 2 54.0% 63.6% 68.0% 61.9%
12 GLM-5.1 47.5% 47.5%

Quality vs. cost

Average score across all benchmarks against cost per 1,000 questions (log scale). Up and to the left is better — more accuracy per dollar. The gold line is the value frontier: the best score available at each price.

40%45%50%55%60%65%70%75%80%85%90%$0.05$0.1$0.2$0.5$1$2$5Cost per 1,000 questions (log scale)Avg score↖ better — more score per dollarGemini 3.5 Flash · 82.8% · $3.97 / 1kGemini 3.5 FlashGemini 3.1 Flash-Lite · 80.5% · $0.59 / 1kGemini 3.1 Flash-LiteDeepSeek V4 Flash · 76.6% · $0.10 / 1kDeepSeek V4 FlashGemini 2.5 Flash · 78.3% · $1.52 / 1kGemini 2.5 FlashQwen3.6 35B-A3B · 78.1% · $0.84 / 1kQwen3.6 35B-A3BClaude Haiku 4.5 · 75.7% · $4.23 / 1k · INCLUDE onlyClaude Haiku 4.5 *Tencent HY3-Preview · 75.3% · $0.15 / 1kTencent HY3-Previewgpt-oss-120b · 66.2% · $0.06 / 1k · INCLUDE onlygpt-oss-120b *DeepSeek V4 Pro · 69.8% · $0.48 / 1k · INCLUDE onlyDeepSeek V4 Pro *Gemma 4 31B · 67.6% · $0.14 / 1k · INCLUDE onlyGemma 4 31B *Mercury 2 · 61.9% · $0.30 / 1kMercury 2GLM-5.1 · 47.5% · $1.14 / 1k · INCLUDE onlyGLM-5.1 *

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

INCLUDE — German

Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

139 questions 4-option multiple choice Native German CohereLabs/include-base-44 ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.5 Flash
72.7%
1.88s 105.8 214.9 $0.29 2026-06-02
2 Gemini 3.1 Flash-Lite
72.7%
1.27s 106.1 153.2 $0.04 2026-06-02
3 DeepSeek V4 Flash
70.5%
2.61s 118.6 108.4 $0.01 2026-06-02
4 Gemini 2.5 Flash
70.5%
1.61s 105.8 193.6 $0.07 2026-06-02
5 DeepSeek V4 Pro
69.8%
3.59s 118.6 113.6 $0.07 2026-06-02
6 Tencent HY3-Preview
69.1%
5.48s 143.9 222.4 $0.01 2026-06-02
7 Qwen3.6 35B-A3B
68.3%
2.72s 118.4 545.1 $0.08 2026-06-02
8 Claude Haiku 4.5
68.3%
3.45s 154.6 347.4 $0.26 2026-06-02
9 Gemma 4 31B
67.6%
11.86s 118.8 257.9 $0.02 2026-06-03
10 gpt-oss-120b 🔒 reasoning
66.2%
1.71s 154.1 21.4 $0.01 2026-05-29
11 Mercury 2
54.0%
0.48s 111.3 58.6 $0.01 2026-06-02
12 GLM-5.1
47.5%
7.54s 113.9 257.2 $0.16 2026-06-03

MMLU-Pro — German

Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.

11,759 questions 10-option multiple choice Professional translation li-lab/MMLU-ProX ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.5 Flash
86.5%
2.23s 1,664.9 353.3 $66.76 2026-05-26
2 Gemini 3.1 Flash-Lite
82.1%
1.80s 1,664.9 303.8 $10.25 2026-05-26
3 Qwen3.6 35B-A3B
80.0%
4.67s 1,689.6 984 $13.36 2026-05-31
4 Gemini 2.5 Flash
79.6%
2.69s 1,664.9 786 $28.89 2026-05-26
5 DeepSeek V4 Flash
74.3%
2.16s 1,718.6 161.9 $1.61 2026-05-28
6 Tencent HY3-Preview
73.2%
6.14s 1,910.2 586.3 $2.76 2026-05-27
7 Mercury 2
63.6%
1.13s 1,563.3 580.3 $6.22 2026-05-29
Show per-subject breakdown (98)
Subject Model Score
biology Gemini 3.5 Flash 92.9%
biology Qwen3.6 35B-A3B 89.7%
biology Gemini 3.1 Flash-Lite 89.3%
biology Gemini 2.5 Flash 88.8%
biology DeepSeek V4 Flash 88.6%
biology Tencent HY3-Preview 85.8%
biology Mercury 2 76.8%
business Gemini 3.5 Flash 89.6%
business Gemini 3.1 Flash-Lite 86.1%
business Gemini 2.5 Flash 83.5%
business Qwen3.6 35B-A3B 83.5%
business Tencent HY3-Preview 78.3%
business DeepSeek V4 Flash 73.3%
business Mercury 2 70.1%
chemistry Gemini 3.5 Flash 89.1%
chemistry Qwen3.6 35B-A3B 88.2%
chemistry Gemini 2.5 Flash 86.9%
chemistry Gemini 3.1 Flash-Lite 86.4%
chemistry Tencent HY3-Preview 75.6%
chemistry DeepSeek V4 Flash 74.2%
chemistry Mercury 2 72.0%
computer science Gemini 3.5 Flash 87.6%
computer science Gemini 3.1 Flash-Lite 86.3%
computer science Gemini 2.5 Flash 85.1%
computer science Qwen3.6 35B-A3B 85.1%
computer science DeepSeek V4 Flash 82.7%
computer science Tencent HY3-Preview 76.8%
computer science Mercury 2 71.7%
economics Gemini 3.5 Flash 89.1%
economics Gemini 3.1 Flash-Lite 87.1%
economics Gemini 2.5 Flash 86.4%
economics Qwen3.6 35B-A3B 85.3%
economics Tencent HY3-Preview 82.3%
economics DeepSeek V4 Flash 77.6%
economics Mercury 2 69.5%
engineering Gemini 3.5 Flash 82.5%
engineering Gemini 3.1 Flash-Lite 77.7%
engineering Qwen3.6 35B-A3B 77.5%
engineering Gemini 2.5 Flash 71.6%
engineering Tencent HY3-Preview 64.4%
engineering DeepSeek V4 Flash 57.4%
engineering Mercury 2 47.7%
health Gemini 3.5 Flash 78.6%
health Gemini 3.1 Flash-Lite 75.5%
health Tencent HY3-Preview 73.7%
health Qwen3.6 35B-A3B 72.8%
health Gemini 2.5 Flash 72.5%
health DeepSeek V4 Flash 72.3%
health Mercury 2 60.8%
history Gemini 3.5 Flash 80.6%
history Gemini 3.1 Flash-Lite 75.9%
history Tencent HY3-Preview 71.4%
history Gemini 2.5 Flash 70.9%
history DeepSeek V4 Flash 69.8%
history Qwen3.6 35B-A3B 69.0%
history Mercury 2 48.6%
law Gemini 3.5 Flash 72.8%
law Gemini 3.1 Flash-Lite 62.7%
law Gemini 2.5 Flash 54.0%
law Qwen3.6 35B-A3B 52.8%
law DeepSeek V4 Flash 46.3%
law Tencent HY3-Preview 45.8%
law Mercury 2 30.1%
math Gemini 3.5 Flash 94.6%
math Qwen3.6 35B-A3B 92.7%
math Gemini 3.1 Flash-Lite 91.0%
math Gemini 2.5 Flash 90.5%
math DeepSeek V4 Flash 88.5%
math Tencent HY3-Preview 86.5%
math Mercury 2 81.4%
other Gemini 3.5 Flash 82.4%
other Gemini 3.1 Flash-Lite 76.2%
other Gemini 2.5 Flash 73.5%
other Tencent HY3-Preview 72.1%
other DeepSeek V4 Flash 70.8%
other Qwen3.6 35B-A3B 70.8%
other Mercury 2 56.9%
philosophy Gemini 3.5 Flash 83.2%
philosophy Gemini 3.1 Flash-Lite 76.0%
philosophy Gemini 2.5 Flash 70.9%
philosophy Qwen3.6 35B-A3B 69.5%
philosophy Tencent HY3-Preview 69.1%
philosophy DeepSeek V4 Flash 68.3%
philosophy Mercury 2 49.1%
physics Gemini 3.5 Flash 90.4%
physics Qwen3.6 35B-A3B 88.0%
physics Gemini 3.1 Flash-Lite 86.8%
physics Gemini 2.5 Flash 85.5%
physics DeepSeek V4 Flash 84.1%
physics Mercury 2 71.9%
physics Tencent HY3-Preview 64.7%
psychology Gemini 3.5 Flash 87.8%
psychology Gemini 3.1 Flash-Lite 83.7%
psychology Gemini 2.5 Flash 82.6%
psychology DeepSeek V4 Flash 81.1%
psychology Tencent HY3-Preview 80.5%
psychology Qwen3.6 35B-A3B 78.7%
psychology Mercury 2 65.7%

MMMLU — German

OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

14,042 questions 4-option multiple choice Professional translation openai/MMMLU ↗
# Model Score Latency Tok in Tok out Cost Date
1 Gemini 3.5 Flash
89.3%
2.05s 172 255.8 $35.95 2026-05-26
2 Gemini 3.1 Flash-Lite
86.7%
1.47s 172 212 $5.07 2026-05-26
3 Qwen3.6 35B-A3B
85.8%
4.39s 182.9 561.4 $8.27 2026-06-01
4 DeepSeek V4 Flash
85.1%
2.54s 190.8 140.2 $0.92 2026-05-28
5 Gemini 2.5 Flash
84.7%
1.48s 172 275.5 $10.40 2026-05-26
6 Tencent HY3-Preview
83.7%
5.55s 223.4 280.9 $1.14 2026-05-28
7 Claude Haiku 4.5
83.1%
4.20s 2,267.2 397.6 $59.74 2026-06-02
8 Mercury 2
68.0%
0.43s 172.2 90.2 $1.53 2026-05-30

Streaming throughput on a controlled workload (~2,000-token German prompt, 400-token output cap, one request at a time). TPS is decode speed — output tokens per second after the first token. TTFT is time to first token. Measured separately from the accuracy benchmarks (their reasoning-off answers are too short to time). Provider-pinned; a snapshot, not a constant.

# Model TPS TTFT
1 gpt-oss-120b 🔒 1,641 0.31s
2 Mercury 2 541 0.39s
3 Gemini 3.1 Flash-Lite 223 0.61s
4 Gemini 3.5 Flash 181 0.75s
5 Qwen3.6 35B-A3B 159 0.62s
6 Gemini 2.5 Flash 159 0.41s
7 Claude Haiku 4.5 131 0.79s
8 DeepSeek V4 Flash 115 0.92s
9 Tencent HY3-Preview 107 2.69s
10 GLM-5.1 32 0.65s
11 Gemma 4 31B 24 0.59s

Quality vs. speed

Average score across all benchmarks against decode speed (output tokens per second) from the throughput probe. Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.

40%45%50%55%60%65%70%75%80%85%90%50100150200300500750100015002000Output speed (tokens / sec)Avg scorefaster & smarter — better ↗Gemini 3.5 Flash · 82.8% · 181 tok/sGemini 3.5 FlashGemini 3.1 Flash-Lite · 80.5% · 223 tok/sGemini 3.1 Flash-LiteGemini 2.5 Flash · 78.3% · 159 tok/sGemini 2.5 FlashQwen3.6 35B-A3B · 78.1% · 159 tok/sQwen3.6 35B-A3BDeepSeek V4 Flash · 76.6% · 115 tok/sDeepSeek V4 FlashClaude Haiku 4.5 · 75.7% · 131 tok/s · INCLUDE onlyClaude Haiku 4.5 *Tencent HY3-Preview · 75.3% · 107 tok/sTencent HY3-Previewgpt-oss-120b · 66.2% · 1641 tok/s · reasoning locked · INCLUDE onlygpt-oss-120b 🔒 *Gemma 4 31B · 67.6% · 24 tok/s · INCLUDE onlyGemma 4 31B *Mercury 2 · 61.9% · 541 tok/sMercury 2GLM-5.1 · 47.5% · 32 tok/s · INCLUDE onlyGLM-5.1 *

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.