a PeerBench project

Which fast LLM speaks German best?

Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.

models: 12
benchmarks: 3
updated: 2026-06-03

#	Model	INCLUDE (DE)	MMLU-ProX (DE)	MMMLU (DE)	Avg
1	Gemini 3.5 Flash	72.7%	86.5%	89.3%	82.8%
2	Gemini 3.1 Flash-Lite	72.7%	82.1%	86.7%	80.5%
3	Gemini 2.5 Flash	70.5%	79.6%	84.7%	78.3%
4	Qwen3.6 35B-A3B	68.3%	80.0%	85.8%	78.1%
5	DeepSeek V4 Flash	70.5%	74.3%	85.1%	76.6%
6	Claude Haiku 4.5	68.3%	—	83.1%	75.7%
7	Tencent HY3-Preview	69.1%	73.2%	83.7%	75.3%
8	DeepSeek V4 Pro	69.8%	—	—	69.8%
9	Gemma 4 31B	67.6%	—	—	67.6%
10	gpt-oss-120b	66.2%	—	—	66.2%
11	Mercury 2	54.0%	63.6%	68.0%	61.9%
12	GLM-5.1	47.5%	—	—	47.5%

Quality vs. cost

Average score across all benchmarks against cost per 1,000 questions (log scale). Up and to the left is better — more accuracy per dollar. The gold line is the value frontier: the best score available at each price.

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

INCLUDE — German

Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

139 questions 4-option multiple choice Native German CohereLabs/include-base-44 ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.5 Flash	72.7%	1.88s	105.8	214.9	$0.29	2026-06-02
2	Gemini 3.1 Flash-Lite	72.7%	1.27s	106.1	153.2	$0.04	2026-06-02
3	DeepSeek V4 Flash	70.5%	2.61s	118.6	108.4	$0.01	2026-06-02
4	Gemini 2.5 Flash	70.5%	1.61s	105.8	193.6	$0.07	2026-06-02
5	DeepSeek V4 Pro	69.8%	3.59s	118.6	113.6	$0.07	2026-06-02
6	Tencent HY3-Preview	69.1%	5.48s	143.9	222.4	$0.01	2026-06-02
7	Qwen3.6 35B-A3B	68.3%	2.72s	118.4	545.1	$0.08	2026-06-02
8	Claude Haiku 4.5	68.3%	3.45s	154.6	347.4	$0.26	2026-06-02
9	Gemma 4 31B	67.6%	11.86s	118.8	257.9	$0.02	2026-06-03
10	gpt-oss-120b 🔒 reasoning	66.2%	1.71s	154.1	21.4	$0.01	2026-05-29
11	Mercury 2	54.0%	0.48s	111.3	58.6	$0.01	2026-06-02
12	GLM-5.1	47.5%	7.54s	113.9	257.2	$0.16	2026-06-03

MMLU-Pro — German

Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.

11,759 questions 10-option multiple choice Professional translation li-lab/MMLU-ProX ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.5 Flash	86.5%	2.23s	1,664.9	353.3	$66.76	2026-05-26
2	Gemini 3.1 Flash-Lite	82.1%	1.80s	1,664.9	303.8	$10.25	2026-05-26
3	Qwen3.6 35B-A3B	80.0%	4.67s	1,689.6	984	$13.36	2026-05-31
4	Gemini 2.5 Flash	79.6%	2.69s	1,664.9	786	$28.89	2026-05-26
5	DeepSeek V4 Flash	74.3%	2.16s	1,718.6	161.9	$1.61	2026-05-28
6	Tencent HY3-Preview	73.2%	6.14s	1,910.2	586.3	$2.76	2026-05-27
7	Mercury 2	63.6%	1.13s	1,563.3	580.3	$6.22	2026-05-29

Show per-subject breakdown (98)

Subject	Model	Score
biology	Gemini 3.5 Flash	92.9%
biology	Qwen3.6 35B-A3B	89.7%
biology	Gemini 3.1 Flash-Lite	89.3%
biology	Gemini 2.5 Flash	88.8%
biology	DeepSeek V4 Flash	88.6%
biology	Tencent HY3-Preview	85.8%
biology	Mercury 2	76.8%
business	Gemini 3.5 Flash	89.6%
business	Gemini 3.1 Flash-Lite	86.1%
business	Gemini 2.5 Flash	83.5%
business	Qwen3.6 35B-A3B	83.5%
business	Tencent HY3-Preview	78.3%
business	DeepSeek V4 Flash	73.3%
business	Mercury 2	70.1%
chemistry	Gemini 3.5 Flash	89.1%
chemistry	Qwen3.6 35B-A3B	88.2%
chemistry	Gemini 2.5 Flash	86.9%
chemistry	Gemini 3.1 Flash-Lite	86.4%
chemistry	Tencent HY3-Preview	75.6%
chemistry	DeepSeek V4 Flash	74.2%
chemistry	Mercury 2	72.0%
computer science	Gemini 3.5 Flash	87.6%
computer science	Gemini 3.1 Flash-Lite	86.3%
computer science	Gemini 2.5 Flash	85.1%
computer science	Qwen3.6 35B-A3B	85.1%
computer science	DeepSeek V4 Flash	82.7%
computer science	Tencent HY3-Preview	76.8%
computer science	Mercury 2	71.7%
economics	Gemini 3.5 Flash	89.1%
economics	Gemini 3.1 Flash-Lite	87.1%
economics	Gemini 2.5 Flash	86.4%
economics	Qwen3.6 35B-A3B	85.3%
economics	Tencent HY3-Preview	82.3%
economics	DeepSeek V4 Flash	77.6%
economics	Mercury 2	69.5%
engineering	Gemini 3.5 Flash	82.5%
engineering	Gemini 3.1 Flash-Lite	77.7%
engineering	Qwen3.6 35B-A3B	77.5%
engineering	Gemini 2.5 Flash	71.6%
engineering	Tencent HY3-Preview	64.4%
engineering	DeepSeek V4 Flash	57.4%
engineering	Mercury 2	47.7%
health	Gemini 3.5 Flash	78.6%
health	Gemini 3.1 Flash-Lite	75.5%
health	Tencent HY3-Preview	73.7%
health	Qwen3.6 35B-A3B	72.8%
health	Gemini 2.5 Flash	72.5%
health	DeepSeek V4 Flash	72.3%
health	Mercury 2	60.8%
history	Gemini 3.5 Flash	80.6%
history	Gemini 3.1 Flash-Lite	75.9%
history	Tencent HY3-Preview	71.4%
history	Gemini 2.5 Flash	70.9%
history	DeepSeek V4 Flash	69.8%
history	Qwen3.6 35B-A3B	69.0%
history	Mercury 2	48.6%
law	Gemini 3.5 Flash	72.8%
law	Gemini 3.1 Flash-Lite	62.7%
law	Gemini 2.5 Flash	54.0%
law	Qwen3.6 35B-A3B	52.8%
law	DeepSeek V4 Flash	46.3%
law	Tencent HY3-Preview	45.8%
law	Mercury 2	30.1%
math	Gemini 3.5 Flash	94.6%
math	Qwen3.6 35B-A3B	92.7%
math	Gemini 3.1 Flash-Lite	91.0%
math	Gemini 2.5 Flash	90.5%
math	DeepSeek V4 Flash	88.5%
math	Tencent HY3-Preview	86.5%
math	Mercury 2	81.4%
other	Gemini 3.5 Flash	82.4%
other	Gemini 3.1 Flash-Lite	76.2%
other	Gemini 2.5 Flash	73.5%
other	Tencent HY3-Preview	72.1%
other	DeepSeek V4 Flash	70.8%
other	Qwen3.6 35B-A3B	70.8%
other	Mercury 2	56.9%
philosophy	Gemini 3.5 Flash	83.2%
philosophy	Gemini 3.1 Flash-Lite	76.0%
philosophy	Gemini 2.5 Flash	70.9%
philosophy	Qwen3.6 35B-A3B	69.5%
philosophy	Tencent HY3-Preview	69.1%
philosophy	DeepSeek V4 Flash	68.3%
philosophy	Mercury 2	49.1%
physics	Gemini 3.5 Flash	90.4%
physics	Qwen3.6 35B-A3B	88.0%
physics	Gemini 3.1 Flash-Lite	86.8%
physics	Gemini 2.5 Flash	85.5%
physics	DeepSeek V4 Flash	84.1%
physics	Mercury 2	71.9%
physics	Tencent HY3-Preview	64.7%
psychology	Gemini 3.5 Flash	87.8%
psychology	Gemini 3.1 Flash-Lite	83.7%
psychology	Gemini 2.5 Flash	82.6%
psychology	DeepSeek V4 Flash	81.1%
psychology	Tencent HY3-Preview	80.5%
psychology	Qwen3.6 35B-A3B	78.7%
psychology	Mercury 2	65.7%

MMMLU — German

OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

14,042 questions 4-option multiple choice Professional translation openai/MMMLU ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.5 Flash	89.3%	2.05s	172	255.8	$35.95	2026-05-26
2	Gemini 3.1 Flash-Lite	86.7%	1.47s	172	212	$5.07	2026-05-26
3	Qwen3.6 35B-A3B	85.8%	4.39s	182.9	561.4	$8.27	2026-06-01
4	DeepSeek V4 Flash	85.1%	2.54s	190.8	140.2	$0.92	2026-05-28
5	Gemini 2.5 Flash	84.7%	1.48s	172	275.5	$10.40	2026-05-26
6	Tencent HY3-Preview	83.7%	5.55s	223.4	280.9	$1.14	2026-05-28
7	Claude Haiku 4.5	83.1%	4.20s	2,267.2	397.6	$59.74	2026-06-02
8	Mercury 2	68.0%	0.43s	172.2	90.2	$1.53	2026-05-30

Streaming throughput on a controlled workload (~2,000-token German prompt, 400-token output cap, one request at a time). TPS is decode speed — output tokens per second after the first token. TTFT is time to first token. Measured separately from the accuracy benchmarks (their reasoning-off answers are too short to time). Provider-pinned; a snapshot, not a constant.

#	Model	TPS	TTFT
1	gpt-oss-120b 🔒	1,641	0.31s
2	Mercury 2	541	0.39s
3	Gemini 3.1 Flash-Lite	223	0.61s
4	Gemini 3.5 Flash	181	0.75s
5	Qwen3.6 35B-A3B	159	0.62s
6	Gemini 2.5 Flash	159	0.41s
7	Claude Haiku 4.5	131	0.79s
8	DeepSeek V4 Flash	115	0.92s
9	Tencent HY3-Preview	107	2.69s
10	GLM-5.1	32	0.65s
11	Gemma 4 31B	24	0.59s

Quality vs. speed

Average score across all benchmarks against decode speed (output tokens per second) from the throughput probe. Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.