a PeerBench project

Which fast LLM speaks German best?

Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.

German LLM Overview

Which model is best on German tasks — one score, ranked. Filter by open vs. closed weights, price, or speed.

Weight

Price

Speed

Model

Score

Price

Speed

Gemini 3.1 Pro

Google · Closed

80.3

78–89

$6.75

$$$ · /Mtok

127

tok/s

Gemini 3.5 Flash

Google · Closed

68.4

67–75

$2.81

$$ · /Mtok

202.5

tok/s

Qwen3.7 Max

Alibaba · Closed

57.7

51–67

$1.28

$$ · /Mtok

170.3

tok/s

Gemini 3.1 Flash-Lite

Google · Closed

56.4

52–65

$0.44

$ · /Mtok

302.3

tok/s

DeepSeek V4 Pro

DeepSeek · Open weights

54.1

50–59

$1.62

$$ · /Mtok

57.8

tok/s

Gemma 4 31B

Google · Open weights

52.4

49–57

$0.17

$ · /Mtok

38.4

tok/s

DeepSeek V4 Flash

DeepSeek · Open weights

52.1

48–58

$0.14

$ · /Mtok

102.5

tok/s

MiMo V2.5 Pro

Xiaomi · Open weights

51.7

46–61

$0.85

$$ · /Mtok

49.2

tok/s

Gemini 2.5 Flash

Google · Closed

51.0

47–57

$1.00

$$ · /Mtok

216.2

tok/s

Qwen3.6 35B-A3B

Alibaba · Open weights

48.7

46–53

$0.30

$ · /Mtok

141.6

tok/s

Claude Haiku 4.5

Anthropic · Closed

46.4

44–49

$3.36

$$$ · /Mtok

140.6

tok/s

Gemma 4 26B A4B

Google · Open weights

45.0

40–51

$0.23

$ · /Mtok

84.4

tok/s

Gemma 4 12B

Google · Open weights

44.8

40–49

—

46.3

tok/s

GLM-5.1

Z.ai · Open weights

44.7

36–52

$1.45

$$ · /Mtok

71.4

tok/s

Tencent HY3-Preview

Tencent · Open weights

44.7

38–48

$0.08

$ · /Mtok

94.8

tok/s

Qwen3.5 9B

Alibaba · Open weights

39.1

34–42

$0.12

$ · /Mtok

57.7

tok/s

Qwen3 14B

Alibaba · Open weights

35.4

29–39

$0.14

$ · /Mtok

64.7

tok/s

Showing 17 models that ran ≥3 of 7 benchmarks (9 excluded for thin coverage). Price = median effective $/1M tokens; Speed = throughput + latency.

GermEval — German NER

Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.

1,024 questions Named-entity recognition Native German GermEval (via EuroEval) ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	87.3%	4.86s	1,059.7	436.2	$10.04	2026-06-09
2	Gemini 3.5 Flash	86.0%	0.81s	1,077	24.6	$1.88	2026-06-09
3	GPT-5.5 🔒 reasoning	85.9%	1.86s	1,140.7	40.4	$7.08	2026-06-13
4	Opus 4.8	85.8%	2.12s	2,110.5	42	$11.88	2026-06-13
5	Gemini 3.1 Flash-Lite	83.2%	0.52s	1,077	25.7	$0.63	2026-06-08
6	Gemma 4 31B	82.6%	11.92s	1,153	27.8	$0.15	2026-06-11
7	DeepSeek V4 Pro	82.2%	1.74s	1,181.2	26.3	$2.02	2026-06-09
8	Qwen3.7 Max	82.1%	1.64s	1,154.3	25.2	$1.57	2026-06-09
9	MiMo V2.5 Pro	81.9%	1.09s	1,533.5	26.4	$1.49	2026-06-09
10	Gemma 4 26B A4B	81.8%	1.14s	1,153	30.3	$0.19	2026-06-08
11	Gemini 2.5 Flash	81.8%	0.52s	1,077	24.5	$0.39	2026-06-09
12	DeepSeek V4 Flash	80.2%	0.77s	1,181.2	26.8	$0.18	2026-06-09
13	Qwen3.6 35B-A3B	79.9%	0.67s	1,154.3	27.3	$0.21	2026-06-09
14	Gemma 4 12B	79.6%	43.80s	1,153	28.1	—	2026-06-11
15	Tencent HY3-Preview	77.3%	2.50s	1,313.7	35.4	$0.09	2026-06-09
16	Qwen3 14B	73.0%	21.12s	1,292.5	28.5	—	2026-06-11
17	Qwen3.5 9B	72.6%	11.77s	1,154.3	26.9	$0.12	2026-06-11
18	GLM-5.1	52.2%	2.20s	1,127.6	26.5	$1.71	2026-06-09

INCLUDE — German

Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

139 questions 4-option multiple choice Native German CohereLabs/include-base-44 ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	77.7%	4.24s	107.1	416.6	$0.72	2026-06-10
2	GPT-5.5 🔒 reasoning	74.8%	3.99s	128.2	153.6	$0.70	2026-06-12
3	Gemini 3.5 Flash	74.1%	1.72s	107.1	205.4	$0.28	2026-06-10
4	MiniMax M2.7 🔒 reasoning	72.7%	12.26s	151.1	1,443.8	$0.29
5	Gemini 3.1 Flash-Lite	71.9%	0.89s	107.1	151	$0.04	2026-06-10
6	Opus 4.8	71.9%	5.77s	256.2	343.7	$1.32	2026-06-12
7	Qwen3.7 Max	71.2%	—	—	—	—	2026-06-08
8	Gemini 2.5 Flash	70.5%	1.61s	105.8	193.6	$0.07	2026-06-02
9	DeepSeek V4 Pro	70.5%	3.44s	118.7	121.3	$0.08	2026-06-08
10	DeepSeek V4 Flash	70.5%	2.53s	118.626	111.288	$0.0066	2026-06-08
11	Kimi K2.6	69.8%	13.65s	138.7	527.1	$0.58	2026-06-08
12	Tencent HY3-Preview	69.1%	5.48s	143.9	222.4	$0.0086	2026-06-02
13	Gemma 4 12B	69.1%	27.64s	123.1	221.9	—	2026-06-11
14	Qwen3.6 35B-A3B	68.3%	2.72s	118.4	545.1	$0.08	2026-06-02
15	Claude Haiku 4.5	68.3%	3.45s	154.6	347.4	$0.26	2026-06-02
16	Gemma 4 31B	68.3%	4.56s	119.1	205.4	$0.01	2026-06-11
17	MiMo V2.5 Pro	67.6%	—	—	—	—	2026-06-08
18	GLM-5.1	67.6%	—	—	—	—	2026-06-08
19	gpt-oss-120b 🔒 reasoning	66.2%	1.71s	154.1	111.7	$0.0078	2026-05-29
20	Gemma 4 26B A4B	64.7%	3.71s	118.7	275.8	$0.03	2026-06-03
21	Qwen3.5 9B	64.7%	10.95s	122.4	378	$0.0096	2026-06-11
22	grok-4.3	63.3%	0.60s	233.8	20.5	$0.04	2026-06-03
23	Qwen3 14B	63.3%	3.32s	135.4	46	—	2026-06-11
24	Ministral 14B	58.3%	0.41s	109.1	79.8	$0.01	2026-06-03
25	gemma-3-12b-it	53.2%	5.00s	114.8	160.5	$0.0073	2026-06-03

MMLU-Pro — German

Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.

11,759 questions 10-option multiple choice Professional translation li-lab/MMLU-ProX ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.5 Flash	86.5%	2.23s	1,664.9	353.3	$66.76	2026-05-26
2	Gemini 3.1 Flash-Lite	82.2%	1.24s	1,666.1	304.5	$10.21	2026-06-03
3	Gemma 4 31B	82.1%	64.35s	1,702.9	502.5	$4.53	2026-06-11
4	DeepSeek V4 Pro	80.8%	—	288.1	218.9	$3.71	2026-06-16
5	Qwen3.6 35B-A3B	80.0%	4.67s	1,689.6	984	$13.36	2026-05-31
6	Gemini 2.5 Flash	79.6%	2.69s	1,664.9	786	$28.89	2026-05-26
7	Gemma 4 26B A4B	78.2%	7.77s	1,700.6	662.1	$4.64	2026-06-05
8	Claude Haiku 4.5	75.3%	3.76s	2,262	433.7	$52.10	2026-06-03
9	DeepSeek V4 Flash	74.9%	1.98s	1,718.7	162.2	$0.0020	2026-06-16
10	Qwen3.5 9B	73.4%	40.41s	1,693.6	931.1	$3.63	2026-06-11
11	Tencent HY3-Preview	73.2%	6.14s	1,910.2	586.3	$2.76	2026-05-27
12	Gemma 4 12B	73.1%	92.81s	1,705.9	493.3	—	2026-06-11
13	Gemini 2.5 Flash-Lite	71.2%	2.48s	1,665.1	1,493.5	$8.95	2026-06-03
14	Qwen3 14B	64.1%	29.26s	1,887.3	413.7	—	2026-06-11

Show per-subject breakdown (196)

Subject	Model	Score
biology	Gemini 3.5 Flash	92.9%
biology	Gemma 4 31B	91.2%
biology	DeepSeek V4 Pro	90.2%
biology	Qwen3.6 35B-A3B	89.7%
biology	Gemini 3.1 Flash-Lite	89.4%
biology	Gemini 2.5 Flash	88.8%
biology	DeepSeek V4 Flash	88.4%
biology	Gemma 4 26B A4B	87.3%
biology	Gemini 2.5 Flash-Lite	86.8%
biology	Claude Haiku 4.5	85.9%
biology	Tencent HY3-Preview	85.8%
business	Gemini 3.5 Flash	89.6%
business	Gemma 4 31B	87.5%
business	Gemini 3.1 Flash-Lite	85.3%
business	DeepSeek V4 Pro	85.3%
business	Gemma 4 26B A4B	83.9%
business	Gemini 2.5 Flash	83.5%
business	Qwen3.6 35B-A3B	83.5%
business	Claude Haiku 4.5	79.0%
business	Tencent HY3-Preview	78.3%
business	Gemini 2.5 Flash-Lite	77.7%
business	DeepSeek V4 Flash	76.2%
chemistry	Gemini 3.5 Flash	89.1%
chemistry	Qwen3.6 35B-A3B	88.2%
chemistry	Gemini 3.1 Flash-Lite	87.1%
chemistry	Gemini 2.5 Flash	86.9%
chemistry	Gemma 4 31B	86.7%
chemistry	DeepSeek V4 Pro	86.7%
chemistry	Gemma 4 26B A4B	85.4%
chemistry	Claude Haiku 4.5	80.4%
chemistry	Gemini 2.5 Flash-Lite	78.4%
chemistry	Tencent HY3-Preview	75.6%
chemistry	DeepSeek V4 Flash	73.3%
computer science	Gemini 3.1 Flash-Lite	88.8%
computer science	Gemini 3.5 Flash	87.6%
computer science	DeepSeek V4 Flash	85.9%
computer science	Gemma 4 31B	85.9%
computer science	DeepSeek V4 Pro	85.4%
computer science	Gemini 2.5 Flash	85.1%
computer science	Qwen3.6 35B-A3B	85.1%
computer science	Gemma 4 26B A4B	83.9%
computer science	Claude Haiku 4.5	83.4%
computer science	Gemini 2.5 Flash-Lite	77.8%
computer science	Tencent HY3-Preview	76.8%
economics	Gemini 3.5 Flash	89.1%
economics	Gemini 3.1 Flash-Lite	87.3%
economics	Gemma 4 31B	86.8%
economics	Gemini 2.5 Flash	86.4%
economics	Qwen3.6 35B-A3B	85.3%
economics	DeepSeek V4 Pro	84.1%
economics	Gemma 4 26B A4B	82.6%
economics	Tencent HY3-Preview	82.3%
economics	Claude Haiku 4.5	81.5%
economics	Gemini 2.5 Flash-Lite	79.1%
economics	DeepSeek V4 Flash	75.0%
engineering	Gemini 3.5 Flash	82.5%
engineering	Gemini 3.1 Flash-Lite	77.8%
engineering	Qwen3.6 35B-A3B	77.5%
engineering	Gemma 4 31B	77.1%
engineering	DeepSeek V4 Pro	74.2%
engineering	Gemma 4 26B A4B	73.5%
engineering	Gemini 2.5 Flash	71.6%
engineering	Claude Haiku 4.5	64.7%
engineering	Tencent HY3-Preview	64.4%
engineering	DeepSeek V4 Flash	64.1%
engineering	Gemini 2.5 Flash-Lite	58.4%
health	Gemini 3.5 Flash	78.6%
health	Gemini 3.1 Flash-Lite	74.8%
health	DeepSeek V4 Pro	74.2%
health	Gemma 4 31B	74.1%
health	Tencent HY3-Preview	73.7%
health	Claude Haiku 4.5	72.9%
health	Qwen3.6 35B-A3B	72.8%
health	Gemini 2.5 Flash	72.5%
health	DeepSeek V4 Flash	71.3%
health	Gemma 4 26B A4B	71.2%
health	Gemini 2.5 Flash-Lite	66.5%
history	Gemini 3.5 Flash	80.6%
history	Gemini 3.1 Flash-Lite	74.5%
history	Gemma 4 31B	74.5%
history	DeepSeek V4 Pro	71.9%
history	Tencent HY3-Preview	71.4%
history	Gemini 2.5 Flash	70.9%
history	Qwen3.6 35B-A3B	69.0%
history	DeepSeek V4 Flash	68.2%
history	Gemma 4 26B A4B	66.7%
history	Claude Haiku 4.5	65.4%
history	Gemini 2.5 Flash-Lite	59.3%
law	Gemini 3.5 Flash	72.8%
law	Gemini 3.1 Flash-Lite	62.8%
law	Gemma 4 31B	59.0%
law	DeepSeek V4 Pro	55.3%
law	Gemini 2.5 Flash	54.0%
law	Qwen3.6 35B-A3B	52.8%
law	Gemma 4 26B A4B	50.7%
law	Claude Haiku 4.5	47.4%
law	DeepSeek V4 Flash	46.1%
law	Tencent HY3-Preview	45.8%
law	Gemini 2.5 Flash-Lite	41.6%
math	Gemini 3.5 Flash	94.6%
math	Gemma 4 31B	93.5%
math	Qwen3.6 35B-A3B	92.7%
math	Gemma 4 26B A4B	92.0%
math	DeepSeek V4 Pro	90.9%
math	Gemini 2.5 Flash	90.5%
math	Gemini 3.1 Flash-Lite	90.2%
math	DeepSeek V4 Flash	88.7%
math	Claude Haiku 4.5	86.8%
math	Tencent HY3-Preview	86.5%
math	Gemini 2.5 Flash-Lite	82.8%
nothink biology	Gemma 4 12B	87.0%
nothink biology	Qwen3.5 9B	85.6%
nothink biology	Qwen3 14B	81.3%
nothink business	Gemma 4 12B	79.8%
nothink business	Qwen3.5 9B	78.1%
nothink business	Qwen3 14B	70.3%
nothink chemistry	Qwen3.5 9B	85.0%
nothink chemistry	Gemma 4 12B	79.9%
nothink chemistry	Qwen3 14B	72.2%
nothink computer science	Gemma 4 12B	79.3%
nothink computer science	Qwen3.5 9B	78.0%
nothink computer science	Qwen3 14B	71.5%
nothink economics	Qwen3.5 9B	78.9%
nothink economics	Gemma 4 12B	78.7%
nothink economics	Qwen3 14B	71.1%
nothink engineering	Qwen3.5 9B	66.6%
nothink engineering	Gemma 4 12B	65.1%
nothink engineering	Qwen3 14B	55.9%
nothink health	Qwen3.5 9B	65.9%
nothink health	Gemma 4 12B	64.5%
nothink health	Qwen3 14B	56.8%
nothink history	Qwen3.5 9B	60.4%
nothink history	Gemma 4 12B	57.2%
nothink history	Qwen3 14B	48.8%
nothink law	Gemma 4 12B	42.8%
nothink law	Qwen3.5 9B	36.7%
nothink law	Qwen3 14B	27.4%
nothink math	Qwen3.5 9B	90.2%
nothink math	Gemma 4 12B	90.1%
nothink math	Qwen3 14B	82.2%
nothink other	Qwen3.5 9B	61.6%
nothink other	Gemma 4 12B	60.7%
nothink other	Qwen3 14B	52.1%
nothink philosophy	Gemma 4 12B	60.3%
nothink philosophy	Qwen3.5 9B	58.1%
nothink philosophy	Qwen3 14B	49.3%
nothink physics	Qwen3.5 9B	84.7%
nothink physics	Gemma 4 12B	81.9%
nothink physics	Qwen3 14B	73.1%
nothink psychology	Gemma 4 12B	76.3%
nothink psychology	Qwen3.5 9B	74.2%
nothink psychology	Qwen3 14B	66.0%
other	Gemini 3.5 Flash	82.4%
other	DeepSeek V4 Pro	77.3%
other	Gemini 3.1 Flash-Lite	76.7%
other	Gemma 4 31B	73.7%
other	Gemini 2.5 Flash	73.5%
other	Tencent HY3-Preview	72.1%
other	Qwen3.6 35B-A3B	70.8%
other	Claude Haiku 4.5	69.5%
other	DeepSeek V4 Flash	69.4%
other	Gemma 4 26B A4B	67.5%
other	Gemini 2.5 Flash-Lite	64.9%
philosophy	Gemini 3.5 Flash	83.2%
philosophy	Gemini 3.1 Flash-Lite	75.2%
philosophy	Gemma 4 31B	74.3%
philosophy	DeepSeek V4 Pro	73.7%
philosophy	DeepSeek V4 Flash	71.9%
philosophy	Gemini 2.5 Flash	70.9%
philosophy	Qwen3.6 35B-A3B	69.5%
philosophy	Gemma 4 26B A4B	69.1%
philosophy	Tencent HY3-Preview	69.1%
philosophy	Claude Haiku 4.5	67.9%
philosophy	Gemini 2.5 Flash-Lite	60.9%
physics	Gemini 3.5 Flash	90.4%
physics	Gemma 4 31B	89.0%
physics	Qwen3.6 35B-A3B	88.0%
physics	Gemini 3.1 Flash-Lite	87.6%
physics	DeepSeek V4 Pro	86.8%
physics	Gemma 4 26B A4B	86.1%
physics	Gemini 2.5 Flash	85.5%
physics	DeepSeek V4 Flash	84.7%
physics	Claude Haiku 4.5	81.3%
physics	Gemini 2.5 Flash-Lite	76.6%
physics	Tencent HY3-Preview	64.7%
psychology	Gemini 3.5 Flash	87.8%
psychology	Gemma 4 31B	83.6%
psychology	Gemini 3.1 Flash-Lite	83.3%
psychology	DeepSeek V4 Pro	83.3%
psychology	Gemini 2.5 Flash	82.6%
psychology	DeepSeek V4 Flash	80.8%
psychology	Tencent HY3-Preview	80.5%
psychology	Claude Haiku 4.5	78.9%
psychology	Qwen3.6 35B-A3B	78.7%
psychology	Gemma 4 26B A4B	78.6%
psychology	Gemini 2.5 Flash-Lite	75.1%

MMMLU — German

OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

14,042 questions 4-option multiple choice Professional translation openai/MMMLU ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	GPT-5.5 🔒 reasoning	92.1%	4.53s	198.9	166.8	$83.60	2026-06-16
2	Gemini 3.5 Flash	89.3%	2.05s	172	255.8	$35.95	2026-05-26
3	Gemini 3.1 Flash-Lite	86.8%	1.06s	173	210.7	$5.04	2026-06-03
4	Gemma 4 31B	86.6%	12.07s	185	251.5	$1.58	2026-06-11
5	Qwen3.6 35B-A3B	85.8%	4.39s	182.9	561.4	$8.27	2026-06-01
6	DeepSeek V4 Flash	84.9%	2.31s	190.8	143.6	$0.94	2026-06-03
7	Gemini 2.5 Flash	84.7%	1.48s	172	275.5	$10.40	2026-05-26
8	Gemma 4 26B A4B	83.7%	7.77s	112.3	221.7	$0.05	2026-06-05
9	Tencent HY3-Preview	83.7%	5.55s	223.4	280.9	$1.14	2026-05-28
10	Claude Haiku 4.5	83.1%	4.20s	277.7	397.6	$31.90	2026-06-02
11	Gemini 2.5 Flash-Lite	79.6%	1.72s	173	506	$3.06	2026-06-03
12	Gemma 4 12B	79.0%	47.38s	189	261.2	—	2026-06-11
13	Qwen3.5 9B	78.8%	13.81s	186.9	408.3	$1.12	2026-06-11
14	Qwen3 14B	73.4%	6.37s	212	87.9	—	2026-06-11

MuSR — German

Multi-step soft reasoning over long narrative contexts — murder mysteries, object placement and team allocation. Requires chaining clues across several paragraphs to reach the correct answer. Translated to German from the original English MuSR benchmark.

564 questions 2–5 option multiple choice Professional translation zayne-sprague/MuSR ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	88.1%	12.14s	1,464.9	1,306.5	$10.49	2026-06-09
2	GPT-5.5 🔒 reasoning	86.3%	13.58s	1,449.5	617.2	$14.53	2026-06-13
3	Opus 4.8	86.3%	13.87s	3,039.8	1,075.6	$23.74	2026-06-13
4	Gemini 3.1 Flash-Lite	84.4%	2.95s	1,467.3	568.5	$0.73	2026-06-09
5	Gemini 3.5 Flash	84.2%	5.79s	1,464.9	850.2	$5.55	2026-06-09
6	Gemma 4 26B A4B	83.7%	—	—	—	—	2026-06-10
7	DeepSeek V4 Flash	83.5%	10.18s	1,659.8	798.9	$0.26	2026-06-10
8	Gemma 4 31B	83.5%	23.30s	1,477.9	650.1	$0.23	2026-06-11
9	Gemini 2.5 Flash	83.3%	6.28s	1,464.9	1,077.4	$1.77	2026-06-09
10	Gemma 4 12B	81.6%	195.50s	1,477.9	707.4	—	2026-06-11
11	GLM-5.1	81.4%	—	—	—	—	2026-06-10
12	MiMo V2.5 Pro	80.9%	—	—	—	—	2026-06-10
13	Qwen3.5 9B	80.9%	61.14s	1,469.2	1,629.7	$0.22	2026-06-11
14	Qwen3 14B	69.7%	263.43s	1,728.9	1,270.4	—	2026-06-11

Show per-subject breakdown (42)

Subject	Model	Score
murder mystery	GPT-5.5	90.0%
murder mystery	Gemini 3.1 Pro	89.2%
murder mystery	Gemini 2.5 Flash	87.6%
murder mystery	Opus 4.8	87.6%
murder mystery	Gemma 4 31B	86.4%
murder mystery	DeepSeek V4 Flash	85.6%
murder mystery	Gemma 4 12B	85.6%
murder mystery	Gemma 4 26B A4B	85.2%
murder mystery	Gemini 3.1 Flash-Lite	84.8%
murder mystery	Gemini 3.5 Flash	84.0%
murder mystery	GLM-5.1	80.0%
murder mystery	MiMo V2.5 Pro	78.8%
murder mystery	Qwen3.5 9B	77.6%
murder mystery	Qwen3 14B	76.8%
object placements	Gemini 3.5 Flash	81.3%
object placements	GLM-5.1	81.3%
object placements	Gemini 3.1 Pro	79.7%
object placements	Opus 4.8	79.7%
object placements	Gemma 4 12B	79.7%
object placements	MiMo V2.5 Pro	76.6%
object placements	Gemma 4 31B	76.6%
object placements	Gemini 3.1 Flash-Lite	75.0%
object placements	Gemini 2.5 Flash	75.0%
object placements	DeepSeek V4 Flash	75.0%
object placements	Gemma 4 26B A4B	73.4%
object placements	Qwen3.5 9B	73.4%
object placements	Qwen3 14B	71.9%
object placements	GPT-5.5	68.8%
team allocation	Gemini 3.1 Pro	89.2%
team allocation	GPT-5.5	87.2%
team allocation	Opus 4.8	86.8%
team allocation	Gemini 3.1 Flash-Lite	86.4%
team allocation	Qwen3.5 9B	86.0%
team allocation	Gemini 3.5 Flash	85.2%
team allocation	DeepSeek V4 Flash	84.8%
team allocation	Gemma 4 26B A4B	84.8%
team allocation	MiMo V2.5 Pro	84.0%
team allocation	GLM-5.1	82.8%
team allocation	Gemma 4 31B	82.4%
team allocation	Gemini 2.5 Flash	81.2%
team allocation	Gemma 4 12B	78.0%
team allocation	Qwen3 14B	62.0%

SB10K — German sentiment

Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.

1,024 questions 3-class sentiment Native German SB10K (via EuroEval) ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	70.1%	3.30s	625.7	205.5	$3.81	2026-06-09
2	Opus 4.8	64.5%	1.88s	1,372.7	5	$7.16	2026-06-13
3	Gemini 3.5 Flash	64.3%	0.67s	650.6	1.7	$1.02	2026-06-09
4	Qwen3.7 Max	63.2%	0.99s	771.9	1.7	$0.99	2026-06-09
5	GPT-5.5	62.5%	1.36s	756.9	8.1	$4.12	2026-06-13
6	DeepSeek V4 Flash	62.4%	0.54s	778.6	2.7	$0.11	2026-06-09
7	Gemma 4 12B	62.2%	21.45s	758.7	2.7	—	2026-06-11
8	MiMo V2.5 Pro	61.7%	0.45s	1,140.3	2.8	$0.99	2026-06-09
9	Gemini 2.5 Flash	61.6%	0.40s	650.7	1.8	$0.20	2026-06-09
10	Gemma 4 31B	61.0%	5.46s	758.7	2.8	$0.09	2026-06-11
11	Gemini 3.1 Flash-Lite	60.4%	0.46s	650.7	1.8	$0.34	2026-06-08
12	DeepSeek V4 Pro	60.0%	1.42s	778.6	2.1	$1.28	2026-06-09
13	Qwen3.6 35B-A3B	58.4%	0.45s	771.9	2.7	$0.12	2026-06-09
14	Qwen3 14B	56.1%	13.26s	899.3	2.7	—	2026-06-11
15	Qwen3.5 9B	55.7%	6.96s	771.9	2.6	$0.08	2026-06-11
16	Tencent HY3-Preview	38.9%	2.35s	886.7	2.7	$0.06	2026-06-09
17	GLM-5.1	23.1%	1.88s	765.6	2.6	$1.09	2026-06-09
18	Gemma 4 26B A4B	15.2%	0.70s	758.8	2.8	$0.12	2026-06-08

ScaLA — German acceptability

Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

2,048 questions Binary acceptability Native German ScaLA-de (via EuroEval) ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Opus 4.8	82.6%	2.00s	1,684.8	4	$17.46	2026-06-13
2	Gemini 3.1 Pro 🔒 reasoning	80.4%	3.71s	777.9	262	$10.37	2026-06-09
3	GPT-5.5 🔒 reasoning	79.7%	1.52s	913.2	27.6	$11.05	2026-06-13
4	Gemini 3.5 Flash	79.4%	0.79s	802.7	1.5	$2.49	2026-06-09
5	Gemini 2.5 Flash	77.1%	0.40s	802.7	1.5	$0.50	2026-06-09
6	DeepSeek V4 Flash	74.9%	0.53s	949.8	2.5	$0.27	2026-06-09
7	MiMo V2.5 Pro	74.8%	0.57s	1,301	2.5	$2.33	2026-06-09
8	Gemini 3.1 Flash-Lite	74.2%	0.46s	802.7	1.5	$0.83	2026-06-08
9	DeepSeek V4 Pro	73.5%	1.41s	949.8	1.6	$3.12	2026-06-09
10	Gemma 4 31B	71.0%	1.28s	910.7	2.5	$0.23	2026-06-11
11	GLM-5.1	69.9%	1.69s	901.6	2.5	$2.57	2026-06-09
12	Qwen3.7 Max	69.8%	1.02s	928.9	1.6	$2.39	2026-06-09
13	Gemma 4 26B A4B	66.3%	0.46s	910.7	2.6	$0.28	2026-06-08
14	Qwen3.6 35B-A3B	64.2%	0.51s	928.9	2.6	$0.29	2026-06-09
15	Gemma 4 12B	64.1%	27.85s	910.7	2.6	—	2026-06-11
16	Qwen3 14B	64.0%	15.71s	1,060	2.6	—	2026-06-11
17	Tencent HY3-Preview	62.5%	2.48s	1,027.6	2.6	$0.13	2026-06-09
18	Qwen3.5 9B	62.1%	8.40s	928.9	2.5	$0.19	2026-06-11

TPS is decode speed — output tokens per second after the first token; higher is faster. TTFT is time to first token; lower is snappier. A snapshot, not a constant.

#	Model	TPS	TTFT
1	gpt-oss-120b 🔒	362	0.51s
2	Gemini 3.1 Flash-Lite	302	4.99s
3	Gemini 2.5 Flash-Lite	293	0.36s
4	Gemini 2.5 Flash	216	0.61s
5	Gemini 3.5 Flash	203	0.88s
6	grok-4.3	171	0.60s
7	Qwen3.7 Max	170	1.61s
8	Qwen3.6 35B-A3B	142	1.18s
9	Claude Haiku 4.5	141	0.64s
10	Gemini 3.1 Pro 🔒	127	—
11	DeepSeek V4 Flash	103	1.00s
12	Tencent HY3-Preview	95	2.47s
13	Ministral 14B	87	0.41s
14	Gemma 4 26B A4B	84	0.95s
15	GLM-5.1	71	0.89s
16	Qwen3 14B	65	1.02s
17	DeepSeek V4 Pro	58	1.02s
18	Qwen3.5 9B	58	1.21s
19	MiMo V2.5 Pro	49	2.03s
20	Gemma 4 12B	46	0.83s
21	Gemma 4 31B	38	1.01s

Quality vs. speed

Average score across all benchmarks against decode speed (output tokens per second). Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.