a PeerBench project

Which fast LLM speaks German best?

Independent German-language benchmarks for the fastest models — accuracy, latency, and cost, side by side.

German LLM Overview

Which model is best on German tasks — one score, ranked. Filter by open vs. closed weights, price, or speed.

Weight

Price

Speed

Model

Score

Price

Speed

Gemini 3.1 Pro

Google · Closed

80.3

78–89

$6.75

$$$ · /Mtok

127

tok/s

Gemini 3.5 Flash

Google · Closed

68.4

67–75

$2.81

$$ · /Mtok

202.5

tok/s

Qwen3.7 Max

Alibaba · Closed

57.7

51–67

$1.28

$$ · /Mtok

170.3

tok/s

Gemini 3.1 Flash-Lite

Google · Closed

56.4

52–65

$0.44

$ · /Mtok

302.3

tok/s

DeepSeek V4 Pro

DeepSeek · Open weights

54.1

50–59

$1.62

$$ · /Mtok

57.8

tok/s

Gemma 4 31B

Google · Open weights

52.4

49–57

$0.17

$ · /Mtok

38.4

tok/s

DeepSeek V4 Flash

DeepSeek · Open weights

52.1

48–58

$0.14

$ · /Mtok

102.5

tok/s

MiMo V2.5 Pro

Xiaomi · Open weights

51.7

46–61

$0.85

$$ · /Mtok

49.2

tok/s

Gemini 2.5 Flash

Google · Closed

51.0

47–57

$1.00

$$ · /Mtok

216.2

tok/s

Qwen3.6 35B-A3B

Alibaba · Open weights

48.7

46–53

$0.30

$ · /Mtok

141.6

tok/s

Claude Haiku 4.5

Anthropic · Closed

46.4

44–49

$3.36

$$$ · /Mtok

140.6

tok/s

Gemma 4 26B A4B

Google · Open weights

45.0

40–51

$0.23

$ · /Mtok

84.4

tok/s

Gemma 4 12B

Google · Open weights

44.8

40–49

—

46.3

tok/s

GLM-5.1

Z.ai · Open weights

44.7

36–52

$1.45

$$ · /Mtok

71.4

tok/s

Tencent HY3-Preview

Tencent · Open weights

44.7

38–48

$0.08

$ · /Mtok

94.8

tok/s

Qwen3.5 9B

Alibaba · Open weights

39.1

34–42

$0.12

$ · /Mtok

57.7

tok/s

Qwen3 14B

Alibaba · Open weights

35.4

29–39

$0.14

$ · /Mtok

64.7

tok/s

Showing 17 models that ran ≥3 of 7 benchmarks (9 excluded for thin coverage). Price = median effective $/1M tokens; Speed = throughput + latency.

GermEval — German NER

Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.

1,024 questions Named-entity recognition Native German GermEval (via EuroEval) ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	87.3%	4.86s	1,059.7	436.2	$10.04	2026-06-09
2	Gemini 3.5 Flash	86.0%	0.81s	1,077	24.6	$1.88	2026-06-09
3	GPT-5.5 🔒 reasoning	85.9%	1.86s	1,140.7	40.4	$7.08	2026-06-13
4	Opus 4.8	85.8%	2.12s	2,110.5	42	$11.88	2026-06-13
5	Gemini 3.1 Flash-Lite	83.2%	0.52s	1,077	25.7	$0.63	2026-06-08
6	Gemma 4 31B	82.6%	10.35s	1,153	27.8	$0.15	2026-06-11
7	DeepSeek V4 Pro	82.2%	1.74s	1,181.2	26.3	$2.02	2026-06-09
8	Qwen3.7 Max	82.1%	1.64s	1,154.3	25.2	$1.57	2026-06-09
9	MiMo V2.5 Pro	81.9%	1.09s	1,533.5	26.4	$1.49	2026-06-09
10	Gemma 4 26B A4B	81.8%	1.14s	1,153	30.3	$0.19	2026-06-08
11	Gemini 2.5 Flash	81.8%	0.52s	1,077	24.5	$0.39	2026-06-09
12	DeepSeek V4 Flash	80.2%	0.77s	1,181.2	26.8	$0.18	2026-06-09
13	Qwen3.6 35B-A3B	79.9%	0.67s	1,154.3	27.3	$0.21	2026-06-09
14	Gemma 4 12B	79.6%	9.49s	1,153	28.1	—	2026-06-11
15	Tencent HY3-Preview	77.3%	2.50s	1,313.7	35.4	$0.09	2026-06-09
16	Qwen3 14B	73.0%	9.61s	1,292.5	28.5	$0.17	2026-06-11
17	Qwen3.5 9B	72.6%	5.70s	1,154.3	26.9	$0.12	2026-06-11
18	GLM-5.1	52.2%	2.20s	1,127.6	26.5	$1.71	2026-06-09

INCLUDE — German

Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

139 questions 4-option multiple choice Native German CohereLabs/include-base-44 ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	77.7%	4.24s	107.1	416.6	$0.72	2026-06-10
2	GPT-5.5 🔒 reasoning	74.8%	3.99s	128.2	153.6	$0.70	2026-06-12
3	Gemini 3.5 Flash	74.1%	1.72s	107.1	205.4	$0.28	2026-06-10
4	MiniMax M2.7 🔒 reasoning	72.7%	12.26s	151.1	1,443.8	$0.29
5	Gemini 3.1 Flash-Lite	71.9%	0.89s	107.1	151	$0.04	2026-06-10
6	Opus 4.8	71.9%	5.77s	256.2	343.7	$1.32	2026-06-12
7	Qwen3.7 Max	71.2%	4.17s	118.3	239.6	$0.28	2026-06-08
8	Gemini 2.5 Flash	70.5%	1.61s	105.8	193.6	$0.07	2026-06-02
9	DeepSeek V4 Pro	70.5%	3.45s	118.626	121.108	$0.08	2026-06-08
10	DeepSeek V4 Flash	70.5%	2.53s	118.626	111.288	$0.0066	2026-06-08
11	Kimi K2.6	69.8%	13.65s	138.7	527.1	$0.58	2026-06-08
12	Tencent HY3-Preview	69.1%	5.48s	143.9	222.4	$0.0086	2026-06-02
13	Gemma 4 12B	69.1%	9.49s	123.1	221.9	—	2026-06-11
14	Qwen3.6 35B-A3B	68.3%	2.72s	118.4	545.1	$0.08	2026-06-02
15	Claude Haiku 4.5	68.3%	3.45s	154.6	347.4	$0.26	2026-06-02
16	Gemma 4 31B	68.3%	10.35s	119.1	205.4	$0.01	2026-06-11
17	MiMo V2.5 Pro	67.6%	4.63s	372.2	224.3	$0.10	2026-06-08
18	GLM-5.1	67.6%	3.36s	114.6	259.7	$0.35	2026-06-08
19	gpt-oss-120b 🔒 reasoning	66.2%	1.71s	154.1	111.7	$0.0078	2026-05-29
20	Gemma 4 26B A4B	64.7%	3.71s	118.7	275.8	$0.03	2026-06-03
21	Qwen3.5 9B	64.7%	5.70s	122.4	378	$0.0096	2026-06-11
22	grok-4.3	63.3%	0.60s	233.8	20.5	$0.04	2026-06-03
23	Qwen3 14B	63.3%	9.61s	135.4	46	$0.0038	2026-06-11
24	Ministral 14B	58.3%	0.41s	109.1	79.8	$0.01	2026-06-03
25	gemma-3-12b-it	53.2%	5.00s	114.8	160.5	$0.0073	2026-06-03

MMLU-Pro — German

Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.

11,759 questions 10-option multiple choice Professional translation li-lab/MMLU-ProX ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.5 Flash	86.5%	2.23s	1,664.9	353.3	$66.76	2026-05-26
2	Gemini 3.1 Flash-Lite	82.2%	1.24s	1,665.9	304	$10.26	2026-06-03
3	Gemma 4 31B	82.1%	10.35s	1,702.9	502.5	$4.47	2026-06-11
4	Qwen3.6 35B-A3B	80.0%	4.67s	1,689.6	984	$13.36	2026-05-31
5	Gemini 2.5 Flash	79.6%	2.69s	1,664.9	786	$28.89	2026-05-26
6	Gemma 4 26B A4B	78.2%	7.43s	1,702.3	629.5	$3.49	2026-06-05
7	Claude Haiku 4.5	75.3%	3.76s	2,262	433.7	$52.10	2026-06-03
8	Qwen3.5 9B	73.4%	5.70s	1,693.6	931.1	$3.63	2026-06-11
9	Tencent HY3-Preview	73.2%	6.14s	1,910.2	586.3	$2.76	2026-05-27
10	Gemma 4 12B	73.1%	9.49s	1,705.9	493.3	—	2026-06-11
11	Gemini 2.5 Flash-Lite	71.2%	2.48s	1,665.1	1,493.5	$8.95	2026-06-03
12	Qwen3 14B	64.1%	9.61s	1,887.3	413.7	$3.83	2026-06-11
13	DeepSeek V4 Flash	36.8%	2.01s	1,716.4	172.4	$1.91	2026-06-03

Show per-subject breakdown (168)

Subject	Model	Score
biology	Gemini 3.5 Flash	92.9%
biology	Gemma 4 31B	91.2%
biology	Qwen3.6 35B-A3B	89.7%
biology	Gemini 3.1 Flash-Lite	89.4%
biology	Gemini 2.5 Flash	88.8%
biology	DeepSeek V4 Flash	88.3%
biology	Gemma 4 26B A4B	87.3%
biology	Gemini 2.5 Flash-Lite	86.8%
biology	Tencent HY3-Preview	85.8%
business	Gemini 3.5 Flash	89.6%
business	Gemma 4 31B	87.5%
business	Gemini 3.1 Flash-Lite	85.3%
business	Gemma 4 26B A4B	83.9%
business	Gemini 2.5 Flash	83.5%
business	Qwen3.6 35B-A3B	83.5%
business	DeepSeek V4 Flash	81.0%
business	Tencent HY3-Preview	78.3%
business	Gemini 2.5 Flash-Lite	77.7%
chemistry	Gemini 3.5 Flash	89.1%
chemistry	Qwen3.6 35B-A3B	88.2%
chemistry	Gemini 3.1 Flash-Lite	87.1%
chemistry	Gemini 2.5 Flash	86.9%
chemistry	Gemma 4 31B	86.7%
chemistry	Gemma 4 26B A4B	85.4%
chemistry	Gemini 2.5 Flash-Lite	78.4%
chemistry	DeepSeek V4 Flash	75.7%
chemistry	Tencent HY3-Preview	75.6%
computer science	Gemini 3.1 Flash-Lite	88.8%
computer science	Gemini 3.5 Flash	87.6%
computer science	Gemma 4 31B	85.9%
computer science	Gemini 2.5 Flash	85.1%
computer science	Qwen3.6 35B-A3B	85.1%
computer science	Gemma 4 26B A4B	83.9%
computer science	DeepSeek V4 Flash	81.7%
computer science	Gemini 2.5 Flash-Lite	77.8%
computer science	Tencent HY3-Preview	76.8%
economics	Gemini 3.5 Flash	89.1%
economics	Gemini 3.1 Flash-Lite	87.3%
economics	Gemma 4 31B	86.8%
economics	Gemini 2.5 Flash	86.4%
economics	Qwen3.6 35B-A3B	85.3%
economics	Gemma 4 26B A4B	82.6%
economics	Tencent HY3-Preview	82.3%
economics	Gemini 2.5 Flash-Lite	79.1%
economics	DeepSeek V4 Flash	69.5%
engineering	Gemini 3.5 Flash	82.5%
engineering	Gemini 3.1 Flash-Lite	77.8%
engineering	Qwen3.6 35B-A3B	77.5%
engineering	Gemma 4 31B	77.1%
engineering	Gemma 4 26B A4B	73.5%
engineering	Gemini 2.5 Flash	71.6%
engineering	Tencent HY3-Preview	64.4%
engineering	DeepSeek V4 Flash	60.0%
engineering	Gemini 2.5 Flash-Lite	58.4%
health	Gemini 3.5 Flash	78.6%
health	Gemini 3.1 Flash-Lite	74.8%
health	Gemma 4 31B	74.1%
health	Tencent HY3-Preview	73.7%
health	Qwen3.6 35B-A3B	72.8%
health	Gemini 2.5 Flash	72.5%
health	Gemma 4 26B A4B	71.2%
health	Gemini 2.5 Flash-Lite	66.5%
health	DeepSeek V4 Flash	13.8%
history	Gemini 3.5 Flash	80.6%
history	Gemini 3.1 Flash-Lite	74.5%
history	Gemma 4 31B	74.5%
history	Tencent HY3-Preview	71.4%
history	Gemini 2.5 Flash	70.9%
history	Qwen3.6 35B-A3B	69.0%
history	Gemma 4 26B A4B	66.7%
history	Gemini 2.5 Flash-Lite	59.3%
history	DeepSeek V4 Flash	10.8%
law	Gemini 3.5 Flash	72.8%
law	Gemini 3.1 Flash-Lite	62.8%
law	Gemma 4 31B	59.0%
law	Gemini 2.5 Flash	54.0%
law	Qwen3.6 35B-A3B	52.8%
law	Gemma 4 26B A4B	50.7%
law	Tencent HY3-Preview	45.8%
law	Gemini 2.5 Flash-Lite	41.6%
law	DeepSeek V4 Flash	10.4%
math	Gemini 3.5 Flash	94.6%
math	Gemma 4 31B	93.5%
math	Qwen3.6 35B-A3B	92.7%
math	Gemma 4 26B A4B	92.0%
math	Gemini 2.5 Flash	90.5%
math	Gemini 3.1 Flash-Lite	90.2%
math	Tencent HY3-Preview	86.5%
math	Gemini 2.5 Flash-Lite	82.8%
math	DeepSeek V4 Flash	9.5%
nothink biology	Gemma 4 12B	87.0%
nothink biology	Qwen3.5 9B	85.6%
nothink biology	Qwen3 14B	81.3%
nothink business	Gemma 4 12B	79.8%
nothink business	Qwen3.5 9B	78.1%
nothink business	Qwen3 14B	70.3%
nothink chemistry	Qwen3.5 9B	85.0%
nothink chemistry	Gemma 4 12B	79.9%
nothink chemistry	Qwen3 14B	72.2%
nothink computer science	Gemma 4 12B	79.3%
nothink computer science	Qwen3.5 9B	78.0%
nothink computer science	Qwen3 14B	71.5%
nothink economics	Qwen3.5 9B	78.9%
nothink economics	Gemma 4 12B	78.7%
nothink economics	Qwen3 14B	71.1%
nothink engineering	Qwen3.5 9B	66.6%
nothink engineering	Gemma 4 12B	65.1%
nothink engineering	Qwen3 14B	55.9%
nothink health	Qwen3.5 9B	65.9%
nothink health	Gemma 4 12B	64.5%
nothink health	Qwen3 14B	56.8%
nothink history	Qwen3.5 9B	60.4%
nothink history	Gemma 4 12B	57.2%
nothink history	Qwen3 14B	48.8%
nothink law	Gemma 4 12B	42.8%
nothink law	Qwen3.5 9B	36.7%
nothink law	Qwen3 14B	27.4%
nothink math	Qwen3.5 9B	90.2%
nothink math	Gemma 4 12B	90.1%
nothink math	Qwen3 14B	82.2%
nothink other	Qwen3.5 9B	61.6%
nothink other	Gemma 4 12B	60.7%
nothink other	Qwen3 14B	52.1%
nothink philosophy	Gemma 4 12B	60.3%
nothink philosophy	Qwen3.5 9B	58.1%
nothink philosophy	Qwen3 14B	49.3%
nothink physics	Qwen3.5 9B	84.7%
nothink physics	Gemma 4 12B	81.9%
nothink physics	Qwen3 14B	73.1%
nothink psychology	Gemma 4 12B	76.3%
nothink psychology	Qwen3.5 9B	74.2%
nothink psychology	Qwen3 14B	66.0%
other	Gemini 3.5 Flash	82.4%
other	Gemini 3.1 Flash-Lite	76.7%
other	Gemma 4 31B	73.7%
other	Gemini 2.5 Flash	73.5%
other	Tencent HY3-Preview	72.1%
other	Qwen3.6 35B-A3B	70.8%
other	Gemma 4 26B A4B	67.5%
other	Gemini 2.5 Flash-Lite	64.9%
other	DeepSeek V4 Flash	9.2%
philosophy	Gemini 3.5 Flash	83.2%
philosophy	Gemini 3.1 Flash-Lite	75.2%
philosophy	Gemma 4 31B	74.3%
philosophy	Gemini 2.5 Flash	70.9%
philosophy	Qwen3.6 35B-A3B	69.5%
philosophy	Tencent HY3-Preview	69.1%
philosophy	Gemma 4 26B A4B	69.1%
philosophy	Gemini 2.5 Flash-Lite	60.9%
philosophy	DeepSeek V4 Flash	6.8%
physics	Gemini 3.5 Flash	90.4%
physics	Gemma 4 31B	89.0%
physics	Qwen3.6 35B-A3B	88.0%
physics	Gemini 3.1 Flash-Lite	87.6%
physics	Gemma 4 26B A4B	86.1%
physics	Gemini 2.5 Flash	85.5%
physics	Gemini 2.5 Flash-Lite	76.6%
physics	Tencent HY3-Preview	64.7%
physics	DeepSeek V4 Flash	10.2%
psychology	Gemini 3.5 Flash	87.8%
psychology	Gemma 4 31B	83.6%
psychology	Gemini 3.1 Flash-Lite	83.3%
psychology	Gemini 2.5 Flash	82.6%
psychology	Tencent HY3-Preview	80.5%
psychology	Qwen3.6 35B-A3B	78.7%
psychology	Gemma 4 26B A4B	78.6%
psychology	Gemini 2.5 Flash-Lite	75.1%
psychology	DeepSeek V4 Flash	10.4%

MMMLU — German

OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

14,042 questions 4-option multiple choice Professional translation openai/MMMLU ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.5 Flash	89.3%	2.05s	172	255.8	$35.95	2026-05-26
2	Gemini 3.1 Flash-Lite	86.8%	1.06s	173	210.8	$5.05	2026-06-03
3	Gemma 4 31B	86.6%	10.35s	185	251.5	$1.55	2026-06-11
4	Qwen3.6 35B-A3B	85.8%	4.39s	182.9	561.4	$8.27	2026-06-01
5	DeepSeek V4 Flash	84.9%	2.32s	185.9	144.7	$1.03	2026-06-03
6	Gemini 2.5 Flash	84.7%	1.48s	172	275.5	$10.40	2026-05-26
7	Gemma 4 26B A4B	83.7%	3.97s	185	340.7	$2.24	2026-06-05
8	Tencent HY3-Preview	83.7%	5.55s	223.4	280.9	$1.14	2026-05-28
9	Claude Haiku 4.5	83.1%	4.20s	277.7	397.6	$31.90	2026-06-02
10	Gemini 2.5 Flash-Lite	79.6%	1.68s	162.7	504.6	$4.99	2026-06-03
11	Gemma 4 12B	79.0%	9.49s	189	261.2	—	2026-06-11
12	Qwen3.5 9B	78.8%	5.70s	186.9	408.3	$1.12	2026-06-11
13	Qwen3 14B	73.4%	9.61s	212	87.9	$0.65	2026-06-11

MuSR (DE)

German MuSR multistep soft reasoning (murder_mystery/object_placements/team_allocation), frozen v1 translation, generate_until cot+; primary metric accuracy.

564 questions

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	88.1%	12.14s	1,464.9	1,306.5	$10.49	2026-06-09
2	GPT-5.5 🔒 reasoning	86.3%	13.58s	1,449.5	617.2	$14.53	2026-06-13
3	Opus 4.8	86.3%	13.87s	3,039.8	1,075.6	$23.74	2026-06-13
4	Gemini 3.1 Flash-Lite	84.4%	2.95s	1,467.3	568.5	$0.73	2026-06-09
5	Gemini 3.5 Flash	84.2%	5.79s	1,464.9	850.2	$5.55	2026-06-09
6	Gemma 4 26B A4B	83.7%	9.78s	1,478.6	814.1	$0.29	2026-06-10
7	DeepSeek V4 Flash	83.5%	10.18s	1,659.8	798.9	$0.26	2026-06-10
8	Gemma 4 31B	83.5%	10.35s	1,477.9	650.1	$0.23	2026-06-11
9	Gemini 2.5 Flash	83.3%	6.28s	1,464.9	1,077.4	$1.77	2026-06-09
10	Gemma 4 12B	81.6%	9.49s	1,477.9	707.4	—	2026-06-11
11	GLM-5.1	81.4%	26.91s	1,583.3	882.5	$3.44	2026-06-10
12	MiMo V2.5 Pro	80.9%	11.43s	1,964.3	754.9	$0.81	2026-06-10
13	Qwen3.5 9B	80.9%	5.70s	1,469.2	1,629.7	$0.22	2026-06-11
14	Qwen3 14B	69.7%	9.61s	1,728.9	1,270.4	$0.28	2026-06-11

Show per-subject breakdown (42)

Subject	Model	Score
murder mystery	GPT-5.5	90.0%
murder mystery	Gemini 3.1 Pro	89.2%
murder mystery	Gemini 2.5 Flash	87.6%
murder mystery	Opus 4.8	87.6%
murder mystery	Gemma 4 31B	86.4%
murder mystery	DeepSeek V4 Flash	85.6%
murder mystery	Gemma 4 12B	85.6%
murder mystery	Gemma 4 26B A4B	85.2%
murder mystery	Gemini 3.1 Flash-Lite	84.8%
murder mystery	Gemini 3.5 Flash	84.0%
murder mystery	GLM-5.1	80.0%
murder mystery	MiMo V2.5 Pro	78.8%
murder mystery	Qwen3.5 9B	77.6%
murder mystery	Qwen3 14B	76.8%
object placements	Gemini 3.5 Flash	81.3%
object placements	GLM-5.1	81.3%
object placements	Gemini 3.1 Pro	79.7%
object placements	Gemma 4 12B	79.7%
object placements	Opus 4.8	79.7%
object placements	MiMo V2.5 Pro	76.6%
object placements	Gemma 4 31B	76.6%
object placements	Gemini 3.1 Flash-Lite	75.0%
object placements	Gemini 2.5 Flash	75.0%
object placements	DeepSeek V4 Flash	75.0%
object placements	Gemma 4 26B A4B	73.4%
object placements	Qwen3.5 9B	73.4%
object placements	Qwen3 14B	71.9%
object placements	GPT-5.5	68.8%
team allocation	Gemini 3.1 Pro	89.2%
team allocation	GPT-5.5	87.2%
team allocation	Opus 4.8	86.8%
team allocation	Gemini 3.1 Flash-Lite	86.4%
team allocation	Qwen3.5 9B	86.0%
team allocation	Gemini 3.5 Flash	85.2%
team allocation	DeepSeek V4 Flash	84.8%
team allocation	Gemma 4 26B A4B	84.8%
team allocation	MiMo V2.5 Pro	84.0%
team allocation	GLM-5.1	82.8%
team allocation	Gemma 4 31B	82.4%
team allocation	Gemini 2.5 Flash	81.2%
team allocation	Gemma 4 12B	78.0%
team allocation	Qwen3 14B	62.0%

SB10K — German sentiment

Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.

1,024 questions 3-class sentiment Native German SB10K (via EuroEval) ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Gemini 3.1 Pro 🔒 reasoning	70.1%	3.30s	625.7	205.5	$3.81	2026-06-09
2	Opus 4.8	64.5%	1.88s	1,372.7	5	$7.16	2026-06-13
3	Gemini 3.5 Flash	64.3%	0.67s	650.6	1.7	$1.02	2026-06-09
4	Qwen3.7 Max	63.2%	0.99s	771.9	1.7	$0.99	2026-06-09
5	GPT-5.5	62.5%	1.36s	756.9	8.1	$4.12	2026-06-13
6	DeepSeek V4 Flash	62.4%	0.54s	778.6	2.7	$0.11	2026-06-09
7	Gemma 4 12B	62.2%	9.49s	758.7	2.7	—	2026-06-11
8	MiMo V2.5 Pro	61.7%	0.45s	1,140.3	2.8	$0.99	2026-06-09
9	Gemini 2.5 Flash	61.6%	0.40s	650.7	1.8	$0.20	2026-06-09
10	Gemma 4 31B	61.0%	10.35s	758.7	2.8	$0.09	2026-06-11
11	Gemini 3.1 Flash-Lite	60.4%	0.46s	650.7	1.8	$0.34	2026-06-08
12	DeepSeek V4 Pro	60.0%	1.42s	778.6	2.1	$1.28	2026-06-09
13	Qwen3.6 35B-A3B	58.4%	0.45s	771.9	2.7	$0.12	2026-06-09
14	Qwen3 14B	56.1%	9.61s	899.3	2.7	$0.11	2026-06-11
15	Qwen3.5 9B	55.7%	5.70s	771.9	2.6	$0.08	2026-06-11
16	Tencent HY3-Preview	38.9%	2.35s	886.7	2.7	$0.06	2026-06-09
17	GLM-5.1	23.1%	1.88s	765.6	2.6	$1.09	2026-06-09
18	Gemma 4 26B A4B	15.2%	0.70s	758.8	2.8	$0.12	2026-06-08

ScaLA — German acceptability

Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

2,048 questions Binary acceptability Native German ScaLA-de (via EuroEval) ↗

#	Model	Score	Latency	Tok in	Tok out	Cost	Date
1	Opus 4.8	82.6%	2.00s	1,684.8	4	$17.46	2026-06-13
2	Gemini 3.1 Pro 🔒 reasoning	80.4%	3.71s	777.9	262	$10.37	2026-06-09
3	GPT-5.5 🔒 reasoning	79.7%	1.52s	913.2	27.6	$11.05	2026-06-13
4	Gemini 3.5 Flash	79.4%	0.79s	802.7	1.5	$2.49	2026-06-09
5	Gemini 2.5 Flash	77.1%	0.40s	802.7	1.5	$0.50	2026-06-09
6	DeepSeek V4 Flash	74.9%	0.53s	949.8	2.5	$0.27	2026-06-09
7	MiMo V2.5 Pro	74.8%	0.57s	1,301	2.5	$2.33	2026-06-09
8	Gemini 3.1 Flash-Lite	74.2%	0.46s	802.7	1.5	$0.83	2026-06-08
9	DeepSeek V4 Pro	73.5%	1.41s	949.8	1.6	$3.12	2026-06-09
10	Gemma 4 31B	71.0%	10.35s	910.7	2.5	$0.23	2026-06-11
11	GLM-5.1	69.9%	1.69s	901.6	2.5	$2.57	2026-06-09
12	Qwen3.7 Max	69.8%	1.02s	928.9	1.6	$2.39	2026-06-09
13	Gemma 4 26B A4B	66.3%	0.46s	910.7	2.6	$0.28	2026-06-08
14	Qwen3.6 35B-A3B	64.2%	0.51s	928.9	2.6	$0.29	2026-06-09
15	Gemma 4 12B	64.1%	9.49s	910.7	2.6	—	2026-06-11
16	Qwen3 14B	64.0%	9.61s	1,060	2.6	$0.26	2026-06-11
17	Tencent HY3-Preview	62.5%	2.48s	1,027.6	2.6	$0.13	2026-06-09
18	Qwen3.5 9B	62.1%	5.70s	928.9	2.5	$0.19	2026-06-11

TPS is decode speed — output tokens per second after the first token; higher is faster. TTFT is time to first token; lower is snappier. A snapshot, not a constant.

#	Model	TPS	TTFT
1	gpt-oss-120b 🔒	362	0.51s
2	Gemini 3.1 Flash-Lite	302	4.99s
3	Gemini 2.5 Flash-Lite	293	0.36s
4	Gemini 2.5 Flash	216	0.61s
5	Gemini 3.5 Flash	203	0.88s
6	grok-4.3	171	0.60s
7	Qwen3.7 Max	170	1.61s
8	Qwen3.6 35B-A3B	142	1.18s
9	Claude Haiku 4.5	141	0.64s
10	Gemini 3.1 Pro 🔒	127	—
11	DeepSeek V4 Flash	103	1.00s
12	Tencent HY3-Preview	95	2.47s
13	Ministral 14B	87	0.41s
14	Gemma 4 26B A4B	84	0.95s
15	GLM-5.1	71	0.89s
16	Qwen3 14B	65	1.02s
17	DeepSeek V4 Pro	58	1.02s
18	Qwen3.5 9B	58	1.21s
19	MiMo V2.5 Pro	49	2.03s
20	Gemma 4 12B	46	0.83s
21	Gemma 4 31B	38	1.01s

Quality vs. speed

Average score across all benchmarks against decode speed (output tokens per second). Up and to the right is better — smarter and faster. The gold line is the speed frontier: the best score available at each speed.

* scored on INCLUDE only — its average covers fewer benchmarks, so it isn't directly comparable to the full-coverage models.

🔒 reasoning can't be disabled — its decode speed includes forced reasoning tokens, so it isn't directly comparable to the reasoning-off models.