German LLM Leaderboard

GPT-5.5 (provider-internal) — 77.9% avg on German LLM Benchmarks

Sat, 13 Jun 2026 00:00:00 GMT

GPT-5.5 (OpenAI) ranked #4 of 26 models on German-language benchmarks with an average score of 77.9%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	85.9%
INCLUDE — German	Native German · 4-option multiple choice	74.8%
MuSR (DE)		86.3%
SB10K — German sentiment	Native German · 3-class sentiment	62.5%
ScaLA — German acceptability	Native German · Binary acceptability	79.7%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $7.811 / 1,000 questions
Quantization: provider-internal
Run date: 2026-06-13

View full results and compare →

Opus 4.8 (provider-internal) — 78.3% avg on German LLM Benchmarks

Sat, 13 Jun 2026 00:00:00 GMT

Opus 4.8 (Anthropic) ranked #3 of 26 models on German-language benchmarks with an average score of 78.3%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	85.8%
INCLUDE — German	Native German · 4-option multiple choice	71.9%
MuSR (DE)		86.3%
SB10K — German sentiment	Native German · 3-class sentiment	64.5%
ScaLA — German acceptability	Native German · Binary acceptability	82.6%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $12.827 / 1,000 questions
Quantization: provider-internal
Run date: 2026-06-13

View full results and compare →

Gemma 4 31B (bf16) — 76.5% avg on German LLM Benchmarks

Thu, 11 Jun 2026 00:00:00 GMT

Gemma 4 31B (Google) ranked #7 of 26 models on German-language benchmarks with an average score of 76.5%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	82.6%
INCLUDE — German	Native German · 4-option multiple choice	68.3%
MMLU-Pro — German	Professional translation · 10-option multiple choice	82.1%
MMMLU — German	Professional translation · 4-option multiple choice	86.6%
MuSR (DE)		83.5%
SB10K — German sentiment	Native German · 3-class sentiment	61.0%
ScaLA — German acceptability	Native German · Binary acceptability	71.0%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.220 / 1,000 questions
Speed: 40 tok/s
TTFT: 0.36s
Quantization: bf16
Run date: 2026-06-11

View full results and compare →

Gemma 4 12B (bf16) — 72.7% avg on German LLM Benchmarks

Thu, 11 Jun 2026 00:00:00 GMT

Gemma 4 12B (Google) ranked #12 of 26 models on German-language benchmarks with an average score of 72.7%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	79.6%
INCLUDE — German	Native German · 4-option multiple choice	69.1%
MMLU-Pro — German	Professional translation · 10-option multiple choice	73.1%
MMMLU — German	Professional translation · 4-option multiple choice	79.0%
MuSR (DE)		81.6%
SB10K — German sentiment	Native German · 3-class sentiment	62.2%
ScaLA — German acceptability	Native German · Binary acceptability	64.1%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Speed: 46 tok/s
TTFT: 0.83s
Quantization: bf16
Run date: 2026-06-11

View full results and compare →

Qwen3.5 9B (bf16) — 69.7% avg on German LLM Benchmarks

Thu, 11 Jun 2026 00:00:00 GMT

Qwen3.5 9B (Alibaba) ranked #18 of 26 models on German-language benchmarks with an average score of 69.7%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	72.6%
INCLUDE — German	Native German · 4-option multiple choice	64.7%
MMLU-Pro — German	Professional translation · 10-option multiple choice	73.4%
MMMLU — German	Professional translation · 4-option multiple choice	78.8%
MuSR (DE)		80.9%
SB10K — German sentiment	Native German · 3-class sentiment	55.7%
ScaLA — German acceptability	Native German · Binary acceptability	62.1%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.176 / 1,000 questions
Speed: 73 tok/s
TTFT: 0.21s
Quantization: bf16
Run date: 2026-06-11

View full results and compare →

Qwen3 14B (bf16) — 66.2% avg on German LLM Benchmarks

Thu, 11 Jun 2026 00:00:00 GMT

Qwen3 14B (Alibaba) ranked #21 of 26 models on German-language benchmarks with an average score of 66.2%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	73.0%
INCLUDE — German	Native German · 4-option multiple choice	63.3%
MMLU-Pro — German	Professional translation · 10-option multiple choice	64.1%
MMMLU — German	Professional translation · 4-option multiple choice	73.4%
MuSR (DE)		69.7%
SB10K — German sentiment	Native German · 3-class sentiment	56.1%
ScaLA — German acceptability	Native German · Binary acceptability	64.0%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.174 / 1,000 questions
Speed: 44 tok/s
TTFT: 0.42s
Quantization: bf16
Run date: 2026-06-11

View full results and compare →

Gemini 3.5 Flash (provider-internal) — 80.6% avg on German LLM Benchmarks

Wed, 10 Jun 2026 00:00:00 GMT

Gemini 3.5 Flash (Google) ranked #2 of 26 models on German-language benchmarks with an average score of 80.6%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	86.0%
INCLUDE — German	Native German · 4-option multiple choice	74.1%
MMLU-Pro — German	Professional translation · 10-option multiple choice	86.5%
MMMLU — German	Professional translation · 4-option multiple choice	89.3%
MuSR (DE)		84.2%
SB10K — German sentiment	Native German · 3-class sentiment	64.3%
ScaLA — German acceptability	Native German · Binary acceptability	79.4%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $3.723 / 1,000 questions
Speed: 181 tok/s
TTFT: 0.75s
Quantization: provider-internal
Run date: 2026-06-10

View full results and compare →

Gemini 3.1 Flash-Lite (provider-internal) — 77.6% avg on German LLM Benchmarks

Wed, 10 Jun 2026 00:00:00 GMT

Gemini 3.1 Flash-Lite (Google) ranked #5 of 26 models on German-language benchmarks with an average score of 77.6%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	83.2%
INCLUDE — German	Native German · 4-option multiple choice	71.9%
MMLU-Pro — German	Professional translation · 10-option multiple choice	82.2%
MMMLU — German	Professional translation · 4-option multiple choice	86.8%
MuSR (DE)		84.4%
SB10K — German sentiment	Native German · 3-class sentiment	60.4%
ScaLA — German acceptability	Native German · Binary acceptability	74.2%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.584 / 1,000 questions
Speed: 223 tok/s
TTFT: 0.61s
Quantization: provider-internal
Run date: 2026-06-10

View full results and compare →

DeepSeek V4 Flash (bf16) — 70.5% avg on German LLM Benchmarks

Wed, 10 Jun 2026 00:00:00 GMT

DeepSeek V4 Flash (DeepSeek) ranked #16 of 26 models on German-language benchmarks with an average score of 70.5%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	80.2%
INCLUDE — German	Native German · 4-option multiple choice	70.5%
MMLU-Pro — German	Professional translation · 10-option multiple choice	36.8%
MMMLU — German	Professional translation · 4-option multiple choice	84.9%
MuSR (DE)		83.5%
SB10K — German sentiment	Native German · 3-class sentiment	62.4%
ScaLA — German acceptability	Native German · Binary acceptability	74.9%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.123 / 1,000 questions
Speed: 115 tok/s
TTFT: 0.92s
Quantization: bf16
Run date: 2026-06-10

View full results and compare →

Gemma 4 26B A4B (provider-internal) — 67.7% avg on German LLM Benchmarks

Wed, 10 Jun 2026 00:00:00 GMT

Gemma 4 26B A4B (Google) ranked #19 of 26 models on German-language benchmarks with an average score of 67.7%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	81.8%
INCLUDE — German	Native German · 4-option multiple choice	64.7%
MMLU-Pro — German	Professional translation · 10-option multiple choice	78.2%
MMMLU — German	Professional translation · 4-option multiple choice	83.7%
MuSR (DE)		83.7%
SB10K — German sentiment	Native German · 3-class sentiment	15.2%
ScaLA — German acceptability	Native German · Binary acceptability	66.3%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.217 / 1,000 questions
Speed: 46 tok/s
TTFT: 1.16s
Quantization: provider-internal
Run date: 2026-06-10

View full results and compare →

Gemini 3.1 Pro (provider-internal) — 80.7% avg on German LLM Benchmarks

Wed, 10 Jun 2026 00:00:00 GMT

Gemini 3.1 Pro (Google) ranked #1 of 26 models on German-language benchmarks with an average score of 80.7%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	87.3%
INCLUDE — German	Native German · 4-option multiple choice	77.7%
MuSR (DE)		88.1%
SB10K — German sentiment	Native German · 3-class sentiment	70.1%
ScaLA — German acceptability	Native German · Binary acceptability	80.4%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $7.383 / 1,000 questions
Quantization: provider-internal
Run date: 2026-06-10

View full results and compare →

MiMo V2.5 Pro (fp8) — 73.4% avg on German LLM Benchmarks

Wed, 10 Jun 2026 00:00:00 GMT

MiMo V2.5 Pro (Xiaomi) ranked #10 of 26 models on German-language benchmarks with an average score of 73.4%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	81.9%
INCLUDE — German	Native German · 4-option multiple choice	67.6%
MuSR (DE)		80.9%
SB10K — German sentiment	Native German · 3-class sentiment	61.7%
ScaLA — German acceptability	Native German · Binary acceptability	74.8%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $1.190 / 1,000 questions
Quantization: fp8
Run date: 2026-06-10

View full results and compare →

GLM-5.1 (fp8) — 58.8% avg on German LLM Benchmarks

Wed, 10 Jun 2026 00:00:00 GMT

GLM-5.1 (Z.ai) ranked #24 of 26 models on German-language benchmarks with an average score of 58.8%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	52.2%
INCLUDE — German	Native German · 4-option multiple choice	67.6%
MuSR (DE)		81.4%
SB10K — German sentiment	Native German · 3-class sentiment	23.1%
ScaLA — German acceptability	Native German · Binary acceptability	69.9%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $1.908 / 1,000 questions
Speed: 32 tok/s
TTFT: 0.65s
Quantization: fp8
Run date: 2026-06-10

View full results and compare →

Qwen3.6 35B-A3B (fp8) — 72.8% avg on German LLM Benchmarks

Tue, 09 Jun 2026 00:00:00 GMT

Qwen3.6 35B-A3B (Alibaba) ranked #11 of 26 models on German-language benchmarks with an average score of 72.8%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	79.9%
INCLUDE — German	Native German · 4-option multiple choice	68.3%
MMLU-Pro — German	Professional translation · 10-option multiple choice	80.0%
MMMLU — German	Professional translation · 4-option multiple choice	85.8%
SB10K — German sentiment	Native German · 3-class sentiment	58.4%
ScaLA — German acceptability	Native German · Binary acceptability	64.2%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.743 / 1,000 questions
Speed: 159 tok/s
TTFT: 0.62s
Quantization: fp8
Run date: 2026-06-09

View full results and compare →

Gemini 2.5 Flash (provider-internal) — 76.9% avg on German LLM Benchmarks

Tue, 09 Jun 2026 00:00:00 GMT

Gemini 2.5 Flash (Google) ranked #6 of 26 models on German-language benchmarks with an average score of 76.9%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	81.8%
INCLUDE — German	Native German · 4-option multiple choice	70.5%
MMLU-Pro — German	Professional translation · 10-option multiple choice	79.6%
MMMLU — German	Professional translation · 4-option multiple choice	84.7%
MuSR (DE)		83.3%
SB10K — German sentiment	Native German · 3-class sentiment	61.6%
ScaLA — German acceptability	Native German · Binary acceptability	77.1%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $1.380 / 1,000 questions
Speed: 159 tok/s
TTFT: 0.41s
Quantization: provider-internal
Run date: 2026-06-09

View full results and compare →

Tencent HY3-Preview (provider-internal) — 67.4% avg on German LLM Benchmarks

Tue, 09 Jun 2026 00:00:00 GMT

Tencent HY3-Preview (Tencent) ranked #20 of 26 models on German-language benchmarks with an average score of 67.4%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	77.3%
INCLUDE — German	Native German · 4-option multiple choice	69.1%
MMLU-Pro — German	Professional translation · 10-option multiple choice	73.2%
MMMLU — German	Professional translation · 4-option multiple choice	83.7%
SB10K — German sentiment	Native German · 3-class sentiment	38.9%
ScaLA — German acceptability	Native German · Binary acceptability	62.5%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $0.139 / 1,000 questions
Speed: 107 tok/s
TTFT: 2.69s
Quantization: provider-internal
Run date: 2026-06-09

View full results and compare →

Qwen3.7 Max (provider-internal) — 71.6% avg on German LLM Benchmarks

Tue, 09 Jun 2026 00:00:00 GMT

Qwen3.7 Max (Alibaba) ranked #14 of 26 models on German-language benchmarks with an average score of 71.6%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	82.1%
INCLUDE — German	Native German · 4-option multiple choice	71.2%
SB10K — German sentiment	Native German · 3-class sentiment	63.2%
ScaLA — German acceptability	Native German · Binary acceptability	69.8%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $1.238 / 1,000 questions
Quantization: provider-internal
Run date: 2026-06-09

View full results and compare →

DeepSeek V4 Pro (fp8) — 71.5% avg on German LLM Benchmarks

Tue, 09 Jun 2026 00:00:00 GMT

DeepSeek V4 Pro (DeepSeek) ranked #15 of 26 models on German-language benchmarks with an average score of 71.5%.

Benchmark scores

Benchmark	Format	Score
GermEval — German NER	Native German · Named-entity recognition	82.2%
INCLUDE — German	Native German · 4-option multiple choice	70.5%
SB10K — German sentiment	Native German · 3-class sentiment	60.0%
ScaLA — German acceptability	Native German · Binary acceptability	73.5%

What these benchmarks test

GermEval — German NER: Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.
INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
SB10K — German sentiment: Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.
ScaLA — German acceptability: Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Cost: $1.534 / 1,000 questions
Quantization: fp8
Run date: 2026-06-09

View full results and compare →

Kimi K2.6 (fp8) — 69.8% avg on German LLM Benchmarks

Mon, 08 Jun 2026 00:00:00 GMT

Kimi K2.6 (Moonshot) ranked #17 of 26 models on German-language benchmarks with an average score of 69.8%.

Benchmark scores

Benchmark	Format	Score
INCLUDE — German	Native German · 4-option multiple choice	69.8%

What these benchmarks test

INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

Cost: $4.162 / 1,000 questions
Quantization: fp8
Run date: 2026-06-08

View full results and compare →

Claude Haiku 4.5 (provider-internal) — 75.6% avg on German LLM Benchmarks

Wed, 03 Jun 2026 00:00:00 GMT

Claude Haiku 4.5 (Anthropic) ranked #8 of 26 models on German-language benchmarks with an average score of 75.6%.

Benchmark scores

Benchmark	Format	Score
INCLUDE — German	Native German · 4-option multiple choice	68.3%
MMLU-Pro — German	Professional translation · 10-option multiple choice	75.3%
MMMLU — German	Professional translation · 4-option multiple choice	83.1%

What these benchmarks test

INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.
MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

Cost: $3.248 / 1,000 questions
Speed: 131 tok/s
TTFT: 0.79s
Quantization: provider-internal
Run date: 2026-06-03

View full results and compare →

Gemini 2.5 Flash-Lite (unverified) — 75.4% avg on German LLM Benchmarks

Wed, 03 Jun 2026 00:00:00 GMT

Gemini 2.5 Flash-Lite (Google) ranked #9 of 26 models on German-language benchmarks with an average score of 75.4%.

Benchmark scores

Benchmark	Format	Score
MMLU-Pro — German	Professional translation · 10-option multiple choice	71.2%
MMMLU — German	Professional translation · 4-option multiple choice	79.6%

What these benchmarks test

MMLU-Pro — German: Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.
MMMLU — German: OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

Cost: $0.540 / 1,000 questions
Quantization: unverified
Run date: 2026-06-03

View full results and compare →

grok-4.3 (unverified) — 63.3% avg on German LLM Benchmarks

Wed, 03 Jun 2026 00:00:00 GMT

grok-4.3 (Unknown) ranked #23 of 26 models on German-language benchmarks with an average score of 63.3%.

Benchmark scores

Benchmark	Format	Score
INCLUDE — German	Native German · 4-option multiple choice	63.3%

What these benchmarks test

INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

Cost: $0.274 / 1,000 questions
Quantization: unverified
Run date: 2026-06-03

View full results and compare →

Ministral 14B (unverified) — 58.3% avg on German LLM Benchmarks

Wed, 03 Jun 2026 00:00:00 GMT

Ministral 14B (Mistral) ranked #25 of 26 models on German-language benchmarks with an average score of 58.3%.

Benchmark scores

Benchmark	Format	Score
INCLUDE — German	Native German · 4-option multiple choice	58.3%

What these benchmarks test

INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

Cost: $0.079 / 1,000 questions
Quantization: unverified
Run date: 2026-06-03

View full results and compare →

gemma-3-12b-it (unverified) — 53.2% avg on German LLM Benchmarks

Wed, 03 Jun 2026 00:00:00 GMT

gemma-3-12b-it (Unknown) ranked #26 of 26 models on German-language benchmarks with an average score of 53.2%.

Benchmark scores

Benchmark	Format	Score
INCLUDE — German	Native German · 4-option multiple choice	53.2%

What these benchmarks test

INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

Cost: $0.053 / 1,000 questions
Quantization: unverified
Run date: 2026-06-03

View full results and compare →

gpt-oss-120b (unverified) — 66.2% avg on German LLM Benchmarks

Fri, 29 May 2026 00:00:00 GMT

gpt-oss-120b (OpenAI) ranked #22 of 26 models on German-language benchmarks with an average score of 66.2%.

Benchmark scores

Benchmark	Format	Score
INCLUDE — German	Native German · 4-option multiple choice	66.2%

What these benchmarks test

INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

Cost: $0.056 / 1,000 questions
Speed: 1641 tok/s 🔒
TTFT: 0.31s
Quantization: unverified
Run date: 2026-05-29

View full results and compare →

MiniMax M2.7 (fp8) — 72.7% avg on German LLM Benchmarks

Thu, 01 Jan 1970 00:00:00 GMT

MiniMax M2.7 (MiniMax) ranked #13 of 26 models on German-language benchmarks with an average score of 72.7%.

Benchmark scores

Benchmark	Format	Score
INCLUDE — German	Native German · 4-option multiple choice	72.7%

What these benchmarks test

INCLUDE — German: Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

Cost: $2.070 / 1,000 questions
Quantization: fp8

View full results and compare →