German LLM Benchmark · Model Profile

Qwen3.6 35B-A3B

Alibaba fp8 run 2026-06-09

72 .8%

avg. German score

#15 of 30 models

+0.7pp above avg.

Benchmark breakdown

GermEval

79.9%

Native German named-entity recognition — identify persons, locations, organisations and misc entities in German text, emitted as JSON. Scored with seqeval micro-F1 excluding the noisy MISC class. Run reasoning-off.

Named-entity recognition native · Native German

via GermEval (via EuroEval) ↗

INCLUDE

68.3%

Native German exam and licensing questions covering region-specific knowledge — history, law, civics and culture. Written by humans in German, not translated.

4-option multiple choice native · Native German

via CohereLabs/include-base-44 ↗

MMLU-Pro

80.0%

Hard academic questions across 14 subjects — STEM, law, health, economics, philosophy and more. Professionally translated to German, with up to ten answer options per question.

10-option multiple choice translated · Professional translation

via li-lab/MMLU-ProX ↗

MMMLU

85.8%

OpenAI's multilingual MMLU, German split — general knowledge spanning STEM, the humanities, social sciences and other domains. Professionally translated to German.

4-option multiple choice translated · Professional translation

via openai/MMMLU ↗

SB10K

58.4%

Native German social-media sentiment classification — positive, neutral or negative. Human-annotated German text, not translated. Run reasoning-off; scored as exact-match accuracy on the predicted label.

3-class sentiment native · Native German

via SB10K (via EuroEval) ↗

ScaLA

64.2%

Native German linguistic acceptability — does the sentence read as grammatical German (ja / nein)? Built from clean vs. minimally-corrupted German sentences. Run reasoning-off.

Binary acceptability native · Native German

via ScaLA-de (via EuroEval) ↗

Cost & speed

$0.743 per 1,000 questions

159 tokens / second

0.62s time to first token

Compare with other models → ← View leaderboard