a PeerBench project

Admin · MICE (rank-norm) aggregation internal

Same MICE pipeline but on per-benchmark rankit (normal-score) normalisation — outlier-resistant, so one catastrophic benchmark can't tank a model the way robust-z does.

Percentile-mean Bradley-Terry ★ MICE (robust-z) Dumb average

Weight

Price

Speed

Model

Score

Price

Speed

English

Gemini 3.1 Pro 5/7

Google · Closed

79.6

74–86

$6.75

$$$ · /Mtok

127

tok/s

Gemini 3.5 Flash 7/7

Google · Closed

69.3

66–72

$2.81

$$ · /Mtok

202.5

tok/s

Qwen3.7 Max 4/7

Alibaba · Closed

60.9

54–68

$1.28

$$ · /Mtok

170.3

tok/s

Gemini 3.1 Flash-Lite 7/7

Google · Closed

59.9

54–66

$0.44

$ · /Mtok

302.3

tok/s

DeepSeek V4 Pro 4/7

DeepSeek · Open weights

58.0

52–64

$1.62

$$ · /Mtok

57.8

tok/s

Gemma 4 31B 7/7

Google · Open weights

54.9

51–59

$0.17

$ · /Mtok

38.4

tok/s

MiMo V2.5 Pro 5/7

Xiaomi · Open weights

53.9

46–62

$0.85

$$ · /Mtok

49.2

tok/s

Gemini 2.5 Flash 7/7

Google · Closed

52.4

48–57

$1.00

$$ · /Mtok

216.2

tok/s

DeepSeek V4 Flash 7/7

DeepSeek · Open weights

51.7

43–60

$0.14

$ · /Mtok

102.5

tok/s

Qwen3.6 35B-A3B 6/7

Alibaba · Open weights

50.2

45–55

$0.30

$ · /Mtok

141.6

tok/s

Claude Haiku 4.5 3/7

Anthropic · Closed

45.0

37–53

$3.36

$$$ · /Mtok

140.6

tok/s

Gemma 4 26B A4B 7/7

Google · Open weights

44.9

37–52

$0.23

$ · /Mtok

84.4

tok/s

Tencent HY3-Preview 6/7

Tencent · Open weights

44.8

38–51

$0.08

$ · /Mtok

94.8

tok/s

Gemma 4 12B 7/7

Google · Open weights

43.9

39–49

—

46.3

tok/s

GLM-5.1 5/7

Z.ai · Open weights

39.6

30–49

$1.45

$$ · /Mtok

71.4

tok/s

Qwen3.5 9B 7/7

Alibaba · Open weights

36.3

31–42

$0.12

$ · /Mtok

57.7

tok/s

Qwen3 14B 7/7

Alibaba · Open weights

32.6

29–36

$0.14

$ · /Mtok

64.7

tok/s

Showing 17 models that ran ≥3 of 7 benchmarks (9 excluded for thin coverage). Price = median effective $/1M tokens; Speed = throughput + latency. The n/7 chip = how many benchmarks back the score. English = English-language intelligence, a background prior anchoring every Score (not a German benchmark, never in the German tables).