a PeerBench project

Admin · Bradley-Terry ★ aggregation internal

Bradley-Terry MLE (Hunter MM, weak prior) over per-benchmark pairwise wins; batch fit, order/seed-independent. CI: 300x benchmark bootstrap.

Percentile-mean MICE (rank-norm) MICE (robust-z) Dumb average

Weight

Price

Speed

Model

Score

Price

Speed

English

Gemini 3.1 Pro 5/7

Google · Closed

80.3

78–89

$6.75

$$$ · /Mtok

127

tok/s

Gemini 3.5 Flash 7/7

Google · Closed

68.4

67–75

$2.81

$$ · /Mtok

202.5

tok/s

Qwen3.7 Max 4/7

Alibaba · Closed

57.7

51–67

$1.28

$$ · /Mtok

170.3

tok/s

Gemini 3.1 Flash-Lite 7/7

Google · Closed

56.4

52–65

$0.44

$ · /Mtok

302.3

tok/s

DeepSeek V4 Pro 4/7

DeepSeek · Open weights

54.1

50–59

$1.62

$$ · /Mtok

57.8

tok/s

Gemma 4 31B 7/7

Google · Open weights

52.4

49–57

$0.17

$ · /Mtok

38.4

tok/s

DeepSeek V4 Flash 7/7

DeepSeek · Open weights

52.1

48–58

$0.14

$ · /Mtok

102.5

tok/s

MiMo V2.5 Pro 5/7

Xiaomi · Open weights

51.7

46–61

$0.85

$$ · /Mtok

49.2

tok/s

Gemini 2.5 Flash 7/7

Google · Closed

51.0

47–57

$1.00

$$ · /Mtok

216.2

tok/s

Qwen3.6 35B-A3B 6/7

Alibaba · Open weights

48.7

46–53

$0.30

$ · /Mtok

141.6

tok/s

Claude Haiku 4.5 3/7

Anthropic · Closed

46.4

44–49

$3.36

$$$ · /Mtok

140.6

tok/s

Gemma 4 26B A4B 7/7

Google · Open weights

45.0

40–51

$0.23

$ · /Mtok

84.4

tok/s

Gemma 4 12B 7/7

Google · Open weights

44.8

40–49

—

46.3

tok/s

GLM-5.1 5/7

Z.ai · Open weights

44.7

36–52

$1.45

$$ · /Mtok

71.4

tok/s

Tencent HY3-Preview 6/7

Tencent · Open weights

44.7

38–48

$0.08

$ · /Mtok

94.8

tok/s

Qwen3.5 9B 7/7

Alibaba · Open weights

39.1

34–42

$0.12

$ · /Mtok

57.7

tok/s

Qwen3 14B 7/7

Alibaba · Open weights

35.4

29–39

$0.14

$ · /Mtok

64.7

tok/s

Showing 17 models that ran ≥3 of 7 benchmarks (9 excluded for thin coverage). Price = median effective $/1M tokens; Speed = throughput + latency. The n/7 chip = how many benchmarks back the score. English = English-language intelligence, a background prior anchoring every Score (not a German benchmark, never in the German tables).