Benchmark suiteLatest source data: Jul 16, 2026Checked: July 20, 2026

Lab comparison rankings.

A provider-level view of who is strongest across public benchmark coverage, best-category wins, open/proprietary mix, fastest matched models, and value leaders.

What These Mean

Labs covered

From public benchmark rows

Top lab score

Anthropic

Avg rank 25.3

Most models

OpenAI

Largest tracked model set

Open-weight depth

Alibaba

9 open models

Provider view

Lab strength, coverage, and portfolio mix.

Average rank is computed across available Arena scores, so labs with fewer public rows should be read with coverage in mind.

Lab leaderboard

Rank	Lab	Models	Avg rank	Best model	Best value	Fastest
#1	Anthropic Coverage 40%	320 open	25.3Best #1	Claude Fable 5	Claude Opus 4.7	Claude Haiku 4.5 20251001
#2	Moonshot AI Coverage 49%	76 open	41.5Best #1	Kimi K3	Kimi K3	Kimi K2.5 Thinking
#3	MiniMax Coverage 56%	54 open	47.1Best #20	Minimax M3	Minimax M2.7	Minimax M2.5
#4	Z.ai Coverage 54%	87 open	48.3Best #4	GLM 5.2	GLM 5.2	GLM 4.7
#5	Meta Coverage 28%	40 open	32.3Best #6	Muse Spark	Muse Spark 1.1	Muse Spark 1.1
#6	Google Coverage 35%	222 open	41.4Best #5	Gemini 3.1 Pro Grounding	Gemini 3.5 Flash High	Gemini 3 Flash
#7	OpenAI Coverage 33%	420 open	40.4Best #3	GPT-5.5 Search	GPT-5.6 Sol xHigh	GPT-5.4 Nano High
#8	Xiaomi Coverage 38%	53 open	52.6Best #25	MiMo V2.5 Pro	MiMo V2.5 Pro	MiMo V2.5 Pro
#9	ByteDance Coverage 17%	20 open	27.7Best #14	Seed 2.1 Pro Preview	No price match	No latency match
#10	Thinky Coverage 33%	11 open	48.5Best #38	Inkling	Inkling	No latency match
#11	xAI Coverage 27%	150 open	45.6Best #6	Grok 4.20 Multi Agent Beta 0309	Grok 4.5	Grok 4.3
#12	Alibaba Coverage 36%	209 open	56.5Best #17	Qwen3.7 Max 20260517	Qwen3.7 Max Preview	Qwen3.5 122b A10B
#13	Perplexity Coverage 11%	20 open	27.5Best #26	Ppl Sonar Pro High	No price match	No latency match
#14	Diffbot Coverage 11%	11 open	32Best #32	Diffbot Small Xl	No price match	No latency match
#15	DeepSeek Coverage 34%	99 open	69.4Best #29	DeepSeek V4 Pro Thinking	DeepSeek V4 Pro	DeepSeek V4 Flash
#16	Baidu Coverage 13%	50 open	51.2Best #13	ERNIE 5.1	No price match	No latency match
#17	IBM Coverage 44%	11 open	94Best #94	Granite 4.1 8b	Granite 4.1 8b	Granite 4.1 8b
#18	Inception AI Coverage 44%	10 open	95Best #95	Mercury 2	Mercury 2	Mercury 2
#19	Mistral Coverage 26%	85 open	84.5Best #60	Mistral Large 3	Mistral Large 3	Mistral Large 3
#20	Poolside Coverage 11%	22 open	73.5Best #69	Laguna M.1	No price match	No latency match
#21	Meituan Coverage 11%	10 open	74Best #74	Longcat Flash Chat 2602 Exp	No price match	No latency match
#22	Tencent Coverage 11%	30 open	77.3Best #64	Hunyuan Hy3 Preview	No price match	No latency match
#23	Amazon Coverage 11%	10 open	86Best #86	Amazon Nova Experimental Chat 26 02 10	No price match	No latency match
#24	Stepfun Coverage 11%	31 open	90.7Best #84	Step 1o Turbo 202506	No price match	No latency match
#25	Nvidia Coverage 11%	11 open	93Best #93	Nvidia Nemotron 3 Ultra 550b A55B Nvfp4	No price match	No latency match

Anthropic

32 models, 0 open

Avg rank

25.3

Coverage

40%

Moonshot AI

7 models, 6 open

Avg rank

41.5

Coverage

49%

MiniMax

5 models, 4 open

Avg rank

47.1

Coverage

56%

Z.ai

8 models, 7 open

Avg rank

48.3

Coverage

54%

What the scores mean.

A quick reading key for provider-level comparisons, coverage, average rank, and open-weight portfolio signals.

Lower: average Arena rankCoverage matters

How is the lab score ranked?

Labs are ranked from available public Arena performance with a coverage adjustment, so broad benchmark coverage matters. A lab with one excellent score should not automatically outrank a lab with many strong model rows.

What does average rank mean?

Average rank is computed across the Arena rows available for that lab's tracked models. It is useful for comparing portfolio strength, but it should be read alongside model count and coverage.

Why do open-weight counts matter?

Open-weight counts show how much of a lab's tracked portfolio can plausibly be self-hosted or inspected outside a closed API. It is a portfolio signal, not a quality score by itself.

Why can labs with fewer models move around?

Small portfolios are more sensitive to one strong or weak model. The page keeps model count and coverage visible so lab comparisons are not reduced to a single rank number.