Thoughts on AI, technology, and the future we're building.

New posts every week

HomeAll PostsAI NewsAI Basics
Timelines
ChatGPTOpenAI release historyAnthropic ClaudeClaude release historyGoogle GeminiGemini release history
Benchmarks
OverviewFull model trackerValue RankingsPerformance for the moneyCoding RankingsSWE-bench and code signalsAgent RankingsTool and workflow signalsReasoning RankingsKnowledge and reasoningLong ContextDocument and retrieval signalsLab ComparisonsProvider-level rankings
CategoriesAboutContact

Subscribe to Newsletter

Practical AI news, tips, tricks, tool analysis, sent straight to your inbox.

No spam. Unsubscribe anytime.

Practical explainers, tool notes, and systems thinking for people turning new AI capability into useful work.

Explore

  • All Posts
  • Categories
  • About
  • Contact

Categories

  • AI News
  • AI Basics
  • ChatGPT
  • Anthropic
  • AI Tools
  • AI Video
  • AI Images
  • Courses

Connect

LinkedInTwitterRSS

© 2026. All rights reserved.

Benchmark suiteUpdated May 27, 2026

Lab comparison rankings.

A provider-level view of who is strongest across public benchmark coverage, best-category wins, open/proprietary mix, fastest matched models, and value leaders.

What These Mean

Labs covered

23

From public benchmark rows

Top lab score

Anthropic

Avg rank 29.7

Most models

OpenAI

Largest tracked model set

Open-weight depth

DeepSeek

13 open models

OverviewFull public trackerValuePerformance for the moneyCodingCode and SWE-bench signalsAgentsTool and workflow readinessReasoningKnowledge and reasoning signalsContextDocument and retrieval signalsLabsProvider comparisons

Provider view

Lab strength, coverage, and portfolio mix.

Average rank is computed across available Arena scores, so labs with fewer public rows should be read with coverage in mind.

Lab leaderboard

RankLabModelsAvg rankBest modelBest valueFastest
#1Anthropic

Coverage 35%

270 open
29.7Best #1
Claude Opus 4.6 ThinkingClaude Opus 4.6 ThinkingClaude Sonnet 4.6
#2Xiaomi

Coverage 53%

53 open
44.1Best #15
MiMo V2.5 ProMiMo V2.5 ProMiMo V2 Flash
#3Moonshot AI

Coverage 43%

66 open
44.2Best #8
Kimi K2.6Kimi K2.6Kimi K2.5 Thinking
#4Z.ai

Coverage 50%

87 open
47.2Best #6
GLM 5.1GLM 5.1GLM 4.7
#5ByteDance

Coverage 22%

10 open
26Best #18
Dola Seed 2.0 ProNo price matchNo latency match
#6MiniMax

Coverage 53%

44 open
55Best #34
Minimax M2.7Minimax M2.1 PreviewMinimax M2
#7Google

Coverage 32%

222 open
41.3Best #4
Gemini 3.1 Pro GroundingGemini 3 ProGemma 4 26b A4B
#8OpenAI

Coverage 30%

450 open
39.4Best #1
GPT-5.5 SearchGPT-5.5 InstantGPT 4o 2024 08 06
#9Perplexity

Coverage 11%

20 open
25Best #24
Ppl Sonar Pro HighNo price matchNo latency match
#10Alibaba

Coverage 35%

229 open
54Best #4
Qwen3.7 Max 20260517Qwen3.7 Plus PreviewQwen3.5 122b A10B
#11xAI

Coverage 23%

150 open
40.6Best #5
Grok 4.20 Multi Agent Beta 0309Grok 4.3Grok 4.3
#12Meta

Coverage 18%

30 open
37Best #4
Muse SparkNo price matchNo latency match
#13Diffbot

Coverage 11%

11 open
29Best #29
Diffbot Small XlNo price matchNo latency match
#14IBM

Coverage 44%

11 open
76Best #76
Granite 4.1 8bGranite 4.1 8bGranite 4.1 8b
#15Inception AI

Coverage 44%

10 open
78Best #78
Mercury 2Mercury 2Mercury 2
#16Baidu

Coverage 13%

50 open
41.2Best #13
ERNIE 5.1No price matchNo latency match
#17DeepSeek

Coverage 28%

1313 open
65.3Best #17
DeepSeek V4 Pro ThinkingDeepSeek V4 Pro ThinkingDeepSeek V4 Flash
#18Meituan

Coverage 11%

10 open
59Best #59
Longcat Flash Chat 2602 ExpNo price matchNo latency match
#19Mistral

Coverage 24%

84 open
82.3Best #73
Mistral Large 3Mistral Large 3Mistral Large 3
#20Tencent

Coverage 15%

30 open
72.5Best #50
Hunyuan Hy3 PreviewNo price matchNo latency match
#21Amazon

Coverage 11%

20 open
80Best #70
Amazon Nova Experimental Chat 26 02 10No price matchNo latency match
#22Stepfun

Coverage 11%

31 open
82Best #75
Step 1o Turbo 202506No price matchNo latency match
#23Ai2

Coverage 11%

11 open
95Best #95
Molmo 2 8bNo price matchNo latency match

Anthropic

27 models, 0 open

#1

Avg rank

29.7

Coverage

35%

Xiaomi

5 models, 3 open

#2

Avg rank

44.1

Coverage

53%

Moonshot AI

6 models, 6 open

#3

Avg rank

44.2

Coverage

43%

Z.ai

8 models, 7 open

#4

Avg rank

47.2

Coverage

50%

ByteDance

1 models, 0 open

#5

Avg rank

26

Coverage

22%

MiniMax

4 models, 4 open

#6

Avg rank

55

Coverage

53%

Google

22 models, 2 open

#7

Avg rank

41.3

Coverage

32%

OpenAI

45 models, 0 open

#8

Avg rank

39.4

Coverage

30%

Perplexity

2 models, 0 open

#9

Avg rank

25

Coverage

11%

Alibaba

22 models, 9 open

#10

Avg rank

54

Coverage

35%

xAI

15 models, 0 open

#11

Avg rank

40.6

Coverage

23%

Meta

3 models, 0 open

#12

Avg rank

37

Coverage

18%

Diffbot

1 models, 1 open

#13

Avg rank

29

Coverage

11%

IBM

1 models, 1 open

#14

Avg rank

76

Coverage

44%

Inception AI

1 models, 0 open

#15

Avg rank

78

Coverage

44%

Baidu

5 models, 0 open

#16

Avg rank

41.2

Coverage

13%

DeepSeek

13 models, 13 open

#17

Avg rank

65.3

Coverage

28%

Meituan

1 models, 0 open

#18

Avg rank

59

Coverage

11%

Mistral

8 models, 4 open

#19

Avg rank

82.3

Coverage

24%

Tencent

3 models, 0 open

#20

Avg rank

72.5

Coverage

15%

Amazon

2 models, 0 open

#21

Avg rank

80

Coverage

11%

Stepfun

3 models, 1 open

#22

Avg rank

82

Coverage

11%

Ai2

1 models, 1 open

#23

Avg rank

95

Coverage

11%

Benchmark guide

What the scores mean.

A quick reading key for provider-level comparisons, coverage, average rank, and open-weight portfolio signals.

Lower: average Arena rankCoverage matters
How is the lab score ranked?

Labs are ranked from available public Arena performance with a coverage adjustment, so broad benchmark coverage matters. A lab with one excellent score should not automatically outrank a lab with many strong model rows.

What does average rank mean?

Average rank is computed across the Arena rows available for that lab's tracked models. It is useful for comparing portfolio strength, but it should be read alongside model count and coverage.

Why do open-weight counts matter?

Open-weight counts show how much of a lab's tracked portfolio can plausibly be self-hosted or inspected outside a closed API. It is a portfolio signal, not a quality score by itself.

Why can labs with fewer models move around?

Small portfolios are more sensitive to one strong or weak model. The page keeps model count and coverage visible so lab comparisons are not reduced to a single rank number.