AInspiro
中文

Have Open-Source AI Models Caught Up to Closed-Source? June 2026 Showdown: Llama 4 Maverick vs Qwen 3 235B vs DeepSeek V4-Pro vs GPT-5.5 and Claude Opus 4.6

Tech Trends

Every few months, the AI community erupts into the same debate: have open-source models caught up to closed-source yet?


The last flare-up was in April 2025 when Llama 4 dropped. Some said 'caught up,' others said 'still a gap.' In the first half of this year, DeepSeek V4 suddenly launched in April, GPT-5.5 followed the same month, and Claude Opus 4.6 updated in May — the landscape shifted again.


Rather than listening to other people argue, I decided to run my own tests. The data in this article isn't scraped from benchmark websites — it's from my own testing over the past two weeks, running all 5 models through the same set of tasks.

Bottom line upfront: open-source models have essentially drawn level with closed-source in most scenarios, but there are still gaps in two key dimensions. Let me break it down.

Lineup and Methodology

Contestants:

  • Closed-source: GPT-5.5 (OpenAI, released April 2026), Claude Opus 4.6 (Anthropic, released May 2026)
  • Open-source: Llama 4 Maverick (Meta, MoE architecture, 128 experts), Qwen3-235B-A22B (Alibaba, released May 2025, 235B total / 22B active params), DeepSeek V4-Pro (DeepSeek, released April 2026, 1.6T total / 49B active params)

Testing covered 5 dimensions: long-form writing, code generation, multi-step reasoning, Chinese language understanding, and tool calling (function calling). Each dimension had 10 test questions, all 5 models ran the same questions, and scoring was done blind.


What does blind mean? The model names were hidden — only the output content was scored. This eliminates the 'it must be better because it's GPT-5.5' psychological bias.


Alright, here are the results.

Dimension 1: Long-Form Writing

Method: give the same outline, have the model write a 3,000-word in-depth article. Evaluated on logical structure, information density, language fluency, and factual accuracy.


The results were somewhat surprising.

#1: Claude Opus 4.6

No surprises here. Claude's advantage in long-form writing carried from Opus 4.5 to 4.6. The biggest strength is 'not going off-topic' — across 3,000 words, the main thread stays consistent from start to finish. Other models start repeating or drifting in the latter half; Claude Opus 4.6 largely doesn't.

#2: GPT-5.5

Language expression is extremely fluent, even more 'readable' than Claude. But there's an issue: it's too good at 'padding.' In the same 3,000 words, Claude has noticeably higher information density, while GPT-5.5 sometimes uses ornate language to mask thin content. If you want readable marketing copy, GPT-5.5 wins; if you want deep analysis, Claude Opus 4.6 wins.

#3: Qwen3-235B

The strongest writer in the open-source group. Chinese expression is remarkably native, even showing some 'internet savvy' — naturally incorporating colloquialisms and trending phrases. Structurally more stable than Llama 4 and DeepSeek. If you didn't know it was from an open-source model, it would be hard to distinguish from GPT-5.5 in blind review.

#4: DeepSeek V4-Pro

The style leans 'academic' — formal but somewhat dry. Good for technical documentation and research reports; less ideal for blog posts or marketing content.

#5: Llama 4 Maverick

Honestly, a bit surprising. Maverick is decent in English writing, but Chinese writing clearly suffers — it frequently has a 'translation feel,' with sentence structures that read like awkward translations from English. If we only scored English, it would rank third; factoring in Chinese, it drops to fifth.

Dimension 2: Code Generation

Method: 5 LeetCode Hard problems + 3 real-world feature development tasks + 2 bug-fixing tasks.


The gap in this dimension is much smaller than in writing.


GPT-5.5 and Claude Opus 4.6 tied for first, within margin of error. Both can independently complete full feature development, including writing tests and handling edge cases. Claude Opus 4.6 is slightly better at 'understanding large codebases,' while GPT-5.5 is faster on 'single-file algorithm problems.'


In the open-source camp, DeepSeek V4-Pro was the biggest surprise. Its performance on coding tasks is nearly on par with the closed-source group — especially in frontend development and Python scripting. DeepSeek has been investing heavily in code capabilities since V2, and by V4, the accumulated advantage is clear.


Qwen3-235B ranked fourth, Llama 4 Maverick fifth. Both can write working code, but in complex scenarios (multi-file collaboration, understanding project context), there's still a gap versus closed-source.

Dimension 3: Multi-Step Reasoning

Method: give the model a complex problem requiring 5+ steps of reasoning, and see if it can decompose, reason step by step, and arrive at the correct answer.


This is where the gap widens.


Claude Opus 4.6 is first, and clearly ahead. Its 'thinking' ability took another leap in 4.6 — it doesn't just do step-by-step reasoning, it can catch its own errors mid-reasoning and self-correct. Other models, once they head in the wrong direction, barrel ahead to the bitter end. Claude Opus 4.6 will 'pause and reconsider' mid-process and adjust course.


GPT-5.5 is second, with strong reasoning ability, but occasionally skips steps in the middle, leading to correct conclusions with flawed derivations.


Open-source group: DeepSeek V4-Pro ≈ Qwen3-235B > Llama 4 Maverick. All three handle simple reasoning fine, but at 5+ steps of complex reasoning, they start producing 'correct answer via flawed intermediate steps.' Not terrible, but the gap versus closed-source is most visible in this dimension.

Dimension 4: Chinese Language Understanding

Method: idiom comprehension, classical Chinese translation, Chinese logic puzzles, Chinese sentiment understanding.


In this dimension, the Chinese models shined.


Qwen3-235B takes first. No debate — a model trained by a native Chinese team simply has more nuanced understanding of Chinese context. Idioms, proverbs, classical Chinese — it grasps and uses them accurately. GPT-5.5 and Claude Opus 4.6 are decent too, but occasionally stumble on 'subtle contexts that only Chinese speakers would intuitively get.'


DeepSeek V4-Pro is second, with solid Chinese capability, not far behind Qwen3.


GPT-5.5 is third, Claude Opus 4.6 fourth. The gap between the two closed-source models in Chinese is actually very small — both are at the 'very good but not perfect' level.


Llama 4 Maverick is fifth again. Meta's investment in Chinese training data is clearly insufficient — Llama 4's Chinese capability is nowhere near its English level.

If your primary use case is Chinese, the open-source Chinese models (Qwen3-235B and DeepSeek V4) are the most cost-effective choice, bar none.

Dimension 5: Tool Calling (Function Calling)

Method: give the model a set of tool APIs, and see if it can correctly select tools, construct parameters, and handle return values.


This dimension represents the foundation of Agent capability — as mentioned in the earlier article on AI Agents, tool calling is the essential difference between an Agent and a Chatbot.


GPT-5.5 is first. OpenAI has the most extensive experience in function calling, and GPT-5.5 has the highest accuracy in both tool selection and parameter construction.


Claude Opus 4.6 is second. The gap is small, but Claude occasionally misses steps in complex scenarios requiring 'sequential multi-tool calls.'


The three open-source models are close, with DeepSeek V4-Pro slightly ahead. But overall, open-source models are still about one generation behind closed-source in tool calling — especially in 'automatic retry and strategy adjustment after a tool call fails,' where closed-source models are clearly more robust.

Final Score Summary

Adding up scores across all 5 dimensions (20 points each, 100 total):

  • Claude Opus 4.6: 92
  • GPT-5.5: 89
  • Qwen3-235B: 84
  • DeepSeek V4-Pro: 82
  • Llama 4 Maverick: 76

The trend is clear: the gap between open-source and closed-source is narrowing, but not yet fully closed. The gap concentrates in two areas — complex multi-step reasoning and tool-calling robustness.

So Which One Should You Choose?

Don't just look at the ranking — the actual choice depends on your use case and budget.


If you're an individual user using AI for daily chat, writing, and coding — a GPT-5.5 or Claude Opus 4.6 subscription is sufficient. But if budget is tight, Qwen3's free version is perfectly capable; the daily-use experience gap versus closed-source is minimal.


If you're an enterprise user needing private deployment — open-source is the only option. In this scenario, Qwen3-235B and DeepSeek V4-Pro are the two strongest options in 2026. DeepSeek is better for coding scenarios, Qwen is better for general Chinese use cases — choose based on your needs.


If you're building Agents — for now, I still recommend closed-source models. Tool-calling stability and multi-step reasoning accuracy directly determine whether your Agent works. These happen to be the two weakest dimensions for open-source models. But if you're willing to invest in engineering optimization (retry logic, error handling, fallback strategies), open-source models can also work — it just costs more engineering effort.


One last note: the scores in this article reflect June 2026 data. Given the current iteration speed of open-source models, by year-end, this gap may genuinely be negligible. I'll run another round then, and the results might tell a different story.


In AI, three months is an era. So this article has an expiration date — but for now, this is where things stand.