ChatGPT vs Claude vs Gemini: 30 Days Running All Three

Contents

This article was supposed to be DeepSeek vs Gemini.

I’d been running both for 30 days against Claude Pro and ChatGPT Plus as baselines, which is the standard way you run a two-tool comparison. You don’t test models in a vacuum. Instead you benchmark them against known quantities. So while I was evaluating DeepSeek and Gemini against each other, I was also routing the same prompts through Claude and ChatGPT every day, building comparison data across all four.

Then the week happened. Developer forums torched DeepSeek over hallucination rates on anything technical. Meanwhile the government integration concerns never quietly went away, they just got louder. Consequently enterprise Slack channels started treating the model as a risk vector instead of a cost play. Meanwhile Gemini 3.1 Pro was quietly owning boards nobody thought Google could touch.

Then on Day 26, Anthropic shipped Claude Opus 4.7 and redrew the coding map.

So the two-horse race became a three-horse race, and the article I thought I was writing became a better one: ChatGPT vs Claude vs Gemini, with 30 days of parallel testing data across all three flagships and a post-Opus 4.7 leaderboard most reviewers haven’t caught up to yet.

It’s Edge vs Chrome in 2015. Furthermore the Google product was technically fine the whole time. Nobody cared until it suddenly mattered. Here’s what the data actually shows about chatgpt vs claude vs gemini in April 2026, with real numbers, real observations, and no fake “I ran a test” filler.

TL;DR

Claude Opus 4.7 ($20 Pro tier) is the new coding king. SWE-Bench Verified 87.6%, SWE-Bench Pro 64.3%, MCP-Atlas 77.3%, Computer Use 78.0%. Released April 16 with /ultrareview, /ultraplan, and a new xhigh effort level. However two caveats apply: BrowseComp regressed from 83.7% on 4.6 to 79.3% on 4.7, and the new tokenizer produces 20 to 35% more tokens per request. Rate card is unchanged but the actual API bill is up.

Meanwhile Gemini 3.1 Pro ($19.99 Google AI Pro) owns the reasoning ceiling. ARC-AGI-2 77.1% (dominant), GPQA Diamond 94.3% (tied record), BrowseComp 85.9%, 1M context, 114 tokens/sec. Additionally it bundles NotebookLM Pro (300 sources), Nano Banana Pro, Veo 3.1 Lite, and 2TB Google One storage. Furthermore Alphabet’s Q4 2025 filing confirms 750M monthly users.

Finally GPT-5.4 Thinking ($20 ChatGPT Plus) leads action-heavy benchmarks. Terminal-Bench 2.0 75.1%, OSWorld 75.0%, GDPval-AA 83.0%, BrowseComp 82.7% (Pro hits 89.3% but that’s $200 tier). Moreover it has the best voice mode and widest third-party ecosystem. However the 128K consumer context cap is the awkward one, since 1M only unlocks via API.

Best for: Claude = production code, legal drafting, long documents. Gemini = research, Workspace, multimodal, cost-efficient API. ChatGPT = general purpose, voice, web agents.

Not ideal for: Claude if you need image generation, voice, or heavy web agent work. Gemini if you hit message limits (drops to ~25/day under load) or need a $100 middle tier (Google skips it). ChatGPT if you write production code daily.

Sticky: Codex for terminals and token cost. Gemini for speed. Claude for production repos.

The $20 Tier at a Glance

Feature	ChatGPT Plus ($20)	Claude Pro ($20)	Google AI Pro ($19.99)
Flagship Model	GPT-5.4 Thinking	Claude Opus 4.7	Gemini 3.1 Pro
Context Window	128K consumer, 1M API	1M	1M
Output Speed	~95 tok/sec	~88 tok/sec	114 tok/sec
Message Limits	~80 GPT-5.4 Thinking / 3 hrs	~45 Opus 4.7 / 5 hrs	~100 Pro / day (drops to ~25 under load)
Image Generation	DALL-E 3 + GPT Image	Claude Design (beta)	Nano Banana Pro
Video Generation	None bundled	None	Veo 3.1 Lite (bundled)
Web Search	ChatGPT Search	Web Search + Research	Google Search + Deep Research
Voice Mode	Advanced Voice	Text only	Gemini Live
Coding Agent	Codex CLI, Codex Cloud	Claude Code (xhigh, /ultrareview, /ultraplan)	Gemini CLI + Antigravity
Memory	Persistent + Projects	Project memory	Gemini Memory + Gems
Computer Use	Operator	Computer Use API	Project Mariner (Ultra-only, US-only)
Notebook Research	None	Projects	NotebookLM Pro (300 sources)
Workspace Integration	Limited (via GPTs)	Limited	Deep: Docs, Sheets, Gmail, Drive, Meet
Cloud Storage	None	None	2TB Google One included
Budget Tier	GPT-5.3 Instant	Haiku 4.5	Gemini 3 Flash + 3.1 Flash-Lite

Three products. Three takes on what $20 buys. The table above hides the nuance. The next 5,000 words unhide it.

Every Dollar, Every Tier

The $20 tier is where chatgpt vs claude vs gemini becomes a real choice for most users. Above that, pricing stops being comparable, because Google skips the entire $100 bracket.

ChatGPT Pricing

Tier	Price	Key Value
Free	$0	Limited GPT-5.3 Instant access
Plus	$20/mo	GPT-5.4 Thinking, xhigh reasoning, voice, image gen
Business	$25/user/mo	Team features, data isolation
Pro $100	$100/mo	NEW as of April 9, 2026. GPT-5.4 Pro, o1 Pro mode, 5x Plus usage, 10x Codex through May 31
Pro $200	$200/mo	20x Plus usage, highest Codex limits

Claude Pricing

Tier	Price	Key Value
Free	$0	Limited Sonnet 4.6 access
Pro	$20/mo	Opus 4.7, Projects, Claude Code
Max 5x	$100/mo	5x Pro usage, priority routing
Max 20x	$200/mo	Highest Opus 4.7 limits

Google AI Pricing

Tier	Price	Key Value
Free	$0	Limited Gemini 3 Flash
AI Pro	$19.99/mo	Gemini 3.1 Pro, Deep Research, NotebookLM Pro, Nano Banana Pro, Veo 3.1 Lite, 2TB Google One
AI Ultra	$249.99/mo	Gemini 3 Deep Think, Project Mariner (US), Veo 3.1 full, highest limits

The $100 Showdown (and Google’s Missing Middle)

OpenAI and Anthropic both occupy the $100 slot. Google does not. Google’s next tier after Pro jumps straight to $249.99 Ultra, which is an inconvenient chasm if you want more than Pro but can’t justify ten times the money.

OpenAI’s Pro $100 tier is brand new, launched April 9, 2026, and it’s the most aggressive play in that bracket right now. The 10x Codex bonus running through May 31 makes it an obvious move for anyone building with agents. Claude Max 5x is the quieter pick, priority routing during peak hours matters more than the raw usage multiplier on paper. If you’re hitting Opus 4.7 limits often (40+ serious coding sessions a week is the rough threshold I hit by Day 20), Max 5x pays for itself.

What $20 Actually Gets You

Google AI Pro wins the contents-of-the-box fight by a margin that isn’t close. The bundle: Gemini 3.1 Pro access, Deep Research, NotebookLM Pro, Nano Banana Pro image generation, Veo 3.1 Lite video, plus 2TB Google One storage. The storage alone replaces a $9.99/mo subscription for anyone already paying for it. Veo 3.1 Lite would be a standalone $30/mo product if Google priced it that way. Neither competitor bundles anything equivalent.

Google AI Pro is the only $20 AI subscription where the AI is the bonus. That’s the line.

ChatGPT Plus and Claude Pro are leaner subscriptions. You get the chatbot, the agent capabilities, and not much else. That’s fine if you want the best-in-class model and nothing adjacent. It’s less fine if you’d actually use the Google extras.

ChatGPT Plus

$20/mo

Voice + general purpose

GPT-5.4 Thinking
Advanced Voice mode
DALL-E 3 image gen
Custom GPTs
128K context cap

No bundled storage or video.

Claude Pro

$20/mo

Production code

Opus 4.7 access
/ultrareview + /ultraplan
Projects (doc scoping)
Claude Code (separate)
No image, voice, video

Leanest sub. Pure specialist.

BEST VALUE

Google AI Pro

$19.99/mo

Best overall value

Gemini 3.1 Pro
NotebookLM Pro (300 sources)
Nano Banana Pro images
Veo 3.1 Lite video
2TB Google One storage
Docs/Sheets/Gmail

Storage alone is $9.99/mo elsewhere.

Entry tier pricing as of April 2026.

Introducing Gemini 3.1 Pro, our new SOTA model across most reasoning, coding, and stem use cases! pic.twitter.com/1fvO3oPTtb
— Logan Kilpatrick (@OfficialLoganK) February 19, 2026

The Leadership Map: GPT-5.4 vs Opus 4.7 vs Gemini 3.1 Pro

Most reviews of chatgpt vs claude vs gemini published in the last 60 days frame Gemini 3.1 Pro as the breakaway leader. That was true on February 19. It’s not true on April 20. Opus 4.7 dropped four days ago. The benchmark map redrew itself.

Benchmark	GPT-5.4 Thinking	Claude Opus 4.7	Gemini 3.1 Pro	Leader
SWE-Bench Verified	85.0% (Codex)	87.6%	80.6%	Claude (+2.6 over Codex)
SWE-Bench Pro	57.7%	64.3%	54.2%	Claude
GPQA Diamond	92.8%	94.2%	94.3%	Gemini (by 0.1)
ARC-AGI-2	28.3%	25.1%	77.1%	Gemini (dominant)
MCP-Atlas	68.1%	77.3%	73.9%	Claude (+9.2 over GPT)
Computer Use (OSWorld)	75.0%	78.0%	68.9%	Claude
Terminal-Bench 2.0	75.1%	69.4%	68.5%	ChatGPT
Humanity’s Last Exam (no tools)	42.0%	41.8%	44.4%	Gemini
Humanity’s Last Exam (with tools)	52.1%	54.7%	51.4%	Claude
LiveCodeBench Pro	2649 Elo	2812 Elo	2887 Elo	Gemini
BrowseComp	82.7% (Pro: 89.3%)	79.3% ↓	85.9%	Gemini (at Plus tier)
CursorBench	62%	70%	61%	Claude
GDPval-AA	83.0%	81.4%	79.8%	ChatGPT
APEX-Agents	30.1%	31.2%	33.5%	Gemini
CharXiv (scientific figures)	baseline	+13 pts no tools, +6 with	baseline	Claude
Context (Pro tier)	128K	1M	1M	Claude/Gemini tie

Three category leaders. Not one.

First, Claude Opus 4.7 leads: real-repo coding (both SWE-Bench variants), agentic tool use (MCP-Atlas by 9.2 points), Computer Use, scientific reasoning with tools, scientific figure interpretation. Furthermore CharXiv jumped double digits on Opus 4.7. Translation: Claude stopped hallucinating what a graph shows.

Meanwhile Gemini 3.1 Pro leads: abstract reasoning (ARC-AGI-2 by double the competition), raw knowledge (GPQA and HLE without tools), long-context reliability past 500K tokens, multimodal input (video, audio, image), and price-to-performance on the API.

Finally GPT-5.4 Thinking leads: Terminal-Bench 2.0, GDPval-AA, BrowseComp web agent tasks, voice mode (Advanced Voice is still unmatched), and third-party ecosystem breadth.

Ultimately nobody leads 13 of 16 anymore. That framing was Gemini’s launch week. Gemini is still impressive, however it just isn’t alone.

Benchmark Leadership by Category

Three category leaders, not one. All scores as of April 2026.

ChatGPT (GPT-5.4 Thinking)

Claude (Opus 4.7)

Gemini (3.1 Pro)

The Tokenizer Trap: Opus 4.7 Is More Expensive Than It Looks

Anthropic kept the Opus 4.7 rate card identical to 4.6: $5 per million input tokens, $25 per million output. The announcement blog post led with “unchanged pricing.” Reviews copied that line.

However the reviews are wrong. Or at least, they’re incomplete.

Opus 4.7 ships with a new tokenizer. Meanwhile independent analysis from Apiyi and multiple developer reports show it produces 20 to 35% more tokens for identical inputs, with the higher end hit by code-heavy and structured content. Consequently a prompt that cost $0.50 on Opus 4.6 now costs $0.60 to $0.67 on 4.7 for the same text. The rate card didn’t move. The token count did.

Ultimately Anthropic raised Claude’s price 25% without changing the price. Welcome to 2026.

What This Looks Like in Practice

Take a mid-sized dev team running Claude on 10 million input tokens a day. At $5 per million, that’s $50/day on 4.6. However same workload on 4.7, the effective rate jumps to roughly $60 to $67/day depending on content type. Over a month, you’re paying $1,800 to $2,010 for what used to cost $1,500. Same code, same prompts, “no price change.”

Meanwhile compare to Gemini 3.1 Pro’s API: $2 per million input, $12 per million output. Same 10M daily input tokens costs $20/day on Gemini, or roughly $600/month. Therefore at scale, you’re paying Claude 3x the Gemini cost for tasks where Gemini’s 80.6% SWE-Bench Verified is within 7 points of Claude’s 87.6%. The quality gap is real. However the price gap is more real.

Additionally Anthropic isn’t hiding the tokenizer change, it’s in the release notes. However they just aren’t emphasizing it. Consequently if you’re budgeting Claude API spend for Q2, add 25% to whatever your Q1 spreadsheet says.

The BrowseComp Regression Nobody’s Talking About

Opus 4.7 is a leap forward in almost every category. Almost.

First, BrowseComp (the web agent benchmark measuring how well a model can navigate and extract information from the open web) regressed. Opus 4.6 scored 83.7%. Meanwhile Opus 4.7 scores 79.3%. Consequently that’s a 4.4-point drop on a task Anthropic has been publicly investing in.

Second, GPT-5.4 Thinking scores 82.7% on the same benchmark (GPT-5.4 Pro pushes this to a state-of-the-art 89.3%, however that’s locked behind the $200 Pro tier). Gemini 3.1 Pro scores 85.9%. Therefore at the $20 consumer tier, Gemini leads. Opus 4.7 is now a distant third for web agent work.

Furthermore most coverage of the 4.7 launch either skipped this number or mentioned it in passing. It matters. Specifically if your use case involves an agent actually pulling information from live websites (research assistants, competitive monitoring, automated form filling), Claude is the wrong choice right now. Ultimately Anthropic shipped a coding breakthrough and took a web-agent step back to do it.

So the honest frame: Opus 4.7 is the best model you can point at a codebase. However it is not the best model you can point at a browser. Indeed in the 30-day test this is the first gap I noticed after upgrading. The second was the API bill.

The Coding Showdown

If you’re picking an AI subscription to help you ship code, this section decides it. It’s also the section where I want to be especially careful about bias, because Claude won the big benchmarks but Codex wins on axes the benchmarks don’t capture.

Codex CLI vs Claude Code vs Gemini CLI

Three different philosophies, three different terminal experiences.

Codex CLI: OpenAI’s agentic terminal

Codex CLI is OpenAI’s agentic coding tool, running GPT-5.4 Thinking locally with xhigh reasoning. The Codex Cloud environment gives you sandboxed execution, so the agent can run the code it writes and self-correct. The brand-new Pro $100 plan launched April 9 includes the 10x bonus usage on Codex through May 31, which is the most aggressive pricing in the category. In the 30-day test I used Codex for DevOps, shell automation, CI wrangling, and greenfield scripts where context wasn’t king. Codex is also open source under Apache 2.0 with 67,000+ GitHub stars, which Claude Code is not.

Claude Code: the professional default for repos

Claude Code is the tool most professional developers I know default to for repo-scale work in April 2026. Opus 4.7 introduced /ultrareview, /ultraplan, and the new xhigh effort level. On Day 26 I ran /ultrareview against an open PR I’d already reviewed manually. It caught a race condition in async cleanup I’d missed. On Day 27 it caught a memory leak in a useEffect that three other reviewers signed off on. That’s when I migrated three in-progress projects off Codex and onto Claude Code inside 48 hours. The CursorBench jump from 58% on 4.6 to 70% on 4.7 isn’t marketing. Partner data shows the same step change in production workflows. For a deeper look at how Claude Code fits into a broader productivity stack, check our guide to GitHub repos that supercharge Claude Code.

Gemini CLI plus Jules plus Antigravity: Google’s stack

Gemini CLI plus Jules plus Antigravity is the stack that matters for Gemini. npm install -g @google/gemini-cli gets you the terminal interface. Jules is the asynchronous cloud agent that runs background bug fixes and refactors and submits PRs. Antigravity is Google’s agent-first IDE (VSCode fork) launched in November 2025. Gemini 3.1 Pro leads LiveCodeBench Pro at 2887 Elo, the highest score ever recorded on that leaderboard. Furthermore, in the 30-day test Gemini CLI was consistently the fastest to return something correct, particularly on algorithmic problems and standalone functions. However, where it lost ground was multi-file orchestration.

Codex vs Claude Code: The Honest Breakdown

Benchmarks don’t tell the whole story. However here’s what actually matters after 30 days.

Where Codex actually beats Claude Code:

Terminal-Bench 2.0: Codex 77.3% vs Claude 69.4%. Real 8-point gap on terminal, shell, and DevOps work.
Token efficiency: Codex uses roughly 4x fewer tokens for comparable output. For example in one documented Figma-to-code benchmark, Codex burned 1.5M tokens to Claude’s 6.2M.
API cost at scale: a complex task that costs $15 on Codex can run $155 on Claude Code at API rates. Consequently that gap compounds fast.
App scaffolding, iOS, macOS, front-end UI work. Additionally GPT-5.4 is particularly strong here.
Async cloud workflows. Fire a task at Codex Cloud, walk away, review the branch later.
Code review catching race conditions, edge cases, and security oversights. Indeed developer consensus on Reddit is that Codex catches a different class of bugs than Claude.
Raw developer preference in a 500+ developer survey: 65% Codex vs 35% Claude Code.
Finally open source under Apache 2.0. Fork it, modify it, self-host it.

Where Claude Code wins:

SWE-Bench Verified: 87.6% vs 85.0% Codex. On real-repo engineering tasks, Claude is 2.6 points ahead.
SWE-Bench Pro: 64.3% vs 57.7%. Multi-language real-world tasks.
Furthermore complex multi-file refactoring and architectural work across large codebases.
First-pass accuracy. Reported 95% on complex prompts in one benchmark vs Codex needing two corrections on the same task.
Additionally Agent Teams with inter-agent messaging, 26 lifecycle hooks (up from 17), 3,000+ MCP integrations.
CLAUDE.md persistent instructions across sessions.
Finally long-horizon agentic reliability, per Anthropic’s own evals on multi-step workflows.

The developer consensus on Reddit and Hacker News condenses to two lines: “Codex for keystrokes, Claude Code for commits.”
“Claude delivers precision edits. Codex handles broad refactoring.”

The Hybrid Workflow Most Pros Actually Use

The sophisticated move in April 2026 isn’t picking a side. Instead it’s running both.

First, on March 30, 2026, OpenAI shipped an official plugin called codex-plugin-cc that lets you call Codex directly from within Claude Code. Three commands: /codex:review (standard code review), /codex:adversarial-review (actively challenges design decisions), and /codex:rescue (full task handoff). Requires a ChatGPT subscription or OpenAI API key plus Node.js 18.18+. Consequently that’s OpenAI setting up camp inside Anthropic’s ecosystem, which tells you something about where the market is heading.

The workflow I settled into by Day 28:

First, Claude Code writes the implementation. Opus 4.7 handles the multi-file refactor, the architectural decisions, the cross-repo context. Then I run /codex:adversarial-review or a separate Codex pass over the diff. Meanwhile Codex catches a different class of issues (missed error handling, race conditions, security oversights) that Claude’s reasoning path missed. Neither tool catches everything on its own. However together they cover more surface area than either does alone.

Cost reality: if you’re running both at scale, use Codex on Pro $100 for volume and Claude Code on Pro or Max 5x for surgical precision. Therefore budget roughly $120/month total. For most serious developers I know, that’s cheaper than the productivity loss from picking wrong and sticking with one tool.

Which One If You Can Only Pick One

Production code on real repos, multi-file refactors, architectural work: Claude Code.
DevOps, shell scripting, CI/CD, terminal-heavy work, async cloud jobs: Codex.
App scaffolding, front-end UI, algorithmic problems at speed: Gemini CLI.
Cost-sensitive workflow: Codex by a wide margin.

The sticky version has evolved: Codex for terminals and token cost. Gemini for speed. Claude for production repos. Run both if you can.

Introducing Claude Opus 4.7, our most capable Opus model yet.

It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back.

You can hand off your hardest work with less supervision. pic.twitter.com/PtlRdpQcG5
— Claude (@claudeai) April 16, 2026

What 30 Days of Parallel Testing Showed

Five observations from 30 days running all three flagships on the same daily work, corroborated by public leaderboards and community testing.

Writing: Claude needs less editing

Claude’s long-form writing consistently needed less editing. Across the 30 days I drafted the same client article three times, once per model, on repeat. Claude’s drafts shipped with roughly a 20-minute polish. Meanwhile GPT-5.4’s took closer to 45 minutes of AI-tell scrubbing, mostly variants of forced contrast framing that kept reappearing past the 1,500-word mark. Gemini’s drafts needed voice rewrites but came with research citations I hadn’t requested, which I kept. Moreover, this pattern matches what Vellum’s head-to-head testing shows on long-form work.

Research: Gemini Deep Research replaced my workflow

Gemini Deep Research became my default research tool by week two. The workflow: feed it a topic, walk away for 8 to 10 minutes, come back to a report with citations. Consequently it compressed what used to be half a day of reading into a 10-minute wait. Furthermore Gemini 3.1 Pro’s 85.9% on BrowseComp backs up what I was seeing day-to-day.

Voice: ChatGPT still the only usable one

ChatGPT voice mode was the only voice AI I actually used for phone calls. Gemini Live is improving. However Claude has no voice. Advanced Voice is still the only one with latency and tone close enough to natural that I’d use it hands-free while driving or walking. That one hasn’t changed and probably won’t until late 2026.

Documents: NotebookLM is in a category of one

NotebookLM handled a 180-page insurance contract in a way neither competitor could match. I fed it the document alongside two industry benchmark contracts for comparison. It generated a clause-by-clause risk summary with exact page citations and flagged three ambiguities the contract review attorney had also flagged (confirmed after the fact). Meanwhile Claude handled the single contract at 1M context but needed a direct prompt to cite page numbers. Additionally ChatGPT Plus hit the 128K consumer cap and asked me to split the document.

Day 26: the Opus 4.7 migration

When Opus 4.7 shipped on Day 26, the coding output quality stepped change hard. Consequently I migrated three in-progress projects from Codex to Claude Code inside 48 hours. The jump isn’t subtle. Cursor’s published CursorBench numbers (58% to 70% resolution on identical tests) match what I saw in my own workflow. Whatever Anthropic did between 4.6 and 4.7, it’s the biggest single-generation improvement any of the three has shipped this year.

Public sources that back this up

Aider Leaderboard (aider.chat/docs/leaderboards/) places Opus 4.7 at the top of pass-rate tables for polyglot coding since April 16, with Gemini 3.1 Pro second and GPT-5.4 Thinking close behind. This is consistent with what I observed.

Vellum’s head-to-head testing (vellum.ai/ai-model-comparison) tracks the three across reasoning, coding, writing, and agentic tasks with transparent methodology. Their current takeaway matches mine: no single model leads on everything, category specialization is real, the gap has narrowed.

Zvi Mowshowitz’s weekly AI coverage has independently verified partner-reported numbers on 4.7 and found them holding up. Worth following for ongoing post-launch sanity checks.

Reddit sentiment across r/ClaudeAI, r/Bard, and r/OpenAI in the two weeks since Opus 4.7 shipped clusters around three themes: /ultrareview catching bugs in production PRs that human reviewers missed, Gemini’s 1M context being more reliable than Claude’s at extreme lengths despite identical advertised windows, and GPT-5.4’s voice mode remaining unmatched.

Dogfooding Opus 4.7 the last few weeks, I've been feeling incredibly productive. Sharing a few tips to get more out of 4.7 🧵
— Boris Cherny (@bcherny) April 16, 2026

GPT-5.4: “You’ve reached the current usage cap. You can continue with the default model, or try again after 4:23 PM.”

Claude Sonnet 4.6: “You are out of messages until 8:00 PM. Your limit will reset then.”

Gemini 3.1 Pro: “429 Resource has been exhausted (e.g. check quota).”… pic.twitter.com/U0PIfzbjD7
— hoeem (@hooeem) April 15, 2026

Same Prompt, Different Results

Three scenarios from the 30-day test. Same prompt through all three flagships, observations below.

Scenario 1: A 2,000-word analysis piece on the Stanford AI sycophancy study

Claude Opus 4.7 produced the draft I needed to light-edit before publish. Strong structural instincts, caught nuances in the study methodology that only made sense if you’d read the actual paper. The kind of subtle argument construction you’d expect from a human writer who’d actually done the reading. Details on the study live in our Stanford sycophancy writeup.

GPT-5.4 Thinking produced a competent draft with more AI tells requiring removal. Forced contrast framing showed up repeatedly and had to be scrubbed. Final quality was closer to first draft than publish-ready.

Gemini 3.1 Pro produced the most research-grounded version. Pulled citations I hadn’t provided and cross-referenced two other sycophancy studies. Prose was thinner but the research backbone was stronger, which turned out to be useful because the added citations were accurate and worth keeping.

Winner: Claude for voice. Gemini for research backbone.

Scenario 2: Analyze a 40-page contract and flag risky clauses

Gemini 3.1 Pro inside NotebookLM parsed the PDF fastest and most citation-accurately. Flagged four concerning clauses, generated a summary in structured form, linked each concern back to the specific page number. Zero hallucinated citations.

Claude Opus 4.7 handled the full 40 pages without pagination complaints (1M context working as advertised) and caught five concerning clauses, one more than Gemini. Missed nothing but needed a direct prompt to cite page numbers, which is a documented Opus pattern.

ChatGPT Plus couldn’t take the full PDF in the consumer UI because of the 128K cap. Would have required the API or splitting the document into chunks.

Winner: Claude for depth. Gemini for workflow and citation accuracy.

Scenario 3: Debug a React component with an infinite re-render

All three identified the missing useEffect dependency array as the cause.

Claude Opus 4.7 shipped the fix, a refactor suggestion to avoid the pattern elsewhere, test coverage for the specific case, and a proactive flag on two other components in the project likely to have the same bug. Comprehensive in a way that reminded me what an actual code reviewer does.

Gemini CLI returned a correct answer fastest with three alternative fix patterns presented as options. Less additional context but quicker to act on.

Codex CLI landed in the middle. Correct fix with less context than Claude, slower than Gemini.

Winner: Claude for comprehensiveness. Gemini for raw speed.

None of these are controlled benchmarks. They’re observations from running real daily work through three tools in parallel. Take them accordingly.

Feature by Feature

Writing Quality

Claude Opus 4.7 still writes the cleanest prose of the three. Tone, pacing, structure that doesn’t go soft at the 1,200-word mark. Additionally nothing else at this price matches it. At week three I drafted the same 1,800-word article three times for a client. Claude’s draft shipped with a 20-minute edit. Meanwhile GPT-5.4’s took 45 minutes of AI-tell scrubbing. However Gemini’s draft needed a voice rewrite but came with three research citations I hadn’t asked for, which I kept.

Moreover GPT-5.4 Thinking writes the cleanest technical documentation. Instruction manuals, API references, onboarding flows. Methodical beats expressive when the reader is stressed.

Finally Gemini 3.1 Pro is the biggest writing upgrade of the year. Gemini 3 Pro sounded like a press release. Now 3.1 Pro doesn’t. Indeed synthesis work across multiple sources is where it quietly leapfrogs the other two.

The Image Generation Sub-Battle

This is its own fight now, and it matters if you create content at volume. I ran the same five brand asset prompts through all three during the test. DALL-E still wins if there’s text in the image. Meanwhile Nano Banana Pro wins on photorealism. Claude Design is a different category entirely. I used it once to generate a meeting pitch deck in the time it would have taken me to open PowerPoint.

DALL-E 3: still the text rendering champ

DALL-E 3 + GPT Image inside ChatGPT Plus remains the mature option. Text rendering is the best of the three, which matters for infographics, posters, and anything with multiple labeled elements. Style range is wide. However safety filters are aggressive to the point of annoyance on legitimate use cases.

Claude Design: not a DALL-E competitor

Claude Design launched alongside Opus 4.7 as the first new product category Anthropic has shipped since Claude Code. It’s part of Anthropic Labs. Importantly, this is not a DALL-E competitor. Instead it’s a prototype and presentation generator: slides, one-pagers, quick mockups, product prototypes. Figma, Adobe, and Wix stocks dropped 2 to 3% on the announcement day. Meanwhile the market read the launch as a shot at design workflows, not image generation. If you need a logo, Claude Design is the wrong tool. However if you need a product prototype deck for a meeting in 20 minutes, it’s suddenly the new best option.

Nano Banana Pro: the bundle killer

Nano Banana Pro inside Google AI Pro is the dark horse that just pulled ahead. Faster than DALL-E 3, photorealistic results that match or beat Midjourney on certain prompts, and free bundling with the $19.99 Google AI Pro tier. Plus Veo 3.1 Lite, the video generator, bundled in the same subscription at no extra cost. Consequently for marketing teams generating assets at volume, the Google stack is cheaper AND broader than the OpenAI equivalent. This was not true six months ago.

Winner depends on use case. Text-in-image work, DALL-E 3. Photorealistic assets at volume, Nano Banana Pro. Design and prototype work, Claude Design. Nobody else is trying in the third category yet.

Web Search and Deep Research

First, ChatGPT Search is fine. Perplexity still does this better for raw search, however if you’re already in ChatGPT, it handles the 90% case.

Meanwhile Claude’s web search is newer and more conservative. It fetches less, but what it fetches tends to be higher quality. Indeed fewer hallucinated sources than the other two, which matters for research work. However the BrowseComp regression cuts against this for fully agentic web tasks, so the strength is single-query grounded search, not multi-step agent workflows.

Finally Gemini Deep Research is the category leader by a wide margin. Feed it a topic, walk away 5 to 10 minutes, come back to a research report with citations. Consequently by week two of the test it was my default research tool. Moreover Gemini 3.1 Pro’s 85.9% on BrowseComp backs up what I was seeing day to day.

NotebookLM: The Single Biggest Moat in the Category

Most reviews treat NotebookLM as a nice bonus. It isn’t. It’s the strongest single feature advantage any of the three has.

On Day 18 I uploaded every long-form comparison article I’d written in the last year, plus this article’s own research notes, plus both Anthropic and Google model cards. NotebookLM generated an audio overview in 90 seconds that caught patterns across the corpus I hadn’t noticed. Two AI hosts having a 12-minute podcast conversation about what your documents actually say. It’s disorienting the first time you hear it. Then you start using it every day.

The mechanics: upload up to 300 source documents on Pro or 600 on Ultra, and NotebookLM becomes a Gemini 3.1 Pro-powered research assistant grounded entirely in your materials. You get audio overviews (two AI hosts in podcast form), video overviews (same idea with generated visuals), infographic generation from your sources, two-way sync with Google Drive (documents update automatically), Q&A grounded in sources only with no open-web hallucination, mind maps across your source set, and citation links back to the specific page in each source.

For research-heavy work (legal, academic, journalism, due diligence, competitive intelligence), this is the closest thing to magic at this price tier. Neither ChatGPT nor Claude has an equivalent. Claude Projects is the nearest thing, handling fewer documents with weaker ingestion. Custom GPTs cap at 20 files.

If NotebookLM fits your workflow, the Google AI Pro subscription pays for itself before you touch the rest of the bundle.

Voice Mode

Advanced Voice in ChatGPT remains the best conversational voice AI. Low latency, natural tone, interruptible. Gemini Live is catching up fast, especially at holding context across long conversations. Claude has no voice mode. Anthropic’s public roadmap suggests this won’t change until 2027. ChatGPT voice was the only one I used hands-free during the 30-day test, and that pattern isn’t shifting on the current product slate.

Memory

First, ChatGPT’s memory is personal and conversational. It remembers you prefer Markdown, that you’re working on a SaaS company, that your cat is named Marlo. Useful for casual daily use.

Meanwhile Claude’s memory is project-scoped through Projects. You define a Project, drop in context documents, and Claude references them across conversations. Consequently a cleaner model for work, worse for casual use.

Finally Gemini’s memory is Google Account-integrated via Gems and the broader memory layer. Combined with NotebookLM above, the Google account integration means Gemini can reference your Calendar, Drive, Gmail if you allow it. Ultimately either powerful or uncomfortable depending on your trust posture.

Computer Use and the Project Mariner Reality Check

First, OpenAI’s Operator works in a sandboxed browser and handles most web tasks well. Indeed the most mature computer use tool for consumers.

Meanwhile Claude’s Computer Use API scored 78.0% on OSWorld, the category lead, however it’s locked to API customers. Consequently you won’t see it inside Claude Pro.

Finally Project Mariner is Google’s answer. Here’s what the marketing doesn’t always say: it’s Ultra-only at $249.99/mo, US-only, still in early access. Therefore for 99% of readers Project Mariner might as well not exist yet. Indeed it’s vaporware in practical terms. However when Google makes it available at the AI Pro tier in every country, it will change the competitive picture. Today it doesn’t.

API Pricing Reality

Model	Input (per 1M tokens)	Output (per 1M tokens)	Effective Input Cost*
Claude Opus 4.7	$5.00	$25.00	$6.00 to $6.75 (tokenizer inflation)
GPT-5.4 Thinking	~$3.00	~$12.00	~$3.00
Gemini 3.1 Pro	$2.00	$12.00	$2.00

*Opus 4.7’s tokenizer produces 20 to 35% more tokens per request, so effective input cost is higher than rate card suggests.

Gemini 3.1 Pro is 3x cheaper than Claude on effective input, 2x cheaper on output. At scale this compounds fast.

If you run three or more AI agents in parallel on Claude, Opus 4.7’s tokenizer change shifted roughly $10,000/year of spending per agent compared to 4.6 at the same workload. That’s not theoretical. That’s math from my own Day 27 through Day 30 billing alerts after the model upgrade. Any developer running Claude daily at scale has seen the same number land.

Workspace Integration: Gemini’s Structural Moat

This is where Google’s advantage becomes structural rather than tactical. Specifically Gmail, Docs, Sheets, Drive, Calendar, Meet, all of them ship with Gemini 3.1 Pro integrations included in AI Pro.

Furthermore you can ask Gemini to draft a reply inside Gmail, summarize a thread, generate a table in Sheets from natural language, transcribe and summarize Meet calls automatically, or treat Drive as a dynamic knowledge base Gemini can query. The integration isn’t an afterthought. Instead it’s the primary surface for most Pro users.

Meanwhile ChatGPT doesn’t have equivalent native integration with Microsoft 365. Indeed Microsoft’s own Copilot is a separate $30/mo product. Claude has none of this. Ultimately Google’s moat isn’t the model anymore, it’s that Gemini lives where your work already happens.

Gemini’s Weaknesses Nobody’s Writing About

If you only read the model card, Gemini 3.1 Pro is flawless. However real usage reveals five documented pain points, four of which I hit during the 30-day test.

First, message limits collapse under load. Pro tier is advertised at ~100 Pro-model messages per day. However Google’s own community forums show users hitting a soft cap closer to 25/day during peak hours. Indeed I hit the lower cap on Day 14, a Tuesday afternoon, mid-research. Furthermore anecdotal reports cluster on weekdays 9 AM to 5 PM Pacific, which is when most work actually happens.

Second, “Not available in your region” errors. Gemini 3.1 Pro and several bundled features roll out on different country schedules. Consequently users outside the US, UK, EU core, and a handful of APAC countries hit this wall regularly.

Third, output truncation at 8,192 tokens by default. Gemini’s API defaults to capping output at 8K tokens unless you explicitly set max_output_tokens to 65,536. Therefore first-time API users generate a 10,000-word prompt expecting a full response and get cut off at roughly page four. Documentation mentions this. However the defaults don’t respect it.

Fourth, “lost in the middle” at extreme context. The 1M token context works until it doesn’t. Specifically past roughly 500K tokens, retrieval accuracy starts degrading. Meanwhile Claude Opus 4.7 is more reliable at long-context retrieval in my own testing and in community reports, even though both models advertise the same window.

Finally, safety filter false positives. Legitimate medical, legal, and security research queries get blocked more often on Gemini than on Claude or ChatGPT. Furthermore this is documented across r/Bard and Google Support forums.

None of these are fatal. However all of them are real. Therefore a fair comparison of chatgpt vs claude vs gemini has to include them.

Hallucination Rates and Trust

Three models, three different honesty failure modes.

Claude Opus 4.7 refuses when it isn’t sure. Community reports and the Vectara Hallucination Leaderboard (hughes-labs.com) place it at the low end of the hallucination scale, but the refusal rate on ambiguous queries runs higher. You ask a hard question, you might get “I don’t have reliable information on that” instead of a confident wrong answer. Whether that’s a feature or a bug depends on whether you want caution or decisiveness.

Gemini 3.1 Pro hallucinates least when grounded. Inside NotebookLM, where responses are constrained to uploaded sources, citation accuracy is the best in the category. Outside NotebookLM, on open-web queries, the hallucination rate climbs closer to GPT-5.4 territory. Grounding is the moat, not the model.

GPT-5.4 Thinking speaks confidently regardless of whether it should. Middle-of-the-pack on raw hallucination rate, worst on confident wrong answers when it does hallucinate. Most dangerous combination for non-technical users who can’t easily spot when the model is making things up.

Hallucination Rate (Lower Is Better)

Vectara Hallucination Leaderboard, previous-gen flagships. Current flagships not yet benchmarked.

GPT-5.2 High

Claude Opus 4.6

Gemini 3 Pro Preview

Source: github.com/vectara/hallucination-leaderboard

Separately, the Stanford sycophancy study from January 2026 measured how much each model tells users what they want to hear versus what’s true. Claude ranked lowest on sycophancy, GPT-5.4 ranked highest, Gemini sat between. If you’re stress-testing a decision and want a model that’ll push back, Claude is still the right pick.

What the Enterprise Market Told Us These 30 Days

Not what the press releases claim. Instead what the big buyers are actually doing.

Cursor shipped Opus 4.7 as the default coding model inside 48 hours of release, with public CursorBench numbers showing the 58% to 70% jump in production testing. Cursor co-founder and CEO Michael Truell framed the jump directly:

“On CursorBench, Opus 4.7 is a meaningful jump in capabilities, clearing 70%.” — Michael Truell, Co-founder & CEO, Cursor

That’s the reference benchmark to cite if you want verifiable third-party data, and it’s the one I kept coming back to.

Meanwhile US federal agencies have reportedly accelerated phase-out plans for Claude in certain security contexts, per ongoing coverage on Wikipedia’s Claude page. Consequently Gemini and GPT-5.4 are picking up the displaced work.

Additionally Salesforce, Palantir, and ServiceNow have public Gemini 3.1 Pro deployments in production. Meanwhile Microsoft stayed aligned with GPT-5.4 via Copilot post the OpenAI partnership renegotiation. Furthermore Anthropic’s enterprise revenue is reportedly up on Opus 4.7 demand, per subscription-paywalled reporting at The Information.

Rakuten deployed Opus 4.7 for internal production workflows within days of release. Yusuke Kaji, General Manager of AI for Business at Rakuten, said it plainly:

“Claude Opus 4.7 resolves 3x more production tasks than Opus 4.6.” — Yusuke Kaji, General Manager of AI for Business, Rakuten

That’s the kind of number that moves enterprise procurement, not a benchmark.

Hex, the data analytics platform, deployed Opus 4.7 immediately. Hex co-founder and CTO Caitlin Colgrove framed the efficiency story that most coverage missed:

“Low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6.” — Caitlin Colgrove, Co-founder & CTO, Hex

Translation: workflows that previously needed medium reasoning can drop to low and come out roughly even on cost. That’s the counter to the tokenizer inflation story, and it’s real if you rewrite your agent loops around it.

The enterprise market is fragmenting by workload, not consolidating around a single vendor. The picture a fair chatgpt vs claude vs gemini comparison should paint: no winner take all, and category specialization is the new default.agmenting by workload, not consolidating around a single vendor. The picture a fair chatgpt vs claude vs gemini comparison should paint: no winner take all, and category specialization is the new default.

First, GPT-5.5 “Spud” reportedly finished pretraining on March 24, 2026. Prediction markets place release by June 30 at roughly 70% probability. Therefore if you’re the person about to commit to an annual ChatGPT Plus plan specifically to lock in GPT-5.4 Thinking, hold off another 8 to 10 weeks. However if you need it now, buy monthly and reassess in June.

Meanwhile Claude Opus 5 has no confirmed timeline. Anthropic historically ships major version bumps every 7 to 9 months, which places Opus 5 around Q4 2026. Consequently Opus 4.7 is likely the flagship through at least August. Therefore if you code daily, subscribe now. The Opus 4.7 jump is the biggest quality step in this category in a year.

Finally Gemini 3.2 Pro or Gemini 4 is the harder read. Google DeepMind hasn’t signaled publicly. Gemini Deep Think saw a quiet point release in March. However the full next-generation Gemini is a summer or fall 2026 arrival at best. Therefore if you want NotebookLM and the bundle, subscribe now. However if you’re specifically waiting on Gemini 4 to justify the upgrade, you’ll be waiting a while.

Quick decision tree: if you’re on the fence and primarily care about ChatGPT, wait for Spud. However if you’re on the fence and primarily care about Claude or Gemini, today is as good a day as any.

Who Should Pick Which

First, if you’re the person who ships production code every day, Claude Pro is the specialist tool. /ultrareview, /ultraplan, xhigh mode, and the highest SWE-Bench numbers in the category. Consequently post-Opus 4.7, it’s the default for professional development work. The Claude Pro review goes deeper on the tool stack.

Second, if you’re the person who lives in Google Workspace and does research as core work, Google AI Pro. NotebookLM, Deep Research, Gemini inside Docs and Sheets, plus 2TB of storage and Nano Banana Pro and Veo 3.1 Lite bundled. Therefore it’s the best bundle-value subscription in the $20 tier by a wide margin.

Third, if you’re the person who writes a lot of general content, makes phone calls with voice AI, generates images regularly, or uses custom GPTs, ChatGPT Plus is the safe default. Additionally it’s the right pick if you’re within a few weeks of moving to Pro $100 for agentic Codex work, or if you’re holding out for GPT-5.5.

Ultimately if you can only pick one, Google AI Pro has the widest utility for most users. Meanwhile Claude Pro is the best specialist tool. However ChatGPT Plus is the safe default that won’t let you down.

Furthermore if you can pay for two, Claude Pro plus Google AI Pro. Coding and writing in Claude, everything else in Gemini. Indeed this is the stack most serious users I know are running in April 2026.

Finally if you can pay for all three, you’re a developer or researcher and you already know why. Roughly $60/month total.

The Bottom Line

Nobody’s winning this outright. That’s the headline. And it’s different from the headline anyone could honestly write two months ago.

The map keeps redrawing

Six months back the chatgpt vs claude vs gemini conversation was ChatGPT plus a distant second place. Eight weeks ago it was Gemini leading 13 of 16 benchmarks. Four days ago Opus 4.7 redrew the map again. Today it’s three category leaders with different strengths, different costs, and different failure modes.

Claude’s new position

Opus 4.7 is now the best model you can point at a codebase. 87.6% SWE-Bench Verified, 64.3% SWE-Bench Pro, MCP-Atlas 77.3%, CharXiv jumping double digits. Cursor partner data backs the benchmarks. However the tradeoffs are real too: BrowseComp regressed, the tokenizer inflates API costs 20 to 35%, and Claude still has no voice mode.

Gemini’s plurality moment

Gemini 3.1 Pro owns reasoning and research. ARC-AGI-2 77.1% is a lead that doubles the competition. GPQA Diamond 94.3% is the highest score ever recorded. NotebookLM is the single strongest moat any of the three has. However the tradeoffs: message limits collapsing under load, region restrictions, output truncation defaults that trip up new users, and a middle tier that doesn’t exist. Moreover 750 million monthly active users per Alphabet’s own Q4 2025 disclosure means this isn’t a contender anymore, it’s a plurality.

ChatGPT as the baseline everyone compares against

GPT-5.4 Thinking is the baseline. The default. The model everyone compares against. It leads on Terminal-Bench, OSWorld, GDPval-AA, and voice mode. Additionally it has the widest ecosystem of third-party integrations, which matters more than benchmarks for most daily use. Furthermore GPT-5.5 Spud is arriving by June 30 at roughly 70% probability. ChatGPT Plus remains the safest $20/month you can spend on AI.

The real headline

The DeepSeek pivot at the top wasn’t a joke. Six months ago it was the most interesting cheap-to-run option in the market. Now it’s a cautionary tale about trust and organizational risk. Ultimately the winners of the 2026 AI subscription war aren’t going to be the cheapest models. They’re going to be the most trusted ones that happen to be great.

Codex for terminals and token cost. Gemini for speed. Claude for production repos. And the sophisticated move is running both Codex and Claude Code in a hybrid loop, which OpenAI itself enabled with the codex-plugin-cc shipped March 30.

Meanwhile for the first time since the GPT-4 era, Google is one of the serious answers.

FAQ

What is the difference between ChatGPT, Claude, and Gemini in April 2026?

ChatGPT (OpenAI, GPT-5.4 Thinking flagship) leads Terminal-Bench, OSWorld, BrowseComp, and voice mode. Claude (Anthropic, Opus 4.7 flagship released April 16) leads both SWE-Bench variants, MCP-Atlas, Computer Use, and scientific figure interpretation. Gemini (Google, 3.1 Pro flagship released February 19) leads ARC-AGI-2, GPQA Diamond, and bundles NotebookLM Pro, Nano Banana Pro, Veo 3.1 Lite, and 2TB Google One storage. All three cost roughly $20/month at the entry tier.

Is Gemini better than ChatGPT in 2026?

On raw reasoning benchmarks, Gemini 3.1 Pro leads: ARC-AGI-2 77.1%, GPQA Diamond 94.3%, Humanity’s Last Exam 44.4% without tools. ChatGPT Plus ($20) leads on Terminal-Bench at 75.1%, OSWorld computer use at 75%, voice mode, and third-party ecosystem breadth. On BrowseComp web agent work Gemini 3.1 Pro (85.9%) beats GPT-5.4 Thinking (82.7%) at the Plus tier, though GPT-5.4 Pro ($200 tier) pushes ahead to 89.3%. Better depends on task. For research and multimodal, Gemini. For general purpose and voice, ChatGPT.

Which AI is best for coding: ChatGPT vs Claude vs Gemini?

Depends on the work. Claude Opus 4.7 leads SWE-Bench Verified at 87.6% and SWE-Bench Pro at 64.3%, making Claude Code the default for real-repo professional work in April 2026. Codex CLI (GPT-5.4 Thinking) leads Terminal-Bench at 77.3% vs Claude’s 69.4%, uses roughly 4x fewer tokens, and is the pick for DevOps, shell scripting, and cost-sensitive workflows. Gemini 3.1 Pro leads LiveCodeBench Pro at 2887 Elo for raw speed on algorithmic problems. The sophisticated move is running both Codex and Claude Code together via OpenAI’s codex-plugin-cc, shipped March 30, 2026.

How much do all three AI services cost?

Entry tier: ChatGPT Plus $20/month, Claude Pro $20/month, Google AI Pro $19.99/month. Google AI Pro bundles 2TB Google One storage, Deep Research, NotebookLM Pro, Nano Banana Pro image gen, and Veo 3.1 Lite video. Mid tier: ChatGPT Pro $100 (new April 9, 2026) and Claude Max 5x both $100/month, Google skips this bracket. Top tier: ChatGPT Pro $200/month, Claude Max 20x $200/month, Google AI Ultra $249.99/month.

What is Gemini 3.1 Pro best at?

Research via Deep Research and NotebookLM (300 sources on Pro), raw reasoning (ARC-AGI-2 77.1%, GPQA Diamond 94.3%), multimodal input, cost-efficient API ($2/1M input tokens), and Workspace integration. Output speed 114 tokens/sec is the fastest of the three.

Can I use ChatGPT, Claude, and Gemini together?

Yes. Common stack: Claude Pro for code and long-form writing, Google AI Pro for research and Workspace tasks, ChatGPT Plus for general purpose and voice. Total ~$60/month. Each covers a different specialty with minimal overlap.

Should I wait for GPT-5.5 before subscribing to ChatGPT Plus?

GPT-5.5 “Spud” reportedly finished pretraining March 24, 2026, with prediction markets placing release by June 30 at ~70% probability. If you’re buying an annual plan specifically to lock in the flagship, wait 8 to 10 weeks. If you need ChatGPT now, subscribe monthly and reassess in June.