[release] 5 min · Apr 24, 2026

GPT-5.5 — OpenAI Ships the Model, Holds Back API Access

OpenAI launched GPT-5.5 as the Codex default but withheld API access. The delay is not about safety — it is a subscription funnel disguised as caution.

GPT-5.5 ↗ Apr 23, 2026
#openai#gpt-5-5#codex#api-pricing#ai-models

OpenAI released GPT-5.5 on April 23, 2026 — the first fully retrained base model since GPT-4.5. Every GPT-5.x release between them (5.1 through 5.4) was a post-training iteration on the same base. This one is a ground-up rebuild, codenamed “Spud,” and it is now the default model in Codex and ChatGPT for Plus, Pro, Business, and Enterprise subscribers. The API? Not yet. OpenAI says it requires “different safeguards” and will arrive “very soon.” No date given.

TL;DR

  • What: GPT-5.5 ships to paid subscribers in ChatGPT and Codex; API access deliberately withheld
  • Why it matters: Production teams running Codex through the API are stuck on GPT-5.4 while subscription users get the upgrade
  • Benchmarks: 82.7% Terminal-Bench 2.0 (SOTA), but trails Claude Opus 4.7 on SWE-Bench Pro and MCP-Atlas
  • Cost: API pricing will double to $5/$30 per 1M tokens when it lands

GPT-5.5 — What Happened

OpenAI is running the same playbook Anthropic ran with Claude Opus 4.7: route the flagship model through a subscription funnel before opening the API, then call the delay a safety precaution. The “different safeguards for API deployments” line is technically true and strategically convenient. It ties GPT-5.5’s most capable agentic workflows to Codex subscriptions for weeks while competitors’ pipelines are still routing through GPT-5.4.

This matters because of how most production teams actually use Codex. They do not sit in the ChatGPT UI. They hit the API. They build pipelines. They run overnight agentic loops. And those teams — the ones spending serious money — are the last to get the upgrade.

The benchmarks tell a split story. GPT-5.5 hits 82.7% on Terminal-Bench 2.0, which measures agentic CLI workflow competence. That is state-of-the-art — up from 75.1% for GPT-5.4, and well ahead of Claude Opus 4.7 at 69.4% and Gemini 3.1 Pro at 68.5%. If your agents spend their time executing multi-step terminal operations, this is a real and measurable improvement.

But Terminal-Bench is not the whole picture. On SWE-Bench Pro, GPT-5.5 scores 58.6% against Opus 4.7’s 64.3%. OpenAI footnotes that Anthropic acknowledged signs of memorization on a subset of those problems — a pointed caveat, but one that does not close a 5.7-point gap. And on Scale AI’s MCP-Atlas benchmark for tool-use coordination, GPT-5.5 trails both Opus 4.7 (79.1%) and Gemini 3.1 Pro (78.2%) with a 75.3%. For teams running MCP-heavy agent stacks, GPT-5.5 is not a clean win.

If your production pipeline relies on MCP tool-use coordination, GPT-5.5 underperforms both Claude Opus 4.7 and Gemini 3.1 Pro on MCP-Atlas. Switching Codex’s default model does not automatically mean switching is the right call for your stack.

Why This Matters

The benchmark split reveals something more interesting than the raw numbers: GPT-5.5 is optimized for a specific kind of agentic work, and OpenAI is betting that kind of work is the future that pays. Terminal-Bench 2.0 dominance combined with Expert-SWE performance — 73.1% on tasks with a median estimated human completion time of 20 hours — tells you where OpenAI thinks the money is. Long-horizon, CLI-driven, autonomous coding loops. The kind of work that Codex subscriptions are designed around.

The token efficiency claim supports this thesis. OpenAI’s own charts show GPT-5.5 using significantly fewer tokens than GPT-5.4 to complete the same Codex tasks, specifically validated on Terminal-Bench 2.0 and Expert-SWE. But “fewer tokens per task” is a benchmark-specific finding, not a blanket rule. Simpler or shorter tasks will not see the same delta. And the per-token price is doubling: $5 per million input tokens and $30 per million output, up from GPT-5.4’s $2.50/$15. The pro-tier variant lands at $30/$180 per million tokens.

OpenAI argues that token efficiency offsets the per-token increase. For long, complex agentic runs — exactly the workloads Codex targets — that argument has real merit. A model that solves a 20-hour coding task in substantially fewer tokens at 2x the per-token price could still come out cheaper. But you will not know your actual cost delta until you run your own workloads through it, and you cannot do that until the API is available. Which brings us back to the subscription funnel.

When the API does drop, do not assume cost parity with GPT-5.4. Run your actual workloads through both models and compare total cost per task — not per-token pricing. The efficiency gains are real but task-dependent.

The competitive picture is more nuanced than OpenAI’s announcement suggests. GPT-5.5’s Terminal-Bench dominance is genuine, but the SWE-Bench Pro gap against Opus 4.7 and the MCP-Atlas shortfall are not footnotes — they represent the two other major dimensions of agentic work. Multi-file codebase reasoning (SWE-Bench Pro) and tool-use coordination (MCP-Atlas) are not edge cases for production teams. They are core capabilities. A model that excels at CLI workflows but falls behind on structured tool use is winning one battle while conceding two others.

The Expert-SWE result deserves attention separate from the headline benchmarks. At 73.1% on tasks that take a human an estimated 20 hours, GPT-5.5 shows a meaningful gain over GPT-5.4 on long-horizon autonomous work. This is the number production teams running overnight Codex pipelines should care about most. It suggests GPT-5.5 maintains coherence and direction over longer task arcs — exactly where previous models tended to lose the plot after dozens of tool calls.

OpenAI also claims GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving. If that holds under production load, it removes one of the usual objections to model upgrades: that you are trading speed for capability. The retrained base seems to have absorbed the capability gains without the latency tax that typically comes with larger or more complex architectures.

The Take

I have seen this playbook before, and I am going to call it what it is. The API holdback is not primarily about safety. It is about directing the highest-value users — the ones building agentic pipelines that could run on any provider’s API — toward Codex subscriptions. Once your team has spent three weeks building workflows around Codex’s hosted GPT-5.5, the switching cost to Claude or Gemini goes up. By the time the API launches, the lock-in is already in place.

That does not make GPT-5.5 a bad model. The Terminal-Bench and Expert-SWE numbers are real, and for CLI-heavy agentic work, nothing else touches it right now. But if your stack depends on MCP tool coordination, Opus 4.7 is still the better choice. And if you are evaluating total cost of ownership, you need the API — which you cannot get yet.

My recommendation: if you are a Codex subscriber, use GPT-5.5 for long-horizon CLI tasks where it clearly excels. Do not migrate MCP-heavy pipelines until you have run your own MCP-Atlas-style evaluations. And do not let the subscription funnel pressure you into architectural decisions before the API pricing shakes out in practice. The model is strong. The distribution strategy is cynical. Keep those two facts separate.