Why is output so much more expensive than input?

Generating text is more compute-heavy than reading it, so every provider charges more for output. On most models the output rate is 4–6x the input rate. If your bill is high, it's usually the model writing a lot, not you sending a lot.

Roughly three-quarters of an English word. A typical page of text is about 500–800 tokens. Pricing is quoted per million tokens, so a million tokens is around 750,000 words, a few long books.

What's the cheapest model that's still good?

Gemini 2.5 Flash at $0.30 in / $2.50 out per million tokens is the cheapest capable hosted option in 2026. Claude Haiku 4.5 ($1/$5) is the next step up. Or run an open model locally for $0 per use.

How do I lower my API bill without switching models?

Three levers: prompt caching (up to 90% off repeated input), batch mode (about 50% off when you don't need an instant reply), and trimming the output length. Together they often cut a real workload's cost by more than half.

How much do the Claude, GPT, and Gemini APIs actually cost?

Frontier APIs bill per million tokens, split into input (what you send) and output (what the model writes). In mid-2026, Gemini 2.5 Flash is the cheapest capable option at $0.30/$2.50, while GPT-5.5-pro is the priciest at $30/$180. Output always costs several times more than input, and caching plus batch mode can cut the bill by anywhere from 50 to 90%.

Last updated 2026-06-14 · Physea Labs

Frontier model pricing looks intimidating until you learn the one unit everything is quoted in: cost per million tokens, split into input and output. Once that clicks, the whole landscape fits on one screen.

How the billing works

You pay for two things separately. Input is the text you send: your prompt, the documents you paste, the conversation so far. Output is the text the model writes back. A token is about three-quarters of a word, so a page of text runs 500–800 tokens, and a million tokens is roughly a few books’ worth.

Here’s the part worth remembering. Output costs far more than input on every provider, usually four to six times more. A chatty model that writes long answers can quietly cost more than the per-token rate suggests.

The 2026 price table

These are list prices for standard, single requests as of mid-2026. Discounts (below) apply on top.

Model	Input $/1M	Output $/1M	Context
Claude Opus 4.8	$5.00	$25.00	1M
Claude Sonnet 4.6	$3.00	$15.00	1M
Claude Haiku 4.5	$1.00	$5.00	200K
GPT-5.5	$5.00	$30.00	~1M
GPT-5.5-pro	$30.00	$180.00	~1M
Gemini 3.1 Pro	$2.00 / $4.00*	$12.00 / $18.00*	2M
Gemini 3.5 Flash	$1.50	$9.00	1M
Gemini 2.5 Flash	$0.30	$2.50	1M

A few things jump out. The cheapest capable model is Gemini 2.5 Flash, at less than a tenth of most flagships. The priciest is GPT-5.5-pro, aimed at the hardest problems where you’ll pay almost anything for a better answer. And Gemini’s flagship undercuts both Claude’s and OpenAI’s while carrying the largest context window of the bunch, 2 million tokens.

*Gemini 3.1 Pro bills the higher rate above 200K input tokens.

Watch the long-context cliff

One trap that doesn’t show up in the headline numbers: GPT-5.5 bills prompts over 272K input tokens at 2x input and 1.5x output, and it applies that to the whole session, not just the overflow. If you routinely push very long contexts into GPT, that surcharge can dominate your bill. Gemini 3.1 Pro has its own version: it charges $2/$12 per million up to 200K input tokens, then jumps to $4/$18 above that, so the surcharge bites on exactly the long-document work its 2M window invites. Claude is the exception here, carrying its full 1M context at the standard rate (Opus 4.6 and Sonnet 4.6 onward), which can make it cheaper for long-document work even when the per-token rate looks higher.

Three ways to cut the bill

You rarely pay list price if you don’t want to. Every major provider offers the same broad levers:

Prompt caching. If you send the same big chunk of context repeatedly (a system prompt, a reference document), caching it cuts the cost of that cached input by up to 90%. For most agent workloads this is the one that moves the needle.
Batch mode. If you don’t need the answer this second, batch processing runs at roughly half price. Good for overnight jobs, bulk classification, report generation.
Pick a smaller model. The jump from a flagship to a Flash or Haiku tier can be 10x cheaper, and for routine tasks the quality difference is small. Match the model to how hard the task actually is, not to its prestige.

When per-token stops making sense

There’s a volume above which renting stops being the cheap option. If you’re running the same kind of task at high, steady volume, the math eventually favors buying hardware and running an open model for $0 per use. Our guide to frontier vs local models walks through that break-even, and the best local models for 2026 covers what you’d actually run.

For a side-by-side of what each provider is best at, not just what it costs, see the Claude vs GPT vs Gemini cheat sheet.

Common questions

Why is output so much more expensive than input?: Generating text is more compute-heavy than reading it, so every provider charges more for output. On most models the output rate is 4–6x the input rate. If your bill is high, it's usually the model writing a lot, not you sending a lot.
What is a token?: Roughly three-quarters of an English word. A typical page of text is about 500–800 tokens. Pricing is quoted per million tokens, so a million tokens is around 750,000 words, a few long books.
What's the cheapest model that's still good?: Gemini 2.5 Flash at $0.30 in / $2.50 out per million tokens is the cheapest capable hosted option in 2026. Claude Haiku 4.5 ($1/$5) is the next step up. Or run an open model locally for $0 per use.
How do I lower my API bill without switching models?: Three levers: prompt caching (up to 90% off repeated input), batch mode (about 50% off when you don't need an instant reply), and trimming the output length. Together they often cut a real workload's cost by more than half.

Keep reading