Most AI planning is built on a flawed assumption: that today's pricing is real. It isn't. What we're seeing right now is not a stable market for intelligence. It's a subsidized land grab. The companies that win won't be the ones optimizing for today's pricing. They'll be the ones already building for tomorrow's cost structure.
1. The Assumption Everyone's Operating On
Right now, the market for AI is sending a signal.
You can build AI-powered features for $0.001 per thousand tokens. You can subscribe to Claude, ChatGPT, or Gemini for $20 a month. You can deploy inference at scale without breaking your unit economics.
That signal looks like permission.
Permission to design systems around unlimited API access. Permission to call language models for every decision, every summarization, every classification in your product. Permission to treat inference like a free resource you simply tap whenever needed.
Companies are building their entire roadmaps around that signal.
But here's the problem: that signal is lying.
2. What's Actually Happening: A Subsidized Land Grab
This is not a stable market for intelligence.
It's a subsidized land grab.
The numbers make this obvious:
GPU supply is constrained. The H100 shortage didn't end; it just became accepted as baseline. Every major lab needs exponentially more compute just to train the next generation. Nvidia controls the supply. Lead times exist. Margins are real.
Inference costs are high. Running a trillion-parameter model through billions of requests has real computational expense. The margin economics on inference-as-a-service are thin without scale. If you're not at Anthropic, OpenAI, or Google's scale, you're not getting there.
And major labs are burning cash to lock in developers. The "$20/month unlimited AI" model? That's not pricing based on cost. That's a subsidy. Independent analysis of OpenAI's reported financials found the company spends approximately $1.35 for every $1 earned on API revenue — deliberately priced below cost to win developer mindshare. Anthropic's economics are similar: at heavy usage tiers, real compute costs have been estimated to reach multiples of the subscription price. Anthropic, OpenAI, Google, Meta — all willing to lose money on individual inference calls to lock in developers before competitors can establish beachheads.
The goal is clear: make it so cheap and convenient that you design systems that only work with their API.
That's brilliant strategy.
But it's not economics. It's a land grab.
And land grabs end.
3. The Hidden Mistake: Designing for Today
Here's where companies are quietly going wrong.
They are designing systems, workflows, and ROI models around temporary pricing conditions.
I see this constantly:
A SaaS company launches an "AI-powered" feature. It calls an LLM for every user action: summarization, categorization, content generation, decision support. Cheap. Reliable. Fast to ship.
The feature ships. Adoption grows. Cost per request stays at $0.0001. Everything works.
But the company has now locked itself into a specific economic model: low-friction, high-frequency inference.
If inference costs 3–5× more tomorrow, the unit economics get uncomfortable. If it costs 10–15× more, the feature becomes a liability.
And the architecture can't adapt. You can't pull AI out of a system that was built assuming it was free. You've baked it into the user experience, the data model, the support structure, everything.
As I wrote in "AI Unlocks Economics", the companies winning right now are the ones architecting for AI as a first-class cost lever. But most aren't. Most are just adding it where it's cheap.
That works until it doesn't.
4. What Real Costs Actually Look Like
Once the market stabilizes-and it will-AI pricing will converge toward actual inputs, not distribution subsidies.
That means:
- Compute - GPU cycles (H100 prices, power, cooling, depreciation, replacement)
- Memory bandwidth - RAM and VRAM constraints (the bottleneck nobody talks about)
- Latency guarantees - SLA penalties, dedicated capacity, priority routing
- Uptime and reliability overhead - redundancy, failover, monitoring, on-call costs
- Model complexity per task - smaller models for simple tasks, larger for complex ones
Not subscriptions. Not "flat access." Not fantasy bundles where you're subsidizing billion-token monthly budgets.
Real infrastructure pricing. Real operational costs. Real trade-offs.
What does that multiple look like? Nobody knows precisely. H100 GPU rental on dedicated infrastructure (Lambda Labs, CoreWeave) currently runs $2–4/hour per GPU. Frontier model inference throughput at those rates implies a real compute cost in the range of $15–50 per million output tokens for large models — versus current API pricing of $3–15/M for comparable capability. The honest range of "how much higher could sustainable pricing be" is somewhere between 2× and 15×, depending on your latency tier, batch efficiency, and model class. The precise multiple matters less than the direction.
When that happens, the current "efficiency" of shipping AI everywhere becomes what it actually is: massively wasteful.
5. The Real Optimization Problem
And this is where most companies get it backwards.
They ask: "How do we use more AI?"
The companies that survive cost normalization will ask: "How do we design systems that minimize unnecessary inference while maximizing output quality?"
Those are completely different problems.
The first leads to: bloated inference pipelines, redundant API calls, wasteful reranking, over-engineered summarization, and slow unit economics.
The second leads to: intelligent caching, hybrid architectures (AI for the 10% of decisions that matter, heuristics for the 90%), batch processing instead of real-time, smaller models for fast classification, larger models only for complex reasoning.
As I explored in "The Hidden Cost of AI-Generated Code", this applies even to code generation: the cost isn't the API call. The cost is the technical debt of maintaining globally fragile systems built by optimizing locally for speed. That's a cost that appears later, silently, in maintenance burden.
The same principle applies everywhere AI touches your system.
Low-frequency, high-value inference = defensible.
High-frequency, low-value inference = fragile.
Companies designing for the first architecture now will have pricing power tomorrow.
Companies designed for the second will have a restructuring problem.
6. Who Actually Wins When Pricing Normalizes
Here's the uncomfortable truth:
The companies that win won't be the ones optimizing for today's pricing.
They'll be the ones already building for tomorrow's cost structure.
That means:
- Treating inference as a constrained resource, not a commodity
- Building hybrid systems that use AI where it compounds (complex reasoning, pattern recognition, content generation) and leave heuristics everywhere else
- Investing in edge inference and smaller models to reduce API dependency
- Designing for cacheability - fewer novel problems, more cached solutions
- Engineering for interrogation - systems that can explain why they called an LLM, and prove it was worth it
As I wrote in "The 5 Files You Must Still Review", the standard that separates builders from generators is: can you defend every decision your system made? For AI, that standard is even more brutal. If you can't defend why you paid for that inference, you've wasted it.
Companies building these standards now-while inference is cheap-will have architectural advantage when inference is expensive.
The ones who don't? They'll face a choice:
- Restructure the entire system to reduce inference (slow, painful, risky)
- Keep the bloated system and accept lower margins (slow death)
- Kill the feature entirely (fast death)
None of those are good options.
The time to fix it is now.
7. What This Means for Your Team
If you're building AI-powered features today, ask yourself:
Will this architecture still make sense if inference costs 3–10× what it costs today?
If the answer is "no," you're not building for the future. You're building for a subsidy.
And subsidies always end.
Start now:
- Measure inference like a cost line item. Every API call, every token, every model selection should be tracked like infrastructure spend. Make it visible. Make it defendable.
- Build hybrid systems. AI for the decisions that matter. Rules and heuristics for everything else. Your margins will be better. Your latency will be faster. Your system will be simpler.
- Invest in smaller models. Haiku is often faster and cheaper than Opus, because constraints force clarity. The same principle applies to task-specific models. Spend engineering time now to reduce model size and complexity later.
- Design for cacheability. If you're solving the same problem twice, cache it. If you're calling an LLM for identical inputs, cache the output. If you're doing redundant inference, stop.
- Treat inference reduction like a feature. Make it a sprint goal. Make it a roadmap item. Make it a metric you track. Not because it's fashionable, but because it's going to be necessary.
The companies that do this while inference is cheap will have pricing power, architectural clarity, and customer defensibility when the market normalizes.
The ones that don't will have a rewrite in their future.
One point worth clarifying: this article is not an argument against using AI freely. The Haiku-First Engineer makes the case that using smaller, cheaper models extensively is good engineering discipline — the cost per call is low enough to experiment without hesitation. Both arguments converge on the same principle: use the cheapest model appropriate for the task, and design out unnecessary calls to expensive frontier models. The risk this article is flagging is not experimentation with Haiku at $0.001 per call. It is systems architected around $0.05-per-call frontier inference as if that price will hold indefinitely.
Conclusion
The window for designing for real costs is now. Inference subsidies will not last forever, and the companies that recognize this early will have a structural advantage. Nobody knows the exact multiple — it depends on model tier, latency requirements, and how the GPU supply constraint resolves. Build for a 3–10× scenario. That range is defensible. Designing for zero cost is not.
Sources:
- Investing.com (2026). "The AI Token Pricing Crisis Behind OpenAI and Anthropic's Revenue Race." Documents OpenAI's below-cost API pricing; estimated ~$1.35 spend per $1 earned on inference.
- Daniel Miessler (2024). "Inference Costs Are Not Sustainable." Analysis of true inference cost structure vs. published API pricing.
- CoreWeave / Lambda Labs (current market rates): H100 dedicated GPU rental $2–4/hour; basis for frontier model compute cost estimates in Section 4.
Working through the challenges in this post? I help engineering leaders and CTOs navigate complex technical decisions and scale high-performing teams. Schedule a consultation →