RAG Architecture Costs: What Nobody Tells You

Everyone talks about RAG accuracy. Nobody talks about RAG economics.

By Richard Ewing·February 28, 2026

The Hidden Cost of Retrieval

A typical RAG query hits 5 cost centers: embedding generation ($0.0001-0.001), vector DB query ($0.0001-0.01), reranking ($0.001-0.01), context assembly ($0.01-0.05), LLM generation ($0.01-0.10).

Total: $0.02-0.17 per query. At 10K queries/day = $6K-51K/month.

The Caching Opportunity

Semantic caching reduces LLM calls by 30-60%. Approaches: exact match, semantic cache, prefix cache.

Calculate AI economics →

Like this analysis?

Get the weekly engineering economics briefing — one email, every Monday.

Subscribe Free →

More in AI Economics

Your AI Coding Tools Are a $58K/Engineer Maintenance Liability — Not a Productivity Gain

GitHub Copilot just moved to usage-based billing. METR proved devs are 19% slower with AI — while feeling 24% faster. That perception gap is costing you $58K per engineer per year in hidden maintenance, security debt, and verification overhead. Here is the math your vendor will never show you.

14 min read

Your Claude API Bill Is Destroying Your Margins — The Economics of Model-Task Mismatch

Enterprise teams are using frontier models for simple tasks and watching margins evaporate. Here is how to calculate your cost collapse point and implement tiered inference routing.

6 min read

The Rise of the AI Economist: Why Product Managers Must Evolve or Perish

Traditional software has zero marginal cost. AI features carry massive, compounding variable costs. If product managers don't learn to engineer margins, they will bankrupt their companies.

8 min read

Canonical Frameworks

Cost of Predictivity

The Cost of Predictivity measures the variable cost of AI accuracy. Unlike traditional software with near-zero marginal costs, AI features have significant variable costs that scale with both usage AND accuracy requirements. As AI correctness increases, cost scales exponentially — not linearly. This is the fundamental economic challenge of AI products. Traditional software follows a simple cost model: high fixed development cost, near-zero marginal cost per user. Build the feature once, serve it to millions for pennies. AI products break this model entirely. Every AI query costs compute. Every inference requires GPU cycles. Every improvement in accuracy requires either more sophisticated prompts (more tokens = more cost), retrieval-augmented generation (vector DB queries + embedding generation), or fine-tuned models (massive training costs amortized over queries). The cost structure looks more like a manufacturing business than a software business. The exponential curve is the killer. Moving from 80% accuracy to 90% accuracy might cost 2x. Moving from 90% to 95% might cost 5x. Moving from 95% to 99% often costs 10-20x. This is because the easy cases are solved by the base model, and each additional percentage point of accuracy requires increasingly sophisticated (and expensive) techniques to handle edge cases. This creates what Richard Ewing calls the AI Margin Collapse Point: the usage volume at which AI feature costs exceed the revenue they generate. Many AI features that work beautifully in prototype (low volume, don't need high accuracy) become economically devastating in production (high volume, users demand high accuracy). The AI Unit Economics Benchmark (AUEB) calculator at richardewing.io/tools/aueb helps companies calculate their Cost of Predictivity and identify their specific margin collapse point before it hits their P&L.

Read Definition →

Feature Bloat Calculus

Feature Bloat Calculus is the economic formula for determining when a feature's maintenance cost exceeds its value contribution. It quantifies the hidden tax of feature accumulation — the compounding cost that makes every new feature harder and more expensive to build. The formula considers three cost components: 1. **Direct Maintenance Cost**: The engineering hours spent maintaining the feature (bug fixes, compatibility updates, dependency management, test maintenance). This is typically 2-5% of original development cost per quarter. 2. **Opportunity Cost**: What else could those maintenance engineers be building? If 3 engineers spend 20% of their time maintaining a low-value feature, that's 0.6 FTE that could be building high-value new capabilities. 3. **Complexity Tax**: This is the compounding factor that most organizations miss entirely. Every feature in the codebase makes every other feature harder to maintain and every new feature harder to build. Adding feature #101 to a system doesn't just add feature #101's maintenance cost — it increases the maintenance cost of features #1-100. The Complexity Tax follows a roughly quadratic curve. A system with 50 features has approximately 1,225 potential interaction points (n × (n-1) / 2). A system with 100 features has 4,950 potential interaction points. Doubling features doesn't double complexity — it quadruples it. Feature Bloat Calculus quantifies this by comparing a feature's total cost (direct + opportunity + complexity) against its value contribution (revenue attribution, user engagement, strategic importance). When total cost exceeds value, the feature has "negative carry" — it's costing more to keep than it's worth. Features with negative carry should be evaluated through the Kill Switch Protocol for potential deprecation. The highest-negative-carry features should be killed first, as they free up the most capacity per removal.

Read Definition →

📊

Richard Ewing

The AI Economist — Quantifying engineering economics for technology leaders, PE firms, and boards.

Book Advisory →Curriculum →Free Tools →

← Back to Blog

⚡

Want to apply this to your organization?

Run a free diagnostic first. If the numbers concern you, book a session to build a remediation plan.

Run Free Diagnostic (Free)View Advisory Options

Richard Ewing — AI Economist & Capital Auditor