Track 2 — AI Product Economics

Module 2.4: RAG Architecture Economics

Embedding costs, vector database pricing, chunking strategy impact on costs, and production optimizations (caching, hybrid search, reranking) that reduce RAG costs 50-70%.

3 Lessons~55 minIntermediate-Advanced
1

Lesson 1: RAG Cost Components

RAG (Retrieval-Augmented Generation) adds retrieval costs on top of generation costs. Understanding each cost component prevents budget surprises when your knowledge base scales.

Embedding Cost

Converting documents into vectors: $0.02-0.10 per million tokens (OpenAI ada-002). One-time cost per document, but re-embedding needed when models change.

100K documents × 1K tokens each = 100M tokens = $2-10 for full embedding
Vector Storage

Storing embeddings in a vector database. Pinecone: $0.096/hr for 1M vectors. Weaviate Cloud: $25/month for 1M vectors. Self-hosted: infrastructure costs.

At 10M vectors: $25-100/month storage. At 100M: $250-1000/month.
Retrieval Cost

Each query searches the vector database for relevant chunks. Cost per query: $0.001-0.01 depending on database and index size.

At 100K queries/day: $100-1000/month in retrieval costs
Generation with Context

Retrieved chunks are injected into the LLM prompt, increasing token count. 5 chunks × 500 tokens = 2,500 additional input tokens per query.

Context injection increases per-query LLM cost by 40-100%
📝 Exercise

Map your RAG pipeline's cost components: embedding (one-time), storage (monthly), retrieval (per-query), and augmented generation (per-query). Calculate total monthly cost.

2

Lesson 2: Chunking Strategy Economics

How you chunk documents directly impacts both retrieval quality AND cost. Wrong chunk size wastes tokens on irrelevant context. Right chunk size maximizes relevance per token.

Small Chunks (100-200 tokens)

More precise retrieval, but requires more chunks per query for full context. Higher retrieval cost, lower generation cost.

Best for: FAQ-style queries, precise factual lookups
Medium Chunks (500-1000 tokens)

Balance of context and precision. Most production RAG systems use this range. Good for paragraph-level retrieval.

Best for: general Q&A, documentation search, customer support
Large Chunks (1000-2000 tokens)

More context per chunk, but higher generation cost and risk of irrelevant content diluting the response. Fewer retrieval operations needed.

Best for: summarization, document analysis, legal/medical contexts
📝 Exercise

Test 3 chunk sizes on your knowledge base. Measure: retrieval relevance (precision@5), total tokens consumed, and response quality. Find your optimal balance.

3

Lesson 3: Production RAG Optimization

Production RAG systems need caching, hybrid search, and reranking to control costs while maintaining quality. These optimizations can reduce RAG costs by 50-70%.

Semantic Caching

Cache responses for semantically similar queries. If someone asks "how do I reset my password?" and a similar query was answered 5 minutes ago, serve the cached response.

Cache hit rate 30-60% for support/FAQ use cases = 30-60% cost reduction
Hybrid Search

Combine vector search (semantic) with keyword search (BM25). Hybrid retrieval is 15-30% more accurate, meaning fewer irrelevant chunks in context = lower token costs.

Hybrid search: 10-20% relevance improvement with minimal additional cost
Reranking

Retrieve 20 chunks, rerank with a lightweight model, use only top 5. Without reranking: 20 chunks × 500 tokens = 10K tokens. With reranking: 5 chunks × 500 tokens = 2,500 tokens.

Reranking reduces context tokens 50-75% while improving relevance
📝 Exercise

Implement one optimization (caching, hybrid search, or reranking) in your RAG pipeline. Measure cost reduction and quality impact over 1 week.