Module 2.4: RAG Architecture Economics
Embedding costs, vector database pricing, chunking strategy impact on costs, and production optimizations (caching, hybrid search, reranking) that reduce RAG costs 50-70%.
Lesson 1: RAG Cost Components
RAG (Retrieval-Augmented Generation) adds retrieval costs on top of generation costs. Understanding each cost component prevents budget surprises when your knowledge base scales.
Converting documents into vectors: $0.02-0.10 per million tokens (OpenAI ada-002). One-time cost per document, but re-embedding needed when models change.
Storing embeddings in a vector database. Pinecone: $0.096/hr for 1M vectors. Weaviate Cloud: $25/month for 1M vectors. Self-hosted: infrastructure costs.
Each query searches the vector database for relevant chunks. Cost per query: $0.001-0.01 depending on database and index size.
Retrieved chunks are injected into the LLM prompt, increasing token count. 5 chunks × 500 tokens = 2,500 additional input tokens per query.
Map your RAG pipeline's cost components: embedding (one-time), storage (monthly), retrieval (per-query), and augmented generation (per-query). Calculate total monthly cost.
Lesson 2: Chunking Strategy Economics
How you chunk documents directly impacts both retrieval quality AND cost. Wrong chunk size wastes tokens on irrelevant context. Right chunk size maximizes relevance per token.
More precise retrieval, but requires more chunks per query for full context. Higher retrieval cost, lower generation cost.
Balance of context and precision. Most production RAG systems use this range. Good for paragraph-level retrieval.
More context per chunk, but higher generation cost and risk of irrelevant content diluting the response. Fewer retrieval operations needed.
Test 3 chunk sizes on your knowledge base. Measure: retrieval relevance (precision@5), total tokens consumed, and response quality. Find your optimal balance.
Lesson 3: Production RAG Optimization
Production RAG systems need caching, hybrid search, and reranking to control costs while maintaining quality. These optimizations can reduce RAG costs by 50-70%.
Cache responses for semantically similar queries. If someone asks "how do I reset my password?" and a similar query was answered 5 minutes ago, serve the cached response.
Combine vector search (semantic) with keyword search (BM25). Hybrid retrieval is 15-30% more accurate, meaning fewer irrelevant chunks in context = lower token costs.
Retrieve 20 chunks, rerank with a lightweight model, use only top 5. Without reranking: 20 chunks × 500 tokens = 10K tokens. With reranking: 5 chunks × 500 tokens = 2,500 tokens.
Implement one optimization (caching, hybrid search, or reranking) in your RAG pipeline. Measure cost reduction and quality impact over 1 week.