Module 2.2: Model Selection & Optimization
The model tier spectrum, routing architectures, fine-tuning ROI, and distillation. Match model capability to task complexity to reduce costs 60-80%.
🎯 What You'll Learn
- ✓ The four model tiers and when to use each (frontier, mid-tier, open-source, specialized)
- ✓ How to architect model routing for 60-80% cost reduction
- ✓ How to calculate fine-tuning ROI and distillation break-even
Lesson 1: The Model Tier Spectrum
Not every AI task needs a frontier model. Understanding model tiers — and matching task complexity to model capability — is the single largest cost optimization lever.
GPT-4o, Claude 3.5 Opus, Gemini Ultra. Best reasoning, highest cost. Use for: complex analysis, multi-step reasoning, creative generation.
GPT-4o-mini, Claude Haiku, Gemini Flash. 85-90% of frontier quality at 10-20x lower cost. Use for: most production features.
Llama 3, Mistral, Phi-3. Self-hosted, zero per-token cost (but infrastructure cost). Use for: high-volume, latency-sensitive, or privacy-critical tasks.
Fine-tuned models for specific domains. Better quality AND lower cost for narrow tasks. A fine-tuned Llama can outperform GPT-4 on your specific use case.
Categorize your AI queries into complexity tiers (simple/medium/complex). What percentage could be handled by a mid-tier model instead of a frontier model?
Lesson 2: Model Routing Architecture
Model routing directs each query to the cheapest model capable of handling it. A router that sends 70% of queries to a fast/cheap model saves 80%+ on inference costs.
Use a lightweight classifier (or simple heuristics like query length, keyword detection) to estimate query complexity. Route simple queries to cheap models, complex to expensive.
Try the cheapest model first. If confidence is below threshold, escalate to the next tier. Most queries resolve at the cheapest tier.
Track accuracy/satisfaction by model tier. If the cheap model produces unacceptable results for certain query types, adjust routing rules.
Design a model routing strategy for your AI feature. Define 3 tiers, set routing rules, and estimate the cost savings vs. your current approach.
Lesson 3: Fine-Tuning Economics
Fine-tuning creates a specialized model that outperforms general models on your specific task, often at lower inference cost. But the ROI depends on volume.
Fine-tuning cost = training data preparation ($5K-$20K) + compute ($1K-$10K per training run) + iteration (3-5 runs typically). Total: $10K-$50K.
Fine-tuned models can be 20-40% more accurate on domain-specific tasks. This reduces retry rates (fewer wasted tokens) and improves user satisfaction.
Train a small model on the outputs of a large model. The small model learns to mimic the large model at 10-100x lower inference cost. GPT-4 quality from a GPT-3.5-sized model.
Calculate fine-tuning ROI: (current monthly inference cost) - (projected fine-tuned model cost) = monthly savings. (Fine-tuning investment) / (monthly savings) = payback months.