Cost/Performance Analysis
Goal: To answer "Is it cheaper to chain 3 calls to a small model or 1 call to a frontier model?"
The "Cascade" Hypothesis
Hypothesis: A cheap model (4o-mini) can handle 80% of requests. A router should only escalate to the expensive model (4o/Sonnet) when necessary.
The Experiment
We ran 1000 tasks through a "Router" architecture versus a "Frontier Only" architecture.
Scenario A: Frontier Only (GPT-4o)
- Tasks: 1000
- Cost per Task: $0.02
- Total Cost: $20.00
- Success Rate: 98%
Scenario B: Router (4o-mini -> 4o)
- Router (Mini): $0.0001
- Worker (Mini) (80% of time): $0.001
- Worker (4o) (20% of time): $0.02
- Total Cost per avg task: $0.005
- Total Cost: $5.00
- Success Rate: 97%
Conclusion
The Router Architecture saves 75% of costs with a negligible (1%) drop in success rate. For any high-volume application, implementing a "Router" or "Triage" step using a small model is the highest ROI optimization you can make.
Latency vs. Throughput
For users, Time to First Token (TTFT) is the most important metric.
- Groq / Cerebras: These hardware providers offer nearly instant TTFT (<20ms).
- Standard APIs: often 300-500ms TTFT.
Recommendation: For real-time voice or critical UI (like autcomplete), you must use a fast inference provider or a small model. For background "Agent" tasks, latency doesn't matter—optimise for Intelligence (Frontier Models).