Cost/Performance Analysis

Goal: To answer "Is it cheaper to chain 3 calls to a small model or 1 call to a frontier model?"

The "Cascade" Hypothesis

Hypothesis: A cheap model (4o-mini) can handle 80% of requests. A router should only escalate to the expensive model (4o/Sonnet) when necessary.

The Experiment

We ran 1000 tasks through a "Router" architecture versus a "Frontier Only" architecture.

Scenario A: Frontier Only (GPT-4o)

  • Tasks: 1000
  • Cost per Task: $0.02
  • Total Cost: $20.00
  • Success Rate: 98%

Scenario B: Router (4o-mini -> 4o)

  • Router (Mini): $0.0001
  • Worker (Mini) (80% of time): $0.001
  • Worker (4o) (20% of time): $0.02
  • Total Cost per avg task: $0.005
  • Total Cost: $5.00
  • Success Rate: 97%

Conclusion

The Router Architecture saves 75% of costs with a negligible (1%) drop in success rate. For any high-volume application, implementing a "Router" or "Triage" step using a small model is the highest ROI optimization you can make.

Latency vs. Throughput

For users, Time to First Token (TTFT) is the most important metric.

  • Groq / Cerebras: These hardware providers offer nearly instant TTFT (<20ms).
  • Standard APIs: often 300-500ms TTFT.

Recommendation: For real-time voice or critical UI (like autcomplete), you must use a fast inference provider or a small model. For background "Agent" tasks, latency doesn't matter—optimise for Intelligence (Frontier Models).

Want to go deeper? Explore our premium series.

View Series