We ran 1,000 tasks across reasoning, code, and summarization to compare price-performance. Results vary by task type and prompt length.
Setup
Datasets: GSM8K-style math, code generation, long-context summarization. Each task run 3x with temperature 0.2. Costs normalized to USD per successful solution.
Findings
Gemini 3 Pro led on long-context summarization cost-efficiency. GPT-5 edged out on complex code and safety filters. For short prompts, pricing differences were minimal.
Recommendation
Hybrid routing wins: default to Gemini 3 Pro for summarization and classification; route code and compliance-heavy prompts to GPT-5; keep a kill switch to move traffic if pricing shifts.
Key takeaways
- Match models to workload type; no single winner across all tasks.
- Price-performance flips depending on prompt length and output size.
- Keep routing configurable so you can react to price/card changes.