PromptCost - Optimize Your LLM Spend

We ran 1,000 tasks across reasoning, code, and summarization to compare price-performance. Results vary by task type and prompt length.

Setup

Datasets: GSM8K-style math, code generation, long-context summarization. Each task run 3x with temperature 0.2. Costs normalized to USD per successful solution.

Findings

Gemini 3 Pro led on long-context summarization cost-efficiency. GPT-5 edged out on complex code and safety filters. For short prompts, pricing differences were minimal.

Recommendation

Hybrid routing wins: default to Gemini 3 Pro for summarization and classification; route code and compliance-heavy prompts to GPT-5; keep a kill switch to move traffic if pricing shifts.

Key takeaways

Match models to workload type; no single winner across all tasks.
Price-performance flips depending on prompt length and output size.
Keep routing configurable so you can react to price/card changes.