PromptCost - Optimize Your LLM Spend

Teams often overspend on LLM calls because of chatty prompts, the wrong model tier, or missing caching. Here’s a practical playbook to cut your bill in half without hurting UX.

1) Right-size the model to the task

Route simple classification and formatting to a small model (e.g., GPT-4o-mini/Gemini Flash) and reserve flagship models for rare, high-stakes calls. Add a confidence flag so you can retry with a larger model when needed.

2) Trim tokens at the source

Aggressively shorten system prompts, prefer bullet points, and cap history. For RAG, chunk to ~500-800 tokens and send only the top 3-5 snippets. Every unnecessary token compounds across scale.

3) Cache and dedupe

Cache identical prompts (or normalized hashes) for 24-72h when freshness is not critical. Keep a small LRU in-memory cache plus a shared Redis layer to avoid recompute across instances.

4) Batch where possible

If the API supports it, batch multiple small tasks into one call. You pay the prompt overhead once, not per request. Watch for latency trade-offs and set batch size limits.

Key takeaways

Route easy work to small models; fall back to premium only when necessary.
Shorten system prompts and context; limit to the minimum relevant snippets.
Cache normalized prompts; avoid paying twice for the same answer.
Batch small tasks to amortize prompt overhead.