Tokenization isn’t just word count. BPE splits common words efficiently but can explode on rare strings. This matters for pricing and latency.
BPE basics
Byte Pair Encoding starts with characters and merges frequent pairs. Frequent subwords become single tokens; rare patterns stay long.
Cost implications
Rare product names, UUIDs, and URLs balloon token counts. Normalize inputs (lowercase, strip noise) and avoid repeating long identifiers in prompts.
Practical tips
Pre-tokenize samples to spot outliers. Keep IDs in a reference table instead of repeating them. Favor succinct wording in system prompts.
Key takeaways
- Rare strings inflate token counts—and cost.
- Normalize inputs; avoid repeating long IDs.
- Sample-tokenize to catch expensive prompts early.