Deep Dive

Understanding Tokenization: Why 'Strawberry' Costs More Than You Think

David KimSep 22, 20247 min read
← Back to all posts

Tokenization isn’t just word count. BPE splits common words efficiently but can explode on rare strings. This matters for pricing and latency.

BPE basics

Byte Pair Encoding starts with characters and merges frequent pairs. Frequent subwords become single tokens; rare patterns stay long.

Cost implications

Rare product names, UUIDs, and URLs balloon token counts. Normalize inputs (lowercase, strip noise) and avoid repeating long identifiers in prompts.

Practical tips

Pre-tokenize samples to spot outliers. Keep IDs in a reference table instead of repeating them. Favor succinct wording in system prompts.

Key takeaways

  • Rare strings inflate token counts—and cost.
  • Normalize inputs; avoid repeating long IDs.
  • Sample-tokenize to catch expensive prompts early.