Token count vs word count in PDFs: what matters for AI
Why word count alone misleads AI prompt planning, and how PDF structure can inflate tokens far beyond expectation.
A common mistake in AI workflows is treating words and tokens as the same thing. They are related, but not interchangeable.
If you plan prompt size using word count alone, you can underestimate by a lot, especially with PDFs that include tables, bullets, legal language, or OCR artifacts.
Words are for humans, tokens are for models
Models ingest tokens, not words. A short word can be one token. A longer term can split into multiple tokens. Numbers, punctuation, and special formatting create extra splits.
That is why two documents with 5 000 words can have very different token counts.
Where PDFs make this gap worse
- Tables repeat labels and separators.
- Headers and footers appear on every page.
- OCR output can duplicate or break words.
- Code snippets and IDs tokenize heavily.
These patterns inflate tokens while the visible word count still looks reasonable.
A better way to estimate before prompting
Use word count only as a rough first pass. For real decisions, use the PDF Token Counter and read per-page token hotspots.
If one section spikes, split there. If the full file is high, extract only relevant chapters and keep context windows for the actual question.
Rule of thumb you can trust
In dense business PDFs, token count often lands between 1.2x and 1.8x a naive word-based expectation. Sometimes more with bad OCR. Measure first, then prompt.