PDF Token Counter · PDFShore

Drop a PDF here

or click to pick one from your device

Your file never leaves this tab. We can't see it, and neither can anyone else.

Need to extract the text first?

Convert to Markdown with OCR support for scanned PDFs.

Open PDF to Markdown

Want to clean the file too?

Strip hidden metadata before sending the PDF to an AI tool.

Open Remove Metadata

About counting PDF tokens

Every LLM has a context window, a limit on how much text it can process at once, measured in tokens. A token is roughly three to four characters of English text. Knowing how many tokens are in your PDF before you paste it into a chat or API call helps you avoid silent truncation, plan how to chunk large documents, and estimate API costs.

This tool extracts all text from the PDF locally, then runs the same tokenizer that GPT-4 and GPT-3.5 use, cl100k_base. Claude uses a similar BPE encoding, so the GPT-4 count is a reliable proxy for most models. Estimates for generic or local models use the standard rule of one token per four characters.

How to count tokens in a PDF

1
Drop your PDF inDrag the file onto the box or click to pick one. Nothing is uploaded; text extraction runs in your browser.
2
Wait a momentThe tokenizer reads each page, extracts the text, and counts. Large files take a few seconds.
3
Read the resultsSee token counts per model family, word and character totals, and which common context windows the file fits into.

Counted in your browser

Your PDF content is especially sensitive here, because you are probably checking it before handing it to an AI system. PDFShore extracts and tokenizes entirely in your browser. The text never leaves your device and nothing is logged.

Common questions

How accurate is the GPT-4 count?

Very accurate for text-based PDFs. The same cl100k_base vocabulary that the OpenAI API uses is applied here in the browser. The count will match what the API charges for, within a negligible margin.

What about scanned or image-only PDFs?

Image content cannot be tokenized directly. If your PDF is a scanned document without an embedded text layer, the count will be very low or zero. Use PDF to Markdown with OCR enabled first to extract a text layer.

Does this work for Claude or Gemini?

Claude uses a similar BPE tokenizer, so the GPT-4 count is a reliable estimate, usually within 5 to 10 percent. Gemini uses a different tokenizer, so treat the generic count as a rough guide.