Every LLM has a context window, a limit on how much text it can process at once, measured in tokens. A token is roughly three to four characters of English text. Knowing how many tokens are in your PDF before you paste it into a chat or API call helps you avoid silent truncation, plan how to chunk large documents, and estimate API costs.
This tool extracts all text from the PDF locally, then runs the same tokenizer that GPT-4 and GPT-3.5 use, cl100k_base. Claude uses a similar BPE encoding, so the GPT-4 count is a reliable proxy for most models. Estimates for generic or local models use the standard rule of one token per four characters.
Your PDF content is especially sensitive here, because you are probably checking it before handing it to an AI system. PDFShore extracts and tokenizes entirely in your browser. The text never leaves your device and nothing is logged.
Very accurate for text-based PDFs. The same cl100k_base vocabulary that the OpenAI API uses is applied here in the browser. The count will match what the API charges for, within a negligible margin.
Image content cannot be tokenized directly. If your PDF is a scanned document without an embedded text layer, the count will be very low or zero. Use PDF to Markdown with OCR enabled first to extract a text layer.
Claude uses a similar BPE tokenizer, so the GPT-4 count is a reliable estimate, usually within 5 to 10 percent. Gemini uses a different tokenizer, so treat the generic count as a rough guide.