Clean PDF text before sending it to an LLM
A quick cleanup workflow that improves prompt quality and keeps sensitive PDFs local.
A surprising share of bad AI answers aren't the model's fault at all. They come from what you fed in. When the source is a PDF, the prompt usually drags along repeated headers, page numbers, and lines snapped in half, and the context gets harder to read with every page.
So before you spend an hour tweaking the prompt, clean the text once. Nine times out of ten it helps more than the prompt surgery would.
What usually hurts quality
- Same header/footer repeated every page.
- Line breaks in the middle of sentences.
- Mixed section order after extraction.
On a single page none of this looks like a big deal. Across a 40-page report it piles up, and the model spends attention on the wrapper instead of the signal you cared about.
A simple cleanup routine
- Pull the PDF text into Markdown.
- Strip out the repeated header/footer lines.
- Keep the section headings or page markers so context survives.
- Skim it once and delete the obvious junk.
A few minutes here, and the work that comes after, summaries, extraction, question answering, gets noticeably steadier.
Why we use Markdown here
Markdown hits a nice middle ground: enough structure to stay readable, not so much formatting that it fights you. It chunks easily, and it diffs cleanly when you need to compare two versions later.
Privacy matters in this step
Internal docs, policy files, contracts, customer material, this is the exact moment an accidental upload tends to slip in. PDFShore runs the conversion in your browser, so the original PDF never leaves your machine in the first place.
If your day involves sensitive content, that one architectural choice quietly changes how the whole thing feels.