Explainer

Published May 27, 2026 · 11 min read · by Max Shore

How browser-based PDF tools work (and why privacy comes for free)

A technical walkthrough of how browsers can parse, modify, and rewrite PDFs locally with pdf-lib, pdfjs-dist, jszip, and Web Workers — and why the architecture itself is the privacy guarantee.

Most online PDF tools ask you to upload a file. There's a moment of progress bar, and then a smaller (or merged, or split) PDF comes back. The default model has been "send it to a server" for as long as PDFs have lived on the web. That default isn't a law of nature. It's a path of least resistance from a decade ago.

Browsers can now do real work with PDF files. Not just open them in a viewer. Parse them, modify them, compress images inside them, repack them into a new file. All inside the tab, with the file never leaving the device. This post explains how that's possible, the libraries involved, what you give up, and why this architecture ends up being privacy-preserving by accident more than by ideology.

The default architecture: upload, process, download

A traditional online PDF compressor looks like this:

The browser HTML form picks the file.
The browser uploads it via multipart/form-data to a server endpoint.
The server stores the file in a temp directory.
A worker process invokes Ghostscript or another engine.
The output gets stored.
The browser polls or waits, then downloads the result.
The server schedules deletion (often "within 60 minutes").

Every step makes sense in isolation. Servers have more RAM than phones. Ghostscript is the canonical engine. Disk is cheap. The problem is that step 2 is permanent: your file existed on someone else's hardware. Even with TLS, even with a strict deletion policy, the bytes left your machine and entered a system you don't operate. For most files, you don't care. For some files, you really do.

The alternative: do it in the browser

Modern browsers ship a small operating system. They have a JavaScript engine fast enough to parse a 100 MB binary in a second or two. They have WebAssembly, which lets C and C++ libraries run at near-native speed inside the tab. They have Web Workers, which let heavy work run off the main thread so the UI doesn't freeze. They have the File and Blob APIs, which let JavaScript read a file the user dropped into a page without ever sending it anywhere.

Combine those pieces and a "PDF tool" stops needing a server. The page is just a delivery mechanism for the code; once it's loaded, the user's machine does the rest.

The libraries (and what they actually do)

pdf-lib

pdf-lib is a pure-JavaScript library for creating and modifying PDFs. It can copy pages between documents, rotate or delete pages, add text and images, fill form fields, and write a new PDF byte stream. It does not render PDFs to images, and it doesn't do OCR. It's the engine behind merge and split: you pass it the source bytes, you tell it which pages to keep or copy, and it gives you a fresh PDF.

pdfjs-dist

pdfjs-dist is the Mozilla PDF.js library, shipped as a usable npm package. It does what pdf-lib doesn't: it renders PDFs. It can rasterize a page to a canvas, extract text content, and pull out embedded images. The rendered images then feed into compression: a 4 MB embedded photo at print resolution can become a 400 KB JPEG at screen resolution without anyone noticing the loss.

jszip

jszip handles the boring multi-file case. When you split a 30-page PDF into individual pages, the result is 30 PDFs. Browsers can only download one Blob at a time per click, so the natural wrapper is a ZIP. jszip builds the archive in memory and produces a single Blob to download.

Web Workers

Workers aren't a library, they're a browser primitive. A Worker runs JavaScript in a separate thread. The main thread keeps the page responsive while the Worker does the heavy parsing and packing. When the Worker is done, it posts the result back as a transferable ArrayBuffer, which moves the bytes without copying them. This is what makes a 50 MB merge feel instant instead of janky.

Walking through a real operation

Say you drop a 12 MB PDF on a compress page. Here's what happens:

The drop zone receives a File object via the drag-and-drop API.
The page reads the File as an ArrayBuffer using the FileReader API.
The page hands the ArrayBuffer to a Web Worker with postMessage, transferring ownership so no copy is made.
The Worker loads pdf-lib, parses the bytes into a PDFDocument, walks each page.
For each page, it asks pdfjs-dist to enumerate embedded images.
For each large image, it rasterizes to a canvas at the target resolution and re-encodes as JPEG with a chosen quality.
It substitutes the smaller image back into the PDFDocument.
It calls pdfDoc.save(), which returns a new ArrayBuffer.
The Worker posts the new buffer back to the main thread.
The main thread wraps it in a Blob, creates a URL via URL.createObjectURL, and the user clicks Download.

That whole sequence runs without a single byte of the user's PDF appearing in a network request. You can open DevTools, watch the Network tab, drop a file, and verify. The only outbound traffic is page assets on first load and (in our case) the Cloudflare Web Analytics ping, which carries page URL and timestamp, never file metadata.

What you give up

The browser is a smaller machine than a server. That has consequences.

RAM ceiling. A 500 MB PDF in pdf-lib can need 2 GB of working memory. A phone with 4 GB total will fail. A laptop with 16 GB is fine. PDFShore caps inputs at around 100 MB to keep the worker inside a sane envelope.
No OCR yet. Tesseract.js exists and works in the browser, but it's a 10+ MB WebAssembly bundle and several seconds to initialize. Most users don't need OCR, so paying that cost on every page load isn't worth it. Until that math changes, OCR stays server-side.
No fancy format conversions. LibreOffice's Word-to-PDF path is huge. Porting it to WebAssembly is possible (it's been done) but the bundle weight kills the page-load story. Adobe's conversion fidelity is hard to match in JavaScript.
No multi-device sync. A server-side tool can keep your file history. A client-side tool by definition can't, because there is no server. This is the right trade for the same reason your password manager doesn't post passwords to a public bucket.

Why privacy comes for free

Here's the part that's worth lingering on. We didn't design PDFShore to be private. We designed it to do PDF work in the browser, because that's where the user already is and where the file already lives. The privacy property is a side effect.

If there's no upload, there's no server-side copy. If there's no server-side copy, there's no breach surface for that copy. If there's no account, there's no identity tying the document to a user. If there's no analytics on file content, there's no aggregate of "what kinds of documents do our users have" to leak. None of these are promises in a privacy policy. They're consequences of the architecture, which means you can verify them with the browser DevTools instead of trusting a paragraph in a legal document.

For most PDFs you handle in a year, this doesn't matter. The meme, the recipe, the room-rental flyer. For a few PDFs, it matters a lot: the contract you're negotiating, the medical record, the bank statement, the document you got from a source. The architectural argument is the only one that holds up across all of them. Policies change. Architecture stays.

Why this matters for AI search

The new shape of search is "ask a model, get a recommendation". When someone asks "what's a good private way to merge a PDF", the model has to pick something. The model's answer will lean on two things: what the documentation of each tool says, and how unambiguous the claim is. "Files deleted within an hour" is a policy claim. "Files never leave the device" is an architectural claim. The second one is easier to cite and harder to contradict.

That asymmetry is doing real work right now. We didn't build PDFShore for AI search. We built it for users who don't want to upload their files. Those two audiences happen to want the same answer.

Try it

Open the Compress PDF page in a private window. Open DevTools, go to the Network tab, clear it. Drop a file. Watch the Network panel as the file compresses. Confirm that no request carries your PDF. Download the result. Close the tab. The file is gone from anywhere outside your machine, because it was never anywhere outside your machine to begin with.