PDF OCR converts pictures of words — scanned PDFs, photographs of pages, screenshots of receipts — into actual editable text using Tesseract.js, a WebAssembly port of Google's Tesseract OCR engine. The recognition runs entirely in your browser inside a Web Worker, so the main thread stays responsive while pages are processed sequentially.
For PDF inputs, pdfjs-dist renders each page onto an HTML canvas at 192 DPI in Fast mode or 288 DPI in Best mode. Higher DPI gives Tesseract more pixels per glyph, which matters most for small print, faded photocopies, and 6–8 pt footnotes. Image inputs (PNG, JPEG, WebP) are passed directly into Tesseract without re-rendering.
Tesseract uses LSTM-based recognition models trained per-language. The first time you OCR in a new language, the model file (typically 2–4 MB compressed) downloads from a CDN and is cached by the browser; subsequent runs reuse the cached model and start almost instantly. Thirteen languages ship out of the box: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Simplified Chinese, Japanese, Korean, Arabic, and Hindi.
Output is plain UTF-8 text plus a confidence score (0–100) per page. Confidence under about 60 usually means the source image is too low-resolution or the wrong language was selected — try Best mode first, then verify the language. Tesseract preserves line breaks but does not reconstruct columns or tables; multi-column PDFs read top-to-bottom in one column then jump to the next, which can scramble reading order.
Practical limits: Tesseract is good at clean printed text and acceptable on typewritten or moderately faded copy. It struggles with handwriting (cursive is essentially unreadable), heavily skewed scans, dense diacritics in poorly trained languages, and stylized display fonts. For mission-critical OCR (legal discovery, archival digitization), commercial engines like Abbyy FineReader or Google Document AI still produce noticeably better results.
Long PDFs are processed page-by-page, with progress shown per page. A 100-page scanned PDF in Best mode takes several minutes on a typical laptop because each page must be rasterized and run through the LSTM model. To speed things up, split very long PDFs with the PDF Splitter and run pieces in parallel browser tabs.
Encrypted PDFs cannot be parsed by pdfjs-dist until they are unlocked — use the PDF Password tool first. Files never leave your device: pdfjs-dist runs in the main thread, Tesseract runs in a Web Worker, and the only network requests are the initial WASM and language model downloads.