PDF OCR: Extract Text from Scanned Documents
· 10 min read
What Is OCR?
OCR (Optical Character Recognition) converts images of text into machine-readable text. When you scan a paper document to PDF, the result is essentially a collection of images — you can see the text but cannot select, search, or copy it. OCR analyzes these images and extracts the text content.
A "searchable PDF" has an invisible text layer positioned behind the scanned image. You see the original scan, but you can Ctrl+F to search, select text to copy, and screen readers can read the content. Try our PDF OCR tool to make your scanned PDFs searchable.
How OCR Works
Modern OCR engines process documents in several stages:
- Image preprocessing — Deskewing (straightening tilted scans), denoising (removing speckles), binarization (converting to black and white), and contrast enhancement
- Layout analysis — Detecting text regions, columns, tables, images, and reading order
- Character segmentation — Isolating individual characters or words
- Character recognition — Matching character shapes against trained models. Modern engines use LSTM neural networks instead of template matching
- Post-processing — Dictionary lookup, language model correction, and confidence scoring
Accuracy Factors
| Factor | Impact | Recommendation |
|---|---|---|
| Scan resolution | High | 300 DPI minimum. 200 DPI for clean text. 400+ for small fonts. |
| Image quality | High | Even lighting, no shadows, flat page (no curve from book spine) |
| Font type | Medium | Standard fonts: 98%+ accuracy. Decorative/handwritten: 60-80% |
| Language | Medium | Latin scripts: best support. CJK: good. Arabic/Devanagari: improving |
| Document age | Medium | Faded ink, yellowed paper, and old typefaces reduce accuracy |
| Layout complexity | Medium | Single column: easy. Multi-column, tables, mixed content: harder |
| Skew angle | Low-Medium | Auto-deskew handles up to ~15°. Beyond that, manually straighten first |
OCR Engines Compared
| Engine | Type | Accuracy (clean text) | Speed | Languages | Cost |
|---|---|---|---|---|---|
| Tesseract 5 | Open source | 95-99% | Medium | 100+ | Free |
| ABBYY FineReader | Commercial | 99%+ | Fast | 200+ | $$$ |
| Google Cloud Vision | Cloud API | 99%+ | Fast | 100+ | $1.50/1000 pages |
| Amazon Textract | Cloud API | 98-99% | Fast | English + others | $1.50/1000 pages |
| EasyOCR | Open source | 90-95% | Slow (GPU helps) | 80+ | Free |
| PaddleOCR | Open source | 95-98% | Fast | 80+ | Free |
Tesseract CLI Guide
# Install (Ubuntu/Debian)
sudo apt install tesseract-ocr tesseract-ocr-eng
# Basic OCR (image to text)
tesseract scan.png output
# Specify language
tesseract scan.png output -l eng+fra
# Output as searchable PDF
tesseract scan.png output pdf
# Output as hOCR (HTML with coordinates)
tesseract scan.png output hocr
# Set page segmentation mode
# PSM 3 = fully automatic (default)
# PSM 6 = single block of text
# PSM 11 = sparse text
tesseract scan.png output --psm 6
# List available languages
tesseract --list-langs
ocrmypdf: The Best CLI Tool
ocrmypdf wraps Tesseract with PDF-specific features: it handles multi-page PDFs, preserves the original scan quality, adds a text layer, and can output PDF/A for archiving.
# Install
pip install ocrmypdf
# Basic usage
ocrmypdf input.pdf output.pdf
# Specify language
ocrmypdf -l eng+deu input.pdf output.pdf
# Skip pages that already have text
ocrmypdf --skip-text input.pdf output.pdf
# Optimize images while OCR'ing
ocrmypdf --optimize 2 input.pdf output.pdf
# Output as PDF/A for archiving
ocrmypdf --output-type pdfa input.pdf output.pdf
# Deskew + clean before OCR
ocrmypdf --deskew --clean input.pdf output.pdf
# Batch process a directory
find /scans -name '*.pdf' -exec ocrmypdf {} {}.ocr.pdf \;
Practical OCR Workflow
- Scan at 300 DPI in color or grayscale (not black & white — OCR engines handle binarization better than scanners)
- Save as PDF directly from the scanner, or as TIFF/PNG for maximum quality
- Preprocess if needed — Deskew, remove borders, enhance contrast
- Run OCR —
ocrmypdf --deskew --clean -l eng input.pdf output.pdf - Verify — Open the output, try searching for known text, spot-check a few pages
- Archive — Save as PDF/A for long-term preservation
Use PDF Text Extractor to verify OCR results by extracting the text layer.
Accuracy by Document Type
| Document Type | Expected Accuracy | Notes |
|---|---|---|
| Clean printed text (modern) | 98-99% | Standard fonts, good scan quality |
| Newspaper/magazine | 95-98% | Multi-column layout can cause ordering issues |
| Old typewriter text | 90-95% | Uneven ink, non-standard spacing |
| Receipts/invoices | 90-95% | Thermal paper fading, small fonts |
| Tables and forms | 85-95% | Structure recognition is the challenge, not character recognition |
| Handwritten (neat print) | 70-85% | Varies greatly by handwriting quality |
| Handwritten (cursive) | 40-60% | Still a hard problem for OCR |
| Historical documents | 70-90% | Degraded paper, old typefaces, archaic spelling |
Frequently Asked Questions
What DPI should I scan at for OCR?
300 DPI is the standard recommendation for OCR. 200 DPI works for clean printed text with standard fonts. 400-600 DPI helps with small fonts (below 10pt) or degraded documents. Higher than 600 DPI rarely improves accuracy and significantly increases file size.
Can OCR read handwriting?
Modern OCR can read neat handwritten print with 60-80% accuracy. Cursive and messy handwriting remains challenging (40-60%). Specialized handwriting recognition (ICR) tools and AI models like Google's perform better than general-purpose OCR engines.
What is a searchable PDF?
A searchable PDF has an invisible text layer positioned behind the scanned image. You see the original scan but can select, copy, and search the OCR-extracted text. The visual appearance is identical to the original scan.
Is Tesseract OCR good enough?
Tesseract 5 achieves 95-99% accuracy on clean printed text, which is sufficient for most use cases. For degraded documents, complex layouts, or handwriting, commercial engines like ABBYY FineReader or cloud APIs (Google Vision, Amazon Textract) perform better.
How do I OCR a PDF in bulk?
Use ocrmypdf for batch processing: find . -name '*.pdf' -exec ocrmypdf {} {}.ocr.pdf \;. It handles multi-page PDFs, skips already-OCR'd pages, and can produce PDF/A output for archiving.