PDF OCR: Extract Text from Scanned Documents
· 12 min read
Table of Contents
What Is OCR?
OCR (Optical Character Recognition) converts images of text into machine-readable text. When you scan a paper document to PDF, the result is essentially a collection of images β you can see the text but cannot select, search, or copy it. OCR analyzes these images and extracts the text content.
A "searchable PDF" has an invisible text layer positioned behind the scanned image. You see the original scan, but you can press Ctrl+F to search, select text to copy, and screen readers can read the content aloud for accessibility. This makes scanned documents as functional as native digital PDFs.
OCR technology has evolved dramatically over the past decade. Early systems relied on template matching and required clean, high-quality scans. Modern OCR engines use deep learning neural networks that can handle degraded documents, multiple languages, and complex layouts with remarkable accuracy.
The most common use cases for OCR include:
- Digitizing paper archives and historical documents
- Making scanned contracts and legal documents searchable
- Extracting data from invoices and receipts for accounting
- Converting printed books and articles to editable text
- Enabling accessibility for visually impaired users
- Creating searchable repositories of technical documentation
Try our PDF OCR tool to make your scanned PDFs searchable in seconds. For documents that need additional processing, check out our PDF compressor to reduce file sizes after OCR.
How OCR Works
Modern OCR engines process documents through a sophisticated pipeline of image analysis and text recognition. Understanding this process helps you optimize your scans for better results.
Image Preprocessing
Before any text recognition happens, the OCR engine prepares the image:
- Deskewing β Detects and corrects rotation. Even a 2-degree tilt can reduce accuracy by 10-15%. The engine analyzes text baselines and straightens the image.
- Denoising β Removes speckles, dust spots, and scanner artifacts. This is critical for older documents or low-quality scans.
- Binarization β Converts grayscale or color images to pure black and white. Adaptive thresholding handles uneven lighting and shadows.
- Contrast enhancement β Sharpens faded text and improves the distinction between text and background.
- Border removal β Crops out margins and non-text areas to focus processing on actual content.
Layout Analysis
The engine must understand document structure before reading text:
- Detecting text regions versus images, diagrams, and white space
- Identifying columns and determining reading order (left-to-right, top-to-bottom)
- Recognizing tables, headers, footers, and page numbers
- Separating paragraphs and maintaining logical document flow
Layout analysis is where many OCR systems struggle with complex documents. A two-column academic paper with footnotes and embedded figures requires sophisticated analysis to maintain correct reading order.
Character Segmentation
The engine isolates individual characters or words for recognition. This step handles:
- Separating touching or overlapping characters
- Identifying character boundaries in cursive or connected scripts
- Handling variable spacing and kerning
- Detecting and preserving special characters and symbols
Character Recognition
This is where the actual text extraction happens. Modern engines use LSTM (Long Short-Term Memory) neural networks trained on millions of character samples. The network analyzes character shapes, context, and patterns to identify each letter, number, or symbol.
Unlike older template-matching systems, neural networks can handle font variations, degraded text, and unusual character shapes. They learn patterns rather than matching exact templates.
Post-Processing
The final stage improves accuracy through intelligent correction:
- Dictionary lookup β Compares recognized words against language dictionaries to catch obvious errors
- Language model correction β Uses statistical models to fix words based on context (e.g., "teh" becomes "the")
- Confidence scoring β Assigns reliability scores to each word, flagging uncertain recognitions
- Format preservation β Maintains bold, italic, font sizes, and other formatting when possible
Pro tip: The preprocessing stage is where you have the most control. A clean, high-resolution scan with good contrast will always outperform aggressive post-processing of a poor-quality image.
Accuracy Factors That Matter
OCR accuracy varies dramatically based on input quality and document characteristics. Understanding these factors helps you optimize your scanning process and set realistic expectations.
| Factor | Impact Level | Recommendation |
|---|---|---|
| Scan resolution | High | 300 DPI minimum. 200 DPI for clean text. 400+ DPI for small fonts or degraded documents. |
| Image quality | High | Even lighting, no shadows, flat page (no curve from book spine). Use document feeder or flatbed scanner. |
| Font type | Medium-High | Standard fonts (Arial, Times): 98%+ accuracy. Decorative/handwritten: 60-80%. Serif fonts generally easier than sans-serif. |
| Language | Medium | Latin scripts: best support. CJK (Chinese/Japanese/Korean): good. Arabic/Devanagari: improving but less mature. |
| Document age | Medium | Faded ink, yellowed paper, and old typefaces reduce accuracy. Consider manual cleanup for critical historical documents. |
| Layout complexity | Medium | Single column: easy. Multi-column, tables, mixed content: harder. May require manual verification. |
| Skew angle | Low-Medium | Auto-deskew handles up to 10 degrees well. Beyond that, manually rotate before OCR. |
| Background noise | Medium | Watermarks, stamps, and background patterns confuse OCR. Clean scans or use preprocessing filters. |
Resolution Deep Dive
Scan resolution deserves special attention because it's the single most controllable factor affecting OCR accuracy. Here's what different resolutions mean in practice:
- 150 DPI β Barely usable. Only for large, clean text (18pt+). Expect 70-80% accuracy.
- 200 DPI β Acceptable for standard documents with 10-12pt fonts. Accuracy around 90-95%.
- 300 DPI β The sweet spot. Handles most documents with 95-99% accuracy. Industry standard.
- 400-600 DPI β Necessary for small fonts (8pt or less), degraded documents, or when you need near-perfect accuracy.
- 600+ DPI β Overkill for most use cases. Creates huge files with minimal accuracy improvement. Use only for archival purposes or extremely small text.
Higher resolution means larger file sizes. A 300 DPI color scan of a letter-sized page is about 25 MB uncompressed. Balance quality needs against storage and processing time.
Quick tip: If you're scanning books, use 400 DPI to compensate for the curved pages near the spine. The distortion at book edges requires extra resolution to maintain accuracy.
OCR Engines Compared
Several OCR engines dominate the open-source and commercial landscape. Each has strengths and weaknesses depending on your use case.
Tesseract OCR
Tesseract is the most popular open-source OCR engine, originally developed by HP and now maintained by Google. It's the default engine for most CLI tools and libraries.
Strengths:
- Completely free and open source
- Supports 100+ languages out of the box
- Active development and regular updates
- Excellent documentation and community support
- Works well with standard documents and clean scans
Weaknesses:
- Struggles with complex layouts and tables
- Lower accuracy on degraded or historical documents
- Requires good preprocessing for optimal results
- Limited format preservation (bold, italic, etc.)
Best for: General-purpose OCR, batch processing, integration into applications, budget-conscious projects.
ABBYY FineReader
ABBYY is the commercial gold standard for OCR accuracy. It's expensive but delivers superior results on challenging documents.
Strengths:
- Highest accuracy rates (99%+ on good scans)
- Excellent layout preservation and format detection
- Handles complex tables, forms, and multi-column layouts
- Superior performance on degraded documents
- Built-in document comparison and redaction tools
Weaknesses:
- Expensive licensing (hundreds of dollars per user)
- Windows-only desktop application (limited Linux support)
- Overkill for simple documents
- Closed-source with no customization options
Best for: Professional document management, legal/medical documents, archival projects with quality requirements.
Google Cloud Vision API
Google's cloud-based OCR service leverages the same technology that powers Google's document scanning features.
Strengths:
- Excellent accuracy with modern neural networks
- Handles handwriting better than most alternatives
- Automatic language detection
- Scales effortlessly for large volumes
- Includes document structure analysis
Weaknesses:
- Requires internet connection and API calls
- Costs money after free tier (1,000 pages/month)
- Privacy concerns for sensitive documents
- Vendor lock-in and dependency on Google infrastructure
Best for: Applications with internet access, variable document types, projects needing handwriting recognition.
Amazon Textract
AWS's document analysis service focuses on structured data extraction from forms and tables.
Strengths:
- Excellent form and table extraction
- Automatic key-value pair detection
- Integrates seamlessly with AWS ecosystem
- Good accuracy on business documents
Weaknesses:
- More expensive than Google Cloud Vision
- Overkill if you just need plain text extraction
- Requires AWS account and setup
Best for: Invoice processing, form digitization, AWS-based applications.
| Engine | Cost | Accuracy | Speed | Best Use Case |
|---|---|---|---|---|
| Tesseract | Free | Good (90-95%) | Fast | General purpose, batch processing |
| ABBYY FineReader | $199+ | Excellent (98-99%) | Medium | Professional documents, archives |
| Google Cloud Vision | $1.50/1000 pages | Excellent (96-98%) | Fast | Cloud apps, handwriting |
| Amazon Textract | $1.50/1000 pages | Very Good (95-97%) | Fast | Forms, tables, AWS integration |
Tesseract CLI Guide
Tesseract is the workhorse of open-source OCR. Here's how to use it effectively from the command line.
Installation
On Ubuntu/Debian:
sudo apt update
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-eng # English language data
On macOS:
brew install tesseract
On Windows, download the installer from the official GitHub releases page.
Basic Usage
The simplest Tesseract command extracts text from an image:
tesseract input.png output
This creates output.txt with the extracted text. Note that you don't include the .txt extension in the command.
PDF Output
To create a searchable PDF instead of plain text:
tesseract input.png output pdf
This generates output.pdf with the image and an invisible text layer.
Language Selection
Specify the document language for better accuracy:
tesseract input.png output -l fra # French
tesseract input.png output -l deu # German
tesseract input.png output -l spa # Spanish
For multilingual documents, combine language codes:
tesseract input.png output -l eng+fra # English and French
List all installed languages:
tesseract --list-langs
Page Segmentation Modes
Tesseract offers different page segmentation modes (PSM) for various document layouts:
tesseract input.png output --psm 3 # Fully automatic (default)
tesseract input.png output --psm 6 # Single uniform block of text
tesseract input.png output --psm 4 # Single column of text
Common PSM values:
0β Orientation and script detection only1β Automatic page segmentation with OSD (Orientation and Script Detection)3β Fully automatic page segmentation (default)4β Single column of text of variable sizes6β Single uniform block of text7β Single text line11β Sparse text without specific order
OCR Engine Mode
Tesseract supports different OCR engines:
tesseract input.png output --oem 1 # LSTM neural network (best)
tesseract input.png output --oem 0 # Legacy engine (faster, less accurate)
tesseract input.png output --oem 2 # Both engines combined
Use --oem 1 for best results with Tesseract 4.0+.
Configuration Variables
Fine-tune recognition with configuration variables:
tesseract input.png output -c tessedit_char_whitelist=0123456789 # Only digits
tesseract input.png output -c preserve_interword_spaces=1 # Keep spacing
Pro tip: For invoices and forms with mostly numbers, use character whitelisting to dramatically improve accuracy. Restricting the character set eliminates ambiguous recognitions like "O" vs "0" or "l" vs "1".
ocrmypdf: The Best CLI Tool
While Tesseract is powerful, ocrmypdf is the tool you actually want to use for PDF OCR. It wraps Tesseract with intelligent preprocessing, PDF handling, and optimization.
Why ocrmypdf?
Raw Tesseract requires you to extract images from PDFs, process them individually, and reassemble the results. ocrmypdf handles all of this automatically:
- Processes multi-page PDFs in one command
- Automatically deskews and cleans images
- Preserves original PDF quality and metadata
- Optimizes output file size
- Skips pages that already have text
- Handles mixed content (text + scanned pages)
Installation
pip install ocrmypdf
Or on Ubuntu/Debian:
sudo apt install ocrmypdf
Basic Usage
The simplest command adds OCR to a scanned PDF:
ocrmypdf input.pdf output.pdf
That's it. ocrmypdf detects the language, deskews pages, runs OCR, and creates a searchable PDF.
Language Selection
ocrmypdf -l fra input.pdf output.pdf # French
ocrmypdf -l eng+fra input.pdf output.pdf # English and French
Optimization Options
Control output quality and file size:
ocrmypdf --optimize 3 input.pdf output.pdf # Maximum compression
ocrmypdf --optimize 1 input.pdf output.pdf # Light compression
ocrmypdf --optimize 0 input.pdf output.pdf # No compression
For archival quality with minimal compression:
ocrmypdf --output-type pdfa input.pdf output.pdf
Deskew and Rotation
Automatically straighten crooked scans:
ocrmypdf --deskew input.pdf output.pdf
Rotate pages to correct orientation:
ocrmypdf --rotate-pages input.pdf output.pdf
Image Preprocessing
Clean up poor-quality scans:
ocrmypdf --clean input.pdf output.pdf # Remove background noise
ocrmypdf --clean-final input.pdf output.pdf # Aggressive cleaning
Skip Existing Text
For PDFs with mixed content (some pages already have text):
ocrmypdf --skip-text input.pdf output.pdf
This only processes pages that need OCR, saving time and preserving existing text quality.
Force OCR on All Pages
To OCR even pages that already have text:
ocrmypdf --force-ocr input.pdf output.pdf
Useful when existing text is poor quality or you want to standardize the text layer.
Parallel Processing
Speed up large documents by using multiple CPU cores:
ocrmypdf --jobs 4 input.pdf output.pdf
Use --jobs auto to automatically use all available cores.
Quick tip: For a 100-page document, combining --jobs auto with --optimize 3 can reduce processing time from 10 minutes to under 2 minutes while creating a smaller output file.
Real-World Example
Here's a production-ready command for processing scanned documents:
ocrmypdf \
--deskew \
--rotate-pages \
--clean \
--optimize 3 \
--jobs auto \
--output-type pdfa \
--skip-text \
input.pdf output.pdf
This command:
- Straightens crooked pages
- Rotates pages to correct orientation
- Removes background noise
- Compresses the output
- Uses all CPU cores
- Creates an archival-quality PDF/A
- Skips pages that already have text
Practical OCR Workflow
Here's a step-by-step workflow for processing scanned documents, from scanning to final output.
Step 1: Scan with Optimal Settings
Configure your scanner for best results:
- Resolution: 300 DPI for standard documents, 400 DPI for small text
- Color mode: Grayscale for text-only documents, color for documents with images or colored text
- Format: Save as PDF directly if your scanner supports it, otherwise use TIFF or PNG
- Compression: Use lossless compression (LZW for TIFF) or no compression
Physical scanning tips:
- Clean the scanner glass before starting
- Ensure pages are flat and straight
- Use the document feeder for multi-page documents
- Avoid shadows from book spines or curved pages
Step 2: Inspect and Prepare
Before running OCR, check your scans:
- Open a few pages to verify quality
- Check that text is sharp and readable
- Look for skew, shadows, or cut-off text
- Rescan problem pages if necessary
If you have multiple files, combine them into a single PDF using our