What DPI should I scan at for OCR?

300 DPI is the standard recommendation. 200 DPI works for clean printed text. 400-600 DPI helps with small fonts or degraded documents. Higher than 600 DPI rarely improves accuracy.

Can OCR read handwriting?

Modern OCR can read neat handwriting with 60-80% accuracy. Cursive and messy handwriting remains challenging. Specialized handwriting recognition (ICR) tools perform better than general OCR.

Is Tesseract OCR good enough?

Tesseract 5 achieves 95-99% accuracy on clean printed text. For degraded documents, complex layouts, or handwriting, commercial engines like ABBYY or Google Vision API perform better.

How do I OCR a PDF in bulk?

Use ocrmypdf for batch processing: find . -name '*.pdf' -exec ocrmypdf {} {}.ocr.pdf. It handles multi-page PDFs, skips already-OCR'd pages, and produces PDF/A output.

PDF OCR: Extract Text from Scanned Documents

Q: What is a searchable PDF?

A searchable PDF has an invisible text layer behind the scanned image. You see the original scan but can select, copy, and search the OCR-extracted text.

March 31, 2026 · 12 min read

Table of Contents

What Is OCR?
How OCR Works
Accuracy Factors That Matter
OCR Engines Compared
Tesseract CLI Guide
ocrmypdf: The Best CLI Tool
Practical OCR Workflow
Accuracy by Document Type
Troubleshooting Common Issues
Batch Processing Multiple PDFs
Frequently Asked Questions
Related Articles

What Is OCR?

OCR (Optical Character Recognition) converts images of text into machine-readable text. When you scan a paper document to PDF, the result is essentially a collection of images — you can see the text but cannot select, search, or copy it. OCR analyzes these images and extracts the text content.

A "searchable PDF" has an invisible text layer positioned behind the scanned image. You see the original scan, but you can press Ctrl+F to search, select text to copy, and screen readers can read the content aloud for accessibility. This makes scanned documents as functional as native digital PDFs.

OCR technology has evolved dramatically over the past decade. Early systems relied on template matching and required clean, high-quality scans. Modern OCR engines use deep learning neural networks that can handle degraded documents, multiple languages, and complex layouts with remarkable accuracy.

The most common use cases for OCR include:

Digitizing paper archives and historical documents
Making scanned contracts and legal documents searchable
Extracting data from invoices and receipts for accounting
Converting printed books and articles to editable text
Enabling accessibility for visually impaired users
Creating searchable repositories of technical documentation

Try our PDF OCR tool to make your scanned PDFs searchable in seconds. For documents that need additional processing, check out our PDF compressor to reduce file sizes after OCR.

How OCR Works

Modern OCR engines process documents through a sophisticated pipeline of image analysis and text recognition. Understanding this process helps you optimize your scans for better results.

Image Preprocessing

Before any text recognition happens, the OCR engine prepares the image:

Deskewing — Detects and corrects rotation. Even a 2-degree tilt can reduce accuracy by 10-15%. The engine analyzes text baselines and straightens the image.
Denoising — Removes speckles, dust spots, and scanner artifacts. This is critical for older documents or low-quality scans.
Binarization — Converts grayscale or color images to pure black and white. Adaptive thresholding handles uneven lighting and shadows.
Contrast enhancement — Sharpens faded text and improves the distinction between text and background.
Border removal — Crops out margins and non-text areas to focus processing on actual content.

Layout Analysis

The engine must understand document structure before reading text:

Detecting text regions versus images, diagrams, and white space
Identifying columns and determining reading order (left-to-right, top-to-bottom)
Recognizing tables, headers, footers, and page numbers
Separating paragraphs and maintaining logical document flow

Layout analysis is where many OCR systems struggle with complex documents. A two-column academic paper with footnotes and embedded figures requires sophisticated analysis to maintain correct reading order.

Character Segmentation

The engine isolates individual characters or words for recognition. This step handles:

Separating touching or overlapping characters
Identifying character boundaries in cursive or connected scripts
Handling variable spacing and kerning
Detecting and preserving special characters and symbols

Character Recognition

This is where the actual text extraction happens. Modern engines use LSTM (Long Short-Term Memory) neural networks trained on millions of character samples. The network analyzes character shapes, context, and patterns to identify each letter, number, or symbol.

Unlike older template-matching systems, neural networks can handle font variations, degraded text, and unusual character shapes. They learn patterns rather than matching exact templates.

Post-Processing

The final stage improves accuracy through intelligent correction:

Dictionary lookup — Compares recognized words against language dictionaries to catch obvious errors
Language model correction — Uses statistical models to fix words based on context (e.g., "teh" becomes "the")
Confidence scoring — Assigns reliability scores to each word, flagging uncertain recognitions
Format preservation — Maintains bold, italic, font sizes, and other formatting when possible

Pro tip: The preprocessing stage is where you have the most control. A clean, high-resolution scan with good contrast will always outperform aggressive post-processing of a poor-quality image.

Accuracy Factors That Matter

OCR accuracy varies dramatically based on input quality and document characteristics. Understanding these factors helps you optimize your scanning process and set realistic expectations.

Factor	Impact Level	Recommendation
Scan resolution	High	300 DPI minimum. 200 DPI for clean text. 400+ DPI for small fonts or degraded documents.
Image quality	High	Even lighting, no shadows, flat page (no curve from book spine). Use document feeder or flatbed scanner.
Font type	Medium-High	Standard fonts (Arial, Times): 98%+ accuracy. Decorative/handwritten: 60-80%. Serif fonts generally easier than sans-serif.
Language	Medium	Latin scripts: best support. CJK (Chinese/Japanese/Korean): good. Arabic/Devanagari: improving but less mature.
Document age	Medium	Faded ink, yellowed paper, and old typefaces reduce accuracy. Consider manual cleanup for critical historical documents.
Layout complexity	Medium	Single column: easy. Multi-column, tables, mixed content: harder. May require manual verification.
Skew angle	Low-Medium	Auto-deskew handles up to 10 degrees well. Beyond that, manually rotate before OCR.
Background noise	Medium	Watermarks, stamps, and background patterns confuse OCR. Clean scans or use preprocessing filters.

Resolution Deep Dive

Scan resolution deserves special attention because it's the single most controllable factor affecting OCR accuracy. Here's what different resolutions mean in practice:

150 DPI — Barely usable. Only for large, clean text (18pt+). Expect 70-80% accuracy.
200 DPI — Acceptable for standard documents with 10-12pt fonts. Accuracy around 90-95%.
300 DPI — The sweet spot. Handles most documents with 95-99% accuracy. Industry standard.
400-600 DPI — Necessary for small fonts (8pt or less), degraded documents, or when you need near-perfect accuracy.
600+ DPI — Overkill for most use cases. Creates huge files with minimal accuracy improvement. Use only for archival purposes or extremely small text.

Higher resolution means larger file sizes. A 300 DPI color scan of a letter-sized page is about 25 MB uncompressed. Balance quality needs against storage and processing time.

Quick tip: If you're scanning books, use 400 DPI to compensate for the curved pages near the spine. The distortion at book edges requires extra resolution to maintain accuracy.

OCR Engines Compared

Several OCR engines dominate the open-source and commercial landscape. Each has strengths and weaknesses depending on your use case.

Tesseract OCR

Tesseract is the most popular open-source OCR engine, originally developed by HP and now maintained by Google. It's the default engine for most CLI tools and libraries.

Strengths:

Completely free and open source
Supports 100+ languages out of the box
Active development and regular updates
Excellent documentation and community support
Works well with standard documents and clean scans

Weaknesses:

Struggles with complex layouts and tables
Lower accuracy on degraded or historical documents
Requires good preprocessing for optimal results
Limited format preservation (bold, italic, etc.)

Best for: General-purpose OCR, batch processing, integration into applications, budget-conscious projects.

ABBYY FineReader

ABBYY is the commercial gold standard for OCR accuracy. It's expensive but delivers superior results on challenging documents.

Strengths:

Highest accuracy rates (99%+ on good scans)
Excellent layout preservation and format detection
Handles complex tables, forms, and multi-column layouts
Superior performance on degraded documents
Built-in document comparison and redaction tools

Weaknesses:

Expensive licensing (hundreds of dollars per user)
Windows-only desktop application (limited Linux support)
Overkill for simple documents
Closed-source with no customization options

Best for: Professional document management, legal/medical documents, archival projects with quality requirements.

Google Cloud Vision API

Google's cloud-based OCR service leverages the same technology that powers Google's document scanning features.

Strengths:

Excellent accuracy with modern neural networks
Handles handwriting better than most alternatives
Automatic language detection
Scales effortlessly for large volumes
Includes document structure analysis

Weaknesses:

Requires internet connection and API calls
Costs money after free tier (1,000 pages/month)
Privacy concerns for sensitive documents
Vendor lock-in and dependency on Google infrastructure

Best for: Applications with internet access, variable document types, projects needing handwriting recognition.

Amazon Textract

AWS's document analysis service focuses on structured data extraction from forms and tables.

Strengths:

Excellent form and table extraction
Automatic key-value pair detection
Integrates seamlessly with AWS ecosystem
Good accuracy on business documents

Weaknesses:

More expensive than Google Cloud Vision
Overkill if you just need plain text extraction
Requires AWS account and setup

Best for: Invoice processing, form digitization, AWS-based applications.

Engine	Cost	Accuracy	Speed	Best Use Case
Tesseract	Free	Good (90-95%)	Fast	General purpose, batch processing
ABBYY FineReader	$199+	Excellent (98-99%)	Medium	Professional documents, archives
Google Cloud Vision	$1.50/1000 pages	Excellent (96-98%)	Fast	Cloud apps, handwriting
Amazon Textract	$1.50/1000 pages	Very Good (95-97%)	Fast	Forms, tables, AWS integration

Tesseract CLI Guide

Tesseract is the workhorse of open-source OCR. Here's how to use it effectively from the command line.

Installation

On Ubuntu/Debian:

sudo apt update
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-eng  # English language data

On macOS:

brew install tesseract

On Windows, download the installer from the official GitHub releases page.

Basic Usage

The simplest Tesseract command extracts text from an image:

tesseract input.png output

This creates output.txt with the extracted text. Note that you don't include the .txt extension in the command.

PDF Output

To create a searchable PDF instead of plain text:

tesseract input.png output pdf

This generates output.pdf with the image and an invisible text layer.

Language Selection

Specify the document language for better accuracy:

tesseract input.png output -l fra  # French
tesseract input.png output -l deu  # German
tesseract input.png output -l spa  # Spanish

For multilingual documents, combine language codes:

tesseract input.png output -l eng+fra  # English and French

List all installed languages:

tesseract --list-langs

Page Segmentation Modes

Tesseract offers different page segmentation modes (PSM) for various document layouts:

tesseract input.png output --psm 3  # Fully automatic (default)
tesseract input.png output --psm 6  # Single uniform block of text
tesseract input.png output --psm 4  # Single column of text

Common PSM values:

0 — Orientation and script detection only
1 — Automatic page segmentation with OSD (Orientation and Script Detection)
3 — Fully automatic page segmentation (default)
4 — Single column of text of variable sizes
6 — Single uniform block of text
7 — Single text line
11 — Sparse text without specific order

OCR Engine Mode

Tesseract supports different OCR engines:

tesseract input.png output --oem 1  # LSTM neural network (best)
tesseract input.png output --oem 0  # Legacy engine (faster, less accurate)
tesseract input.png output --oem 2  # Both engines combined

Use --oem 1 for best results with Tesseract 4.0+.

Configuration Variables

Fine-tune recognition with configuration variables:

tesseract input.png output -c tessedit_char_whitelist=0123456789  # Only digits
tesseract input.png output -c preserve_interword_spaces=1  # Keep spacing

Pro tip: For invoices and forms with mostly numbers, use character whitelisting to dramatically improve accuracy. Restricting the character set eliminates ambiguous recognitions like "O" vs "0" or "l" vs "1".

ocrmypdf: The Best CLI Tool

While Tesseract is powerful, ocrmypdf is the tool you actually want to use for PDF OCR. It wraps Tesseract with intelligent preprocessing, PDF handling, and optimization.

Why ocrmypdf?

Raw Tesseract requires you to extract images from PDFs, process them individually, and reassemble the results. ocrmypdf handles all of this automatically:

Processes multi-page PDFs in one command
Automatically deskews and cleans images
Preserves original PDF quality and metadata
Optimizes output file size
Skips pages that already have text
Handles mixed content (text + scanned pages)

Installation

pip install ocrmypdf

Or on Ubuntu/Debian:

sudo apt install ocrmypdf

Basic Usage

The simplest command adds OCR to a scanned PDF:

ocrmypdf input.pdf output.pdf

That's it. ocrmypdf detects the language, deskews pages, runs OCR, and creates a searchable PDF.

Language Selection

ocrmypdf -l fra input.pdf output.pdf  # French
ocrmypdf -l eng+fra input.pdf output.pdf  # English and French

Optimization Options

Control output quality and file size:

ocrmypdf --optimize 3 input.pdf output.pdf  # Maximum compression
ocrmypdf --optimize 1 input.pdf output.pdf  # Light compression
ocrmypdf --optimize 0 input.pdf output.pdf  # No compression

For archival quality with minimal compression:

ocrmypdf --output-type pdfa input.pdf output.pdf

Deskew and Rotation

Automatically straighten crooked scans:

ocrmypdf --deskew input.pdf output.pdf

Rotate pages to correct orientation:

ocrmypdf --rotate-pages input.pdf output.pdf

Image Preprocessing

Clean up poor-quality scans:

ocrmypdf --clean input.pdf output.pdf  # Remove background noise
ocrmypdf --clean-final input.pdf output.pdf  # Aggressive cleaning

Skip Existing Text

For PDFs with mixed content (some pages already have text):

ocrmypdf --skip-text input.pdf output.pdf

This only processes pages that need OCR, saving time and preserving existing text quality.

Force OCR on All Pages

To OCR even pages that already have text:

ocrmypdf --force-ocr input.pdf output.pdf

Useful when existing text is poor quality or you want to standardize the text layer.

Parallel Processing

Speed up large documents by using multiple CPU cores:

ocrmypdf --jobs 4 input.pdf output.pdf

Use --jobs auto to automatically use all available cores.

Quick tip: For a 100-page document, combining --jobs auto with --optimize 3 can reduce processing time from 10 minutes to under 2 minutes while creating a smaller output file.

Real-World Example

Here's a production-ready command for processing scanned documents:

ocrmypdf \
  --deskew \
  --rotate-pages \
  --clean \
  --optimize 3 \
  --jobs auto \
  --output-type pdfa \
  --skip-text \
  input.pdf output.pdf

This command:

Straightens crooked pages
Rotates pages to correct orientation
Removes background noise
Compresses the output
Uses all CPU cores
Creates an archival-quality PDF/A
Skips pages that already have text

Practical OCR Workflow

Here's a step-by-step workflow for processing scanned documents, from scanning to final output.

Step 1: Scan with Optimal Settings

Configure your scanner for best results:

Resolution: 300 DPI for standard documents, 400 DPI for small text
Color mode: Grayscale for text-only documents, color for documents with images or colored text
Format: Save as PDF directly if your scanner supports it, otherwise use TIFF or PNG
Compression: Use lossless compression (LZW for TIFF) or no compression

Physical scanning tips:

Clean the scanner glass before starting
Ensure pages are flat and straight
Use the document feeder for multi-page documents
Avoid shadows from book spines or curved pages

Step 2: Inspect and Prepare

Before running OCR, check your scans:

Open a few pages to verify quality
Check that text is sharp and readable
Look for skew, shadows, or cut-off text
Rescan problem pages if necessary

If you have multiple files, combine them into a single PDF using our

More Tools: gen-kit seo-io txt-tool run-dev

PDF OCR: Extract Text from Scanned Documents

What Is OCR?

How OCR Works

Image Preprocessing

Layout Analysis

Character Segmentation

Character Recognition

Post-Processing

Accuracy Factors That Matter

Resolution Deep Dive

OCR Engines Compared

Tesseract OCR

ABBYY FineReader

Google Cloud Vision API

Amazon Textract

Tesseract CLI Guide

Installation

Basic Usage

PDF Output

Language Selection

Page Segmentation Modes

OCR Engine Mode

Configuration Variables

ocrmypdf: The Best CLI Tool

Why ocrmypdf?

Installation

Basic Usage

Language Selection

Optimization Options

Deskew and Rotation

Image Preprocessing

Skip Existing Text

Force OCR on All Pages

Parallel Processing

Real-World Example

Practical OCR Workflow

Step 1: Scan with Optimal Settings

Step 2: Inspect and Prepare

Convert

Edit

Security

Extract

Company