PDF OCR: Extract Text from Scanned Documents

· 12 min read

Table of Contents

What Is OCR?

OCR (Optical Character Recognition) converts images of text into machine-readable text. When you scan a paper document to PDF, the result is essentially a collection of images β€” you can see the text but cannot select, search, or copy it. OCR analyzes these images and extracts the text content.

A "searchable PDF" has an invisible text layer positioned behind the scanned image. You see the original scan, but you can press Ctrl+F to search, select text to copy, and screen readers can read the content aloud for accessibility. This makes scanned documents as functional as native digital PDFs.

OCR technology has evolved dramatically over the past decade. Early systems relied on template matching and required clean, high-quality scans. Modern OCR engines use deep learning neural networks that can handle degraded documents, multiple languages, and complex layouts with remarkable accuracy.

The most common use cases for OCR include:

Try our PDF OCR tool to make your scanned PDFs searchable in seconds. For documents that need additional processing, check out our PDF compressor to reduce file sizes after OCR.

How OCR Works

Modern OCR engines process documents through a sophisticated pipeline of image analysis and text recognition. Understanding this process helps you optimize your scans for better results.

Image Preprocessing

Before any text recognition happens, the OCR engine prepares the image:

Layout Analysis

The engine must understand document structure before reading text:

Layout analysis is where many OCR systems struggle with complex documents. A two-column academic paper with footnotes and embedded figures requires sophisticated analysis to maintain correct reading order.

Character Segmentation

The engine isolates individual characters or words for recognition. This step handles:

Character Recognition

This is where the actual text extraction happens. Modern engines use LSTM (Long Short-Term Memory) neural networks trained on millions of character samples. The network analyzes character shapes, context, and patterns to identify each letter, number, or symbol.

Unlike older template-matching systems, neural networks can handle font variations, degraded text, and unusual character shapes. They learn patterns rather than matching exact templates.

Post-Processing

The final stage improves accuracy through intelligent correction:

Pro tip: The preprocessing stage is where you have the most control. A clean, high-resolution scan with good contrast will always outperform aggressive post-processing of a poor-quality image.

Accuracy Factors That Matter

OCR accuracy varies dramatically based on input quality and document characteristics. Understanding these factors helps you optimize your scanning process and set realistic expectations.

Factor Impact Level Recommendation
Scan resolution High 300 DPI minimum. 200 DPI for clean text. 400+ DPI for small fonts or degraded documents.
Image quality High Even lighting, no shadows, flat page (no curve from book spine). Use document feeder or flatbed scanner.
Font type Medium-High Standard fonts (Arial, Times): 98%+ accuracy. Decorative/handwritten: 60-80%. Serif fonts generally easier than sans-serif.
Language Medium Latin scripts: best support. CJK (Chinese/Japanese/Korean): good. Arabic/Devanagari: improving but less mature.
Document age Medium Faded ink, yellowed paper, and old typefaces reduce accuracy. Consider manual cleanup for critical historical documents.
Layout complexity Medium Single column: easy. Multi-column, tables, mixed content: harder. May require manual verification.
Skew angle Low-Medium Auto-deskew handles up to 10 degrees well. Beyond that, manually rotate before OCR.
Background noise Medium Watermarks, stamps, and background patterns confuse OCR. Clean scans or use preprocessing filters.

Resolution Deep Dive

Scan resolution deserves special attention because it's the single most controllable factor affecting OCR accuracy. Here's what different resolutions mean in practice:

Higher resolution means larger file sizes. A 300 DPI color scan of a letter-sized page is about 25 MB uncompressed. Balance quality needs against storage and processing time.

Quick tip: If you're scanning books, use 400 DPI to compensate for the curved pages near the spine. The distortion at book edges requires extra resolution to maintain accuracy.

OCR Engines Compared

Several OCR engines dominate the open-source and commercial landscape. Each has strengths and weaknesses depending on your use case.

Tesseract OCR

Tesseract is the most popular open-source OCR engine, originally developed by HP and now maintained by Google. It's the default engine for most CLI tools and libraries.

Strengths:

Weaknesses:

Best for: General-purpose OCR, batch processing, integration into applications, budget-conscious projects.

ABBYY FineReader

ABBYY is the commercial gold standard for OCR accuracy. It's expensive but delivers superior results on challenging documents.

Strengths:

Weaknesses:

Best for: Professional document management, legal/medical documents, archival projects with quality requirements.

Google Cloud Vision API

Google's cloud-based OCR service leverages the same technology that powers Google's document scanning features.

Strengths:

Weaknesses:

Best for: Applications with internet access, variable document types, projects needing handwriting recognition.

Amazon Textract

AWS's document analysis service focuses on structured data extraction from forms and tables.

Strengths:

Weaknesses:

Best for: Invoice processing, form digitization, AWS-based applications.

Engine Cost Accuracy Speed Best Use Case
Tesseract Free Good (90-95%) Fast General purpose, batch processing
ABBYY FineReader $199+ Excellent (98-99%) Medium Professional documents, archives
Google Cloud Vision $1.50/1000 pages Excellent (96-98%) Fast Cloud apps, handwriting
Amazon Textract $1.50/1000 pages Very Good (95-97%) Fast Forms, tables, AWS integration

Tesseract CLI Guide

Tesseract is the workhorse of open-source OCR. Here's how to use it effectively from the command line.

Installation

On Ubuntu/Debian:

sudo apt update
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-eng  # English language data

On macOS:

brew install tesseract

On Windows, download the installer from the official GitHub releases page.

Basic Usage

The simplest Tesseract command extracts text from an image:

tesseract input.png output

This creates output.txt with the extracted text. Note that you don't include the .txt extension in the command.

PDF Output

To create a searchable PDF instead of plain text:

tesseract input.png output pdf

This generates output.pdf with the image and an invisible text layer.

Language Selection

Specify the document language for better accuracy:

tesseract input.png output -l fra  # French
tesseract input.png output -l deu  # German
tesseract input.png output -l spa  # Spanish

For multilingual documents, combine language codes:

tesseract input.png output -l eng+fra  # English and French

List all installed languages:

tesseract --list-langs

Page Segmentation Modes

Tesseract offers different page segmentation modes (PSM) for various document layouts:

tesseract input.png output --psm 3  # Fully automatic (default)
tesseract input.png output --psm 6  # Single uniform block of text
tesseract input.png output --psm 4  # Single column of text

Common PSM values:

OCR Engine Mode

Tesseract supports different OCR engines:

tesseract input.png output --oem 1  # LSTM neural network (best)
tesseract input.png output --oem 0  # Legacy engine (faster, less accurate)
tesseract input.png output --oem 2  # Both engines combined

Use --oem 1 for best results with Tesseract 4.0+.

Configuration Variables

Fine-tune recognition with configuration variables:

tesseract input.png output -c tessedit_char_whitelist=0123456789  # Only digits
tesseract input.png output -c preserve_interword_spaces=1  # Keep spacing

Pro tip: For invoices and forms with mostly numbers, use character whitelisting to dramatically improve accuracy. Restricting the character set eliminates ambiguous recognitions like "O" vs "0" or "l" vs "1".

ocrmypdf: The Best CLI Tool

While Tesseract is powerful, ocrmypdf is the tool you actually want to use for PDF OCR. It wraps Tesseract with intelligent preprocessing, PDF handling, and optimization.

Why ocrmypdf?

Raw Tesseract requires you to extract images from PDFs, process them individually, and reassemble the results. ocrmypdf handles all of this automatically:

Installation

pip install ocrmypdf

Or on Ubuntu/Debian:

sudo apt install ocrmypdf

Basic Usage

The simplest command adds OCR to a scanned PDF:

ocrmypdf input.pdf output.pdf

That's it. ocrmypdf detects the language, deskews pages, runs OCR, and creates a searchable PDF.

Language Selection

ocrmypdf -l fra input.pdf output.pdf  # French
ocrmypdf -l eng+fra input.pdf output.pdf  # English and French

Optimization Options

Control output quality and file size:

ocrmypdf --optimize 3 input.pdf output.pdf  # Maximum compression
ocrmypdf --optimize 1 input.pdf output.pdf  # Light compression
ocrmypdf --optimize 0 input.pdf output.pdf  # No compression

For archival quality with minimal compression:

ocrmypdf --output-type pdfa input.pdf output.pdf

Deskew and Rotation

Automatically straighten crooked scans:

ocrmypdf --deskew input.pdf output.pdf

Rotate pages to correct orientation:

ocrmypdf --rotate-pages input.pdf output.pdf

Image Preprocessing

Clean up poor-quality scans:

ocrmypdf --clean input.pdf output.pdf  # Remove background noise
ocrmypdf --clean-final input.pdf output.pdf  # Aggressive cleaning

Skip Existing Text

For PDFs with mixed content (some pages already have text):

ocrmypdf --skip-text input.pdf output.pdf

This only processes pages that need OCR, saving time and preserving existing text quality.

Force OCR on All Pages

To OCR even pages that already have text:

ocrmypdf --force-ocr input.pdf output.pdf

Useful when existing text is poor quality or you want to standardize the text layer.

Parallel Processing

Speed up large documents by using multiple CPU cores:

ocrmypdf --jobs 4 input.pdf output.pdf

Use --jobs auto to automatically use all available cores.

Quick tip: For a 100-page document, combining --jobs auto with --optimize 3 can reduce processing time from 10 minutes to under 2 minutes while creating a smaller output file.

Real-World Example

Here's a production-ready command for processing scanned documents:

ocrmypdf \
  --deskew \
  --rotate-pages \
  --clean \
  --optimize 3 \
  --jobs auto \
  --output-type pdfa \
  --skip-text \
  input.pdf output.pdf

This command:

Practical OCR Workflow

Here's a step-by-step workflow for processing scanned documents, from scanning to final output.

Step 1: Scan with Optimal Settings

Configure your scanner for best results:

Physical scanning tips:

Step 2: Inspect and Prepare

Before running OCR, check your scans:

If you have multiple files, combine them into a single PDF using our