Image To Text For Scanned PDFs

Performing a text analysis on PDF files is not a complicated task. Load a PDF library (like pdfbox), open the file, extract the text with the help of PDFTextStripper, perform the pattern matching based on the set of predefined rules, assign metadata based on match results, rinse and repeat until the entire batch is processed.

This works great on files created programmatically or by end-users. The libraries extract and decode binary sections marked as /Subtype /Text while ignoring the images.

But if the PDF file is produced by scanning a document (a common scenario for digitalization), the same extraction code will produce an empty string (or just a few returns). That's because a scanned PDF document is actually an image containing no /Text blocks for libraries to process.

So if the extraction code does not produce desired results, open PDF in any text editor, skip through binary blocks, and verify the absence of /Subtype /Text. Scanned PDF documents would only have /Image subtype.

/Subtype /Image

Another clue is to look at the value of Producer. For example:

/Producer (DotImage PDF Encoder)

DotImage Encoder is a library that converts image files into PDF format.

To perform the text analysis on scanned files, the image needs to be converted to text with optical character recognition (OCR) library.

The following instructions are Mac-specific and written for Tesseract OCR and Python.

Python setup

Make sure that brew is installed. Check by typing brew in the terminal. If the command is not recognized, head up to https://brew.sh and copy the install command into the terminal.
Install pyenv

brew install pyenv

Install the latest Python (check Python.org for the latest version)

pyenv install 3.9.0

Set pyenv global to the downloaded version

pyenv global 3.9.0

Add config option for .zsh terminal (I use zsh; the command would be different for bash)

echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n  eval "$(pyenv init -)"\nfi' >> ~/.zshrc

Restart the terminal and test the setup. If you are running VS Code, remember to restart IDE as well and to use zsh terminal.

which python 
.pyenv/shims/python

OCR, Image Processing, and PDF Libraries

Install Tesseract OCR (open source OCR library) with brew

brew install tesseract

Install Python library for tesseract

pip install pytesseract

We also need the imaging library PIL. instead of installing PIL, install Pillow, which is a pip friendly fork

pip install Pillow

To convert PDF to PIL image, install pdf2image library, which is a python wrapper for pdftoppm and pdftocairo (pdf rendering library)

pip install pdf2image

To use the wrapper for poppler(pdftocairo), it needs to be installed.

brew install poppler

At this point, the setup is complete.

Python Code

The logic for processing is simple - convert each PDF page into PIL image format, then extract text from each image with pytesseract

Import packages

from pytesseract import image_to_string from pdf2image import convert_from_path from PIL import Image import tempfile

To process a single file, create a PIL Image list (one image per page).

with tempfile.TemporaryDirectory() as path:     imgs = convert_from_path(file_path, output_folder=path)

If only a few tiny files are processed, output_folder can be skipped; but for large files, output_folder is more efficient from a memory standpoint. Using tempfile ensures that all resources are cleaned up upon exit.

Now imgs array can be processed with pytesseract to convert images to text.

for img in imgs:
    textstring=str(pytesseract.image_to_string(img))

Further manipulations can be done with textstring to perform text analysis (like str.find or re.search).