Image To Text For Scanned PDFs
Performing a text analysis on PDF files is not a complicated task. Load a PDF library (like pdfbox), open the file, extract the text with the help of PDFTextStripper, perform the pattern matching based on the set of predefined rules, assign metadata based on match results, rinse and repeat until the entire batch is processed.
This works great on files created programmatically or by end-users. The libraries extract and decode binary sections marked as /Subtype /Text while ignoring the images.
But if the PDF file is produced by scanning a document (a common scenario for digitalization), the same extraction code will produce an empty string (or just a few returns). That's because a scanned PDF document is actually an image containing no /Text blocks for libraries to process.
So if the extraction code does not produce desired results, open PDF in any text editor, skip through binary blocks, and verify the absence of /Subtype /Text. Scanned PDF documents would only have /Image subtype.
/Subtype /Image
Another clue is to look at the value of Producer. For example:
/Producer (DotImage PDF Encoder)
DotImage Encoder is a library that converts image files into PDF format.
To perform the text analysis on scanned files, the image needs to be converted to text with optical character recognition (OCR) library.
The following instructions are Mac-specific and written for Tesseract OCR and Python.
Python setup
- Make sure that brew is installed. Check by typing brew in the terminal. If the command is not recognized, head up to https://brew.sh and copy the install command into the terminal.
- Install pyenv
brew install pyenv
- Install the latest Python (check Python.org for the latest version)
pyenv install 3.9.0
- Set pyenv global to the downloaded version
pyenv global 3.9.0
- Add config option for .zsh terminal (I use zsh; the command would be different for bash)
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.zshrc
- Restart the terminal and test the setup. If you are running VS Code, remember to restart IDE as well and to use zsh terminal.
which python .pyenv/shims/python
OCR, Image Processing, and PDF Libraries
- Install Tesseract OCR (open source OCR library) with brew
brew install tesseract
- Install Python library for tesseract
pip install pytesseract
- We also need the imaging library PIL. instead of installing PIL, install Pillow, which is a pip friendly fork
pip install Pillow
- To convert PDF to PIL image, install pdf2image library, which is a python wrapper for pdftoppm and pdftocairo (pdf rendering library)
pip install pdf2image
- To use the wrapper for poppler(pdftocairo), it needs to be installed.
brew install poppler
At this point, the setup is complete.
Python Code
The logic for processing is simple - convert each PDF page into PIL image format, then extract text from each image with pytesseract
- Import packages
from pytesseract import image_to_string from pdf2image import convert_from_path from PIL import Image import tempfile
To process a single file, create a PIL Image list (one image per page).
with tempfile.TemporaryDirectory() as path: imgs = convert_from_path(file_path, output_folder=path)
If only a few tiny files are processed, output_folder can be skipped; but for large files, output_folder is more efficient from a memory standpoint. Using tempfile ensures that all resources are cleaned up upon exit.
Now imgs array can be processed with pytesseract to convert images to text.
for img in imgs: textstring=str(pytesseract.image_to_string(img))
Further manipulations can be done with textstring to perform text analysis (like str.find or re.search).