Making PDFs searchable

I finally took my bookkeeping paperless, and scan paper invoices with the Microsoft Lens app on my phone. The app works great but doesn’t do OCR, meaning the text of the scans is not searchable.

OCR PDFs

Scanned PDFs can be made searchable by adding a layer of text over the background image. The text layer positions the characters over the image so it looks like you can select the words in the image by highlighting them. To extract the text from an image we need an OCR program. The OCR program will try to guess the characters found in the image. Tesseract is such a free and open-source program.

Tesseract can not read PDFs, nor does it clean the images before attempting to extract the text. Scanned PDFs are often skewed with black bands on the sides which can make it harder for Tesseract to do interpret the text. pdfsandwich is a command-line tool that combines unpaper, to straighten and clean the PDFs, and Tesseract in a single package. The end result is a “sandwiched PDF” with the scanned image as the bottom layer and the OCRed text on the top.

It’s important to specify the language of the document as it greatly improves the text recognition. Run pdfsandwich with the -list_langs flag to list the available language codes (see the Tesseract documentation to install extra languages). These 3 letter language codes can be passed to pdfsandwich to improve the text parsing, even multiple languages in the same document.

$ pdfsandwich -lang eng+nld document.pdf

The result will be stored in a new ‘document_ocr.pdf’ file (can be changed by passing the -o flag). The -rgb flag preserves color (careful though, the documentation mentions it could cause problems with some color spaces).

Scripting

This works fine for a single file but we can do better. This little bash script converts all PDF files in a folder. Since we don’t want to OCR PDF files we already converted we check if the PDF file includes fonts. If it does, we know the PDF contains some sort of text and we don’t spend time converting it again.

  #!/bin/bash
  for file in *.pdf
  do
    if !(strings "$file" | grep -q 'FontName'); then
      pdfsandwich -lang eng+nld "$file" -o "$file"
    fi
  done
  

This can now easily run as a cronjob to periodically OCR all scanned invoices.

This post is open source. Did you spot a mistake? Ideas for improvements? Contribute to this post via Github. Thank you!