Reorder PDF using OCR with Python

Categories:   web development  
Tags:   python   ocr   pdf  

Reorder PDF pages using OCR text recognition with Python and Regex.

The full working code can be found on github.

Getting Started

A simplified version of our script to reorder hundreds of PDF pages for our online orders.

This documentation assumes you have Python3 installed along with pip, virtualenv and git.

I’m using regex (regular expressions) in this example to match phone numbers in each PDF page to reorder the pages.

If phone numbers aren’t standardized like in this example, results won’t be perfect; improvise.

Prerequisites

Things you need to install on your workstation

  • python 3
  • pip
  • virtualenv
  • tesseract
  • poppler

Here are some references for poppler and tesserect:

Some useful reference for pdf2image: - pdf2image https://pypi.org/project/pdf2image/

As of this writing, I’ve tested this script using:

  • Windows 10 Pro
  • Python 3.7.4
  • poppler-0.68.0_x86
  • tesseract-ocr-w64-setup-v5.0.0-alpha.20191010.exe

You’ll install the python modules using the requirements.txt example below.

Installing

A step by step guide to set up a development environment.

  • Install poppler on your workstation.

  • Install tesseract on your workstation.

  • Add them both to the PATH Environment.

Next, create a project folder and clone this repo:

mkdir pdfreorder
cd pdfreorder
git clone https://github.com/snowyTheHamster/pdf_reorder_with_ocr.git .

Create a virtual environment:

python -m virtualenv .venv
. .venv/scripts/activate # for windows
. .venv/bin/activate # for mac/linux

Install the included modules using pip

pip install -r requirements.txt

Now edit the full paths of the poppler and tesseract in the start.py file (details in Code Explanation below).

Running the tests

I included a sample.pdf file.

To test, run:

python start.py

Project Structure

  • input_pdf_here: Add your pdf file here with a .pdf extension
  • output_is_here: This is where the reordered pdf will be saved.
  • start.py: Our python script.

Break down of process

This example uses regular expressions to find phone numbers for sorting the pages.

Results may not be perfect if the phone numbers aren’t standardized.

  • The script saves the PDF page order in an array.
  • Converts the PDF file to jpg files.
  • Use OCR to grab text from the jpgs.
  • Reorders the jpgs based on user defined regex match.
  • Generates new PDF with updated page order.

Some documentation

Code Explanation

To make changes, edit the start.py file.

The script is divided in 3 main sections.

Part #1 : Converting PDF to images

The PDF file is converted into jpgs.

Temporary jpg and txt files will be generated per pdf page.

You can edit the parameters in convert_from_path for more options.

More info is available in the pdf2image documentation above.

Note: You may need to add full path to poppler on some work stations:

pages = convert_from_path(PDF_file, 500, poppler_path="C:\\poppler-0.68.0\\bin")

Note 2: Change \ to / for linux and mac file paths.

Part #2 : Recognizing text from the images using OCR

OCR will extract text from the jpg files to txt files.

We then use regex to research for phone numbers.

If regex matches a phone number, it’ll add it to an array.

If regex doesn’t match a phone number, it’ll prepend a large number to it before adding it to the array.

Finally, the script will reorder the array using the phone number as an index.

Note: You may need to add full path to tesseract on some work stations:

pytesseract.pytesseract.tesseract_cmd = 'C:\\Users\\<username>\\AppData\\Local\\Tesseract-OCR\\tesseract.exe'

Note 2: Change \ to / for linux and mac file paths.

Part #3 : Recognizing text from the images using OCR

The script will generate a new PDF with the updated page order.

The script will then delete all temporary txt and jpg files.

Related Products



Categories:   web development  
Tags:   python   ocr   pdf