How to reorder PDF's using OCR with Python

Categories:   web development  
Tags:   python  

PDF files can be quite difficult to manipulate as there are various encodings. Sometimes, we can’t get the values directly from a PDF file.

In such cases, we can convert the PDF into images and use OCR to detect the text instead.

Using regex (regular expressions), we can match patterns in these text to create fun manipulation and automation tools.

In this post I’ll demonstrate how to get started with tools like pdf2image, PyPDF2 & pytesseract using Python.

This post will only provide simple examples. You’ll need to adjust the files and folder paths yourself.

Here are the python libraries we need plus some extras:

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os, glob, re
from os import listdir
from PyPDF2 import PdfFileWriter, PdfFileReader

We also need to install the following software on our workstation:

After installation, check they are in the PATH

example for windows:

C:\Program Files\gs\gs9.27\bin
C:\Program Files\poppler-0.68.0\bin
C:\Program Files\Tesseract-OCR\

How to reorder/remove PDF pages

Here is how to reorder or remove pages in a PDF file:

from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os, glob, re
from os import listdir
from PyPDF2 import PdfFileWriter, PdfFileReader

infile = PdfFileReader('my_file.pdf', 'rb')
output = PdfFileWriter()

# get no. of pages in pdf
pages = (infile.getNumPages()) # we can make cool stuff with this info

keep_pages = [2,5,1] # page numbers starts from 0

for i in keep_pages:
    p = infile.getPage(i)
    output.addPage(p)

with open('my_new_file.pdf', 'wb') as f:
    output.write(f)

That was a very quick way to manually rearrange pages in a PDF. The infile.getNumPages() value gives us the total no. of pages in the PDF file. With this info, we can create various loops like excluding odd/even numbers & reversing the order.

Note that instead of editing the original file, we save the changes as a new file in case we screw up.

How to convert PDF into Images

from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os, glob, re
from os import listdir
from PyPDF2 import PdfFileWriter, PdfFileReader

pages = convert_from_path('my_file.pdf', dpi=500, fmt='jpg') # documentation here: https://github.com/Belval/pdf2image

image_counter = 1

for page in pages: 
    print(f"converting pdf to jpg for: {page}")
    # PDF page n -> page_n.jpg 
    filename = "page_"+str(image_counter)+".jpg"
    # Save the image of the page in system 
    page.save(filename, 'JPEG') 
    # Increment the counter to update filename 
    image_counter = image_counter + 1
    print(f"conversion done for: {page}")

This will convert the PDF file into images. The loop above will create an image for each page.

You can adjust the convert_from_path parameter to change resolution settings, image format and such.

Reducing the resolution will speed things up but can also affect the results you get with text detection.

How to get Text from Images using OCR

filelimit = image_counter-1

# Creating a text file to write the output 
for i in range(1, filelimit + 1):
    print(f"text recognition for image #:{i}")
    print(f"generating text file for image #:{i}")
    filename = "page_"+str(i)+".jpg"
    outfile = "page_"+str(i)+".txt"
    f = open(outfile, "w")

    # Recognize the text as string in image using pytesserct 
    text = str(((pytesseract.image_to_string(Image.open(filename)))))
    text = text.replace('-\n', '')
    f.write(text)
    f.close()

What this does:

  • we detect text in each image.
  • output results in text files.

So 50 images will yield 50 text files. Image resolution can affect results.

From here on we can use loops & regex to match patterns and manipulate the data however we want.

If you store the initial page order of the PDF in an arary, it’s also possible to change it’s order based on what you do next.

Though not a perfect solution, it’s still better than physically sorting hundreds of PDF files manually.

Related Products



Categories:   web development  
Tags:   python