Reading a PDF file using Python

2 min readApr 17, 2022

In this article, I am going to explain what can be the strategy for reading a pdf file using python and its libraries.

There are many ways to read pdf files but when doing it for data analysis we want to have control over what data we can extract and we should be able to manipulate the data also. Using APIs is one way but it might not be the best solution since it involves restrictions like limits on how many times we can use the API or share the info etc.

So I have used the following strategy for Reading PDF files using Python :

Step 1: Convert pdf files to image files using poppler

Step 2: Convert image files to text using tesseract

Libraries you have to install or import :

import nltk
nltk.download('punkt')
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os

Following is the code snippet for the same :

def get_pdf_data():
        PDF_file = "<FILE_NAME>.pdf"
    pages = convert_from_path(PDF_file, 500,poppler_path= <POPPLER_PATH>) 
    image_counter = 1
    for page in pages: 
        filename = "page_"+str(image_counter)+".jpg"
        page.save(filename, 'JPEG') 
        image_counter = image_counter + 1filelimit = image_counter-1
    outfile = "out_text.txt"
    #f = open(outfile, "a") 
    pytesseract.pytesseract.tesseract_cmd = <TESSERACT_PATH>
    corpus =''
    for i in range(1, filelimit + 1): 
        filename = "page_"+str(i)+".jpg"
        text = str(((pytesseract.image_to_string(Image.open(filename))))) 
        text = text.replace('-\n', '')     
        corpus+= text
        sent_tokens = nltk.sent_tokenize(corpus)
    return sent_tokens

In the end, I have used the NLP library (nltk) to convert text into tokens.

Errors you may face and their resolutions :

PDFInfoNotInstalledError: This may arise because poppler is not installed and is exposed to the script. Its resolution can also be seen here.
TesseractNotFoundError: This may arise because tesseract is not installed and exposed to the script. Its resolution can be seen here .

Basically, you can also get rid of these errors by directly providing them in the code snippet above.

In the next article, I will be creating a script where we can create a simple chatbot for reading pdf files.

Create a chatbot for reading PDF files using Python

Chatbot for interacting with PDF files using python

amitb0007.medium.com

Thanks! :)

Reading a PDF file using Python

Step 1: Convert pdf files to image files using poppler

Step 2: Convert image files to text using tesseract

Create a chatbot for reading PDF files using Python

Chatbot for interacting with PDF files using python

Written by Amit Bhardwaj

No responses yet