Reading a PDF file using Python
In this article, I am going to explain what can be the strategy for reading a pdf file using python and its libraries.
There are many ways to read pdf files but when doing it for data analysis we want to have control over what data we can extract and we should be able to manipulate the data also. Using APIs is one way but it might not be the best solution since it involves restrictions like limits on how many times we can use the API or share the info etc.
So I have used the following strategy for Reading PDF files using Python :
Step 1: Convert pdf files to image files using poppler
Step 2: Convert image files to text using tesseract
Libraries you have to install or import :
import nltk
nltk.download('punkt')
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
Following is the code snippet for the same :
def get_pdf_data():
PDF_file = "<FILE_NAME>.pdf"
pages = convert_from_path(PDF_file, 500,poppler_path= <POPPLER_PATH>)
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1filelimit = image_counter-1
outfile = "out_text.txt"
#f = open(outfile, "a")
pytesseract.pytesseract.tesseract_cmd = <TESSERACT_PATH>
corpus =''
for i in range(1, filelimit + 1):
filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
corpus+= text
sent_tokens = nltk.sent_tokenize(corpus)
return sent_tokens
In the end, I have used the NLP library (nltk) to convert text into tokens.
Errors you may face and their resolutions :
- PDFInfoNotInstalledError: This may arise because poppler is not installed and is exposed to the script. Its resolution can also be seen here.
- TesseractNotFoundError: This may arise because tesseract is not installed and exposed to the script. Its resolution can be seen here .
Basically, you can also get rid of these errors by directly providing them in the code snippet above.
In the next article, I will be creating a script where we can create a simple chatbot for reading pdf files.
Thanks! :)