Powerful pdf text extractor python module

3/31/2023

We will install and import PyPDF2 module and open the PDF file in Python to start reading from the PDF file. Text from PDF cannot be extracted correctly always as PDF can sometime comprises of Diagrams, Tables etc. Some other methods like finditer() could also help in case you want to do more complex stuff. We are going to use PyPdf2 module to read and extract text of a PDF. Take a look at the regex module documentation at: We are going to use some of these libraries in this tutorial as they are very easy you just need to install the library and run some codes in your ide let’s see how to do this process. If you have groups in your regex pattern, findall return a list of the group matches instead, so the result would be: results = re.findall(r'user:\s(\w )', pdf_text) Working with PDF files in python is very easy you can use different types of Python libraries/module for working in PDF like PyPDF2, tabula-py, PyMuPDF, etc. If you would only like to get the "value" field back, you could use: r'user:\s(\w )' which would instruct the regex engine to group the string matched by '\w '. This basically means: find all matches that start with the string 'user:', followed by a whitespace '\s' and then followed by characters that form words (letters and numbers) '\w' until it cannot match anymore ' '. Results = re.findall(r'user:\s\w ', pdf_text) # re.findall will create a list of all strings matching the specified pattern The document won’t look perfect and there will likely be a few minor cleanups to do, but you should have all the text from the executive summary.If you are already able to read the PDF and store the text into a string, you could do the following: import re # Import the Regex Module # Create Document object document = Document() # Add a heading to our word document document.add_heading('Executive Summary', 0) # Create a paragraph by feeding our document the extracted text p = document.add_paragraph(clean_text)Īll we need to do now is save our document and it will appear in our file repository on the right side of the Google colab environment. With the necessary library installed we must first create an empty document object and then build that empty object by doing the following steps. !pip install python-docx from docx import Document First, we install and import it into our environment. The library we will be using is called python-docx. We need one more library now so that we can create our word document.

clean_text = executive_summary.replace("\n","") Our Python Code: Making our word document We can remove this with a simple one-liner. You’ll notice that the text has many instances of “\n” within it when you print it out. # Getting Executive Summary page_obj1 = pdf_reader.getPage(12) page_obj2 = pdf_reader.getPage(13) executive_summary = page_obj1.extractText() page_obj2.extractText() Now let’s pull all the text from pages 12 and 13 and combine them to get the executive summary. If you print the page_obj you will get something quite unreadable to the human eye. # How to create a page objec page_obj = pdf_reader.getPage(12) We can pull out an individual page using the following method. We know from looking at the original PDF that we are interested in pages 12 and 13 where the Executive Summary resides. # Converting the object into a PDF Reader Object pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj) # If you want to find out the number of pages in the PDF use this # command print(pdf_reader.numPages) Now we need to convert pdf_file_obj into a PyPDF2 object so that we can use the library to search through the Indonesia Energy Outlook to extract our text of interest. pdf_file_obj = open("/content/content-indonesia-energy-outlook-2019-english-version.pdf","rb") We must save the PDF as an object before we can start using PyPDF2 on it. !pip install PyPDF2 import PyPDF2īefore we move to the next step make sure you have loaded the PDF document into the file repository on the left of the colab environment. This library isn’t pre-installed in the Google colab environment so we will have to install it before importing the PyPDF2 into our code. PyPDF2 can do much more than just extract text and if you are curious about its other capabilities, you can read about them here. The library we will use to extract the PDF text is called PyPDF2. Note: The following code explanation is designed for the Google colab environment. With the PDF and text identified let’s move on to using python to extract the Executive Summary. For the purpose of this post, I am only going to focus on extracting the text from the Executive Summary on pages xii and xiii. If you open the link to the PDF you will find a long report with many pages and figures. Following the theme of my last post, I’m going to use another PDF focused on Indonesia’s current energy situation with the Indonesia Energy Outlook 2019 Report published by the Secretariat General of the National Energy Council.

0 Comments

Powerful pdf text extractor python module

Leave a Reply.

Author

Archives

Categories