Extract Pages from PDF based on search in python

Question

Everything is working fine except timing. it takes lot time for my file containing 1000 pages and having 100 pages of interest.

import re
from PyPDF2 import PdfFileReader, PdfFileWriter
import glob, os

# find pages
def  findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in slist:
            if re.search(s.lower(), content) is not None:
                if i not in pages:
                    pages.append(i)
    return pages

#extract pages
def extractPage(f, fOut, pages):
    file = open(f, 'rb')
    output = PdfFileWriter()
    pdfOne = PdfFileReader(file)
    for i in pages:
        output.addPage(pdfOne.getPage(i))
    outputStream = open(fOut, "wb")
    output.write(outputStream)
    outputStream.close()
    return

os.chdir(r"path\to\mydir")
for pdfFile in glob.glob("*.pdf"):
    print(pdfFile)
    outPdfFile = pdfFile.replace(".pdf","_searched_extracted.pdf")
    stringList = ["string1", "string2"]
    extractPage(pdfFile, outPdfFile, findText(pdfFile, stringList))

Updated code after suggestions is at:

https://gist.github.com/pra007/099f10b07be5b7126a36438c67ad7a1f

We don't really care about the overall time but more about the specifics. Instead of python file.py, use python -m cProfile -s cumtime file.py and post the functions that took the most time. — Quentin Pradet, yesterday
I have rolled back the question to Rev 1. Please see What to do when someone answers. — 200_success♦, yesterday
Thanks. I will keep in mind next time not to change the question. — Rahul, 14 hours ago

Quentin Pradet · Answer 1 · 2016-09-07 09:29:47Z

up vote 4 down vote

You could try profiling but the code is simple enough that I think you're spending most of the time in PyPDF2 code. Two options:

You can preprocess your PDF files to store their text somewhere, which will make the search phase much faster, especially if you run multiples queries on the same PDF files
You can try another parser such as a Python 3 version of PDFMiner, or even a parser written in a faster language

answered yesterday

Quentin Pradet

5,85311140

Thanks. I thought pdfminer is dead. let me test pdfminer3k – Rahul yesterday

@Rahul Preprocessing sounds better. It's not an option for you? – Quentin Pradet yesterday

yes. I am planning. Thanks. – Rahul yesterday

add a comment |

pjz · Answer 2 · 2016-09-08 04:15:09Z

One thing that might help a lot is to compile your regexs just once. Instead of

def findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in slist:
            if re.search(s.lower(), content) is not None:
                if i not in pages:
                    pages.append(i)
    return pages

try:

def  findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    searches = [ re.compile(s.lower()) for s in slist ]
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in searches:
            if s.search(content) is not None:
                if i not in pages:
                    pages.append(i)
    return pages

Also, you can short-circuit out a lot faster than you're doing:

def  findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    searches = [ re.compile(s.lower()) for s in slist ]
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in searches:
            if s.search(content) is not None:
                pages.append(i)
                break
    return pages

Thanks. I will get back soon with time – Rahul 13 hours ago — Rahul, 13 hours ago

asked	yesterday
viewed	260 times
active	today

current community

your communities

more stack exchange communities

Extract Pages from PDF based on search in python

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance python-3.x pdf or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Extract Pages from PDF based on search in python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance python-3.x pdf or ask your own question.

Related

Hot Network Questions