Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Everything is working fine except timing. it takes lot time for my file containing 1000 pages and having 100 pages of interest.

import re
from PyPDF2 import PdfFileReader, PdfFileWriter
import glob, os

# find pages
def  findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in slist:
            if re.search(s.lower(), content) is not None:
                if i not in pages:
                    pages.append(i)
    return pages

#extract pages
def extractPage(f, fOut, pages):
    file = open(f, 'rb')
    output = PdfFileWriter()
    pdfOne = PdfFileReader(file)
    for i in pages:
        output.addPage(pdfOne.getPage(i))
    outputStream = open(fOut, "wb")
    output.write(outputStream)
    outputStream.close()
    return

os.chdir(r"path\to\mydir")
for pdfFile in glob.glob("*.pdf"):
    print(pdfFile)
    outPdfFile = pdfFile.replace(".pdf","_searched_extracted.pdf")
    stringList = ["string1", "string2"]
    extractPage(pdfFile, outPdfFile, findText(pdfFile, stringList))

Updated code after suggestions is at:

https://gist.github.com/pra007/099f10b07be5b7126a36438c67ad7a1f

share|improve this question
1  
We don't really care about the overall time but more about the specifics. Instead of python file.py, use python -m cProfile -s cumtime file.py and post the functions that took the most time. – Quentin Pradet yesterday
    
is my modified code OK? – Rahul yesterday
    
I have rolled back the question to Rev 1. Please see What to do when someone answers. – 200_success yesterday
    
Thanks. I will keep in mind next time not to change the question. – Rahul 14 hours ago

You could try profiling but the code is simple enough that I think you're spending most of the time in PyPDF2 code. Two options:

  • You can preprocess your PDF files to store their text somewhere, which will make the search phase much faster, especially if you run multiples queries on the same PDF files
  • You can try another parser such as a Python 3 version of PDFMiner, or even a parser written in a faster language
share|improve this answer
    
Thanks. I thought pdfminer is dead. let me test pdfminer3k – Rahul yesterday
    
@Rahul Preprocessing sounds better. It's not an option for you? – Quentin Pradet yesterday
    
yes. I am planning. Thanks. – Rahul yesterday

One thing that might help a lot is to compile your regexs just once. Instead of

def findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in slist:
            if re.search(s.lower(), content) is not None:
                if i not in pages:
                    pages.append(i)
    return pages

try:

def  findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    searches = [ re.compile(s.lower()) for s in slist ]
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in searches:
            if s.search(content) is not None:
                if i not in pages:
                    pages.append(i)
    return pages

Also, you can short-circuit out a lot faster than you're doing:

def  findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    searches = [ re.compile(s.lower()) for s in slist ]
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in searches:
            if s.search(content) is not None:
                pages.append(i)
                break
    return pages
share|improve this answer
    
Thanks. I will get back soon with time – Rahul 13 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.