Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

basically allow a user to label specific text in a pdf file. Use the label information to create a template, which a script uses to extract text from specific pdf.

What would be a good pattern to use in such situation? For example, maybe the script that uses the templates might work for a group of well formatted pdfs but not so with badly formatted pdfs. To optimize for those latter cases, you end up modifying the script which in turn ends up breaking other previously working pdfs. So you fix the script until it finally works on all the pdfs but only the ones you've seen and have been able to create templates for. You hope that eventually you will reach a script that will handle largely varying type of pdfs. It feels like a cat and mouse game and I'm thinking maybe there's a better way to go about doing this.

The other issue that comes up is how to organize code in such scenario. But more importantly, I'm looking for advice on how to tackle the difficult problem of extracting data from pdf using a template based system (or is template system not efficient here?).

share|improve this question
    
Print to an image then OCR? (That might be a joke. I've been in your shoes and I do not envy you). Try using someone else's tool (Acrobat?) to do the text extraction. It's messy. –  Dan Pichelman Sep 16 '14 at 15:24
    
I am trying to write a tool that does that in order to learn the process. OCR could work too –  user299709 Sep 16 '14 at 17:44

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.