Take the 2-minute tour ×

Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

Efficiently extracting data from pdf with a template system?

up vote 0 down vote favorite

basically allow a user to label specific text in a pdf file. Use the label information to create a template, which a script uses to extract text from specific pdf.

What would be a good pattern to use in such situation? For example, maybe the script that uses the templates might work for a group of well formatted pdfs but not so with badly formatted pdfs. To optimize for those latter cases, you end up modifying the script which in turn ends up breaking other previously working pdfs. So you fix the script until it finally works on all the pdfs but only the ones you've seen and have been able to create templates for. You hope that eventually you will reach a script that will handle largely varying type of pdfs. It feels like a cat and mouse game and I'm thinking maybe there's a better way to go about doing this.

The other issue that comes up is how to organize code in such scenario. But more importantly, I'm looking for advice on how to tackle the difficult problem of extracting data from pdf using a template based system (or is template system not efficient here?).

asked Sep 16 '14 at 6:44

user299709
1011

Print to an image then OCR? (That might be a joke. I've been in your shoes and I do not envy you). Try using someone else's tool (Acrobat?) to do the text extraction. It's messy. – Dan Pichelman Sep 16 '14 at 15:24

I am trying to write a tool that does that in order to learn the process. OCR could work too – user299709 Sep 16 '14 at 17:44

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged design design-patterns machine-learning or ask your own question.

question feed

asked	6 months ago
viewed	37 times

current community

your communities

more stack exchange communities

Efficiently extracting data from pdf with a template system?

Your Answer

Browse other questions tagged design design-patterns machine-learning or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Efficiently extracting data from pdf with a template system?

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged design design-patterns machine-learning or ask your own question.

Related

Hot Network Questions