Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I am trying to write a bash script to generate a csv file from text within multiple pdf documents. I have a script for converting pdf to text, but not for generating the csv file. Each text document gets its own row, with certain pieces of data extracted from each text document. The first row of the csv file contains the name of the column, while everything else is data extracted from the text file. So you would have the csv file looking something like this:

Data1,Data2,Data3,Data4 Data1_FromFile1,Data2_FromFile1,Data3_FromFile1,Data4_FromFile1 Data1_FromFile2,Data2_FromFile2,Data3_FromFile2,Data4_FromFile2 Data1_FromFile3,Data2_FromFIle3,Data3_FromFile3,Data4_FromFile3

Not all text inside the text files will be used, just lines that fit certain patterns (dates, codes, contents of certain sections). There will be more than 3 lines also. How would I go about creating the csv file like this? Would I redirect standard output to a file, the file being the csv file, and then how do I then format it as a csv file that way?

share|improve this question
1  
This is probably unanswerable without samples of what you're actually searching for; it's going to involve some kind of regex parsing – Michael Mrozek Jul 14 '15 at 15:37
    
I can provide an example of what I mean a little later. At the moment, don't have time to. Suffice to say though that Data1 contains a string of numbers, letters and certain characters (-, . mainly), Data 2 is the title of something, data3 is dates, and data4 is URLs. Will have more details later. – cluemein Jul 14 '15 at 20:01

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.