Parsing complex text file using Unix commands

Question

I have the following text structure which I would like to parse:

>Cluster 423
0   56aa, >HWI-ST1448:257:C3V2HACXX:1:1106:19087:2550.1... at 92.86%
1   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1... *
2   41aa, >HWI-ST1448:257:C3V2HACXX:1:1106:12438:91360.3... at 90.24%
3   45aa, >HWI-ST1448:257:C3V2HACXX:1:1108:13046:13861.1... at 91.11%
4   52aa, >HWI-ST1448:257:C3V2HACXX:1:1110:12260:2424.2... at 90.38%
>Cluster 434
0   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1... *
1   46aa, >HWI-ST1448:257:C3V2HACXX:2:1312:1967:40935.2... at 97.83%

Basically, the identifier is marked with a * at the end and the group size is the last group number +1.

The output I want to produce would be (please note the group size at the end):

HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1      5
HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1      2

Any ideas?

In your file that you are parsing, where exactly is the 5 and the 1 coming from? — ryekayo, Sep 10 '14 at 19:21
I've edited your question; please verify that I got it right. — G-Man, Sep 10 '14 at 19:39

G-Man · Accepted Answer · 2014-09-10 20:33:11Z

up vote 1 down vote accepted

Here’s a somewhat rough cut (with no error handling):

awk '/\*$/   { save_id = substr($3, 2, length($3)-4) }
    /^[0-9]/ { save_num = $1 }
    NR > 1 && /^>/ {print save_id, save_num+1 }
    END  {print save_id, save_num+1 }
    ' data_file

On a line that ends with * (i.e., that matches /*$/), extract the group ID from the third word, discarding the first character (>) and the last three (...).
On lines that begin with a number, save the number (i.e., the first word).
Upon encountering a line beginning with > (but excluding the first line in the file by specifying NR > 1) or the end of the file, output the appropriate saved values.

edited Sep 10 '14 at 20:33

answered Sep 10 '14 at 19:40

G-Man
3,52211025

ok, i tested it on my file but the output is only one id and one group size. any idea why the others are not in the output? – BSP Sep 10 '14 at 19:53

Is it giving you only the last one? Are the "blank" lines in your input really blank, or do they have whitespace characters (spaces, tabs, and/or carriage returns)? If they are not truly blank, they won't match /^$/ and so the corresponding print statement will never be executed. – G-Man Sep 10 '14 at 20:05

sorry, there are no blank lines in the input. I had to put them as a line change in my post was only displayed when i used double 'enter' – BSP Sep 10 '14 at 20:25

OK, I updated my script. Please edit your question to delete the blank line that isn't supposed to be there. – G-Man Sep 10 '14 at 20:34

It worked! Thank you so much – BSP Sep 10 '14 at 20:41

add a comment |

mikeserv · Answer 2 · 2014-09-10 21:05:19Z

sed '/^[>0-9]/h;s/.*>\(.*[0-9]\).*\*/[\1 ]P /p
     $s/.*//;/^[>0-9[]/d;g;s/ .*/ 1+pc/ 
' <<\DATA | dc
>Cluster 423
0   56aa, >HWI-ST1448:257:C3V2HACXX:1:1106:19087:2550.1... at 92.86%
1   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1... *
2   41aa, >HWI-ST1448:257:C3V2HACXX:1:1106:12438:91360.3... at 90.24%
3   45aa, >HWI-ST1448:257:C3V2HACXX:1:1108:13046:13861.1... at 91.11%
4   52aa, >HWI-ST1448:257:C3V2HACXX:1:1110:12260:2424.2... at 90.38%

>Cluster 434
0   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1... *
1   46aa, >HWI-ST1448:257:C3V2HACXX:2:1312:1967:40935.2... at 97.83%
DATA

OUTPUT

HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1 5
HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1 2

It's pretty simple. It keeps a copy of the first number on the line for any line with a character. It only prints if it can successfully either remove the * as the last character on a line or if the line does not begin with >0-9. On the last line all characters are removed. So dc gets one [ stuff here ] string to p per Cluster and one little addition job when, on blank lines, sed pulls that saved number.

Kasramvd · Answer 3 · 2014-09-10 20:17:36Z

up vote 0 down vote

 grep "\*" file.txt |grep -E "(?<=>)[\w+\s\W]+"|sed 's/\.\.\.\*//'

answered Sep 10 '14 at 20:17

Kasramvd
289112

add a comment |

Stéphane Chazelas · Answer 4 · 2014-09-10 20:12:09Z

up vote 1 down vote

perl  -F'\n' -lan00e 'print "$1\t$#F" if />(.*)\.{3} \*$/m'

answered Sep 10 '14 at 20:12

community wiki

Stéphane Chazelas

add a comment |

asked	1 year ago
viewed	97 times
active	1 year ago

current community

your communities

more stack exchange communities

Parsing complex text file using Unix commands

4 Answers 4

OUTPUT

Your Answer

Not the answer you're looking for? Browse other questions tagged text-processing awk grep or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Parsing complex text file using Unix commands

4 Answers 4

OUTPUT

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged text-processing awk grep or ask your own question.

Related

Hot Network Questions