Sign up ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

I have the following text structure which I would like to parse:

>Cluster 423
0   56aa, >HWI-ST1448:257:C3V2HACXX:1:1106:19087:2550.1... at 92.86%
1   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1... *
2   41aa, >HWI-ST1448:257:C3V2HACXX:1:1106:12438:91360.3... at 90.24%
3   45aa, >HWI-ST1448:257:C3V2HACXX:1:1108:13046:13861.1... at 91.11%
4   52aa, >HWI-ST1448:257:C3V2HACXX:1:1110:12260:2424.2... at 90.38%
>Cluster 434
0   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1... *
1   46aa, >HWI-ST1448:257:C3V2HACXX:2:1312:1967:40935.2... at 97.83%

Basically, the identifier is marked with a * at the end and the group size is the last group number +1.

The output I want to produce would be (please note the group size at the end):

HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1      5
HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1      2

Any ideas?

share|improve this question
1  
In your file that you are parsing, where exactly is the 5 and the 1 coming from? –  ryekayo Sep 10 '14 at 19:21
    
I've edited your question; please verify that I got it right. –  G-Man Sep 10 '14 at 19:39
    
just checking it, thanks –  BSP Sep 10 '14 at 19:47

4 Answers 4

up vote 1 down vote accepted

Here’s a somewhat rough cut (with no error handling):

awk '/\*$/   { save_id = substr($3, 2, length($3)-4) }
    /^[0-9]/ { save_num = $1 }
    NR > 1 && /^>/ {print save_id, save_num+1 }
    END  {print save_id, save_num+1 }
    ' data_file
  • On a line that ends with * (i.e., that matches /*$/), extract the group ID from the third word, discarding the first character (>) and the last three (...).
  • On lines that begin with a number, save the number (i.e., the first word).
  • Upon encountering a line beginning with > (but excluding the first line in the file by specifying NR > 1) or the end of the file, output the appropriate saved values.
share|improve this answer
    
ok, i tested it on my file but the output is only one id and one group size. any idea why the others are not in the output? –  BSP Sep 10 '14 at 19:53
    
Is it giving you only the last one? Are the "blank" lines in your input really blank, or do they have whitespace characters (spaces, tabs, and/or carriage returns)? If they are not truly blank, they won't match /^$/ and so the corresponding print statement will never be executed. –  G-Man Sep 10 '14 at 20:05
    
sorry, there are no blank lines in the input. I had to put them as a line change in my post was only displayed when i used double 'enter' –  BSP Sep 10 '14 at 20:25
    
OK, I updated my script. Please edit your question to delete the blank line that isn't supposed to be there. –  G-Man Sep 10 '14 at 20:34
    
It worked! Thank you so much –  BSP Sep 10 '14 at 20:41
sed '/^[>0-9]/h;s/.*>\(.*[0-9]\).*\*/[\1 ]P /p
     $s/.*//;/^[>0-9[]/d;g;s/ .*/ 1+pc/ 
' <<\DATA | dc
>Cluster 423
0   56aa, >HWI-ST1448:257:C3V2HACXX:1:1106:19087:2550.1... at 92.86%
1   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1... *
2   41aa, >HWI-ST1448:257:C3V2HACXX:1:1106:12438:91360.3... at 90.24%
3   45aa, >HWI-ST1448:257:C3V2HACXX:1:1108:13046:13861.1... at 91.11%
4   52aa, >HWI-ST1448:257:C3V2HACXX:1:1110:12260:2424.2... at 90.38%

>Cluster 434
0   64aa, >HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1... *
1   46aa, >HWI-ST1448:257:C3V2HACXX:2:1312:1967:40935.2... at 97.83%
DATA

OUTPUT

HWI-ST1448:257:C3V2HACXX:1:1106:15943:81371.1 5
HWI-ST1448:257:C3V2HACXX:1:1106:15723:89894.1 2

It's pretty simple. It keeps a copy of the first number on the line for any line with a character. It only prints if it can successfully either remove the * as the last character on a line or if the line does not begin with >0-9. On the last line all characters are removed. So dc gets one [ stuff here ] string to p per Cluster and one little addition job when, on blank lines, sed pulls that saved number.

share|improve this answer
 grep "\*" file.txt |grep -E "(?<=>)[\w+\s\W]+"|sed 's/\.\.\.\*//'
share|improve this answer
perl  -F'\n' -lan00e 'print "$1\t$#F" if />(.*)\.{3} \*$/m'
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.