Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems.. It's 100% free, no registration required.

I have input data which I want to parse and extract values using awk/grep/sed:

group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 28 29 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 29 30 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 10 11 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 11 12 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1

Basically,I want to take the distinct values in "from=" and its "fromid" and "to=" and its "toid=" which can be seen below as to how the output should be:

The desired output.has to be the values in "from=" and "to=" joined row wise.Since from=ABCB11 is present many times but I want only it once,so as the value in "to=" has to be once in the output.

Whatever is present as fromid or toid ,I want all rows to have fromid,after taking distinct values from both. The format of the output can be interpreted from below output:

ABCB11 = fromid=4,from=ABCB11
ABCC8 = fromid=5,from=ABCC8
ACE = fromid=11,from=ACE
CHRM1 = fromid=114,from=CHRM1
CHRM2 = fromid=115,from=CHRM2
DRD2 = fromid=158,from=DRD2
EGF = fromid=164,from=EGF
ADRA1A = fromid=21,from=ADRA1A
ADRA1B = fromid=22,from=ADRA1B
ADRA1D = fromid=23,from=ADRA1D

I want to have exactly the same output as above,but I have a new input file,which is below:

ABCB11  4   ACE 11
ABCB11  4   CHRM1   114
ABCB11  4   CHRM2   115
ABCB11  4   DRD2    158
ABCB11  4   EGF 164
ABCC8   5   ACE 11
ABCC8   5   ADRA1A  21
ABCC8   5   ADRA1B  22
ABCC8   5   ADRA1D  23
ABCC8   5   CHRM1   114

Taking all the unique genes and creating the output.

share|improve this question
    
Why your to= in output is different with to= from input? –  Gnouc Apr 23 at 15:58
    
Its just the unique values that I want in my output.taking the unique of "from" and "to" and joining them row wise. –  Ron Apr 23 at 16:02
1  
So can you correct your output to fit with input? –  Gnouc Apr 23 at 16:07
add comment

2 Answers

up vote 2 down vote accepted

You could use an awk associative array indexed by the field whose uniqueness you are asserting e.g. for the unique values of the to= field (field $6 when split on commas):

$ awk -F, '{split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;} END{for (id in arr) print arr[id]}' data.txt
EGF = toid=164,to=EGF
ADRA1A = toid=21,to=ADRA1A
ACE = toid=11,to=ACE
ADRA1B = toid=22,to=ADRA1B
ADRA1D = toid=23,to=ADRA1D
DRD2 = toid=158,to=DRD2
CHRM1 = toid=114,to=CHRM1
CHRM2 = toid=115,to=CHRM2

The expression for the unique fromid entries is the same but replacing fields $6 and $7 with $2 and $3:

$ awk -F, '{split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;} END{for (id in arr) print arr[id]}' data.txt
ABCC8 = fromid=5,from=ABCC8
ABCB11 = fromid=4,from=ABCB11


If you want the output to contain both toid and fromid data, you can combine the expressions i.e.

awk -F, '{
split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;
split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;
} END{for (id in arr) print arr[id]}' data.txt

To change the labels (i.e. label all the fields in one table as toid even if they come from the fromid lines) probably the most natural way is to pipe the output through sed e.g.

$ awk -F, '{
split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;
split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;
} END{for (id in arr) print arr[id]}' data.txt | sed 's/from/to/g'
ABCC8 = toid=5,to=ABCC8
EGF = toid=164,to=EGF
ADRA1A = toid=21,to=ADRA1A
ACE = toid=11,to=ACE
ABCB11 = toid=4,to=ABCB11
ADRA1B = toid=22,to=ADRA1B
ADRA1D = toid=23,to=ADRA1D
DRD2 = toid=158,to=DRD2
CHRM1 = toid=114,to=CHRM1
CHRM2 = toid=115,to=CHRM2

You could make the fromid <--> toid substitutions inside awk but this method makes the intent clearer, I think. The other table can then be made just by changing the final sed expression to sed 's/to/from/g' instead.

share|improve this answer
    
It works great but I am not getting these two values ABCB11 = toid=4,to=ABCB11 ABCC8 = toid=5,to=ABCC8. –  Ron Apr 23 at 16:24
    
Can you clarify which field (or combination of fields) you want to test, and which you want to output? The expression I gave finds unique values of field $6 which is the to= field - it can be modified but I don't understand your requirements. –  steeldriver Apr 23 at 16:41
    
You do not have toid=4 and toid=5 in your question. –  Ramesh Apr 23 at 16:41
    
@steeldriver if you see the two outputs,irrespective of the from in input table,In the output I want distinct values present in "from" as well in my output.I have printed the exact two outputs that I want!I can do find and replace to get the other output,so even if I get one of them,it works for me. –  Ron Apr 23 at 16:50
    
@Ramesh I do not have toid=4 and toid=5 in my question,but I want my output like that.If you see the required output!! –  Ron Apr 23 at 16:52
show 5 more comments

Assuming that the names are in a file called "filename.txt", You can try the following for the first table:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//'

For the second table:

cat filename.txt | awk -F "," '{ print $2 " = " $3 "," $6}' | sed -r 's/^.{5}//'

Good luck!

EDIT: For the second table:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//' | sed 's/toid/fromid/'

EDIT 2:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed 's/^.....//' | sed 's/toid/fromid/'

these are 5 dots.

share|improve this answer
    
The OP wants unique values. So, you can use uniq at the end of the commands to get the exact output as the OP needs. –  Ramesh Apr 23 at 15:30
1  
UUOC alert –  1_CR Apr 23 at 15:38
    
sed command after pipe throws error sed: illegal option -- r –  Ron Apr 23 at 15:41
    
It doesn't procedure output as the OP show. –  Gnouc Apr 23 at 15:50
    
Sorry, I thought you were filtering for different columns. How about: cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//' | sed 's/toid/fromid/' –  Ghassan Apr 23 at 15:59
show 3 more comments

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.