Extract data in linux/unix

Question

I have input data which I want to parse and extract values using awk/grep/sed:

group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 28 29 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 29 30 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 10 11 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 11 12 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1

Basically,I want to take the distinct values in "from=" and its "fromid" and "to=" and its "toid=" which can be seen below as to how the output should be:

The desired output.has to be the values in "from=" and "to=" joined row wise.Since from=ABCB11 is present many times but I want only it once,so as the value in "to=" has to be once in the output.

Whatever is present as fromid or toid ,I want all rows to have fromid,after taking distinct values from both. The format of the output can be interpreted from below output:

ABCB11 = fromid=4,from=ABCB11
ABCC8 = fromid=5,from=ABCC8
ACE = fromid=11,from=ACE
CHRM1 = fromid=114,from=CHRM1
CHRM2 = fromid=115,from=CHRM2
DRD2 = fromid=158,from=DRD2
EGF = fromid=164,from=EGF
ADRA1A = fromid=21,from=ADRA1A
ADRA1B = fromid=22,from=ADRA1B
ADRA1D = fromid=23,from=ADRA1D

I want to have exactly the same output as above,but I have a new input file,which is below:

ABCB11  4   ACE 11
ABCB11  4   CHRM1   114
ABCB11  4   CHRM2   115
ABCB11  4   DRD2    158
ABCB11  4   EGF 164
ABCC8   5   ACE 11
ABCC8   5   ADRA1A  21
ABCC8   5   ADRA1B  22
ABCC8   5   ADRA1D  23
ABCC8   5   CHRM1   114

Taking all the unique genes and creating the output.

Why your to= in output is different with to= from input? — cuonglm, Apr 23 '14 at 15:58
Its just the unique values that I want in my output.taking the unique of "from" and "to" and joining them row wise. — Ron, Apr 23 '14 at 16:02

steeldriver · Accepted Answer · 2014-04-23 20:18:10Z

You could use an awk associative array indexed by the field whose uniqueness you are asserting e.g. for the unique values of the to= field (field $6 when split on commas):

$ awk -F, '{split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;} END{for (id in arr) print arr[id]}' data.txt
EGF = toid=164,to=EGF
ADRA1A = toid=21,to=ADRA1A
ACE = toid=11,to=ACE
ADRA1B = toid=22,to=ADRA1B
ADRA1D = toid=23,to=ADRA1D
DRD2 = toid=158,to=DRD2
CHRM1 = toid=114,to=CHRM1
CHRM2 = toid=115,to=CHRM2

The expression for the unique fromid entries is the same but replacing fields $6 and $7 with $2 and $3:

$ awk -F, '{split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;} END{for (id in arr) print arr[id]}' data.txt
ABCC8 = fromid=5,from=ABCC8
ABCB11 = fromid=4,from=ABCB11

If you want the output to contain both toid and fromid data, you can combine the expressions i.e.

awk -F, '{
split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;
split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;
} END{for (id in arr) print arr[id]}' data.txt

To change the labels (i.e. label all the fields in one table as toid even if they come from the fromid lines) probably the most natural way is to pipe the output through sed e.g.

$ awk -F, '{
split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;
split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;
} END{for (id in arr) print arr[id]}' data.txt | sed 's/from/to/g'
ABCC8 = toid=5,to=ABCC8
EGF = toid=164,to=EGF
ADRA1A = toid=21,to=ADRA1A
ACE = toid=11,to=ACE
ABCB11 = toid=4,to=ABCB11
ADRA1B = toid=22,to=ADRA1B
ADRA1D = toid=23,to=ADRA1D
DRD2 = toid=158,to=DRD2
CHRM1 = toid=114,to=CHRM1
CHRM2 = toid=115,to=CHRM2

You could make the fromid <--> toid substitutions inside awk but this method makes the intent clearer, I think. The other table can then be made just by changing the final sed expression to sed 's/to/from/g' instead.

It works great but I am not getting these two values ABCB11 = toid=4,to=ABCB11 ABCC8 = toid=5,to=ABCC8. — Ron, Apr 23 '14 at 16:24
Can you clarify which field (or combination of fields) you want to test, and which you want to output? The expression I gave finds unique values of field $6 which is the to= field - it can be modified but I don't understand your requirements. — steeldriver, Apr 23 '14 at 16:41
@steeldriver if you see the two outputs,irrespective of the from in input table,In the output I want distinct values present in "from" as well in my output.I have printed the exact two outputs that I want!I can do find and replace to get the other output,so even if I get one of them,it works for me. — Ron, Apr 23 '14 at 16:50
@Ramesh I do not have toid=4 and toid=5 in my question,but I want my output like that.If you see the required output!! — Ron, Apr 23 '14 at 16:52

Ghassan · Answer 2 · 2014-04-23 16:14:00Z

up vote 1 down vote

Assuming that the names are in a file called "filename.txt", You can try the following for the first table:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//'

For the second table:

cat filename.txt | awk -F "," '{ print $2 " = " $3 "," $6}' | sed -r 's/^.{5}//'

Good luck!

EDIT: For the second table:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//' | sed 's/toid/fromid/'

EDIT 2:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed 's/^.....//' | sed 's/toid/fromid/'

these are 5 dots.

edited Apr 23 '14 at 16:14

answered Apr 23 '14 at 15:27

Ghassan
31615

The OP wants unique values. So, you can use uniq at the end of the commands to get the exact output as the OP needs. – Ramesh Apr 23 '14 at 15:30

1

UUOC alert – 1_CR Apr 23 '14 at 15:38

sed command after pipe throws error sed: illegal option -- r – Ron Apr 23 '14 at 15:41

It doesn't procedure output as the OP show. – cuonglm Apr 23 '14 at 15:50

Sorry, I thought you were filtering for different columns. How about: cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//' | sed 's/toid/fromid/' – Ghassan Apr 23 '14 at 15:59

| show 3 more comments

asked	1 year ago
viewed	189 times
active	1 year ago

current community

your communities

more stack exchange communities

Extract data in linux/unix

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged linux sed awk grep or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Extract data in linux/unix

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged linux sed awk grep or ask your own question.

Linked

Related

Hot Network Questions