sum columns based on matching fields

Question

I have encountered a problem that exceeds my basic unix knowledge and would really appreciate some help. I have a large file in the following format:

2 1019 0 12
2 1019 3 0
2 1021 0 2
2 1021 2 0
2 1022 4 5
2 1030 0 1
2 1030 5 0
2 1031 4 4

If the values in column two match, I want to sum the values in column 3 and 4 of both lines, else just the sum of the values in the unique line.

So the output I am hoping for would look like this:

2 1019 15
2 1021 4
2 1022 9
2 1030 6
2 1031 8

I am able to sort files according to column 2 with awk or sort and sum the last columns with awk, but only for individual lines not for two lines where column 2 matches.

@glennjackman: Column 1 has the same value throughout each file. It serves as an identifier for the file (I have 45 of those) and will be used for some downstream process. For my question it might as well be ignored (or deleted) and later added again. — TomPio, Mar 18 at 14:33

terdon · Accepted Answer · 2015-03-18 12:21:34Z

I would do this in Perl:

$ perl -lane '$k{"$F[0] $F[1]"}+=$F[2]+$F[3]; 
              END{print "$_ $k{$_}" for keys(%k) }' file 
2 1019 15
2 1021 4
2 1030 6
2 1031 8
2 1022 9

Or awk:

awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file

If you want the output sorted according to the second column you could just pipe to sort:

awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file | sort -k2

Note that both solutions include the 1st column as well. The idea is to use the first and second columns as keys to a hash (in perl) or an associative array (in awk). The key in each solution is column1 column2 so if two lines have the same column two but a different column one, they will be grouped separately:

$ cat file
2 1019 2 3
2 1019 4 1
3 1019 2 2

$ awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file
3 1019 4
2 1019 10

terdon · Answer 2 · 2015-03-18 12:13:41Z

up vote 4 down vote

Maybe this could help, but is column 1 always 2 and does results depend on it?

awk '{ map[$2] += $3 + $4; } END { for (i in map) { print "2", i, map[i] | "sort -t't'" } }' file

or as mentioned by glenn jackman in comments about sorting:

gawk '{ map[$2] += $3 + $4; } END { PROCINFO["sorted_in"] = "@ind_str_asc"; for (i in map) { print 2, i, map[i] } }' file

edited Mar 18 at 12:13

terdon♦
51k766139

answered Mar 18 at 11:56

taliezin
2,154210

2

If you have GNU awk, use PROCINFO["sorted_in"] = "@ind_num_asc" instead of piping to sort. ref gnu.org/software/gawk/manual/html_node/… – glenn jackman Mar 18 at 12:01

@taliezin: Thanks taliezin and terdon. Both approaches worked like a charm. I really appreciate your help. – TomPio Mar 18 at 14:39

1

@taliezin: As I said both worked for me, I marked the terdon answers as the "correct" one. I guess that's what you intended. Thanks again. – TomPio Mar 20 at 13:43

1

If I understand the question you want the total unique keys, we can just add a counter and print it: awk '{ map[$2] += $3 + $4; } END { for (i in map) { print "2", i, map[i] | "sort -t'n'"; cnt++; } print "total unique: " cnt }' file – taliezin Mar 23 at 9:19

1

It is almost the same: awk '{ map[$2] += $3 + $4; oc[$2]++; } END { for (i in map) { print "2", i, map[i], oc[i] | "sort -t'n'"; } }', now you will see another column with occurrences. – taliezin Mar 23 at 9:30

| show 6 more comments

Thor · Answer 3 · 2015-03-18 12:20:59Z

You could pre-sort the data and let awk handle the details:

sort -n infile | awk 'NR>1 && p!=$2 {print p,s} {s+=$3+$4} {p=$2}'

Output:

If you really want the keep the first column, do something like this:

sort -n infile | awk 'NR>1 && p!=$1FS$2 {print p,s} {s+=$3+$4} {p=$1FS$2}'

Output:

Explanation

The p variable holds the $2 value of the previous line, or $1FS$2 in the second case above. This means that the {print p,s} is triggered when $2 of the previous line is not the same as the one on the current line (p!=$2).

asked	1 month ago
viewed	165 times
active	1 month ago

current community

your communities

more stack exchange communities

sum columns based on matching fields

3 Answers 3

Explanation

Your Answer

Not the answer you're looking for? Browse other questions tagged text-processing awk sort uniq or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

sum columns based on matching fields

3 Answers 3

Did you find this question interesting? Try our newsletter

Explanation

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged text-processing awk sort uniq or ask your own question.

Linked

Related

Hot Network Questions