I have two numpy arrays, looking like:

field = np.array([5,1,3,3,2,1,6])    
counts = np.array([100,210,300,150,20,90,170])

They are not sorted (and shouldnt change). I now want to calculate a third array (of the same length and order) which contains the sum of the counts whenever they lie in the same field. Here the result should be:

field_counts = np.array([100,300,450,450,20,300,170])

The arrays are very long, such that iterating through it (and always looking where the corresponding partner fields are) is way too inefficient. Maybe I am just not seeing the wood for the trees... I hope someone can help me out on this!

share|improve this question
    
Aside: when you find yourself needing a groupby operation, that's often a sign you should be using pandas instead of numpy; your operation would be something like df.groupby("field")["counts"].transform(sum). – DSM Mar 26 '15 at 20:53
up vote 2 down vote accepted

I don't know if it will be efficient enough (since I do iterate over field) but here is a suggestion. I first make a directory of field/counts values. Then, I create an array based on that.

from collections import defaultdict
dic = defaultdict(int)
for j, f in enumerate(field):
    dic[f] += counts[j]

field_counts = np.array([dic[f] for f in field])
share|improve this answer

Use the following list comprehension :

>>> [np.sum(counts[np.where(field==i)]) for i in field]
[100, 300, 450, 450, 20, 300, 170]

You can get the index of same elements in field with np.where :

>>> [np.where(field==i) for i in field]
[(array([0]),), (array([1, 5]),), (array([2, 3]),), (array([2, 3]),), (array([4]),), (array([1, 5]),), (array([6]),)]

And then get the corresponding elements of counts with indexing! and calculate the sum with np.sum.

share|improve this answer
    
This will be very slow if the arrays are long; you've made this an N^2 calculation. – DSM Mar 26 '15 at 20:39

This problem an be solved in a fully vectorized manner using the numpy_indexed package (disclaimer: I am its author)

import numpy_indexed as npi
g = npi.group_by(field)
field_counts = g.sum(counts)[1][g.inverse]

g.sum computes the sums for each group of unique fields, and g.inverse maps those values back to the original fields.

share|improve this answer
    
There is a reason a went through the hassle to package this functionality, since there are indeed many questions of this type. In my perception, all these questions stand to benefit from my answers; as does this one. It substantially improves upon the currently accepted answer in several respects. It is my understanding that the sections you refer to are directed at commercial purposes; but this is a free-as-in-beer open-source package, but correct me if I'm wrong. My only selfish motive here is getting it better tested :). – Eelco Hoogendoorn Apr 2 '16 at 18:38
    
Subjectively, it feels more like self-promotion to me if I do mention my authorship; but thank you for the heads-up. Do you happen to have a link to any resources that are a bit more explicit about the distinction between commercial and non-commercial purposes? – Eelco Hoogendoorn Apr 2 '16 at 18:46
1  
Some of them are duplicates I would say, yes. I will follow your suggestion to disclose authorship then, thanks. – Eelco Hoogendoorn Apr 2 '16 at 18:56
    
I do appreciate the feedback – Eelco Hoogendoorn Apr 2 '16 at 19:02
1  
Awesome @EelcoHoogendoorn I see you added disclosure :). Please do the same for your other answers as well. As a side-note, if some of them are duplicates, feel free to flag them as such! I will delete my previous comments to clean up. – Tunaki Apr 2 '16 at 19:05

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.