How to create a NumPy Array based on the values of another NumPy array?

Question

I would like to create a NumPy array. The value of it's elements depends on the value of the elements in another NumPy array. Presently, I have to use a for-loop in a list comprehension to iterate through array a to get b. What is the NumPy way to achieve this?

Test Script:

import numpy as np

def get_b( a ):
    b_dict = {  1:10., 2:20., 3:30. }
    return b_dict[ a ]

a = np.full( 10, 2 )
print( f'a = {a}' )
b = np.array( [get_b(i) for i in a] )
print( f'b = {b}' )

Output:

a = [2 2 2 2 2 2 2 2 2 2]
b = [20. 20. 20. 20. 20. 20. 20. 20. 20. 20.]

Does b_dict have to be a dict? If you had an array, eg. ref = np.array([0, 10,20,30]) you quickly select the values by index, ref[a]. I would try to avoid dict when working with numpy. — hpaulj, Commented Jul 29, 2020 at 6:58
@hpaulj No, it does not need to be a dict. I have experimented on your comment and reported my findings below. — Sun Bear, Commented Jul 29, 2020 at 17:10

bigbounty · Accepted Answer · 2020-07-29 02:32:16Z

1

You can use np.vectorize to map a dictionary value to an array

In [6]: b_dict = {  1:10., 2:20., 3:30 }

In [7]: a = np.full( 10, 2 )

In [8]: np.vectorize(b_dict.get)(a)
Out[8]: array([20., 20., 20., 20., 20., 20., 20., 20., 20., 20.])

answered Jul 29, 2020 at 2:32

bigbounty

17.4k7 gold badges44 silver badges75 bronze badges

Add a comment |

score 1 · Accepted Answer · 2020-07-30 13:31:36Z

What about using map and np.fromiter?

def get_b( a ):
    b_dict = {  1:10., 2:20., 3:30. }
    return b_dict[ a ]

a = np.full( 10, 2 )
b = np.fromiter(map(get_b, a), dtype=np.float64)

Edit 1: Small time comparison:

%timeit np.array( [get_b(i) for i in a] )
5.58 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.fromiter(map(get_b, a), dtype=np.float64)
5.77 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.vectorize(b_dict.get)(a)
12.9 µs ± 76.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Edit 2: Seems like that example is too small:

a = np.full( 1000, 2 )

%timeit np.array( [get_b(i) for i in a] )
415 µs ± 9.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.fromiter(map(get_b, a), dtype=np.float64)
383 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.vectorize(b_dict.get)(a)
68.6 µs ± 625 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

shouldn't np.fromiter have a dtype?
– Ehsan
Commented Jul 29, 2020 at 6:13 — Ehsan, Commented Jul 29, 2020 at 6:13

Ehsan · Accepted Answer · 2020-07-31 23:27:42Z

Another approach to the problem:

from operator import itemgetter
np.array(itemgetter(*a)(b_dict))

output:

[20., 20., 20., 20., 20., 20., 20., 20., 20., 20.]

Comparison:

#@kmundnic solution
def m1(a):
  def get_b(x):
    b_dict = {  1:10., 2:20., 3:30. }
    return b_dict[x]
  return np.fromiter(map(get_b, a),dtype=np.float)

#@bigbounty solution
def m2(a):
  b_dict = {  1:10., 2:20., 3:30. }
  return np.vectorize(b_dict.get)(a)

#@Ehsan solution
def m3(a):
  b_dict = {  1:10., 2:20., 3:30. }
  return np.array(itemgetter(*a)(b_dict))

#@Sun Bear solution
def m4(a):
  def get_b( a ):
    b_dict = {  1:10., 2:20., 3:30. }
    return b_dict[ a ]
  return np.array( [get_b(i) for i in a] )

in_ = [np.full( n, 2 ) for n in [10,100,1000,10000]]

For small dictionary, seems m2 is fastest at large inputs and m3 for smaller ones.

And for a larger dictionary:

b_dict = dict(zip(np.arange(100),np.arange(100)))
in_ = [np.full(n,50) for n in [10,100,1000,10000]]

m3 is the fastest approach. You can choose based on your dictionary size and key array size.

Could you also add @hpaulj solution (see his comment in my question) to your comparison? — Sun Bear, Commented Jul 29, 2020 at 8:31
@SunBear You don't even need comparison for that. Of course indexing with arrays are faster anyways. If your keys are integers and you are going to call your dict many times, I suggest converting it to array with keys as indices. — Ehsan, Commented Jul 29, 2020 at 9:03

Sun Bear · Accepted Answer · 2020-07-29 21:48:52Z

I like to stress the value of @hpaulj comment to my question:

Does b_dict have to be a dict? If you had an array, eg. ref = np.array([0, 10,20,30]) you quickly select the values by index, ref[a]. I would try to avoid dict when working with numpy.

I found that using NumPy's indexing will lead to a few to several orders of magnitude faster in performance than when trying to work with a python dict.

Building on @Ehsan's solution, below is a script that makes such a comparison.

import numpy as np
from operator import itemgetter
import timeit
import matplotlib.pyplot as plt


#@kmundnic solution
def m1(a):
    def get_b(x):
        b = {  1:10., 2:20., 3:30. }
        #b = dict( zip( np.arange(1,101),np.arange(10,1001,10) ) )
        return b[x]
    return np.fromiter(map(get_b, a),dtype=np.float)

#@bigbounty solution
def m2(a):
    b = {  1:10., 2:20., 3:30. }
    #b = dict( zip( np.arange(1,101),np.arange(10,1001,10) ) )
    return np.vectorize(b.get)(a)

#@Ehsan solution
def m3(a):
    b = {  1:10., 2:20., 3:30. }
    #b = dict( zip( np.arange(1,101),np.arange(10,1001,10) ) )
    return np.array(itemgetter(*a)(b))

#@Sun Bear solution
def m4(a):
    def get_b( a ):
        b = {  1:10., 2:20., 3:30. }
        #b = dict( zip( np.arange(1,101),np.arange(10,1001,10) ) )
        return b[ a ]
    return np.array( [get_b(i) for i in a] )

#@hpaulj solution
def m5(a):
    b = np.array([10, 20, 30])
    #b = np.arange(10,1001,10) 
    return b[a]

        
sizes=[10,100,1000,10000]
pm1 = []
pm2 = []
pm3 = []
pm4 = []
pm5 = []
for size in sizes:
    a = np.full( size, 2 )
    pm1.append( timeit.timeit( 'm1(a)', number=1000, globals=globals() ) )
    pm2.append( timeit.timeit( 'm2(a)', number=1000, globals=globals() ) )
    pm3.append( timeit.timeit( 'm3(a)', number=1000, globals=globals() ) )
    pm4.append( timeit.timeit( 'm4(a)', number=1000, globals=globals() ) )
    pm5.append( timeit.timeit( 'm5(a)', number=1000, globals=globals() ) )

print( 'm1 slower than m5 by :',np.array(pm1) / np.array(pm5) )
print( 'm2 slower than m5 by :',np.array(pm2) / np.array(pm5) )
print( 'm3 slower than m5 by :',np.array(pm3) / np.array(pm5) )
print( 'm4 slower than m5 by :',np.array(pm4) / np.array(pm5) )

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot( sizes, pm1, label='m1' )
ax.plot( sizes, pm2, label='m2' )
ax.plot( sizes, pm3, label='m3' )
ax.plot( sizes, pm4, label='m4' )
ax.plot( sizes, pm5, label='m5' )
ax.grid( which='both' )
ax.set_xscale('log')
ax.set_yscale('log')
ax.legend()
ax.get_xaxis().set_label_text( label='len(a)', fontweight='bold' )
ax.get_yaxis().set_label_text( label='Runtime (sec)', fontweight='bold' )
plt.show()

Results:

len(b) = 3:

m1 slower than m5 by : [  4.22462367  29.79407905  85.03454097 339.2915358 ]
m2 slower than m5 by : [  8.64220685 11.57175871 13.76761749 46.1940683 ]
m3 slower than m5 by : [  3.25785432  21.63131578  54.71305704 220.15777696 ]
m4 slower than m5 by : [  4.60710166  30.93616607  91.8936744  371.00398273 ]

len(b) = 100:

m1 slower than m5 by : [  218.98603678  1976.50128737  9697.76615006 17742.79151719 ]
m2 slower than m5 by : [  41.76535891  53.85600913 109.35129345 164.13075291 ]
m3 slower than m5 by : [  24.82715462  36.77830986  87.56253196 141.04493237 ]
m4 slower than m5 by : [  222.04184193  2001.72120836  9775.22464369 18431.00155305 ]

Collectives™ on Stack Overflow

How to create a NumPy Array based on the values of another NumPy array?

4 Answers 4

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Related