Create numpy array with random elements from list

Question

Is there an efficient way to create an arbitrary long numpy array where each dimension consists of n elements drawn from a list of length >= n? Each element in the list can be drawn only once for each dimension.

For instance, if I have the list l = ['cat', 'mescaline', 'popcorn'], I want to be able to, for instance by typing something like np.random.pick_random(l, (3, 2), replace=false), create an array array([['cat', 'popcorn'], ['cat', 'popcorn'], ['mescaline', 'cat']]).

Thank you.

Is there something wrong with the dirt simple and obvious import random; random.shuffle()? — Jim Dennis, Nov 11 '12 at 21:04
I wonder why does it have to be numpy? In general numpy is for numerical type of calculations, hence its name is short for numerical python, granted it does support other types ... pythons own random.sample might be better for this [random.sample(['cat', 'mescaline', 'popcorn'], number_of_members) for index in xrange(number_of_arrays)]... — Samy Vilar, Nov 11 '12 at 21:20
@samy-vilar The reason is that I want to avoid slow loops. I will use this for Monte Carlo simulation, so I will need quite large arrays. — Aae, Nov 12 '12 at 0:52
@jim-dennis The difference in performance when I create large arrays. — Aae, Nov 12 '12 at 0:53

Samy Vilar · Accepted Answer · 2012-11-19 22:03:48Z

Theres a couple of ways of doing this, each has their pros/cons, the following four where just from the top of my head ...

pythons own random.sample, is simple and built in, though it may not be the fastest...
numpy.random.permutation again simple but it creates a copy of which we have to slice, ouch!
numpy.random.shuffle is faster since it shuffles in place, but we still have to slice.
numpy.random.sample is the fastest but it only works on the interval 0 to 1 so we have to normalize it, and convert it to ints to get the random indices, at the end we still have to slice, note normalizing to the size we want does not generate a uniform random distribution.

Here are some benchmarks.

import timeit
from matplotlib import pyplot as plt

setup = \
"""
import numpy
import random

number_of_members = 20
values = range(50)
"""

number_of_repetitions = 20
array_sizes = (10, 200)

python_random_times = [timeit.timeit(stmt = "[random.sample(values, number_of_members) for index in xrange({0})]".format(array_size),
                                     setup = setup,                      
                                     number = number_of_repetitions)
                                        for array_size in xrange(*array_sizes)]

numpy_permutation_times = [timeit.timeit(stmt = "[numpy.random.permutation(values)[:number_of_members] for index in xrange({0})]".format(array_size),
                               setup = setup,
                               number = number_of_repetitions)
                                    for array_size in xrange(*array_sizes)]

numpy_shuffle_times = [timeit.timeit(stmt = \
                                """
                                random_arrays = []
                                for index in xrange({0}):
                                    numpy.random.shuffle(values)
                                    random_arrays.append(values[:number_of_members])
                                """.format(array_size),
                                setup = setup,
                                number = number_of_repetitions)
                                     for array_size in xrange(*array_sizes)]                                                                    

numpy_sample_times = [timeit.timeit(stmt = \
                                    """
                                    values = numpy.asarray(values)
                                    random_arrays = [values[indices][:number_of_members] 
                                                for indices in (numpy.random.sample(({0}, len(values))) * len(values)).astype(int)]
                                    """.format(array_size),
                                    setup = setup,
                                    number = number_of_repetitions)
                                         for array_size in xrange(*array_sizes)]                                                                                                                                            

line_0 = plt.plot(xrange(*array_sizes),
                             python_random_times,
                             color = 'black',
                             label = 'random.sample')

line_1 = plt.plot(xrange(*array_sizes),
         numpy_permutation_times,
         color = 'red',
         label = 'numpy.random.permutations'
         )

line_2 = plt.plot(xrange(*array_sizes),
                    numpy_shuffle_times,
                    color = 'yellow',
                    label = 'numpy.shuffle')

line_3 = plt.plot(xrange(*array_sizes),
                    numpy_sample_times,
                    color = 'green',
                    label = 'numpy.random.sample')

plt.xlabel('Number of Arrays')
plt.ylabel('Time in (s) for %i rep' % number_of_repetitions)
plt.title('Different ways to sample.')
plt.legend()

plt.show()

and the result:

enter image description here

So it looks like numpy.random.permutation is the worst, not surprising, pythons own random.sample is holding it own, so it looks like its a close race between numpy.random.shuffle and numpy.random.sample with numpy.random.sample edging out, so either should suffice, even though numpy.random.sample has a higher memory footprint I still prefer it since I really don't need to build the arrays I just need the random indices ...

$ uname -a
Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386

$ python --version
Python 2.6.1

$ python -c "import numpy; print numpy.__version__"
1.6.1

UPDATE

Unfortunately numpy.random.sample doesn't draw unique elements from a population so you'll get repitation, so just stick with shuffle is just as fast.

UPDATE 2

If you want to remain within numpy to leverage some of its built in functionality just convert the values into numpy arrays.

import numpy as np
values = ['cat', 'popcorn', 'mescaline']
number_of_members = 2
N = 1000000
random_arrays = np.asarray([values] * N)
_ = [np.random.shuffle(array) for array in random_arrays]
subset = random_arrays[:, :number_of_members]

Note that N here is quite large as such you are going to get repeated number of permutations, by permutations I mean order of values not repeated values within a permutation, since fundamentally theres a finite number of permutations on any giving finite set, if just calculating the whole set then its n!, if only selecting k elements its n!/(n - k)! and even if this wasn't the case, meaning our set was much larger, we might still get repetitions depending on the random functions implementation, since shuffle/permutation/... and so on only work with the current set and have no idea of the population, this may or may not be acceptable, depends on what you are trying to achieve, if you want a set of unique permutations, then you are going to generate that set and subsample it.

Thanks for the effort. The efficiency of the numpy.shuffle method is okay. However it doesn't save me from slow loops when doing calculations on the array. For instance I would like to do sum(random_arrays, axis=1). Sorry I am so unclear in what I'm looking for. — Aae, Nov 13 '12 at 13:02
umm random_arrays.sum(axis = 1)? random_arrays should be a numpy type. Also note that shuffle may generate non-unique permutations depending on the number of random arrays you need, if you truly want unique permutations than you are going to have to generate them manually and sub sample them, also note that numpy.random.choice was added in 1.7 Im currently at 1.6.1, docs.scipy.org/doc/numpy-dev/reference/generated/… Im not sure about its performance need to test it, but it may be slower since it gens new arrays ... — Samy Vilar, Nov 18 '12 at 1:54
Maybe I misunderstood, but the way I did it generates a 'list': pastee.org/d76bb The permutations shouldn't be unique. — Aae, Nov 19 '12 at 18:39
@Aae I've updated to work with numpy, also sum can only be applied to numeric values, here you have string values, if you want to use indices, simply replace values by range(leng(values)) and it should work. — Samy Vilar, Nov 19 '12 at 22:05

davidbrai · Answer 2 · 2012-11-12 23:46:18Z

up vote 6 down vote

Here's a way to do it using numpy's np.random.randint:

In [68]: l = np.array(['cat', 'mescaline', 'popcorn'])

In [69]: l[np.random.randint(len(l), size=(3,2))]
Out[69]: 
array([['cat', 'popcorn'],
       ['popcorn', 'popcorn'],
       ['mescaline', 'cat']], 
      dtype='|S9')

EDIT: after the additional details that each element should appear at most once in each row

this is not very space efficient, do you need something better?

In [29]: l = np.array(['cat', 'mescaline', 'popcorn'])

In [30]: array([np.random.choice(l, 3, replace=False) for i in xrange(5)])
Out[30]: 
array([['mescaline', 'popcorn', 'cat'],
       ['mescaline', 'popcorn', 'cat'],
       ['popcorn', 'mescaline', 'cat'],
       ['mescaline', 'cat', 'popcorn'],
       ['mescaline', 'cat', 'popcorn']], 
      dtype='|S9')

edited Nov 12 '12 at 23:46

answered Nov 11 '12 at 21:06

davidbrai
779413

Thank you for this. However, there is one detail I forgot to mention. The new array shouldn't consist of dimensions which contain the same element more than once (if it hasn't been listed more than once in the list). – Aae Nov 12 '12 at 0:56

@Aae editted my answer – davidbrai Nov 12 '12 at 23:46

The update gives the desired result, but it's not very efficient. And efficiency is really what I'm requesting. Sorry if I have been unclear. – Aae Nov 14 '12 at 20:20

@Aae then you should specify what type of efficiency is important to you. speed? memory? – davidbrai Nov 15 '12 at 11:55

Speed is the important. I mentioned it in a comment above (‘avoid slow loops’) but I guess I could have made it clearer. – Aae Nov 16 '12 at 1:02

add a comment |

jterrace · Answer 3 · 2012-11-11 21:07:00Z

up vote 2 down vote

>>> import numpy
>>> l = numpy.array(['cat', 'mescaline', 'popcorn'])
>>> l[numpy.random.randint(0, len(l), (3, 2))]
array([['popcorn', 'mescaline'],
       ['mescaline', 'popcorn'],
       ['cat', 'cat']], 
      dtype='|S9')

answered Nov 11 '12 at 21:07

jterrace
22.2k44890

Thank you. But as I said to the other person here: there is one detail I forgot to mention. The new array shouldn't consist of dimensions which contain the same element more than once (if it hasn't been listed more than once in the list). – Aae Nov 12 '12 at 0:57

add a comment |

asked	1 year ago
viewed	2556 times
active	1 year ago

current community

your communities

more stack exchange communities

Create numpy array with random elements from list

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays list random numpy or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Create numpy array with random elements from list

3 Answers 3

Did you find this question interesting? Try our newsletter

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays list random numpy or ask your own question.

Visit Chat

Related

Hot Network Questions