Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute:

I need to create a large numpy array containing random boolean values without hitting the swap.

My laptop has 8 GB of RAM. Creating a (1200, 2e6) array takes less than 2 s and use 2.29 GB of RAM:

>>> dd = np.ones((1200, int(2e6)), dtype=bool)
>>> dd.nbytes/1024./1024
2288.818359375

>>> dd.shape
(1200, 2000000)

For a relatively small (1200, 400e3), np.random.randint is still quite fast, taking roughly 5 s to generate a 458 MB array:

db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'

But if I double the size of the array to (1200, 800e3) I hit the swap, and it takes ~2.7 min to create db ;(

cmd = """
import numpy as np
db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'"""

print timeit.Timer(cmd).timeit(1)

Using random.getrandbits takes even longer (~8min), and also uses the swap:

from random import getrandbits
db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool)

Using np.random.randint for a (1200, 2e6) just gives a MemoryError.

Is there a more efficient way to create a (1200, 2e6) random boolean array?

share|improve this question
up vote 7 down vote accepted

One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary.

A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes, which can be converted to an array of 8-bit integers using np.fromstring. These integers can then be converted to boolean values, for example by testing whether they are greater than 255 * p, where p is the desired probability of each element being True:

import numpy as np

def random_bool(shape, p=0.5):
    n = np.prod(shape)
    x = np.fromstring(np.random.bytes(n), np.uint8, n)
    return (x > 255 * p).reshape(shape)

Benchmark:

In [1]: shape = 1200, int(2E6)

In [2]: %timeit random_bool(shape)
1 loops, best of 3: 12.7 s per loop

Update:

An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool using np.unpackbits:

def fast_random_bool(shape):
    n = np.prod(shape)
    nb = -(-n // 8)     # ceiling division
    b = np.fromstring(np.random.bytes(nb), np.uint8, nb)
    return np.unpackbits(b)[:n].reshape(shape).view(np.bool)

For example:

In [3]: %timeit fast_random_bool(shape)
1 loops, best of 3: 5.54 s per loop
share|improve this answer
1  
Your last solution will be even faster if, rather than doing .astype(np.bool), you instead go with .view(np.bool) or .astype(np.bool, copy=False), as either one will spare you a copy of the full array. – Jaime yesterday
    
@Jaime Thanks - I always forget that .astype() returns a copy by default – ali_m yesterday
    
thank's @ali_m this random boolean array is in the context of a numpy-broadcasting question : stackoverflow.com/q/34496409/3313834 – user3313834 10 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.