Memory-efficient way to generate a large numpy array containing random boolean values

Question

I need to create a large numpy array containing random boolean values without hitting the swap.

My laptop has 8 GB of RAM. Creating a (1200, 2e6) array takes less than 2 s and use 2.29 GB of RAM:

>>> dd = np.ones((1200, int(2e6)), dtype=bool)
>>> dd.nbytes/1024./1024
2288.818359375

>>> dd.shape
(1200, 2000000)

For a relatively small (1200, 400e3), np.random.randint is still quite fast, taking roughly 5 s to generate a 458 MB array:

db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'

But if I double the size of the array to (1200, 800e3) I hit the swap, and it takes ~2.7 min to create db ;(

cmd = """
import numpy as np
db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'"""

print timeit.Timer(cmd).timeit(1)

Using random.getrandbits takes even longer (~8min), and also uses the swap:

from random import getrandbits
db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool)

Using np.random.randint for a (1200, 2e6) just gives a MemoryError.

Is there a more efficient way to create a (1200, 2e6) random boolean array?

ali_m · Accepted Answer · 2015-12-28 13:45:47Z

One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary.

A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes, which can be converted to an array of 8-bit integers using np.fromstring. These integers can then be converted to boolean values, for example by testing whether they are greater than 255 * p, where p is the desired probability of each element being True:

import numpy as np

def random_bool(shape, p=0.5):
    n = np.prod(shape)
    x = np.fromstring(np.random.bytes(n), np.uint8, n)
    return (x > 255 * p).reshape(shape)

Benchmark:

In [1]: shape = 1200, int(2E6)

In [2]: %timeit random_bool(shape)
1 loops, best of 3: 12.7 s per loop

Update:

An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool using np.unpackbits:

def fast_random_bool(shape):
    n = np.prod(shape)
    nb = -(-n // 8)     # ceiling division
    b = np.fromstring(np.random.bytes(nb), np.uint8, nb)
    return np.unpackbits(b)[:n].reshape(shape).view(np.bool)

For example:

In [3]: %timeit fast_random_bool(shape)
1 loops, best of 3: 5.54 s per loop

Your last solution will be even faster if, rather than doing .astype(np.bool), you instead go with .view(np.bool) or .astype(np.bool, copy=False), as either one will spare you a copy of the full array. — Jaime, yesterday
@Jaime Thanks - I always forget that .astype() returns a copy by default — ali_m, yesterday
thank's @ali_m this random boolean array is in the context of a numpy-broadcasting question : stackoverflow.com/q/34496409/3313834 — user3313834, 10 hours ago

asked	yesterday
viewed	135 times
active	yesterday

current community

your communities

more stack exchange communities

Memory-efficient way to generate a large numpy array containing random boolean values

1 Answer 1

Update:

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance numpy random boolean or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Memory-efficient way to generate a large numpy array containing random boolean values

1 Answer 1

Update:

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance numpy random boolean or ask your own question.

Linked

Related

Hot Network Questions