Numpy Convert String Representation of Boolean Array To Boolean Array

Question

Is there a native numpy way to convert an array of string representations of booleans eg:

['True','False','True','False']

To an actual boolean array I can use for masking/indexing? I could do a for loop going through and rebuilding the array but for large arrays this is slow.

Is it a numpy string array (if such a thing exists) or a python string array? — Eric, Jun 5 '13 at 16:12
@Newmu -- I think that the solution to this is to avoid getting an array of string representations in the first place. How did you come by that array? Maybe that's where we should start looking to optimize this one... — mgilson, Jun 5 '13 at 16:17

DSM · Accepted Answer · 2013-06-05 16:16:46Z

up vote 6 down vote accepted

You should be able to do a boolean comparison, IIUC, whether the dtype is a string or object:

>>> a = np.array(['True', 'False', 'True', 'False'])
>>> a
array(['True', 'False', 'True', 'False'], 
      dtype='|S5')
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

or

>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object)
>>> a
array(['True', 'False', 'True', 'False'], dtype=object)
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

answered Jun 5 '13 at 16:16

DSM
101k5154195

+1 -- So ... simple. – mgilson Jun 5 '13 at 16:20

The wonders of broadcasting? Also it's fast (20x quicker than the other answer). – Newmu Jun 5 '13 at 16:24

@Newmu And string interning, if I'm not mistaken (at least, all items in a that are 'True' have the same value for id(), which is also the case for all 'False' elements [though oddly, is does not appear to work for those elements. In fact it doesn't work even when you test an entry against itself. a[0] is a[0] returns False, even though id(a[0]) == id(a[0]) returns True]) I believe interning is why the equality checks here are so much faster than numpy.char.startswith() even though the functions in numpy.char are supposed to perform fast string operations on numpy arrays. – JAB Jun 5 '13 at 18:08

add a comment |

Eric · Answer 2 · 2013-06-05 16:13:27Z

up vote 0 down vote

Is this good enough?

my_list = ['True', 'False', 'True', 'False']
np.array(x == 'True' for x in my_list)

It's not native, but if you're starting with a non-native list anyway, it really shouldn't matter.

answered Jun 5 '13 at 16:13

Eric
43.7k1890197

1

That won't work as written because numpy doesn't play well with generator expressions. – DSM Jun 5 '13 at 16:14

Boxing it as a list comp works but it's 20x slower than DSM's answer for arrays with more than a few thousand values. – Newmu Jun 5 '13 at 16:22

add a comment |

JAB · Answer 3 · 2013-06-06 11:47:01Z

I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. Testing with both is and == (for situations where the strings are interned versus when they might not be, as is would not work with non-interned strings. As 'True' is probably going to be a literal in the script it should be interned, though) showed that while my version with == was slower than with is, it was still much faster than DSM's version.

Test setup:

import timeit
def timer(statement, count):
    return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count)

>>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)"
>>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)"
>>> stateDSM = "y = np.array(x) == 'True'"

With 1000 items, the faster statements take about 66% the time of DSM's:

>>> timer(stateIs, 1000)
[101.77722641656146, 100.74985342340369, 101.47228618107965]
>>> timer(stateEq, 1000)
[112.26464996250706, 112.50754567379681, 112.76057346127709]
>>> timer(stateDSM, 1000)
[155.67689949529995, 155.96820504501557, 158.32394669279802]

For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's:

>>> timer(stateIs, 100)
[11.947757485669172, 11.927990253608186, 12.057855628259858]
>>> timer(stateEq, 100)
[13.064947253943501, 13.161545451986967, 13.30599035623618]
>>> timer(stateDSM, 100)
[31.270060799078237, 30.941749748808434, 31.253922641324607]

A bit over 25% of DSM's when done with 50 items per list:

>>> timer(stateIs, 50)
[6.856538342483873, 6.741083326021908, 6.708402786859551]
>>> timer(stateEq, 50)
[7.346079345032194, 7.312723444475523, 7.309259899921017]
>>> timer(stateDSM, 50)
[24.154247576229864, 24.173593700599667, 23.946403452288905]

For 5 items, about 11% of DSM's:

>>> timer(stateIs, 5)
[1.8826215278058953, 1.850232652068371, 1.8559381315990322]
>>> timer(stateEq, 5)
[1.9252821868467436, 1.894011299061276, 1.894306935199893]
>>> timer(stateDSM, 5)
[18.060974208809057, 17.916322392367874, 17.8379771602049]

.. nonempty strings are True, though, even if they're "False".. — DSM, Jun 5 '13 at 16:18
@DSM My new answer seems to be a great improvement over my old one. — JAB, Jun 5 '13 at 19:28
@Newmu My two are still slightly faster than DSM's even with 5K values, it's just that the improvement is a lot less than for smaller lists. — JAB, Jun 6 '13 at 12:14

asked	2 years ago
viewed	503 times
active	2 years ago

current community

your communities

more stack exchange communities

Numpy Convert String Representation of Boolean Array To Boolean Array

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged python numpy or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Numpy Convert String Representation of Boolean Array To Boolean Array

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python numpy or ask your own question.

Visit Chat

Related

Hot Network Questions