Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them, it only takes a minute:

Is there a native numpy way to convert an array of string representations of booleans eg:

['True','False','True','False']

To an actual boolean array I can use for masking/indexing? I could do a for loop going through and rebuilding the array but for large arrays this is slow.

share|improve this question
    
Is it a numpy string array (if such a thing exists) or a python string array? – Eric Jun 5 '13 at 16:12
    
It's a numpy string array - weird, I know. – Newmu Jun 5 '13 at 16:14
1  
@Newmu -- I think that the solution to this is to avoid getting an array of string representations in the first place. How did you come by that array? Maybe that's where we should start looking to optimize this one... – mgilson Jun 5 '13 at 16:17
    
I have to deal with it from someone else's code. – Newmu Jun 5 '13 at 16:21

3 Answers 3

up vote 6 down vote accepted

You should be able to do a boolean comparison, IIUC, whether the dtype is a string or object:

>>> a = np.array(['True', 'False', 'True', 'False'])
>>> a
array(['True', 'False', 'True', 'False'], 
      dtype='|S5')
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

or

>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object)
>>> a
array(['True', 'False', 'True', 'False'], dtype=object)
>>> a == "True"
array([ True, False,  True, False], dtype=bool)
share|improve this answer
    
+1 -- So ... simple. – mgilson Jun 5 '13 at 16:20
    
The wonders of broadcasting? Also it's fast (20x quicker than the other answer). – Newmu Jun 5 '13 at 16:24
    
@Newmu And string interning, if I'm not mistaken (at least, all items in a that are 'True' have the same value for id(), which is also the case for all 'False' elements [though oddly, is does not appear to work for those elements. In fact it doesn't work even when you test an entry against itself. a[0] is a[0] returns False, even though id(a[0]) == id(a[0]) returns True]) I believe interning is why the equality checks here are so much faster than numpy.char.startswith() even though the functions in numpy.char are supposed to perform fast string operations on numpy arrays. – JAB Jun 5 '13 at 18:08

Is this good enough?

my_list = ['True', 'False', 'True', 'False']
np.array(x == 'True' for x in my_list)

It's not native, but if you're starting with a non-native list anyway, it really shouldn't matter.

share|improve this answer
1  
That won't work as written because numpy doesn't play well with generator expressions. – DSM Jun 5 '13 at 16:14
    
Boxing it as a list comp works but it's 20x slower than DSM's answer for arrays with more than a few thousand values. – Newmu Jun 5 '13 at 16:22

I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. Testing with both is and == (for situations where the strings are interned versus when they might not be, as is would not work with non-interned strings. As 'True' is probably going to be a literal in the script it should be interned, though) showed that while my version with == was slower than with is, it was still much faster than DSM's version.

Test setup:

import timeit
def timer(statement, count):
    return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count)

>>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)"
>>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)"
>>> stateDSM = "y = np.array(x) == 'True'"

With 1000 items, the faster statements take about 66% the time of DSM's:

>>> timer(stateIs, 1000)
[101.77722641656146, 100.74985342340369, 101.47228618107965]
>>> timer(stateEq, 1000)
[112.26464996250706, 112.50754567379681, 112.76057346127709]
>>> timer(stateDSM, 1000)
[155.67689949529995, 155.96820504501557, 158.32394669279802]

For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's:

>>> timer(stateIs, 100)
[11.947757485669172, 11.927990253608186, 12.057855628259858]
>>> timer(stateEq, 100)
[13.064947253943501, 13.161545451986967, 13.30599035623618]
>>> timer(stateDSM, 100)
[31.270060799078237, 30.941749748808434, 31.253922641324607]

A bit over 25% of DSM's when done with 50 items per list:

>>> timer(stateIs, 50)
[6.856538342483873, 6.741083326021908, 6.708402786859551]
>>> timer(stateEq, 50)
[7.346079345032194, 7.312723444475523, 7.309259899921017]
>>> timer(stateDSM, 50)
[24.154247576229864, 24.173593700599667, 23.946403452288905]

For 5 items, about 11% of DSM's:

>>> timer(stateIs, 5)
[1.8826215278058953, 1.850232652068371, 1.8559381315990322]
>>> timer(stateEq, 5)
[1.9252821868467436, 1.894011299061276, 1.894306935199893]
>>> timer(stateDSM, 5)
[18.060974208809057, 17.916322392367874, 17.8379771602049]
share|improve this answer
1  
Did you notice that they're all True in the result? :-P – mgilson Jun 5 '13 at 16:17
    
.. nonempty strings are True, though, even if they're "False".. – DSM Jun 5 '13 at 16:18
    
@DSM My new answer seems to be a great improvement over my old one. – JAB Jun 5 '13 at 19:28
    
I'm dealing with 5Kish - thanks though. – Newmu Jun 5 '13 at 23:02
    
@Newmu My two are still slightly faster than DSM's even with 5K values, it's just that the improvement is a lot less than for smaller lists. – JAB Jun 6 '13 at 12:14

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.