2

I'm trying to extract either the first (or only) floating point or integer from strings like these:

str1 = np.asarray('92834.1alksjdhaklsjh')
str2 = np.asarray'-987___-')
str3 = np.asarray'-234234.alskjhdasd')

where, if parsed correctly, we should get

var1 = 92834.1   #float
var2 = -987      #int 
var3 = -234234.0 #float

Using the "masking" property of numpy arrays I come up with something like for any of the str_ variables, e.g.:

>> ma1 = np.asarray([not str.isalpha(c) for c in str1.tostring()],dtype=bool)

array([ True,  True,  True,  True,  True,  True,  True, False, False,
     False, False, False, False, False, False, False, False, False,
     False, False], dtype=bool)

>> str1[ma1]

IndexError: too many indeces for array 

Now I've read just about everything I can find about indexing using boolean arrays; but I can't get it to work.

It's simple enough that I don't think hunkering down to figure out a regex for is worth it, but complex enough that it's been giving me trouble.

4
  • Similar algorithm without numpy - ''.join([c for c in s if not c.isalpha()]) . But please note this in no way takes out the first float/int if there are multiple places where digits exist in the string. Sep 23, 2015 at 7:12
  • I think you can use a ^.*?([+-]?\d*\.?\d+) regex here. Does it work for you? Sep 23, 2015 at 7:20
  • @stribizhev - Impressive with the regex (a fear of mine) but for the example you linked, it returns an int, when it needs to return a parsed float. For my application, getting the type correct is important. I modified your script to show what I mean.
    – user27886
    Sep 23, 2015 at 8:12
  • I think that -234234 is an int, not a float. You asked to extract either integer or floats. If you only need floats, use Kasra's version. Sep 23, 2015 at 8:15

1 Answer 1

1

You can not create an array with different type like that, If you wan to use different types in a numpy array object you might use a record array and specify the types in your array but here as a more straight way you can convert your numpy object to string and use re.search to get the number :

>>> float(re.search(r'[\d.-]+',str(str1)).group())
92834.1
>>> float(re.search(r'[\d.-]+',str(str2)).group())
-987.0
>>> float(re.search(r'[\d.-]+',str(str3)).group())
-234234.0

But if you want to use a numpy approach you need to first create an array from your string :

>>> st=str(str1)
>>> arr=np.array(list(st))
>>> mask=map(str.isalpha,st)
>>> mask
[False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]

>>> arr[~mask]
array(['9', '2', '8', '3', '4', '.', '1'], 
      dtype='|S1')

And then use str.join method with float:

>>> float(''.join(arr[~mask]))
92834.1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.