4

anyone can tell me what is the fastest way to translate this string array into a number array as below:

import numpy as np
strarray = np.array([["123456"], ["654321"]])

     to

numberarray = np.array([[1,2,3,4,5,6], [6,5,4,3,2,1]])

map str to list and then map str to int is too slow for a large array!

Please help!

6
  • 3
    Possible duplicate of How to convert an array of strings to an array of floats in numpy? Commented Feb 24, 2016 at 13:22
  • 2
    Is this typo? ["12456"] -> [1,2,3,4,5,6] Commented Feb 24, 2016 at 13:22
  • Are all elements guaranteed to have the same length (like it's 6 in the sample case)? Commented Feb 24, 2016 at 13:32
  • To lan: Yes, that is a typo�?already correct that! Commented Feb 24, 2016 at 13:58
  • To Divakar: Yes, guaranteed to have the same length!! Commented Feb 24, 2016 at 13:59

2 Answers 2

3

You can split the strings into single characters with the array view method:

In [18]: strarray = np.array([[b"123456"], [b"654321"]])

In [19]: strarray.dtype
Out[19]: dtype('S6')

In [20]: strarray.view('S1')
Out[20]: 
array([['1', '2', '3', '4', '5', '6'],
       ['6', '5', '4', '3', '2', '1']], 
      dtype='|S1')

See here for data type character codes.

Then the most obvious next step is to use astype:

In [23]: strarray.view('S1').astype(int)
Out[23]: 
array([[1, 2, 3, 4, 5, 6],
       [6, 5, 4, 3, 2, 1]])

However, it's a lot faster to reinterpret (view) the memory underlying the strings as single byte integers and subtract 48. This works because ASCII characters take up a single byte and the characters '0' through '9' are binary equivalent to (u)int8's 48 through 57 (check the ord builtin).

Speed comparison:

In [26]: ar = np.array([[''.join(np.random.choice(list('123456789'), size=320))] for _ in range(1000)], bytes)

In [27]: %timeit _ = ar.view('S1').astype(np.uint8)
1 loops, best of 3: 284 ms per loop

In [28]: %timeit _ = ar.view(np.uint8) - ord('0')
1000 loops, best of 3: 1.07 ms per loop

If have Unicode instead of ASCII you need to do these steps slightly different. Or just convert to ASCII first with astype(bytes).

6
  • Could be a version issue, I am getting unicode for strarray.dtype. I am on Python 3.4. And ar.view('S1') has "b'" all over alongwith the strings themselves. Commented Feb 24, 2016 at 17:06
  • @Divakar - I changed the strings to bytes for Python 3 compatibility. Commented Feb 24, 2016 at 17:16
  • But if OP has those as strings, he/she has to convert to byte first, right? How could that be done? Commented Feb 24, 2016 at 17:18
  • @Divakar - Python 2.x has ASCII strings as default and for those it works. Commented Feb 24, 2016 at 17:22
  • Ah yes you have mentioned .astype(bytes) for the conversion in the post! Nice, works for me now. Commented Feb 24, 2016 at 17:30
0

Here's an approach that converts the input strings to N-length numeric arrays, i.e. each string gets converted to a 1D array of length N, where N is the length of each of those strings. The approach suggested here basically converts the string to their int equivalents and then gets all the digits using differentiation from their preceding elements' power-10 scaled version. The implementation looks like this -

A = (strarray.astype(int)/(10**np.arange(len(strarray[0][0])))).astype(int)
out = np.column_stack((A[:,-1],(A[:,:-1] - 10*A[:,1:])[:,::-1]))

Sample run -

In [177]: strarray  = np.array([["0308468"], ["6540542"], ["4973473"]])

In [178]: A = (strarray.astype(int)/(10**np.arange(len(strarray[0][0])))).astype(int)
     ...: out = np.column_stack((A[:,-1],(A[:,:-1] - 10*A[:,1:])[:,::-1]))
     ...: 

In [179]: out
Out[179]: 
array([[0, 3, 0, 8, 4, 6, 8],
       [6, 5, 4, 0, 5, 4, 2],
       [4, 9, 7, 3, 4, 7, 3]])
1
  • Tricky solution! Thanks for providing this method for lighting me up! Commented Feb 25, 2016 at 14:44

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.