Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up
Join the Stack Overflow community to:
  1. Ask programming questions
  2. Answer and help your peers
  3. Get recognized for your expertise

I have a data set stored in NumPy array like shown in below, but all the data inside it is stored as string. How can I change the string to int or float, and store it in back?

  data = numpy.array([]) # <--- array initialized with numpy.array

in the data variable, below information is stored

 [['1' '0' '3' ..., '7.25' '' 'S']
  ['2' '1' '1' ..., '71.2833' 'C85' 'C']
   ['3' '1' '3' ..., '7.925' '' 'S']
   ..., 
   ['889' '0' '3' ..., '23.45' '' 'S']
   ['890' '1' '1' ..., '30' 'C148' 'C']
   ['891' '0' '3' ..., '7.75' '' 'Q']]

I want to change the first column to int and store the values back. To do so, I did:

 data[0::,0] = data[0::,0].astype(int)

but, it didn't change anything.

share|improve this question
    
Do you mean a recarray docs.scipy.org/doc/numpy/reference/generated/…? – Padraic Cunningham Jul 19 '15 at 12:00
    
where does ['1' '0' '3' ..., '7.25' '' 'S'].. come from originally? – Padraic Cunningham Jul 19 '15 at 13:10
    
What is the shape and dtype of data? – hpaulj Jul 19 '15 at 15:16

You could set the data type (dtype) at array initialization. For example if your rows are composed by one 32-bit integer and one 4-byte string you could specify the dtype 'i4, S4'.

data = np.array([(1, 'a'), (2, 'b')], dtype='i4, S4')

You could read more about dtypes here.

share|improve this answer
    
What is this doing exactly? – Padraic Cunningham Jul 19 '15 at 12:17
    
@PadraicCunningham You are specifying that the data type (dtype) for each row is a 4-byte integer and a 4-byte string. – enrico.bacis Jul 19 '15 at 12:23
    
I a not asking for myself, I posted a link in the comments to a recarray already. Some explanation for the OP and how he/she is going to get the original data object into an array with the first column as an integer would be good. – Padraic Cunningham Jul 19 '15 at 12:27
    
@PadraicCunningham: In fact it sounded a strange question from someone skilled like you ;) I will add the details to the answer. – enrico.bacis Jul 19 '15 at 12:28

NumPy arrays have associated types for their elements. Assigning to a slice of a NumPy array will up-cast the new data to that type. If that's not possible, the assignment will fail with an exception:

import numpy
a = numpy.array([[1, 2],[3, 4]])
print a
# [[1 2]
#  [3 4]]
print a.dtype
# int64

a[0,0] = 'look, a string'
# ValueError: invalid literal for long() with base 10: 'a'

In your case, data[0::,0].astype(int) will produce a NumPy array with associated member type int64, but assigning back into a slice of the original array will convert them back to strings.

Other than standard NumPy arrays, NumPy record arrays mentioned in Padraic's comment allow for different types for different columns.

I don't know if a standard NumPy array can be converted to a NumPy record array in-place, so constructing one like suggested in enrico's answer with

data = np.array([(1, 'a'), (2, 'b')], dtype='i4, S4')

might be the best option. If that's not possible, you can construct one from your standard NumPy array and overwrite the variable with the result:

import numpy
data = numpy.array([['1', '0', '3', '7.25', '', 'S'],
                    ['2', '1', '1', '71.2833', 'C85', 'C'],
                    ['3', '1', '3', '7.925', '', 'S'],
                    ['889', '0', '3', '23.45', '', 'S'],
                    ['890', '1', '1', '30', 'C148', 'C'],
                    ['891', '0', '3', '7.75', '', 'Q']])
print(repr(data))
# array([['1', '0', '3', '7.25', '', 'S'],
#        ['2', '1', '1', '71.2833', 'C85', 'C'],
#        ['3', '1', '3', '7.925', '', 'S'],
#        ['889', '0', '3', '23.45', '', 'S'],
#        ['890', '1', '1', '30', 'C148', 'C'],
#        ['891', '0', '3', '7.75', '', 'Q']], 
#       dtype='|S7')

data = numpy.core.records.fromarrays(data.T, dtype='i4,S4,S4,S4,S4,S4')
print(repr(data))
# rec.array([(1, '0', '3', '7.25', '', 'S'), (2, '1', '1', '71.2', 'C85', 'C'),
#        (3, '1', '3', '7.92', '', 'S'), (889, '0', '3', '23.4', '', 'S'),
#        (890, '1', '1', '30', 'C148', 'C'), (891, '0', '3', '7.75', '', 'Q')], 
#       dtype=[('f0', '<i4'), ('f1', '|S4'), ('f2', '|S4'), ('f3', '|S4'), ('f4', '|S4'), ('f5', '|S4')])
share|improve this answer
    
Does someone know whether an in-place conversion is possible or how a record array would be constructed from a standard NumPy array? @PadraicCunningham, maybe? – das-g Jul 19 '15 at 13:02
    
Not sure about inplace but if data was a list of python lists you could np.array(list(map(tuple, data)), dtype="i4,S4,S4,S4,S4,S4"), if it was an array you could np.core.records.fromarrays(data.T,dtype="i4,S4,S4,S4,S4,S4")) – Padraic Cunningham Jul 19 '15 at 13:17
    
Inplace conversions have to leave the total data buffer size unchanged. 'i4' dtypes can be changed for 4 'i1' types, or (I think) 4 `s1'. But interpreting strings as ints or floats will change the number of bytes, and can't be done in-place. – hpaulj Jul 19 '15 at 16:08

I can make an array that contains strings by starting with lists of strings; note the S4 dtype:

In [690]: data=np.array([['1','0','7.23','two'],['2','3','1.32','four']])

In [691]: data
Out[691]: 
array([['1', '0', '7.23', 'two'],
       ['2', '3', '1.32', 'four']], 
      dtype='|S4')

It's more likely that such an array is created by reading a csv file.

I can also view it as an array of single byte strings - the shape and dtype has changed, but the databuffer is the same (the same 32 bytes)

In [692]: data.view('S1')
Out[692]: 
array([['1', '', '', '', '0', '', '', '', '7', '.', '2', '3', 't', 'w',
        'o', ''],
       ['2', '', '', '', '3', '', '', '', '1', '.', '3', '2', 'f', 'o',
        'u', 'r']], 
      dtype='|S1')

In fact, I can change an individual byte, changing the two of the original array to twos:

In [693]: data.view('S1')[0,-1]='s'

In [694]: data
Out[694]: 
array([['1', '0', '7.23', 'twos'],
       ['2', '3', '1.32', 'four']], 
      dtype='|S4')

But if I try to change an element of data to an integer, it is converted to a string to match the S4 dtype:

In [695]: data[1,0]=4

In [696]: data
Out[696]: 
array([['1', '0', '7.23', 'twos'],
       ['4', '3', '1.32', 'four']], 
      dtype='|S4')

The same would happen if the number came from int(data[1,0]) or some variation on that.

But I can trick it into seeing the integer as a string of bytes (represented as \x04)

In [704]: data[1,0]=np.array(4).view('S4')

In [705]: data
Out[705]: 
array([['1', '0', '7.23', 'twos'],
       ['\x04', '3', '1.32', 'four']], 
      dtype='|S4')

Arrays can share data buffers. The data attribute is a pointer to a block of memory. It's the array's dtype that controls how that block is interpreted. For example I can make another array of ints, and redirect it's data attribute:

In [714]: d2=np.zeros((2,4),dtype=int)

In [715]: d2
Out[715]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0]])

In [716]: d2.data=data.data  # change the data pointer

In [717]: d2
Out[717]: 
array([[        49,         48,  858926647, 1936684916],
       [         4,         51,  842214961, 1920298854]])

Now d2[1,0] is the integer 4. But the other items are not recognizable, because they are strings viewed as integers. That's not the same as passing them through the int() function.

I don't recommend changing the data pointer like this as a regular practice. It would be easy to mess things up. I had to take care to ensure that d2.nbytes was 32, the same as for data.

Because the buffer is sharded, a change to d2 also appears in data (but displayed according to a different dtype):

In [718]: d2[0,0]=3

In [719]: data
Out[719]: 
array([['\x03', '0', '7.23', 'twos'],
       ['\x04', '3', '1.32', 'four']], 
      dtype='|S4')

A view with a complex dtype does something similar:

In [723]: data.view('i4,i4,f,|S4')
Out[723]: 
array([[(3, 48, 4.148588672592268e-08, 'twos')],
       [(4, 51, 1.042967401332362e-08, 'four')]], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4'), ('f3', 'S4')])

Notice the 48 and 51 that also appear in d2. The next float column is unrecognizable.

That gives an idea of what can and cannot be done 'in-place'.

But to get an array that contains numbers and strings in a meaningful way, I it is better to construct a new structured array. Perhaps the cleanest way to do that is with an intermediary list of tuples.

In [759]: dl=[tuple(i) for i in data.tolist()]

In [760]: dl
Out[760]: [('1', '0', '7.23', 'two'), ('2', '3', '1.32', 'four')]

In [761]: np.array(dl,dtype='i4,i4,f,|S4')
Out[761]: 
array([(1, 0, 7.230000019073486, 'two'), (2, 3, 1.3200000524520874, 'four')], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4'), ('f3', 'S4')])

All these fields take up 4 bytes, so the nbytes is the same. But the individual values have passed through converters. I have given 'np.array' the freedom to convert values as is consistent for the input and the new dtype. That's a lot easier than trying to perform some sort of convoluted in-place conversion.

A list tuples with a mix of numbers and strings would also have worked:

[(1, 0, 7.23, 'two'), (2, 3, 1.32, 'four')]

Structured arrays are displayed a list of tuples. And in the structured array docs, values are always input as list of tuples.

recarray can also be used, but essentially that is just a array subclass that lets you access fields as attributes.

If the original array was generated from a csv file, it would have been better to use np.genfromtxt (or loadtxt) with appropriate options. It can generate the appropriate list(s) of tuples, and return a structured array directly.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.