How to Use Pandas with Multiple Column Numpy Array

Question

Okay I'm stumped on this I've looked at the Pandas documentation but I can't figure out the right way to do it and I think I'm just making a mess. Basically I have data which are numpy arrays e.g.

data = numpy.loadtxt('foo.txt', dtype=str,delimiter=',') 
gps_data = numpy.concatenate((data[0:len(data),0:2],data[0:len(data),3:5]),axis=1)
gps_time = data[0:len(data),2:3].astype(numpy.float)/1000

gps_data basically looks like this

array([['50.3482627', '-71.662499', '30', 'network'],
       ['50.3482588', '-71.6624934', '30', 'network'],
       ['50.34829', '-71.6625077', '30', 'network'],
       ...,
       ['20.3482488', '-78.66245463999999', '9', 'gps'],
       ['20.3482598', '-78.6625174', '30', 'network'],
       ['20.34824943', '-78.6624565', '10', 'gps']],
      dtype='|S18')

and gps_time

array([[  1.16242035e+09],
       [  1.26242036e+09],
       [  1.36242038e+09],
       ...,
       [  1.32330411e+09],
       [  1.16330413e+09],
       [  1.26330413e+09]])

What I'm trying to do is use DataFrame to bring another similar looking array called acc_data and combine it with gps_data and then go back through and fill in the different missing data times. E.g. this is what I've been trying

df1 = DataFrame(gps_data,index=gps_time,columns=['GPS'])

And it gives this error

ValueError: Shape of passed values is (4, 35047), indices imply (1, 35047)

Which I don't know how to handle, if I can find a way around that then I assume the next step df2 but for acc_data will work fine, and then I can do

p = Panel({'ACC': df1, 'GPS': df2})

Any help would be greatly appreciated been stumped on this for last few hours.

ajcr · Accepted Answer · 2014-10-06 18:19:34Z

up vote 2 down vote accepted

You need to make sure you pass in as many column names (using the columns keyword) as there are columns in your NumPy array:

df1 = DataFrame(gps_data, index=gps_time, columns=['col1', 'col2', 'col3', 'col4'])

Pandas raises the error because you've given it an array with four columns and it only has one column name, 'GPS', which you've specified.

edited Oct 6 '14 at 18:19

answered Oct 6 '14 at 18:14

ajcr
15.5k62343

Sweet thanks, although now when I do p = Panel({'GPS':df1,'ACC':df2}) it complains buffer has wrong number of dimensions expected 1 found 2. ? – eWizardII Oct 6 '14 at 18:28

No problem. What is your df2? What shape is it? – ajcr Oct 6 '14 at 18:33

df2 is [7111 rows x 3 columns] (sorry I don't know how to do formatting properly in comments) But basically df2 looks like: x y z 1.362420e+09 -0.249893 4.125504 9.105667 1.362420e+09 -2.738571 5.260941 8.285629 – eWizardII Oct 6 '14 at 18:35

1

@eWizardII Hmmm... I can't seem to replicate the error and I'm afraid I haven't played around with Panel a great deal. It might be a bug if you're using an older version of Pandas. If not, perhaps asking a new question is the way to go... – ajcr Oct 6 '14 at 19:04

Alright will do thanks! I have version 0.14.1 on Windows which should be the latest version or close to it I believe. – eWizardII Oct 6 '14 at 19:09

add a comment |

unutbu · Answer 2 · 2014-10-06 18:39:37Z

ajcr is right; the error can be avoided by specifying the right number of columns. Since gps_data has shape (35047, 4), the DataFrame has four columns. So you need columns=['col1', 'col2', 'col3', 'col4'] if you are going to specify column names.

To get gps_data in the right shape, it would also be easier to use

import numpy as np
import pandas as pd
data = np.genfromtxt('foo.txt', dtype=None, delimiter=',',
                     usecols=[0,1,2,3,4])
gps_data = data[:, [0,1,3,4]]
gps_time = data[:, 2]/1000.0

and then you can build the DataFrame with

df1 = pd.DataFrame(gps_data, index=gps_time)

Caveats:

gps_time = data[0:len(data),2:3]

makes gps_time 2-dimensional with shape (35047, 1). If you use

gps_time = data[0:len(data),2]

then gps_time will be 1-dimensional, with shape (35047,). This is more likely what you want, since the index (time) appears to be 1-dimensional.

data = numpy.loadtxt('foo.txt', dtype=str,delimiter=',')

makes all your numbers strings. If you use

np.genfromtxt('foo.txt', dtype=None, )

the dtype=None tells genfromtxt to make an intelligent guess about the type of each column -- so your float-like numbers will automatically have dtype float.

Alright I'll try this also - it might be the cause of the problem I just followed up too the other answer below that I get an error when using Panel. — eWizardII, Oct 6 '14 at 18:29

asked	7 months ago
viewed	77 times
active	7 months ago

current community

your communities

more stack exchange communities

How to Use Pandas with Multiple Column Numpy Array

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays numpy pandas or ask your own question.

Visit Chat

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

How to Use Pandas with Multiple Column Numpy Array

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays numpy pandas or ask your own question.

Visit Chat

Linked

Related

Hot Network Questions