load csv file to numpy and access columns by name

Question

I have a csv file with headers like:

Given this test.csv file:

"A","B","C","D","E","F","timestamp"
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12

I simply want to load it as a matrix/ndarray with 3 rows and 7 columns and also I want to access the column vectors from a given column name. If I use genfromtxt (like shown below) I get an ndarray with 3 rows (one per line) and no columns.

r = np.genfromtxt('test.csv',delimiter=',',dtype=None, names=True)
print r
print r.shape

[ (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291111964948.0)
 (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291113113366.0)
 (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291120650486.0)]
(3,)

I can get column vectors from column names like this:

print r['A']
  [ 611.88243  611.88243  611.88243]

If, I use load.txt then I get the array with 3 rows and 7 columns but cannot access columns by using the column names (like shown below).

numpy.loadtxt(open("test.csv","rb"),delimiter=",",skiprows=1)

I get

  [ [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12]
    [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12]
    [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12] ]

Is there any approach in Python that I can achieve both the requirements together (access columns by coluumn name like np.genfromtext and have a matrix like np.loadtxt)?

unutbu · Answer 1 · 2014-06-10 14:59:39Z

up vote 3 down vote

Using numpy alone, the options you show are your only options. Either use an ndarray of homogeneous dtype with shape (3,7), or a structured array of (potentially) heterogenous dtype and shape (3,).

If you really want a data structure with labeled columns and shape (3,7), (and lots of other goodies) you could use a pandas DataFrame:

In [67]: import pandas as pd
In [68]: df = pd.read_csv('data'); df
Out[68]: 
           A          B     C          D           E          F     timestamp
0  611.88243  9089.5601  5133  864.07514  1715.37476  765.22777  1.291112e+12
1  611.88243  9089.5601  5133  864.07514  1715.37476  765.22777  1.291113e+12
2  611.88243  9089.5601  5133  864.07514  1715.37476  765.22777  1.291121e+12    

In [70]: df['A']
Out[70]: 
0    611.88243
1    611.88243
2    611.88243
Name: A, dtype: float64

In [71]: df.shape
Out[71]: (3, 7)

A pure NumPy/Python alternative would be to use a dict to map the column names to indices:

import numpy as np
import csv
with open(filename) as f:
    reader = csv.reader(f)
    columns = next(reader)
    colmap = dict(zip(columns, range(len(columns))))

arr = np.matrix(np.loadtxt(filename, delimiter=",", skiprows=1))
print(arr[:, colmap['A']])

yields

[[ 611.88243]
 [ 611.88243]
 [ 611.88243]]

This way, arr is a NumPy matrix, with columns that can be accessed by label using the syntax

arr[:, colmap[column_name]]

edited Jun 10 '14 at 14:59

answered Jun 10 '14 at 14:47

unutbu
280k27438585

I want a numpy matrix (which will be used for futher matrix manipulation) not array. – user2481422 Jun 10 '14 at 14:49

Numpy matrices do not have columns accessible by labels. – unutbu Jun 10 '14 at 14:51

I am wondering the time efficiency in this case. At first, I thought of loading the csv file in both loadtxt and genfromtext and accessing both numpy array and column names but that is taking too much time. It seems this solution is also similar just genfromtext is replaced with csv.reader (with more lines of code). My csv file is of 5MB, so I wanted one library that could do both at the same time. – user2481422 Jun 10 '14 at 15:04

The time efficiency (using the csv module) is not bad, no matter how large the file, since only the first line is being read. However, I think Warren Weckesser's solution is better. – unutbu Jun 10 '14 at 16:07

add a comment |

Warren Weckesser · Answer 2 · 2014-06-10 15:29:44Z

Because your data is homogeneous--all the elements are floating point values--you can create a view of the data returned by genfromtxt that is a 2D array. For example,

In [42]: r = np.genfromtxt("test.csv", delimiter=',', names=True)

Create a numpy array that is a "view" of r. This is a regular numpy array, but it is created using the data in r:

In [43]: a = r.view(np.float64).reshape(len(r), -1)

In [44]: a.shape
Out[44]: (3, 7)

In [45]: a[:, 0]
Out[45]: array([ 611.88243,  611.88243,  611.88243])

In [46]: r['A']
Out[46]: array([ 611.88243,  611.88243,  611.88243])

r and a refer to the same block of memory:

In [47]: a[0, 0] = -1

In [48]: r['A']
Out[48]: array([  -1.     ,  611.88243,  611.88243])

asked	11 months ago
viewed	331 times
active	11 months ago

current community

your communities

more stack exchange communities

load csv file to numpy and access columns by name

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays csv numpy or ask your own question.

Visit Chat

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

load csv file to numpy and access columns by name

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays csv numpy or ask your own question.

Visit Chat

Linked

Related

Hot Network Questions