If the date and time string in the example.txt data file were given as one column with no separating whitespace, then genfromtxt
could convert it into a datetime object like this:
import numpy as np
import datetime as dt
def mkdate(text):
return dt.datetime.strptime(text, '%Y-%m-%dT%H:%M:%S:%f')
data = np.genfromtxt(
'example.txt',
names=('data','num','date')+tuple('col{i}'.format(i=i) for i in range(19)),
converters={'date':mkdate},
dtype=None)
Given example.txt
as it is, you could form the desired numpy array with
import numpy as np
import datetime as dt
import csv
def mkdate(text):
return dt.datetime.strptime(text, '%Y-%m-%d%H:%M:%S:%f')
def using_csv(fname):
desc=([('data', '|S4'), ('num', '<i4'), ('date', '|O4')]+
[('col{i}'.format(i=i), '<f8') for i in range(19)])
with open(fname,'r') as f:
reader=csv.reader(f,delimiter='\t')
data=np.array([tuple(row[:2]+[mkdate(''.join(row[2:4]))]+row[4:])
for row in reader],
dtype=desc)
# print(mc.report_memory())
return data
Merging two columns in a numpy array can be a slow operation especially if the array is large. That's because merging, like resizing, requires allocating memory for a new array, and copying data from the original array to the new one. So I think it is worth trying to form the correct numpy array directly, instead of in stages (by forming a partially correct array and merging two columns).
By the way, I tested the above csv
code versus merging two columns (below). Forming a single array from csv
(above) was faster (and the memory usage was about the same):
import matplotlib.cbook as mc
import numpy as np
import datetime as dt
def using_genfromtxt(fname):
data = np.genfromtxt(fname, dtype=None)
orig_desc=data.dtype.descr
view_desc=orig_desc[:2]+[('date','|S22')]+orig_desc[4:]
new_desc=orig_desc[:2]+[('date','|O4')]+orig_desc[4:]
newdata = np.empty(data.shape, dtype=new_desc)
fields=data.dtype.names
fields=fields[:2]+fields[4:]
for field in fields:
newdata[field] = data[field]
newdata['date']=np.vectorize(mkdate)(data.view(view_desc)['date'])
# print(mc.report_memory())
return newdata
# using_csv('example4096.txt')
# using_genfromtxt('example4096.txt')
example4096.txt
is the same as example.txt
, duplicated 4096 times. It's about 12K lines long.
% python -mtimeit -s'import test' 'test.using_genfromtxt("example4096.txt")'
10 loops, best of 3: 1.92 sec per loop
% python -mtimeit -s'import test' 'test.using_csv("example4096.txt")'
10 loops, best of 3: 982 msec per loop
dtype
of the array? Are the columnsobject
s or fix-length string fields? – Sven Marnach Sep 21 '11 at 13:532011-08-04
and19:00:00:08
when creating the original text file? If there is no whitespace, there is a slick way to form the right array withnp.genfromtxt
(without having to merge columns). – unutbu Sep 21 '11 at 14:01a
is your array, you can access itsdtype
usinga.dtype
. If the columns are fixed-width string columns, this would allow for a minor optimisation as we can skip the step of joining them by reinterpreting the data. This would not be possible if they are Pythonstr
objects. – Sven Marnach Sep 21 '11 at 14:09