Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

So I have a sample data in a file, which is of the arrangement:

  u   v   w   p
 100 200 300 400 
 101 201 301 401
 102 202 302 402
 103 203 303 403 
 104 204 304 404
 105 205 305 405
 106 206 306 406
 107 207 307 407

Now I want to read the 1st column and save it into a list 'u' , 2nd column into a list 'v' and so on for every column till 'p'. This is what I have so far:

import numpy as np
u  = []
v  = []
w  = []
p  = []

with open('testdata.dat') as f:
   for line in f:
       for x in line.split():
           u.append([int(x)])
           v.append([int(x)+1])
           w.append([int(x)+2])
           p.append([int(x)+3]) 

print 'u is'
print(u)
print 'v is'
print(v)
print 'w is'
print(w)
print 'p is'
print(p)

I have tried varying the indices, but obviously it is wrong since I get the output

u is
[[100], [200], [300], [400], [101], [201], [301], [401], [102], [202], [302], 
 [402], [103], [203], [303], [403], [104], [204], [304], [404], [105], [205], 
 [305], [405], [106], [206], [306], [406], [107], [207], [307], [407]]

v is
[[101], [201], [301], [401], [102], [202], [302], [402], [103], [203], [303], 
 [403], [104], [204], [304], [404], [105], [205], [305], [405], [106], [206], 
 [306], [406], [107], [207], [307], [407], [108], [208], [308], [408]]

w is
[[102], [202], [302], [402], [103], [203], [303], [403], [104], [204], [304], 
 [404], [105], [205], [305], [405], [106], [206], [306], [406], [107], [207], 
 [307], [407], [108], [208], [308], [408], [109], [209], [309], [409]]

p is
[[103], [203], [303], [403], [104], [204], [304], [404], [105], [205], [305], 
 [405], [106], [206], [306], [406], [107], [207], [307], [407], [108], [208], 
 [308], [408], [109], [209], [309], [409], [110], [210], [310], [410]]

It just increments the row number by the index and reads the entire row, whereas I want data from every column written to a separate variable,i.e corresponding to the names given in the sample data - u = 100 --> 107, v = 200 --> 207 etc.

Any ideas on how to do this in Python ? ( I have to perform this operation on really large datasets in an iterative manner,So a fast and efficient code would be of great benefit)

share|improve this question
I don't understand how you deal with data header line? – Sylvain Leroux 19 hours ago
I did not mention it in the code, but I can skip the first line from being read. I put that header for others to understand what my data meant. Sorry if that confused you.. – arvind 18 hours ago
How large is your data set? – yaccz 18 hours ago
I would have 7 - 8 columns with atleast 8000 - 10000 numbers in each column. I would have to execute this with every iteration of a numerical simulation code that I am running. – arvind 18 hours ago
1  
To compare various implementations against your real dataset you could use timeit. Would be cool if you share your results! – Sylvain Leroux 18 hours ago

3 Answers

up vote 2 down vote accepted

Please change the inner loop:

   for x in line.split():
       u.append([int(x)])
       v.append([int(x)+1])
       w.append([int(x)+2])
       p.append([int(x)+3]) 

to

   x = line.split()
   u.append([int(x[0])])
   v.append([int(x[1])])
   w.append([int(x[2])])
   p.append([int(x[3])])

In your orginal implement, the statements in the loop "for x in line.split():" would be executed for four times (for each column).

share|improve this answer

If I understand it well, by using Python build-in functions zip and map, you only need one line to do that:

from itertools import izip

u,v,w,p = izip(*(map(int,line.split()) for line in open('data.txt')))

# Usage (Python3 syntax)
print("u is", list(u))
print("v is", list(v))
print("w is", list(w))
print("p is", list(p))

Producing the following result:

u is [100, 101, 102, 103, 104, 105, 106, 107]
v is [200, 201, 202, 203, 204, 205, 206, 207]
w is [300, 301, 302, 303, 304, 305, 306, 307]
p is [400, 401, 402, 403, 404, 405, 406, 407]

Since this is your concern, implicit looping by using zip and map should exhibit better performances that doing it in python (even if loops are really fast). I'm not sure this solution has better memory footprint thought...

EDIT: replaced zip by izip to use a generator even on python 2.x

share|improve this answer
+1 for elegance! – arvind 18 hours ago
This won't fit into memory if the data set is large (YMMV) as you are loading the whole data file. – yaccz 18 hours ago
@yaccz I'm not sure about your assertion. My understanding is the inner structure is a generator and so will read lines one by one while zip is requesting them, discarding previous input data rows after usage. I don't know how will the "star operator" * impact this expected behavior. Neither if using izip instead of zip has any impact in that domain. Except for that later, I think this is the exact same solution you proposed in your edit 40 minutes ago. – Sylvain Leroux 18 hours ago
I think it depends on python version. When I look into manual, py3.2 zip() seems to be returning generator, however py2.7 returns a list, where the whole file will be loaded into memory, iiuc. – yaccz 17 hours ago
@yaccz I have changed the answer accordingly. I'm not absolutely certain this has an huge impact regarding memory footprint because of the statement u,v,w,p = izip(...) (is this optimized for generators?) Maybe extracting columns by using itertools.islice could lead to a better solution? – Sylvain Leroux 15 hours ago
show 1 more comment

x.append([int(y)+c]) appends a list of one element - int(y)+c

you need x.append(int(y)+c) to get list of numbers instead of list of singletons

also here is pretty nice solution

from itertools import izip

a="""1 2 3 4
10 20 30 40"""

lines= ([int(y) for y in x.split()] for x in a.split("\n"))
cols = izip(*lines)

print list(cols)

prints

[(1, 10), (2, 20), (3, 30), (4, 40)]

The a.split("\n") would in your case be open("data").readlines() or so

This should give you much better memory performance as you are gonna need to have loaded only one line of the data file in any given time, unless you are gonna continue the computation with turning the generators into list.

However, I don't know how it will performance CPU-wise but my guesstimate is it might be a bit better or about the same as your original code.

If you are gonna benchmark this, it would be also interesting to use just lists instead of generators and try it on pypy (because https://bitbucket.org/pypy/pypy/wiki/JitFriendliness see the generators headline) if you can fit it into the memory.

Considering your data set

  (10**4 * 8 * 12)/1024.0

Assuming your numbers are relatively small and take 12 bytes each (Python: How much space does each element of a list take?), that gives me something a little under 1MB of memory to hold all the data at once. Which is pretty tiny data set in terms of memory consumption.

share|improve this answer
+1. I will try each of these methods and let you all know how they fare in terms of performance. – arvind 8 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.