Save data from separate columns in a file into a variable in Python 2.7

Question

So I have a sample data in a file, which is of the arrangement:

  u   v   w   p
 100 200 300 400 
 101 201 301 401
 102 202 302 402
 103 203 303 403 
 104 204 304 404
 105 205 305 405
 106 206 306 406
 107 207 307 407

Now I want to read the 1st column and save it into a list 'u' , 2nd column into a list 'v' and so on for every column till 'p'. This is what I have so far:

import numpy as np
u  = []
v  = []
w  = []
p  = []

with open('testdata.dat') as f:
   for line in f:
       for x in line.split():
           u.append([int(x)])
           v.append([int(x)+1])
           w.append([int(x)+2])
           p.append([int(x)+3]) 

print 'u is'
print(u)
print 'v is'
print(v)
print 'w is'
print(w)
print 'p is'
print(p)

I have tried varying the indices, but obviously it is wrong since I get the output

u is
[[100], [200], [300], [400], [101], [201], [301], [401], [102], [202], [302], 
 [402], [103], [203], [303], [403], [104], [204], [304], [404], [105], [205], 
 [305], [405], [106], [206], [306], [406], [107], [207], [307], [407]]

v is
[[101], [201], [301], [401], [102], [202], [302], [402], [103], [203], [303], 
 [403], [104], [204], [304], [404], [105], [205], [305], [405], [106], [206], 
 [306], [406], [107], [207], [307], [407], [108], [208], [308], [408]]

w is
[[102], [202], [302], [402], [103], [203], [303], [403], [104], [204], [304], 
 [404], [105], [205], [305], [405], [106], [206], [306], [406], [107], [207], 
 [307], [407], [108], [208], [308], [408], [109], [209], [309], [409]]

p is
[[103], [203], [303], [403], [104], [204], [304], [404], [105], [205], [305], 
 [405], [106], [206], [306], [406], [107], [207], [307], [407], [108], [208], 
 [308], [408], [109], [209], [309], [409], [110], [210], [310], [410]]

It just increments the row number by the index and reads the entire row, whereas I want data from every column written to a separate variable,i.e corresponding to the names given in the sample data - u = 100 --> 107, v = 200 --> 207 etc.

Any ideas on how to do this in Python ? ( I have to perform this operation on really large datasets in an iterative manner,So a fast and efficient code would be of great benefit)

I did not mention it in the code, but I can skip the first line from being read. I put that header for others to understand what my data meant. Sorry if that confused you..
I would have 7 - 8 columns with atleast 8000 - 10000 numbers in each column. I would have to execute this with every iteration of a numerical simulation code that I am running.
To compare various implementations against your real dataset you could use timeit. Would be cool if you share your results!

Sheng · Accepted Answer · 2013-06-09 09:37:35Z

Please change the inner loop:

   for x in line.split():
       u.append([int(x)])
       v.append([int(x)+1])
       w.append([int(x)+2])
       p.append([int(x)+3])

to

   x = line.split()
   u.append([int(x[0])])
   v.append([int(x[1])])
   w.append([int(x[2])])
   p.append([int(x[3])])

In your orginal implement, the statements in the loop "for x in line.split():" would be executed for four times (for each column).

Sylvain Leroux · Answer 2 · 2013-06-09 13:16:29Z

up vote 1 down vote

If I understand it well, by using Python build-in functions zip and map, you only need one line to do that:

from itertools import izip

u,v,w,p = izip(*(map(int,line.split()) for line in open('data.txt')))

# Usage (Python3 syntax)
print("u is", list(u))
print("v is", list(v))
print("w is", list(w))
print("p is", list(p))

Producing the following result:

u is [100, 101, 102, 103, 104, 105, 106, 107]
v is [200, 201, 202, 203, 204, 205, 206, 207]
w is [300, 301, 302, 303, 304, 305, 306, 307]
p is [400, 401, 402, 403, 404, 405, 406, 407]

Since this is your concern, implicit looping by using zip and map should exhibit better performances that doing it in python (even if loops are really fast). I'm not sure this solution has better memory footprint thought...

EDIT: replaced zip by izip to use a generator even on python 2.x

edited 15 hours ago

answered 18 hours ago

Sylvain Leroux
3707

	+1 for elegance! – arvind 18 hours ago
	This won't fit into memory if the data set is large (YMMV) as you are loading the whole data file. – yaccz 18 hours ago
	@yaccz I'm not sure about your assertion. My understanding is the inner structure is a generator and so will read lines one by one while zip is requesting them, discarding previous input data rows after usage. I don't know how will the "star operator" `*` impact this expected behavior. Neither if using `izip` instead of `zip` has any impact in that domain. Except for that later, I think this is the exact same solution you proposed in your edit 40 minutes ago. – Sylvain Leroux 18 hours ago
	I think it depends on python version. When I look into manual, py3.2 zip() seems to be returning generator, however py2.7 returns a list, where the whole file will be loaded into memory, iiuc. – yaccz 17 hours ago
	@yaccz I have changed the answer accordingly. I'm not absolutely certain this has an huge impact regarding memory footprint because of the statement `u,v,w,p = izip(...)` (is this optimized for generators?) Maybe extracting columns by using `itertools.islice` could lead to a better solution? – Sylvain Leroux 15 hours ago

show 1 more comment

yaccz · Answer 3 · 2013-06-09 10:35:11Z

x.append([int(y)+c]) appends a list of one element - int(y)+c

you need x.append(int(y)+c) to get list of numbers instead of list of singletons

also here is pretty nice solution

from itertools import izip

a="""1 2 3 4
10 20 30 40"""

lines= ([int(y) for y in x.split()] for x in a.split("\n"))
cols = izip(*lines)

print list(cols)

prints

[(1, 10), (2, 20), (3, 30), (4, 40)]

The a.split("\n") would in your case be open("data").readlines() or so

This should give you much better memory performance as you are gonna need to have loaded only one line of the data file in any given time, unless you are gonna continue the computation with turning the generators into list.

However, I don't know how it will performance CPU-wise but my guesstimate is it might be a bit better or about the same as your original code.

If you are gonna benchmark this, it would be also interesting to use just lists instead of generators and try it on pypy (because https://bitbucket.org/pypy/pypy/wiki/JitFriendliness see the generators headline) if you can fit it into the memory.

Considering your data set

  (10**4 * 8 * 12)/1024.0

Assuming your numbers are relatively small and take 12 bytes each (Python: How much space does each element of a list take?), that gives me something a little under 1MB of memory to hold all the data at once. Which is pretty tiny data set in terms of memory consumption.

+1. I will try each of these methods and let you all know how they fare in terms of performance.

asked	today
viewed	35 times
active	today

Save data from separate columns in a file into a variable in Python 2.7

3 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python file-io python-2.7 or ask your own question.

Linked

Save data from separate columns in a file into a variable in Python 2.7

3 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python file-io python-2.7 or ask your own question.

Linked

Related