Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them, it only takes a minute:

I want to read in a tab separated value file and turn it into a numpy array. The file has 3 lines. It looks like this:

Ann1    Bill1   Chris1   Dick1
Ann2    Bill2   Chris2  "Dick2
Ann3    Bill3   Chris3   Dick3

So, I used this simple line of code:

new_list = []
with open('/home/me/my_tsv.tsv') as tsv:
    for line in csv.reader(tsv, delimiter="\t"):
        new_list.append(line)

new = np.array(job_posts)
print new.shape

And because of that pesky " character, the shape of my fancy new numpy array is

(2,4)

That's not right! So, the solution is to include the argument, quoting, in the csv.reader call, as follows:

for line in csv.reader(tsv, delimiter="\t", quoting=csv.QUOTE_NONE):

This is great! Now my dimension is

(3,4)  

as I hoped.

Now comes the real problem -- in all actuality, I have a 700,000 X 10 .tsv file, with long fields. I can read the file into Python with no problems, just like in the above situation. But, when I get to the step where I create new = np.array(job_posts), my wimpy 16 GB laptop cries and says...

MEMORY ERROR  

Obviously, I cannot simultaneously have these 2 object in memory -- the Python list of lists, and the numpy array.

My question is therefore this: how can I read this file directly into a numpy array, perhaps using genfromtxt or something similar...yet also achieve what I've achieved by using the quoting=csv.QUOTE_NONE argument in the csv.reader?

So far, I've found no analogy to the quoting=csv.QUOTE_NONE option available anywhere in the standard ways to read in a tsv file using numpy.

This is a tough little problem. I though about iteratively building the numpy array during the reading in process, but I can't figure it out.

I tried

nparray = np.genfromtxt("/home/me/my_tsv.tsv", delimiter="/t")
print obj.shap

and got

(3,0)

If anyone has any advice I'd be greatly appreciative. Also, I know the real answer is probably to use Pandas...but at this point I'm committed to using numpy for a lot of compelling reasons...

Thanks in advance.

share|improve this question
    
Simply using genfromtxt seems to work for me without worrying about quoting on your example, and you didn't actually show a transcript showing that it failed. But even if quoting were an issue, wouldn't it be easiest to use csv.writer to make an easier-to-parse file? – DSM Jun 7 '14 at 3:26
    
I edited the question to show the results I got from using genfromtxt. The final solution may indeed be cleaning the file beforehand, but I am wondering if it would be possible to avoid that approach using a tool in numpy. – Matt O'Brien Jun 7 '14 at 3:43
    
Matt: your edit isn't super-persuasive. :^) You have "/t" where you meant "\t", you're printing obj.shap but reading in to nparray, and if obj is an array then obj.shap would give an AttributeError. It's usually best to type an example in a console and copy the whole thing, input and output, exactly. – DSM Jun 7 '14 at 3:45
    
FYI, np.genfromtxt("my_tsv2.tsv",delimiter="\t", dtype=object) gives me array([['Ann1', 'Bill1', 'Chris1', 'Dick1'], ['Ann2', 'Bill2', 'Chris2', '"Dick2'], ['Ann3', 'Bill3', 'Chris3', 'Dick3']], dtype=object), as I think you want, but I'm using 1.9.0.dev-ef7901d and so something might have changed recently. – DSM Jun 7 '14 at 3:48
1  
How long is long in "I have long fields"? Numpy arrays have all entries of the same type. For strings that includes the number of characters. So if you have a single field with 2,000 characters, all 7,000,000 entries of your array will save space for 2,000 characters, even if most don't need it, and that will eat up 14 GB of your memory. – Jaime Jun 7 '14 at 4:05

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.