I want to read in a tab separated value file and turn it into a numpy array. The file has 3 lines. It looks like this:
Ann1 Bill1 Chris1 Dick1
Ann2 Bill2 Chris2 "Dick2
Ann3 Bill3 Chris3 Dick3
So, I used this simple line of code:
new_list = []
with open('/home/me/my_tsv.tsv') as tsv:
for line in csv.reader(tsv, delimiter="\t"):
new_list.append(line)
new = np.array(job_posts)
print new.shape
And because of that pesky "
character, the shape of my fancy new numpy array is
(2,4)
That's not right! So, the solution is to include the argument, quoting
, in the csv.reader call, as follows:
for line in csv.reader(tsv, delimiter="\t", quoting=csv.QUOTE_NONE):
This is great! Now my dimension is
(3,4)
as I hoped.
Now comes the real problem -- in all actuality, I have a 700,000 X 10 .tsv file, with long fields. I can read the file into Python with no problems, just like in the above situation. But, when I get to the step where I create new = np.array(job_posts)
, my wimpy 16 GB laptop cries and says...
MEMORY ERROR
Obviously, I cannot simultaneously have these 2 object in memory -- the Python list of lists, and the numpy array.
My question is therefore this: how can I read this file directly into a numpy array, perhaps using genfromtxt
or something similar...yet also achieve what I've achieved by using the quoting=csv.QUOTE_NONE
argument in the csv.reader?
So far, I've found no analogy to the quoting=csv.QUOTE_NONE
option available anywhere in the standard ways to read in a tsv file using numpy.
This is a tough little problem. I though about iteratively building the numpy array during the reading in process, but I can't figure it out.
I tried
nparray = np.genfromtxt("/home/me/my_tsv.tsv", delimiter="/t")
print obj.shap
and got
(3,0)
If anyone has any advice I'd be greatly appreciative. Also, I know the real answer is probably to use Pandas...but at this point I'm committed to using numpy for a lot of compelling reasons...
Thanks in advance.
genfromtxt
seems to work for me without worrying about quoting on your example, and you didn't actually show a transcript showing that it failed. But even if quoting were an issue, wouldn't it be easiest to usecsv.writer
to make an easier-to-parse file? – DSM Jun 7 '14 at 3:26genfromtxt
. The final solution may indeed be cleaning the file beforehand, but I am wondering if it would be possible to avoid that approach using a tool in numpy. – Matt O'Brien Jun 7 '14 at 3:43"/t"
where you meant"\t"
, you're printingobj.shap
but reading in tonparray
, and ifobj
is an array thenobj.shap
would give an AttributeError. It's usually best to type an example in a console and copy the whole thing, input and output, exactly. – DSM Jun 7 '14 at 3:45np.genfromtxt("my_tsv2.tsv",delimiter="\t", dtype=object)
gives mearray([['Ann1', 'Bill1', 'Chris1', 'Dick1'], ['Ann2', 'Bill2', 'Chris2', '"Dick2'], ['Ann3', 'Bill3', 'Chris3', 'Dick3']], dtype=object)
, as I think you want, but I'm using 1.9.0.dev-ef7901d and so something might have changed recently. – DSM Jun 7 '14 at 3:48