I want to go through each line of the a .csv file and compare to see if the first field of line 1 is the same as first field of next line and so on. If it finds a match then I would like to ignore those two lines that contains the same fields and keep the lines where there is no match.
Here is an example dataset (no_dup.txt):
Ac_Gene_ID M_Gene_ID ENSGMOG00000015632 ENSORLG00000010573 ENSGMOG00000015632 ENSORLG00000010585 ENSGMOG00000003747 ENSORLG00000006947 ENSGMOG00000003748 ENSORLG00000004636
Here is the output that I wanted:
Ac_Gene_ID M_Gene_ID ENSGMOG00000003747 ENSORLG00000006947 ENSGMOG00000003748 ENSORLG00000004636
Here is my code that works, but I want to see how it can be improved:
import sys
in_file = sys.argv[1]
out_file = sys.argv[2]
entries = {}
entries1 = {}
with open(in_file, 'r') as fh_in:
for line in fh_in:
if line.startswith('E'):
line = line.strip()
line = line.split()
entry = line[0]
if entry in entries:
entries[entry].append(line)
else:
entries[entry] = [line]
with open('no_dup_out.txt', 'w') as fh_out:
for kee, val in entries.iteritems():
if len(val) == 1:
fh_out.write("{} \n".format(val))
with open('no_dup_out.txt', 'r') as fh_in2:
for line in fh_in2:
line = line.strip()
line = line.split()
entry = line[1]
if entry in entries1:
entries1[entry].append(line)
else:
entries1[entry] = [line]
with open(out_file, 'w') as fh_out2:
for kee, val in entries1.iteritems():
if len(val) == 1:
fh_out2.write("{} \n".format(val))
The output that I am getting:
[["[['ENSGMOG00000003747',", "'ENSORLG00000006947']]"]] [["[['ENSGMOG00000003748',", "'ENSORLG00000004636']]"]]