Comparing two columns in two different rows

Question

I want to go through each line of the a .csv file and compare to see if the first field of line 1 is the same as first field of next line and so on. If it finds a match then I would like to ignore those two lines that contains the same fields and keep the lines where there is no match.

Here is an example dataset (no_dup.txt):

Ac_Gene_ID  M_Gene_ID
ENSGMOG00000015632  ENSORLG00000010573
ENSGMOG00000015632  ENSORLG00000010585
ENSGMOG00000003747  ENSORLG00000006947
ENSGMOG00000003748  ENSORLG00000004636

Here is the output that I wanted:

Ac_Gene_ID  M_Gene_ID
ENSGMOG00000003747  ENSORLG00000006947
ENSGMOG00000003748  ENSORLG00000004636

Here is my code that works, but I want to see how it can be improved:

import sys

in_file = sys.argv[1]
out_file = sys.argv[2]

entries = {}
entries1 = {}

with open(in_file, 'r') as fh_in:
    for line in fh_in:
        if line.startswith('E'):
            line = line.strip()
            line = line.split()
            entry = line[0]
            if entry in entries:
                entries[entry].append(line)
            else:
                entries[entry] = [line]

with open('no_dup_out.txt', 'w') as fh_out:
    for kee, val in entries.iteritems():
        if len(val) == 1:
            fh_out.write("{} \n".format(val))


with open('no_dup_out.txt', 'r') as fh_in2:
    for line in fh_in2:
        line = line.strip()
        line = line.split()
        entry = line[1]
        if entry in entries1:
            entries1[entry].append(line)
        else:
            entries1[entry] = [line]

with open(out_file, 'w') as fh_out2:
     for kee, val in entries1.iteritems():
        if len(val) == 1:
            fh_out2.write("{} \n".format(val))

The output that I am getting:

[["[['ENSGMOG00000003747',", "'ENSORLG00000006947']]"]]     
[["[['ENSGMOG00000003748',", "'ENSORLG00000004636']]"]]

Josay · Answer 1 · 2015-06-12 17:49:18Z

This part

        if entry in entries:
            entries[entry].append(line)
        else:
            entries[entry] = [line]

definitly smells like it could/should be written with setdefault or defaultdict.

This would be for instance entries.setdefault(entry, []).append(line).

Avoid to re-assign the same variable again and again as it makes it harder to hard to understand what the variable is supposed to represent.

        line = line.strip()
        line = line.split()

could be written : splitted_list = line.strip().split()

You are iterating over key ("kee"?)/values of a dictionnary but ignoring the actual key.

The convention is to use _ as the variable name for throw-away values so you could write : for _, val in entries.iteritems():. However, it would probably be better to just iterate over the values using itervalues, values or viewvalues.

200_success · Answer 2 · 2015-06-12 17:48:48Z

It's odd that you write no_dup_out.txt, then immediately read it back in again. Couldn't you just construct entries1 from entries without doing file I/O?

This code has some weird behaviour, though, that you should be aware of. Consider the following example:

Elephant           apple
Elephant           banana
Eel                apple

If you uniquify the data set based on the first column, then by the second column, you, as you have done in your program, you'll obtain the result:

Eel                apple

However, if you were to uniquify the data set based on the second column, then by the first column, you would obtain instead:

Elephant           banana

I don't know enough about the motivation behind the code to say whether either of those is the desired outcome. Or perhaps all three rows should be eliminated? In any case, the intended behaviour should be thoroughly described in a docstring to avoid misunderstandings.

asked	1 year ago
viewed	1079 times
active	1 year ago

current community

your communities

more stack exchange communities

Comparing two columns in two different rows

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python csv hash-table bioinformatics or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Comparing two columns in two different rows

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python csv hash-table bioinformatics or ask your own question.

Related

Hot Network Questions