I have lines_list_copy
of the form:
[['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ... ]
I needed to remove all duplicate entries where a
,b
,c
,d
are identical. So note that I don't care about what value e
has. So for example, If lines_list_copy = [['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ['a1', 'b1', 'c1', 'd1', 'e1'], ['a1', 'b1', 'c1', 'd1', 'e2']]
we have three values that are the same, namely lines_list_copy[0]
, lines_list_copy[2]
and lines_list_copy[3]
and any 2 of them need to be deleted which will give us the value of lines_list
. At the end deleting any two results in a valid outputs for lines_list
lines_list_copy
has lengths typically exceeding 200000 and realistically will eventually exceed 500000 with the amount of data we are collecting. Thus I needed a way to remove duplicates fast. I found a way to efficiently remove all duplicates, but the method would take e
into account and thus wouldn't give me what I need. Therefore, I delete all the e
values in each list first like so:
for x in lines_list_copy:
del x[cfg.TEXT_LOC_COL]
lines_list_copy = [list(x) for x in set(tuple(x) for x in lines_list_copy)]
After which I have lines_list_copy
as I need it. All I need to do is re-add any one of the e
values for each list. My double for loop is admittedly naive and more so that I didn't think it would bring my program to a crawl.
for line_copy_ind in range(len(lines_list_copy)):
for line_ind in range(len(lines_list)):
if lines_list_copy[line_copy_ind][cfg.TIME_COL] == lines_list[line_ind][cfg.TIME_COL] and \
len(lines_list_copy[line_copy_ind]) == 4:
lines_list_copy[line_copy_ind].append(lines_list[line_ind][cfg.TEXT_LOC_COL])
lines_list = lines_list_copy
I looked into vectorizing and using filter but just cant seem to reverse engineer solutions to other problems and make them work for my problem of adding e
back on the end of each list in lines_list_copy
. Maybe there's an elegant way for me to instead not delete the e
column and still remove duplicates efficiently without considering the e
values?