Comparing phone numbers across CSVs Python

Question

(continuation from Speeding up and fixing phone numbers from CSVs with Regex)

I'm pulling all of the phone numbers from all CSVs in two different directories, outputting them in a single simple format to two different files, and then comparing those two files for which numbers are in one but not the other.

I'm failing in speed (of execution), style, and results. Here's what I was trying:

import csv
import re
import glob
import string

with open('Firstlist.csv', 'wb') as out:
    with open('Secondlist.csv', 'wb') as out2:
        with open('SecondnotinFirst.csv', 'wb') as out3:
            with open('FirstnotinSecond.csv', 'wb') as out4:
                seen = set()
                regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
                out_writer = csv.writer(out)
                out_writer.writerow([])
                csv_files = glob.glob('First\*.csv')
                for filename in csv_files:
                    with open(filename, 'rbU') as ifile:
                        read = csv.reader(ifile)
                        for row in read:
                            for column in row:
                                s1 = column.strip()
                                match = regex.search(s1)
                                if match:
                                    canonical_phone = re.sub(r'\D', '', match.group(0))
                                    if canonical_phone not in seen:
                                        seen.add(canonical_phone)
                for val in seen:
                    out_writer.writerow([val])

                seen2 = set()
                out_writer2 = csv.writer(out2)
                out_writer2.writerow([])
                csv_files2 = glob.glob('Second\*.csv')
                for filename in csv_files2:
                    with open(filename, 'rbU') as ifile2:
                        read2 = csv.reader(ifile2)
                        for row in read2:
                            for column in row:
                                s2 = column.strip()
                                match2 = regex.search(s2)
                                if match2:
                                    canonical_phone2 = re.sub(r'\D', '', match2.group(0))
                                    if canonical_phone2 not in seen2:
                                        seen2.add(canonical_phone2)

                for val in seen2:
                    out_writer2.writerow([val])

                out_writer3 = csv.writer(out3)
                for val in seen2 not in seen:
                    out_writer3.writerow([val])

                out_writer4 = csv.writer(out4)
                for val in seen not in seen2:
                    out_writer4.writerow([val])

Some of the operations on built-in sets might make your code simpler (and possibly faster): docs.python.org/2/library/stdtypes.html#set — alexwlchan, Apr 15 at 17:12

Vogel612 · Answer 1 · 2014-04-15 08:32:26Z

up vote 3 down vote

Here's my 5 cents as non-python guy

Naming:

As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.

rename your out, out2, and the corresponding writers. Instead use descriptive names (maybe similar to your "filenames"): first_list, second_list, and so on. The same applies for your writers.

Also seen2 is not a good name.. What do you store in that variable? Name it after that: second_seen is definitely better, than seen2

answered Apr 15 at 8:32

Vogel612
4,2742742

These kinds of general tips make codereview so great for someone like myself who has no clue what I'm doing. It's embarrassing to chalk up my line-by-line blindly-wandering code, but I only learn from trial and error. – Xodarap777 Apr 16 at 16:39

add comment

clutton · Answer 2 · 2014-04-16 02:00:15Z

Hopefully I understand what your trying to do...

Take all phone numbers from all files in directory A, strip phone numbers out, storing it in a summary file.
Do the same for directory B.
Get a file of what is not in directory A, and another that is not in directory B.

I would break the problem down into a few steps:

Process each file in directory A, save the summary.
Do the same for directory B.
Compare A_summary with B_summary.
Compare B_summary with A_summary.

Here is what I would change the code to in order to get going toward a solution.

import csv
import re
import os

PHONE_PATTERN = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')

def process_file(filename, phone_numbers=None):
    """ Add phone numbers from a file to the set of phone_numbers. """
    if phone_numbers = None:
        phone_numbers = set()

    with open(filename, 'rbU') as ifile:
            read = csv.reader(ifile)
            for row in read:
                for column in row:
                    s1 = column.strip()
                    match = PHONE_PATTERN.search(s1)
                    if match:
                        canonical_phone = re.sub(r'\D', '', match.group(0))
                        if canonical_phone not in phone_numbers:
                            phone_numbers.add(canonical_phone)

    return phone_numbers

# Check each directory, creating a set to tally up the unique numbers.
for directory in ['directory_A_path', 'directory_B_path']:
    phone_numbers = set()

    # Process all the files in a directory
    for filename in os.listdir(directory):
        if(filename.endswith(".csv")):
            phone_numbers = process_file(filename, phone_numbers)

    # Create a summary file for the directory and write the numbers
    with open(directory + "_summary.csv") as summary:
        summary_writer = csv.writer(summary)
        for phone_number in phone_numbers:
            summary_writer.writerow([phone_number])

summary_A = process_file('directory_A_path_summary.csv')
summary_B = process_file('directory_A_path_summary.csv')

I think you could then diff summaries and write them out. Hopefully it is helpful.

janos · Answer 3 · 2014-04-16 06:04:33Z

You should try to avoid deeply nesting code as much as possible. Deeply nested code has high cyclomatic complexity -- too many possible execution paths, which can make it extremely difficult to test.

It's especially bad how you nested the out...out3 file handlers, because they are independent from each other. You should have written this more "horizontally", without nesting them within each other.

Another big issue with your code is duplicated logic:

You could write the logic reading from a file, and use it twice when you read the two input files
You could write the logic writing to a file, and use it 4 times when you write the output files

If you do something more than once, you should extract to a separate method so it's less typing, and if you have to fix a bug, you can fix it once, which is easier and more robust.

This is pointless, you're not using it:

import string

Putting it all together (untested):

def read_unique_numbers_from_reader(reader):
    unique_numbers = set()
    for row in reader:
        for column in row:
            s1 = column.strip()
            match = regex.search(s1)
            if match:
                canonical_phone = re.sub(r'\D', '', match.group(0))
                unique_numbers.add(canonical_phone)
    return unique_numbers


def read_unique_numbers_from_files(glob_pattern):
    unique_numbers = set()
    for filename in glob.glob(glob_pattern):
        with open(filename, 'rbU') as ifile:
            reader = csv.reader(ifile)
            unique_numbers.update(read_unique_numbers_from_reader(reader))
    return unique_numbers


def write_numbers(filename, numbers):
    with open(filename, 'wb') as out:
        out_writer = csv.writer(out)
        out_writer.writerow([])
        for val in numbers:
            out_writer.writerow([val])

numbers_in_first = read_unique_numbers_from_files('First*.csv')
write_numbers('Firstlist.csv', numbers_in_first)

numbers_in_second = read_unique_numbers_from_files('Second*.csv')
write_numbers('Secondlist.csv', numbers_in_second)

write_numbers('SecondnotinFirst.csv', numbers_in_second - numbers_in_first)

write_numbers('FirstnotinSecond.csv', numbers_in_first - numbers_in_second)

200_success · Answer 4 · 2014-04-16 06:39:53Z

A simple way to avoid excessive indentation is to combine all of the with blocks into one:

with open('List1.csv', 'wb') as out1, \
     open('List2.csv', 'wb') as out2, \
     open('List3.csv', 'wb') as out3, \
     open('List4.csv', 'wb') as out4:
    seen = set()
    # etc.

I've renamed out → out1 for parallelism.

However, another problem is that the code is repetitive and monolithic. Ideally, there should be a function, say read_unique_phone_numbers_from_csv_files(glob). There is a handy data structure for that: collections.OrderedDict.

def read_unique_phone_numbers_from_csv_files(csv_glob):
    regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')

    unique_phones = collections.OrderedDict()
    csv_files = glob.glob(csv_glob)
    for filename in csv_files:
        with open(filename, 'rbU') as ifile:
             read = csv.reader(ifile)
             for row in read:
                 for column in row:
                     # column.strip() is superfluous
                     match = regex.search(column)
                     if match:
                         canonical_phone = re.sub(r'\D', '', match.group(0))
                         unique_phones[canonical_phone] = True
    return unique_phones

With that function defined, the main loop should become quite simple.

asked	2 months ago
viewed	65 times
active	2 months ago

current community

your communities

more stack exchange communities

Comparing phone numbers across CSVs Python

4 Answers

Naming:

Your Answer

Not the answer you're looking for? Browse other questions tagged python python-2.7 csv or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Comparing phone numbers across CSVs Python

4 Answers

Naming:

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python python-2.7 csv or ask your own question.

Linked

Related

Hot Network Questions