Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

(continuation from Speeding up and fixing phone numbers from CSVs with Regex)

I'm pulling all of the phone numbers from all CSVs in two different directories, outputting them in a single simple format to two different files, and then comparing those two files for which numbers are in one but not the other.

I'm failing in speed (of execution), style, and results. Here's what I was trying:

import csv
import re
import glob
import string

with open('Firstlist.csv', 'wb') as out:
    with open('Secondlist.csv', 'wb') as out2:
        with open('SecondnotinFirst.csv', 'wb') as out3:
            with open('FirstnotinSecond.csv', 'wb') as out4:
                seen = set()
                regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
                out_writer = csv.writer(out)
                out_writer.writerow([])
                csv_files = glob.glob('First\*.csv')
                for filename in csv_files:
                    with open(filename, 'rbU') as ifile:
                        read = csv.reader(ifile)
                        for row in read:
                            for column in row:
                                s1 = column.strip()
                                match = regex.search(s1)
                                if match:
                                    canonical_phone = re.sub(r'\D', '', match.group(0))
                                    if canonical_phone not in seen:
                                        seen.add(canonical_phone)
                for val in seen:
                    out_writer.writerow([val])

                seen2 = set()
                out_writer2 = csv.writer(out2)
                out_writer2.writerow([])
                csv_files2 = glob.glob('Second\*.csv')
                for filename in csv_files2:
                    with open(filename, 'rbU') as ifile2:
                        read2 = csv.reader(ifile2)
                        for row in read2:
                            for column in row:
                                s2 = column.strip()
                                match2 = regex.search(s2)
                                if match2:
                                    canonical_phone2 = re.sub(r'\D', '', match2.group(0))
                                    if canonical_phone2 not in seen2:
                                        seen2.add(canonical_phone2)

                for val in seen2:
                    out_writer2.writerow([val])

                out_writer3 = csv.writer(out3)
                for val in seen2 not in seen:
                    out_writer3.writerow([val])

                out_writer4 = csv.writer(out4)
                for val in seen not in seen2:
                    out_writer4.writerow([val])
share|improve this question
    
Some of the operations on built-in sets might make your code simpler (and possibly faster): docs.python.org/2/library/stdtypes.html#set –  alexwlchan Apr 15 at 17:12
add comment

4 Answers

Here's my 5 cents as non-python guy

Naming:

As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.

rename your out, out2, and the corresponding writers. Instead use descriptive names (maybe similar to your "filenames"): first_list, second_list, and so on. The same applies for your writers.

Also seen2 is not a good name.. What do you store in that variable? Name it after that: second_seen is definitely better, than seen2

share|improve this answer
    
These kinds of general tips make codereview so great for someone like myself who has no clue what I'm doing. It's embarrassing to chalk up my line-by-line blindly-wandering code, but I only learn from trial and error. –  Xodarap777 Apr 16 at 16:39
add comment

Hopefully I understand what your trying to do...

  1. Take all phone numbers from all files in directory A, strip phone numbers out, storing it in a summary file.
  2. Do the same for directory B.
  3. Get a file of what is not in directory A, and another that is not in directory B.

I would break the problem down into a few steps:

  1. Process each file in directory A, save the summary.
  2. Do the same for directory B.
  3. Compare A_summary with B_summary.
  4. Compare B_summary with A_summary.

Here is what I would change the code to in order to get going toward a solution.

import csv
import re
import os

PHONE_PATTERN = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')

def process_file(filename, phone_numbers=None):
    """ Add phone numbers from a file to the set of phone_numbers. """
    if phone_numbers = None:
        phone_numbers = set()

    with open(filename, 'rbU') as ifile:
            read = csv.reader(ifile)
            for row in read:
                for column in row:
                    s1 = column.strip()
                    match = PHONE_PATTERN.search(s1)
                    if match:
                        canonical_phone = re.sub(r'\D', '', match.group(0))
                        if canonical_phone not in phone_numbers:
                            phone_numbers.add(canonical_phone)

    return phone_numbers

# Check each directory, creating a set to tally up the unique numbers.
for directory in ['directory_A_path', 'directory_B_path']:
    phone_numbers = set()

    # Process all the files in a directory
    for filename in os.listdir(directory):
        if(filename.endswith(".csv")):
            phone_numbers = process_file(filename, phone_numbers)

    # Create a summary file for the directory and write the numbers
    with open(directory + "_summary.csv") as summary:
        summary_writer = csv.writer(summary)
        for phone_number in phone_numbers:
            summary_writer.writerow([phone_number])

summary_A = process_file('directory_A_path_summary.csv')
summary_B = process_file('directory_A_path_summary.csv')

I think you could then diff summaries and write them out. Hopefully it is helpful.

share|improve this answer
add comment

You should try to avoid deeply nesting code as much as possible. Deeply nested code has high cyclomatic complexity -- too many possible execution paths, which can make it extremely difficult to test.

It's especially bad how you nested the out...out3 file handlers, because they are independent from each other. You should have written this more "horizontally", without nesting them within each other.


Another big issue with your code is duplicated logic:

  • You could write the logic reading from a file, and use it twice when you read the two input files

  • You could write the logic writing to a file, and use it 4 times when you write the output files

If you do something more than once, you should extract to a separate method so it's less typing, and if you have to fix a bug, you can fix it once, which is easier and more robust.


This is pointless, you're not using it:

import string

Putting it all together (untested):

def read_unique_numbers_from_reader(reader):
    unique_numbers = set()
    for row in reader:
        for column in row:
            s1 = column.strip()
            match = regex.search(s1)
            if match:
                canonical_phone = re.sub(r'\D', '', match.group(0))
                unique_numbers.add(canonical_phone)
    return unique_numbers


def read_unique_numbers_from_files(glob_pattern):
    unique_numbers = set()
    for filename in glob.glob(glob_pattern):
        with open(filename, 'rbU') as ifile:
            reader = csv.reader(ifile)
            unique_numbers.update(read_unique_numbers_from_reader(reader))
    return unique_numbers


def write_numbers(filename, numbers):
    with open(filename, 'wb') as out:
        out_writer = csv.writer(out)
        out_writer.writerow([])
        for val in numbers:
            out_writer.writerow([val])

numbers_in_first = read_unique_numbers_from_files('First*.csv')
write_numbers('Firstlist.csv', numbers_in_first)

numbers_in_second = read_unique_numbers_from_files('Second*.csv')
write_numbers('Secondlist.csv', numbers_in_second)

write_numbers('SecondnotinFirst.csv', numbers_in_second - numbers_in_first)

write_numbers('FirstnotinSecond.csv', numbers_in_first - numbers_in_second)
share|improve this answer
add comment

A simple way to avoid excessive indentation is to combine all of the with blocks into one:

with open('List1.csv', 'wb') as out1, \
     open('List2.csv', 'wb') as out2, \
     open('List3.csv', 'wb') as out3, \
     open('List4.csv', 'wb') as out4:
    seen = set()
    # etc.

I've renamed outout1 for parallelism.

However, another problem is that the code is repetitive and monolithic. Ideally, there should be a function, say read_unique_phone_numbers_from_csv_files(glob). There is a handy data structure for that: collections.OrderedDict.

def read_unique_phone_numbers_from_csv_files(csv_glob):
    regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')

    unique_phones = collections.OrderedDict()
    csv_files = glob.glob(csv_glob)
    for filename in csv_files:
        with open(filename, 'rbU') as ifile:
             read = csv.reader(ifile)
             for row in read:
                 for column in row:
                     # column.strip() is superfluous
                     match = regex.search(column)
                     if match:
                         canonical_phone = re.sub(r'\D', '', match.group(0))
                         unique_phones[canonical_phone] = True
    return unique_phones

With that function defined, the main loop should become quite simple.

share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.