searching a value from one csv file in another csv file - Python

Question

I am writing a script that takes one csv file searches a value in another csv file then writes an output depending on the result it finds.

I have been using python's csv. Distreader and writer, I have it working, but it is very inefficient because it is looping through the 2 sets of data until it finds a result.

this is what i have so far, there are a few bits in the code which are specific to my setup (file locations etc) but I'm sure people can see around this;

# Set all csv atributes

    cache = {}
    in_file = open(sript_path + '/cp_updates/' + update_file, 'r')
    reader = csv.DictReader(in_file, delimiter= ',')
    out_file = open(sript_path + '/cp_updates/' + update_file + '.new', 'w')
    out_file.write("StockNumber,SKU,ChannelProfileID\n")
    writer = csv.DictWriter(out_file, fieldnames=('StockNumber', 'SKU', 'ChannelProfileID'), delimiter=',')
    check_file = open(sript_path + '/feeds/' + feed_file, 'r')
    ch_file_reader = csv.DictReader(check_file, delimiter=',')

    #loop through the csv's, find stock levels and update file

    for row in reader:
        #print row
        check_file.seek(0)
        found = False
        for ch_row in ch_file_reader:
            #if row['SKU'] not in cache:
            if ch_row['ProductCode'] == row[' Stock']:
                Stk_Lvl = int(ch_row[stk_lvl_header])
                if Stk_Lvl > 0:
                    res = 3746
                elif Stk_Lvl == 0:
                    res = 3745
                else:
                    res = " "
                found = True
                print ch_row
                print res

                cache[row['SKU']] = res
        if not found:
            res = " "
            #print ch_row
            #print res
            cache[row['SKU']] = res     
        row['ChannelProfileID'] = cache[row['SKU']]
        writer.writerow(row)

This is a few lines from my in_file and also the outfile is the same structure, it just updates the ChannelProfileID depending on the results found;

"StockNumber","SKU","ChannelProfileID"
"10m_s-vid#APTIIAMZ","2VV-10",3746
"10m_s-vid#CSE","2VV-10",3746
"1RR-01#CSE","1RR-01",3746
"1RR-01#PCAWS","1RR-01",3746
"1m_s-vid_ext#APTIIAMZ","2VV-101",3746

This is a few line from the check_file;

ProductCode, Description, Supplier, CostPrice, RRPPrice, Stock, Manufacturer, SupplierProductCode, ManuCode, LeadTime
2VV-03,3MTR BLACK SVHS M - M GOLD CABLE - B/Q 100,Cables Direct Ltd,0.43,,930,CDL,2VV-03,2VV-03,1
2VV-05,5MTR BLACK SVHS M - M GOLD CABLE - B/Q 100,Cables Direct Ltd,0.54,,1935,CDL,2VV-05,2VV-05,1
2VV-10,10MTR BLACK SVHS M - M GOLD CABLE - B/Q 50,Cables Direct Ltd,0.86,,1991,CDL,2VV-10,2VV-10,1

So you can see it selects the first line from the in_file, looks up the SKU in the check_file then writes the out_file in the same format as the in_file changing the ChannelProfileID depending what it finds in the Stock field of the check_file, it then goes back to the first line in the check_file and performs the same on the next line on the in_file.

As I say this script is working and outputs exactly what I want, but I believe is slow and inefficient due to having to keep loop through the check_file until it finds a result.

What i'm after is suggestions on how to improve the efficiency, I'm guessing there's a better way to find the data than keep looping through the check_file!??

I am a beginner at python and this is my first post, so I apologize in advance for anything incorrect or unclear, I think I have posted according to forum rules and have clearly set out what my script is doing and the help I am after!

Thanks in advance for any help/suggestions.

Blair · Accepted Answer · 2011-12-23 00:30:15Z

What you want is a mapping from the product code (the key) to a stock level/result code (the value). In Python this is known as a dictionary. The way to do it is to go through your check file at the start, and use the information in it to create a dictionary containing all the stock level details. You then go through your input file, read in the product code, and retrieve the stock code from the dictionary you created earlier.

I've rewritten your code to do this, and it works for the example files you gave. I have commented it fairly thoroughly, but if there is anything unclear in it just post a comment and I'll try to clarify.

import csv

# Open the check file in a context manager. This ensures the file will be closed
# correctly if an error occurs.
with open('checkfile.csv', 'rb') as checkfile:
    checkreader = csv.DictReader(checkfile)

    # Create a function which maps the stock level to the result code.
    def result_code(stock_level):
        if stock_level > 0:
            return 3746
        if stock_level == 0:
            return 3745
        return " "

    # This does the real work. The middle line is a generator expression which
    # iterates over each line in the check file. The product code and stock
    # level are extracted from each line, the stock level converted into the
    # result, and the two values put together in a tuple. This is then converted
    # into a dictionary. This dictionary has the product codes as its keys and
    # their result code as its values.
    product_result = dict(
        (v['ProductCode'], result_code(int(v[' Stock']))) for v in checkreader
    )

# Open the input and output files.
with open('infile.csv', 'rb') as infile:
    with open('outfile.csv', 'wb') as outfile:
        reader = csv.DictReader(infile)

        # Use the same field names for the output file.
        writer = csv.DictWriter(outfile, reader.fieldnames)
        writer.writeheader()

        # Iterate over the products in the input.
        for product in reader:
            # Find the stock level from the dictionary we created earlier. Using
            # the get() method allows us to specify a default value if the SKU
            # does not exist in the dictionary.
            result = product_result.get(product['SKU'], " ")

            # Update the product info.
            product['ChannelProfileID'] = result

            # Write it to the output file.
            writer.writerow(product)

Great Thanks, really clear and simple with excellent comments! — gingebot, Dec 26 '11 at 17:46
Great Thanks, really clear and simple with excellent comments, has really helped my learning. It's also taken a script that took about 30 mins to complete with a full dataset to a matter of seconds! One slight change I made is; ` def result_code(stock_level): try : stock_level = int(stock_level) except : stock_level = 0 if stock_level > 0: return 3746 if stock_level == 0: return 3745 return " " product_result = dict( (v['item code'], result_code(v['stock'])) for v in checkreader )` Because now and again a blank space is used instead of 0 for a stock level of 0. — gingebot, Dec 26 '11 at 17:52

jcollado · Answer 2 · 2011-12-22 23:58:23Z

One thing that you should avoid is reading the same file multiple times. There isn't any detail about how big are your files in the question, so I guess that they can fit in memory. In that case, I'd recommend to read the files once, work on the data in memory and write the result file.

Aside from that, as you read the data, there should be some way to improve the search time later. It looks like the columns in which you're interested are the ones related to the ProductCode. So maybe you could create a dictionary of lists that can be accessed using the ProductCode as key. As I said, this should speed up the search.

If there's some reason why using a dictionary isn't appropriate. You can try to use a database like sqlite3, which is part of the standard library, and store your data in memory in such a way that you can run SQL queries to get the data that you need in a faster way.

I hope this helps.

MDT · Answer 3 · 2011-12-23 00:06:49Z

I think I have come up with something along the lines of what you want using dicts

import csv
in_file = open("in.csv", 'r')
reader = csv.DictReader(in_file, delimiter= ',')
out_file = open("out.txt", 'w')
out_file.write("StockNumber,SKU,ChannelProfileID\n")
check_file = open("check.csv", 'r')
check = csv.DictReader(check_file, delimiter=',')

prodid = set()
prod_sn = dict()
for row in reader:
    prodid.add(row["SKU"])
    prod_sn[row["SKU"]] = row["StockNumber"]
    print(row["SKU"])

stocknums = dict()
for row in check:
    stocknums[row["ProductCode"]] = row[" Stock"]
    print(row["ProductCode"])


for product in prodid:
    ref = 0
    if product in stocknums:
        if(stocknums[product] > 0):
            ref = 1


    out_file.write(str(prod_sn[product]) + ',' + str(product) + ','+ str(ref)+ "\n")

pyInTheSky · Answer 4 · 2011-12-24 19:31:48Z

This I hope, meets all your needs. It allows you to hold on to your csv in a dict form, do lookups and modifications, and also write it in a perserved order. You can also change which column you want to be your lookup column (making sure that there is a unique id for every row of that column. In my example of usage, it assumes that both classes are contained withing the same file named 'CustomDictReader.py'. So in the end, what you can do with this is create two CSVRW objects, set your lookup column for each one and do your swapping/compare/lookup, then do the final write, when you are done going through what you need

-- File 'CustomDictReader.py' --

import csv, collections, copy

'''
# CSV TEST FILE 'test.csv'

TBLID,DATETIME,VAL
C1,01:01:2011:00:01:23,5
C2,01:01:2012:00:01:23,8
C3,01:01:2013:00:01:23,4
C4,01:01:2011:01:01:23,9
C5,01:01:2011:02:01:23,1
C6,01:01:2011:03:01:23,5
C7,01:01:2011:00:01:23,6
C8,01:01:2011:00:21:23,8
C9,01:01:2011:12:01:23,1


#usage

>>> import CustomDictReader
>>> import pprint
>>> test = CustomDictReader.CSVRW()
>>> success, thedict = test.createCsvDict('TBLID',',',None,'test.csv')
>>> pprint.pprint(dict(d))
{'C1': OrderedDict([('TBLID', 'C1'), ('DATETIME', '01:01:2011:00:01:23'), ('VAL', '5')]),
 'C2': OrderedDict([('TBLID', 'C2'), ('DATETIME', '01:01:2012:00:01:23'), ('VAL', '8')]),
 'C3': OrderedDict([('TBLID', 'C3'), ('DATETIME', '01:01:2013:00:01:23'), ('VAL', '4')]),
 'C4': OrderedDict([('TBLID', 'C4'), ('DATETIME', '01:01:2011:01:01:23'), ('VAL', '9')]),
 'C5': OrderedDict([('TBLID', 'C5'), ('DATETIME', '01:01:2011:02:01:23'), ('VAL', '1')]),
 'C6': OrderedDict([('TBLID', 'C6'), ('DATETIME', '01:01:2011:03:01:23'), ('VAL', '5')]),
 'C7': OrderedDict([('TBLID', 'C7'), ('DATETIME', '01:01:2011:00:01:23'), ('VAL', '6')]),
 'C8': OrderedDict([('TBLID', 'C8'), ('DATETIME', '01:01:2011:00:21:23'), ('VAL', '8')]),
 'C9': OrderedDict([('TBLID', 'C9'), ('DATETIME', '01:01:2011:12:01:23'), ('VAL', '1')])}

'''

class CustomDictReader(csv.DictReader):
    '''
        override the next() function and  use an
        ordered dict in order to preserve writing back
        into the file
    '''

    def __init__(self, f, fieldnames = None, restkey = None, restval = None, dialect ="excel", *args, **kwds):
        csv.DictReader.__init__(self, f, fieldnames = None, restkey = None, restval = None, dialect = "excel", *args, **kwds)

    def next(self):
        if self.line_num == 0:
            # Used only for its side effect.
            self.fieldnames
        row = self.reader.next()
        self.line_num = self.reader.line_num

        # unlike the basic reader, we prefer not to return blanks,
        # because we will typically wind up with a dict full of None
        # values
        while row == []:
            row = self.reader.next()
        d = collections.OrderedDict(zip(self.fieldnames, row))

        lf = len(self.fieldnames)
        lr = len(row)
        if lf < lr:
            d[self.restkey] = row[lf:]
        elif lf > lr:
            for key in self.fieldnames[lr:]:
                d[key] = self.restval
        return d

class CSVRW(object):

    def __init__(self):
        self.file_name = ""
        self.csv_delim = ""
        self.csv_dict  = collections.OrderedDict()

    def setCsvFileName(self, name):
        '''
            @brief stores csv file name
            @param name- the file name
        '''
        self.file_name = name

    def getCsvFileName():
        '''
            @brief getter
            @return returns the file name
        '''
        return self.file_name

    def getCsvDict(self):
        '''
            @brief getter
            @return returns a deep copy of the csv as a dictionary
        '''
        return copy.deepcopy(self.csv_dict)

    def clearCsvDict(self):
        '''
            @brief resets the dictionary
        '''
        self.csv_dict = collections.OrderedDict()

    def updateCsvDict(self, newCsvDict):
        '''
            creates a deep copy of the dict passed in and
            sets it to the member one
        '''
        self.csv_dict = copy.deepcopy(newCsvDict)

    def createCsvDict(self,dictKey, delim, handle = None, name = None, readMode = 'rb', **kwargs):
        '''
            @brief create a dict from a csv file where:
                the top level keys are the first line in the dict, overrideable w/ **kwargs
                each row is a dict
                each row can be accessed by the value stored in the column associated w/ dictKey

                that is to say, if you want to index into your csv file based on the contents of the
                third column, pass the name of that col in as 'dictKey'

            @param dictKey  - row key whose value will act as an index
            @param delim    - csv file deliminator
            @param handle   - file handle (leave as None if you wish to pass in a file name)
            @param name     - file name   (leave as None if you wish to pass in a file handle)
            @param readMode - 'r' || 'rb'
            @param **kwargs - additional args allowed by the csv module
            @return bool    - SUCCESS|FAIL
        '''
        retVal         = (False, None)
        self.csv_delim = delim

        try:
            reader = None
            if isinstance(handle, file):
                self.setCsvFileName(handle.name)
                reader = CustomDictReader(handle, delim, **kwargs)
            else:
                if None == name:
                    name = self.getCsvFileName()
                else:
                    self.setCsvFileName(name)
                reader = CustomDictReader(open(name, readMode), delim, **kwargs)

            for row in reader:
                self.csv_dict[row[dictKey]] = row

            retVal = (True, self.getCsvDict())

        except IOError:
            retVal = (False, 'Error opening file')

        return retVal

    def createCsv(writeMode, outFileName = None, delim = None):
        '''
            @brief create a csv from self.csv_dict
            @param writeMode   - 'w' || 'wb'
            @param outFileName - file name || file handle
            @param delim       - csv deliminator
            @return none
        '''
        if None == outFileName:
            outFileName = self.file_name
        if None == delim:
            delim = self.csv_delim

        with open(outFileName, writeMode) as fout:
            for key in self.csv_dict.values():
                fout.write(delim.join(key.keys()) + '\n')
                break

            for key in self.csv_dict.values():
                fout.write(delim.join(key.values()) + '\n')

asked	2 years ago
viewed	2127 times
active	2 years ago

current community

your communities

more stack exchange communities

searching a value from one csv file in another csv file - Python

migrated from stackoverflow.com Dec 23 '11 at 14:28

4 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python search csv or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

searching a value from one csv file in another csv file - Python

migrated from stackoverflow.com Dec 23 '11 at 14:28

4 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python search csv or ask your own question.

Related

Hot Network Questions