Optimize performance of flatten function in python

Question

I have to flatten a large number (>300k) dicts to write them to a csv file.

Example:

dict = {
    a: b,
    c: [
       {
           d:e
       },
       {
           f:g
       }  
    ]
}

becomes:

a, c.0.d, c.1.f
b, e, g

The dicts can be really big with a lot of dicts as values.

My function to flatten them generically is:

def flatten(self, d, parent_key='', sep='.'):
        items = []
        for k, v in d.items():
            new_key = parent_key + sep + k if parent_key else k
            if isinstance(v, collections.MutableMapping):
                items.extend(self.flatten(v, new_key, sep=sep).items())
            elif isinstance(v, list):
                if isinstance(v[0], dict):
                    counter = 0
                    for entry in v:
                        new_count_key = new_key + sep + str(counter)
                        items.extend(self.flatten(entry, new_count_key, sep=sep).items())
                        counter += 1
                else:
                    items.append((new_key, v))
                    if new_key not in self.key_list:
                        self.key_list.append(new_key)
            else:
                items.append((new_key, v))
                if new_key not in self.key_list:
                    self.key_list.append(new_key)
        return dict(items)

When I meassure the time needed for exceution I found out the highest amount of time is caused by the instance checks. With 2500 dicts for example, the instance checks need around 6 seconds out of 12 seconds total.

Is there any way, I can speed up this function?

SuperBiasedMan · Accepted Answer · 2016-03-07 14:17:33Z

One avenue to try improve your speed is to use the Pythonic principle that it's Easier to Ask Forgiveness than beg Permission. This means that instead of testing your data every single time, you should just assume that it's in the correct form and catch the errors when you've gotten it wrong.

For instance, you test your dictionary here:

if isinstance(v[0], dict):
    counter = 0
    for entry in v:
        new_count_key = new_key + sep + str(counter)
        items.extend(self.flatten(entry, new_count_key, sep=sep).items())
        counter += 1
else:
    items.append((new_key, v))
    if new_key not in self.key_list:
        self.key_list.append(new_key)

What if instead you just assume that you have a list, and then work with the dictionary if you don't:

try:
    items.append((new_key, v))
    if new_key not in self.key_list:
        self.key_list.append(new_key)

except AttributeError:  # Is a dict, not a list
    counter = 0
    for entry in v:
        new_count_key = new_key + sep + str(counter)
        items.extend(self.flatten(entry, new_count_key, sep=sep).items())
        counter += 1

What you're doing here is attempting to append to items with the assumption that it's a list. If items is a dictionary and not a list, then your except catches the AttributeError that's raised and it will run the code for the dictionary format instead.

The advantage to this, is that you're not detecting the type, you're just catching when it's not a list. That means you spend no time calculating the datatype, you only need to spend time catching the exception when it comes up. This works best if the except catches the rarer datatype. So with my proposed solution, if most items will be lists then you'll save a lot more time than if most items end up being dictionaries.

asked	7 months ago
viewed	34 times
active	7 months ago

current community

your communities

more stack exchange communities

Optimize performance of flatten function in python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance dictionary or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Optimize performance of flatten function in python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance dictionary or ask your own question.

Related

Hot Network Questions