Python string clean up function with optional args

Question

I've got a function that I mainly use while web scraping. It gives me the ability to throw in a multi line address and clean it or a name field with unwanted characters and clean those, etc.

Below is the code and I would like to know if this is the best approach. If I should switch to recursive or stick with the while loop. Or if I should look at some other completely different approach. Examples of I/O commented in the code.

def clean_up(text, strip_chars=[], replace_extras={}):
    """
    :type text str
    :type strip_chars list
    :type replace_extras dict
    *************************
    strip_chars: optional arg
    Accepts passed list of string objects to iter through.
    Each item, if found at beginning or end of string, will be
    gotten rid of.
    example:
    text input: '       ,  ,      , .,.,.,.,,,......test, \t  this\n.is.a\n.test...,,,         , .'
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^------^^^^----^^-----^^-----^^^^^^^^^^^^^^^^^^
    strip_chars arg: [',', '.']
    output: 'test, this .is.a .test'
    *************************
    replace_extras: optional arg
    Accepts passed dict of items to replace in the standard
    clean_up_items dict or append to it.
    example:
    text_input: ' this is one test\n!\n'
                 ^--------^^^-----^^-^^
    replace_extras arg: {'\n': '', 'one': '1'}
    output: 'this is 1 test!'
    *************************
    DEFAULT REPLACE ITEMS
    ---------------------
    These can be overridden and/or appended to using the replace_extras
    argument.
    replace item      |   with
    <\\n line ending> - <space>
    <\\r line ending> - <space>
    <\\t tab>         - <space>
    <  double-space>  - <space>
    <text-input>      - <stripped>
    *************************
    """

    clean_up_items = {'\n': ' ', '\r': ' ', '\t': ' ', '  ': ' '}
    clean_up_items.update(replace_extras)

    text = text.strip()

    change_made = True
    while change_made:
        text_old = text
        for x in strip_chars:
            while text.startswith(x) or text.endswith(x):
                text = text.strip(x).strip()

        for key, val in clean_up_items.items():
            while key in text:
                text = text.replace(key, val)

        change_made = False if text_old == text else True

    return text.strip()

200_success · Accepted Answer · 2016-08-25 21:29:14Z

Substitution in multiple passes

Your function is slightly buggy in a non-deterministic way, due to the way you perform substitutions in multiple passes. Performing string substitutions in multiple passes is usually a bad idea. Here is another instance of such a bug. The problem is that the result could vary depending on the iteration order through replace_extras.

Take this example:

>>> clean_up('acetone', replace_extras={
...     'one': '1',
...     'acetone': '(CH3)2CO',
...     'CO': 'carbon monoxide',
...     'C': 'carbon',
...     'CH3': 'methane'
... })

What will be the result? If the 'one' substitution is done first, then it becomes 'acet1'. If the 'acetone' substitution is done first, followed by the 'CO' substitution and the 'C' substitution, then it becomes '(carbonH3)2carbon monoxide'. If it's 'acetone', then 'C', then 'CO', then it becomes '(carbonH3)2carbonO'. Another possible outcome is '(methane)2carbon monoxide'. All kinds of results are possible!

Therefore, I recommend doing substitutions in a single pass wherever possible, using regular expressions. Regular expressions will always look for the leftmost longest match.¹ With a regex substitution, the result will not be fed back for further processing.

Documentation and function design

You wrote a very long docstring, which is nice, but nowhere did you actually state the purpose of the function.

What is the purpose of the function? There is a strip_chars phase, followed by a replace_extras phase. By the single-responsibility principle, consider splitting the function into two functions, or at least writing it as a composition of two helper functions.

If you have specific examples of inputs and their corresponding outputs, it would be a good idea to write them as doctests instead.

Suggested solution

import re

def clean_up(text, strip_chars=[], replace_extras={}):
    r"""
    Remove all occurrences of strip_chars and whitespace at the beginning
    and end of each line, then perform string substitutions specified by
    the replace_extras dictionary (as well as normalizing all whitespace
    to a single space character), and then strip whitespace from the
    beginning and end.

    >>> clean_up('       ,  ,      , .,.,.,.,,,......test, \t  this\n'
    ...     '.is.a\n'
    ...     '.test...,,,         , .', strip_chars=',.')
    'test, this .is.a .test'

    Any consecutive whitespace is normalized to a single space, but
    you can override these implicit substitutions in replace_extras.

    >>> clean_up(' this is one test\n!\n', replace_extras={'\n': '', 'one': '1'})
    'this is 1 test!'
    """
    # Handle strip_chars
    strip_items = '|'.join(re.escape(s) for s in strip_chars)
    strip_re = r'^(?:{}|\s)+|(?:{}|\s)+$'.format(strip_items, strip_items)
    text = re.sub(strip_re, '', text, re.MULTILINE)

    # Normalize whitespace and handle replace_extras
    replace_keys = list(replace_extras.keys())
    replace_keys.sort(key=len, reverse=True)
    replace_re = '|'.join([re.escape(s) for s in replace_keys] + [r'\s+'])
    return re.sub(
        replace_re,
        lambda match: replace_extras.get(match.group(), ' '),
        text
    ).strip()

Wow. I like this approach. It also does handle the unexpected but potential of entering into an infinite loop with each pass through the dict which I didn't previously see. Thanks for this! — double_j, Aug 25 at 15:46

Sjoerd Job Postmus · Answer 2 · 2016-08-25 04:15:25Z

up vote 3 down vote

First of all,str.strip accepts a list of characters already. So instead of

for x in strip_chars:
    while text.startswith(x) or text.endswith(x):
        text = text.strip(x).strip()

you can write

text = text.strip(strip_chars)

As for your replace-loop: it could be simpler. Most often you don't need to iterate multiple times, because .replace replaces all (non-overlapping) occurrences.

For the outer while loop, in other languages I would use a do-while loop.

do {
    statements:
} while (expression);

In Python the idiomatic form is

while True:
    statements
    if not expression:
        break

In this case:

while True:
    text_old = text
    text = text.strip(''.join(strip_chars)).strip()
    for src, dst in cleanup_items.items():
        text = text.replace(src, dst)
    if text == text_old:
        break

edited Aug 25 at 4:15

answered Aug 24 at 18:05

Sjoerd Job Postmus

2,884816

This is very useful. Re: As for your replace-loop: it could be simpler. I was wondering how you could still handle the overlapping instances without the for loop. – double_j Aug 24 at 18:42

Additionally with the first for loop of stripping characters, I do that as well to then strip the text after stripping the character and then start over. This would handle something like: 'test.... .... ....' where text.strip(strip_chars) would only handle the first set of periods if strip_chars == ['.']. – double_j Aug 24 at 18:51

In this case, it does not matter if there are overlapping instances. We first replace everything once, and if nothing changed that is ok, because then we know we will loop again. – Sjoerd Job Postmus Aug 24 at 19:09

from is a keyword in Python 3.4 so a different name may be better for compatibility – Caridorc Aug 24 at 19:38

1

@Caridorc: fixed – Sjoerd Job Postmus Aug 24 at 20:35

| show 1 more comment

asked	3 months ago
viewed	99 times
active	3 months ago

current community

your communities

more stack exchange communities

Python string clean up function with optional args

2 Answers 2

Substitution in multiple passes

Documentation and function design

Suggested solution

Your Answer

Not the answer you're looking for? Browse other questions tagged python strings or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Python string clean up function with optional args

2 Answers 2

Substitution in multiple passes

Documentation and function design

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python strings or ask your own question.

Related

Hot Network Questions