I've got a function that I mainly use while web scraping. It gives me the ability to throw in a multi line address and clean it or a name field with unwanted characters and clean those, etc.
Below is the code and I would like to know if this is the best approach. If I should switch to recursive or stick with the while
loop. Or if I should look at some other completely different approach. Examples of I/O commented in the code.
def clean_up(text, strip_chars=[], replace_extras={}):
"""
:type text str
:type strip_chars list
:type replace_extras dict
*************************
strip_chars: optional arg
Accepts passed list of string objects to iter through.
Each item, if found at beginning or end of string, will be
gotten rid of.
example:
text input: ' , , , .,.,.,.,,,......test, \t this\n.is.a\n.test...,,, , .'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^------^^^^----^^-----^^-----^^^^^^^^^^^^^^^^^^
strip_chars arg: [',', '.']
output: 'test, this .is.a .test'
*************************
replace_extras: optional arg
Accepts passed dict of items to replace in the standard
clean_up_items dict or append to it.
example:
text_input: ' this is one test\n!\n'
^--------^^^-----^^-^^
replace_extras arg: {'\n': '', 'one': '1'}
output: 'this is 1 test!'
*************************
DEFAULT REPLACE ITEMS
---------------------
These can be overridden and/or appended to using the replace_extras
argument.
replace item | with
<\\n line ending> - <space>
<\\r line ending> - <space>
<\\t tab> - <space>
< double-space> - <space>
<text-input> - <stripped>
*************************
"""
clean_up_items = {'\n': ' ', '\r': ' ', '\t': ' ', ' ': ' '}
clean_up_items.update(replace_extras)
text = text.strip()
change_made = True
while change_made:
text_old = text
for x in strip_chars:
while text.startswith(x) or text.endswith(x):
text = text.strip(x).strip()
for key, val in clean_up_items.items():
while key in text:
text = text.replace(key, val)
change_made = False if text_old == text else True
return text.strip()