One bug I see is that you perform the string substitutions sequentially, rather than "simultaneously". For example, suppose you had an input file with contents
abc
with substitution code
arrayOne = ["a", "b"]
arrayTwo = ["b", "c"]
…
replace(0, 1)
Most users would expect the output to be bcc
. However, your code actually produces ccc
, which would probably be considered surprising.
The main problem, though, is that everything is tightly coupled, such that the code is not reusable.
For example, replace()
is a closure that captures arrayOne
and arrayTwo
, as well as file f
. It doesn't generalize to perform one substitution or ≥ 3 substitutions. The two parameters that it does accept are more annoying than helpful, as you have indicated that you expect to always call it as replace(0, 1)
. Furthermore, replace()
will only operate on the file that was opened as f
at the time that the function was defined.
A better interface would be for replace()
to accept a dict
of substitutions to be performed. Furthermore, it should just concern itself with string transformation, staying clear of all file operations.
I propose the following solution, which contains some rather advanced concepts.
from functools import partial
import re
def str_replacer(mapping):
regex = re.compile('|'.join(re.escape(orig) for orig in mapping))
return partial(regex.sub, lambda match: mapping[match.group(0)])
You would use it like this:
>>> replace = str_replacer({
... 'a': 'b',
... 'b': 'c',
... 'c': 'd',
... })
>>> print(replace('abc'))
bcd
Explanation: Performing "simultaneous" replacements means scanning the input string until any of the "old" strings is encountered. If a substitution is performed, the scanning should recommence at the after where the newly spliced-in string ends.
That's tedious to do using low-level string manipulation, so I've opted to use re.sub()
instead. To use re.sub()
, I have to do two things
- Compose a regular expression that includes all of the "old" strings to look for. That's what
re.compile(…)
does.
- Create a callback function that specifies what the replacement for each old string should be. That's the
lambda
. match.group(0)
is the old string that was found. We look its corresponding "new" string in the mapping
.
However, I don't want str_replacer()
to call regex.sub(repl, input_string)
just yet. Instead, I want str_replacer()
to return a function, that if called, performs the desired replacements. Why? Because you will likely call the resulting function many times if you want to process a file a line at a time, and it would be more efficient to reuse the regex and the lambda. That's what functools.partial()
does: it creates a function that accepts an input string; when that function is called, then the regex.sub(lambda match: mapping[match.group(0)], input_string)
actually happens.
replace = str_replacer({
'old string one': 'new string one',
'old string two': 'new string two',
})
with open(path, "r", encoding="utf8") as in_file, \
open(path.replace(".xml", "-new.xml"), "w", encoding="utf8") as out_file:
for line in in_file:
out_file.write(replace(line))
Now the string replacement logic and the file manipulation logic are cleanly separated!