Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I currently have a Python script that parses a file (in this case an XML file) and makes line by line replacements as necessary, depending on the values of multiple arrays. As it stands now the script works great, but feel like it could be a lot better. I'll also need to add few more arrays later on and am worried about performance.

Additionally, the way I currently have it set up doesn't allow for counting and printing the total replacements that have been made. (e.g. "Replaced: xyz -- Made XX replacements") While not an immediate requirement, I'd like to add this functionality in the future.

arrayOne = ["old string one", "new string one"]
arrayTwo = ["old string two", "new string two"]

# Variable 'path' collected from command line input
f = open(path, "r", encoding="utf8")
newFile = open(path.replace(".xml", "-new.xml"), "w", encoding="utf8")

def replace(a,b):
    for data in f:
        for datatype in (arrayOne, arrayTwo):
            data = data.replace(datatype[a], datatype[b])
        newFile.write(data)
    newFile.close()

replace(0,1)

f.close()
share|improve this question
    
Would you ever expect to need to do the inverse transformation, i.e., replacing "new string one" with "old string one"? –  200_success Apr 16 at 16:46
    
No, I don't expect I'd ever need to go in reverse. If I did though, I could just reverse the arguments (e.g. replace(1,0)). –  zugzug Apr 16 at 17:02
add comment

1 Answer

One bug I see is that you perform the string substitutions sequentially, rather than "simultaneously". For example, suppose you had an input file with contents

abc

with substitution code

arrayOne = ["a", "b"]
arrayTwo = ["b", "c"]

…
replace(0, 1)

Most users would expect the output to be bcc. However, your code actually produces ccc, which would probably be considered surprising.


The main problem, though, is that everything is tightly coupled, such that the code is not reusable.

For example, replace() is a closure that captures arrayOne and arrayTwo, as well as file f. It doesn't generalize to perform one substitution or ≥ 3 substitutions. The two parameters that it does accept are more annoying than helpful, as you have indicated that you expect to always call it as replace(0, 1). Furthermore, replace() will only operate on the file that was opened as f at the time that the function was defined.

A better interface would be for replace() to accept a dict of substitutions to be performed. Furthermore, it should just concern itself with string transformation, staying clear of all file operations.


I propose the following solution, which contains some rather advanced concepts.

from functools import partial
import re

def str_replacer(mapping):
    regex = re.compile('|'.join(re.escape(orig) for orig in mapping))
    return partial(regex.sub, lambda match: mapping[match.group(0)])

You would use it like this:

>>> replace = str_replacer({
...     'a': 'b',
...     'b': 'c',
...     'c': 'd',
... })
>>> print(replace('abc'))
bcd

Explanation: Performing "simultaneous" replacements means scanning the input string until any of the "old" strings is encountered. If a substitution is performed, the scanning should recommence at the after where the newly spliced-in string ends.

That's tedious to do using low-level string manipulation, so I've opted to use re.sub() instead. To use re.sub(), I have to do two things

  1. Compose a regular expression that includes all of the "old" strings to look for. That's what re.compile(…) does.
  2. Create a callback function that specifies what the replacement for each old string should be. That's the lambda. match.group(0) is the old string that was found. We look its corresponding "new" string in the mapping.

However, I don't want str_replacer() to call regex.sub(repl, input_string) just yet. Instead, I want str_replacer() to return a function, that if called, performs the desired replacements. Why? Because you will likely call the resulting function many times if you want to process a file a line at a time, and it would be more efficient to reuse the regex and the lambda. That's what functools.partial() does: it creates a function that accepts an input string; when that function is called, then the regex.sub(lambda match: mapping[match.group(0)], input_string) actually happens.

replace = str_replacer({
    'old string one': 'new string one',
    'old string two': 'new string two',
})

with open(path, "r", encoding="utf8") as in_file, \
     open(path.replace(".xml", "-new.xml"), "w", encoding="utf8") as out_file:
    for line in in_file:
        out_file.write(replace(line))

Now the string replacement logic and the file manipulation logic are cleanly separated!

share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.