Making line by line replacements using values from multiple arrays

Question

I currently have a Python script that parses a file (in this case an XML file) and makes line by line replacements as necessary, depending on the values of multiple arrays. As it stands now the script works great, but feel like it could be a lot better. I'll also need to add few more arrays later on and am worried about performance.

Additionally, the way I currently have it set up doesn't allow for counting and printing the total replacements that have been made. (e.g. "Replaced: xyz -- Made XX replacements") While not an immediate requirement, I'd like to add this functionality in the future.

arrayOne = ["old string one", "new string one"]
arrayTwo = ["old string two", "new string two"]

# Variable 'path' collected from command line input
f = open(path, "r", encoding="utf8")
newFile = open(path.replace(".xml", "-new.xml"), "w", encoding="utf8")

def replace(a,b):
    for data in f:
        for datatype in (arrayOne, arrayTwo):
            data = data.replace(datatype[a], datatype[b])
        newFile.write(data)
    newFile.close()

replace(0,1)

f.close()

Would you ever expect to need to do the inverse transformation, i.e., replacing "new string one" with "old string one"? — 200_success♦, Apr 16 at 16:46
No, I don't expect I'd ever need to go in reverse. If I did though, I could just reverse the arguments (e.g. replace(1,0)). — zugzug, Apr 16 at 17:02

200_success · Answer 1 · 2014-04-16 19:11:40Z

One bug I see is that you perform the string substitutions sequentially, rather than "simultaneously". For example, suppose you had an input file with contents

abc

with substitution code

arrayOne = ["a", "b"]
arrayTwo = ["b", "c"]

…
replace(0, 1)

Most users would expect the output to be bcc. However, your code actually produces ccc, which would probably be considered surprising.

The main problem, though, is that everything is tightly coupled, such that the code is not reusable.

For example, replace() is a closure that captures arrayOne and arrayTwo, as well as file f. It doesn't generalize to perform one substitution or ≥ 3 substitutions. The two parameters that it does accept are more annoying than helpful, as you have indicated that you expect to always call it as replace(0, 1). Furthermore, replace() will only operate on the file that was opened as f at the time that the function was defined.

A better interface would be for replace() to accept a dict of substitutions to be performed. Furthermore, it should just concern itself with string transformation, staying clear of all file operations.

I propose the following solution, which contains some rather advanced concepts.

from functools import partial
import re

def str_replacer(mapping):
    regex = re.compile('|'.join(re.escape(orig) for orig in mapping))
    return partial(regex.sub, lambda match: mapping[match.group(0)])

You would use it like this:

>>> replace = str_replacer({
...     'a': 'b',
...     'b': 'c',
...     'c': 'd',
... })
>>> print(replace('abc'))
bcd

Explanation: Performing "simultaneous" replacements means scanning the input string until any of the "old" strings is encountered. If a substitution is performed, the scanning should recommence at the after where the newly spliced-in string ends.

That's tedious to do using low-level string manipulation, so I've opted to use re.sub() instead. To use re.sub(), I have to do two things

Compose a regular expression that includes all of the "old" strings to look for. That's what re.compile(…) does.
Create a callback function that specifies what the replacement for each old string should be. That's the lambda. match.group(0) is the old string that was found. We look its corresponding "new" string in the mapping.

However, I don't want str_replacer() to call regex.sub(repl, input_string) just yet. Instead, I want str_replacer() to return a function, that if called, performs the desired replacements. Why? Because you will likely call the resulting function many times if you want to process a file a line at a time, and it would be more efficient to reuse the regex and the lambda. That's what functools.partial() does: it creates a function that accepts an input string; when that function is called, then the regex.sub(lambda match: mapping[match.group(0)], input_string) actually happens.

replace = str_replacer({
    'old string one': 'new string one',
    'old string two': 'new string two',
})

with open(path, "r", encoding="utf8") as in_file, \
     open(path.replace(".xml", "-new.xml"), "w", encoding="utf8") as out_file:
    for line in in_file:
        out_file.write(replace(line))

Now the string replacement logic and the file manipulation logic are cleanly separated!

asked	1 month ago
viewed	54 times
active	1 month ago

current community

your communities

more stack exchange communities

Making line by line replacements using values from multiple arrays

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python strings or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Making line by line replacements using values from multiple arrays

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python strings or ask your own question.

Related

Hot Network Questions