I have a large set of large files and a set of "phrases" that need to be replaced in each file.
The "business logic" imposes several restrictions:
- Matching must be case-insensitive
- The whitespace, tabs and new lines in the regex cannot be ignored
My solution (see below) is a bit on the slow side. How could it be optimised, both in terms of IO and string replacement?
data = open("INPUT__FILE").read()
o = open("OUTPUT_FILE","w")
for phrase in phrases: # these are the set of words I am talking about
b1, b2 = str(phrase).strip().split(" ")
regex = re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE)
data = regex.sub(b1+"_"+b2,data)
o.write(data)
UPDATE: 4x speed-up by converting all text to lower case and dropping re.IGNORECASE
re.compile(r"(%s)\s*(%s)"%(b1,b2))
andregex.sub("\1_\2", data)
- no idea if it will be faster. – andrew cooke Sep 2 '11 at 17:18\s*
, change it to(?: |\t |\n)*
- that might avoid issues with unicode that are all i can think of to explain the case effect. – andrew cooke Sep 2 '11 at 17:46