Nested string replace with regex in Python

Question

I've got a bunch of HTML pages, in which I'd like to convert CSS-formatted text snippets into standard HTML tags. e.g some text will become some text

I'm stuck at nested span fragments:

<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>

I'd like to convert the fragment using Python's regex library. What would be the optimal strategy to regex search-&-replace the above input?

It's just a personal preference. I know it could be done with recusive plain string search... But somehow I find regex solutions to be more elegant... — user1656343, 2 days ago
The optimal strategy would really be to use something other than regular expressions, which are terribly underpowered for this. Beautiful Soup is the most popular go-to solution for parsing HTML in Python. — qwrrty, 2 days ago
It probably won't be so elegant. To do tag balancing, you need something stronger than regex. If you still want to use regular expressions, you'll need to use a loop. — Michael, 2 days ago

James Mills · Answer 1 · 2013-12-10 05:46:55Z

up vote 1 down vote

My solution using lxml and cssselect and a bit of Python:

#!/usr/bin/env python

import cssselect  # noqa
from lxml.html import fromstring


html = """
<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>
"""

class_to_style = {
    "underline": "u",
    "italic": "i",
    "bold": "b",
}

output = []
doc = fromstring(html)
spans = doc.cssselect("span")
for span in spans:
    if span.attrib.get("class"):
        output.append("<{0}>{1}</{0}>".format(class_to_style[span.attrib["class"]], span.text or ""))
print "".join(output)

Output:

<i></i><b>XXXXXXXX</b><i>some text</i><b>nested text</b><u>deep nested text</u>

NB: This is a naive solution and does not produce the correct output as you'd have to keep a queue of open tags and close them at the end.

edited yesterday

answered 2 days ago

James Mills
853311

1

Awesome! I was unaware of cssselect until now! Thanks @James Mills ! – user1656343 yesterday

Welcome I use it quite a lot in my work :) pypi.python.org/pypi/spyda – James Mills yesterday

Oops! It doesn't work as expected.. the output should be: XXXXXXXXsome textnested textdeep nested text – user1656343 yesterday

Yes my solution is naive at best. You'll have to keep a queue of open tags and close them at the end. I'm sure you can do this? :) Updated my answer to reflect this. (Have to leave you a little work!) – James Mills yesterday

You're right I'm exploring csselect & spyda. Thanks for the heads up! – user1656343 yesterday

add comment

asked	2 days ago
viewed	30 times
active	yesterday

Explore our sites

Nested string replace with regex in Python

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python regex nested or ask your own question.

Linked

Explore our sites

Nested string replace with regex in Python

1 Answer

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged python regex nested or ask your own question.

Linked

Related