Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I've got a bunch of HTML pages, in which I'd like to convert CSS-formatted text snippets into standard HTML tags. e.g <span class="bold">some text</span> will become <b>some text</b>

I'm stuck at nested span fragments:

<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>

I'd like to convert the fragment using Python's regex library. What would be the optimal strategy to regex search-&-replace the above input?

share|improve this question
1  
Why must it be done by regular expression? –  hwnd 2 days ago
 
It's just a personal preference. I know it could be done with recusive plain string search... But somehow I find regex solutions to be more elegant... –  user1656343 2 days ago
2  
The optimal strategy would really be to use something other than regular expressions, which are terribly underpowered for this. Beautiful Soup is the most popular go-to solution for parsing HTML in Python. –  qwrrty 2 days ago
 
It probably won't be so elegant. To do tag balancing, you need something stronger than regex. If you still want to use regular expressions, you'll need to use a loop. –  Michael 2 days ago
1  
The ultimate html-regex rant is here. –  John1024 2 days ago
show 2 more comments

1 Answer

My solution using lxml and cssselect and a bit of Python:

#!/usr/bin/env python

import cssselect  # noqa
from lxml.html import fromstring


html = """
<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>
"""

class_to_style = {
    "underline": "u",
    "italic": "i",
    "bold": "b",
}

output = []
doc = fromstring(html)
spans = doc.cssselect("span")
for span in spans:
    if span.attrib.get("class"):
        output.append("<{0}>{1}</{0}>".format(class_to_style[span.attrib["class"]], span.text or ""))
print "".join(output)

Output:

<i></i><b>XXXXXXXX</b><i>some text</i><b>nested text</b><u>deep nested text</u>

NB: This is a naive solution and does not produce the correct output as you'd have to keep a queue of open tags and close them at the end.

share|improve this answer
1  
Awesome! I was unaware of cssselect until now! Thanks @James Mills ! –  user1656343 yesterday
 
Welcome I use it quite a lot in my work :) pypi.python.org/pypi/spyda –  James Mills yesterday
 
Oops! It doesn't work as expected.. the output should be: <i><b>XXXXXXXX</b></i><i>some text<b>nested text<u>deep nested text</b></u></i> –  user1656343 yesterday
 
Yes my solution is naive at best. You'll have to keep a queue of open tags and close them at the end. I'm sure you can do this? :) Updated my answer to reflect this. (Have to leave you a little work!) –  James Mills yesterday
 
You're right I'm exploring csselect & spyda. Thanks for the heads up! –  user1656343 yesterday
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.