Join the Stack Overflow Community
Stack Overflow is a community of 6.7 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

There are lots of XML and HTML parsers in Python and I am looking for a simple way to extract a section of a HTML document, preferably using an XPATH construct but that's only optional.

Here is an example

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"

I want to extract the entire body of the element with id=content, so the result should be: <div id=content>AAA<B>BBB</B>CCC</div>

It would be if I can do this without installing a new library.

I would also prefer to get the original content of the desired element (not reformatted).

Usage of regexp is not allowed, as these are not safe for parsing XML/HTML.

share|improve this question

To parse using a library - the best way is BeautifulSoup Here is a snippet of how it will work for you!

from BeautifulSoup import BeautifulSoup

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"
soupy = BeautifulSoup( src )

content_divs = soupy.findAll( attrs={'id':'content'} )
if len(content_divs) > 0:
    # print the first one
    print str(content_divs[0])

    # to print the text contents
    print content_divs[0].text

    # or to print all the raw html
    for each in content_divs:
        print each
share|improve this answer

Yea I have done this. It may not be the best way to do it but it works something like the code below. I didn't test this

import re

match = re.finditer("<div id=content>",src)
src = src[match.start():]

#at this point the string start with your div everything proceeding it has been stripped.
#This next part works because the first div in the string is the end of your div section.
match = re.finditer("</div>",src)
src = src[:match.end()]

src now has just the div your after in the string. If there are situations where there is another inside what you want you will just have to build a fancier search pattern for you re.finditer sections.

share|improve this answer
    
For posterity: stackoverflow.com/a/1732454/326736 – Kalyan02 Jun 13 '13 at 16:22

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.