Simplification of Python code (HTML extraction)

Question

I'd like to extract HTML content of websites and save it into a file. To achieve this, I have the following (working) code:

output = codecs.open("test.html", "a", "utf-8")

def first():
    for i in range(1, 10):

        root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()

        for empty in root.xpath('//*[self::b or self::i][not(node())]'):
            empty.getparent().remove(empty)

        tables = root.cssselect('table.main')
        tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

        txt = []

        txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])


        text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)

        output.write(text.decode("utf-8"))
        output.write("\n\n")

Is this "nice" code? I ask because I'm not sure if it's a good idea to use strings, and because of the fact that other HTML texts I've seen use one tag-ending (for example </div>) per line. My code produces partly more tag-endings per line.

Is it possible not to use strings or/and to achieve that we receive not things like

<td class="cell" valign="top" style=" width:0.00pt;"></td> but things like:

<td class="cell" valign="top" style=" width:0.00pt;">
</td>

? Thanks for any proposition :)

enginefree · Answer 1 · 2013-09-03 21:51:47Z

Try something like this, you can definitely clean up the two list comprehensions you have, but for now this should suffice.

def first():
    with codecs.open("test.html", "a", "utf-8") as output:
        for i in range(1, 10):
            txt = []
            root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()
            for empty in root.xpath('//*[self::b or self::i][not(node())]'):
                empty.getparent().remove(empty)

            #tables = root.cssselect('table.main')  <--Dont need this, its is being overwritten
            tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

            txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])
            text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)

            output.write(text.decode("utf-8") + "\n\n")

FlipMcF · Answer 2 · 2013-09-04 17:47:45Z

I like beautifulsoup, which is based on lxml but has a cleaner interface, imho.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

I'm not quite sure what the "for empty in root.xpath" does - it looks like it's looking for empty b or i tags and removing them. I recommend a comment for that line.

Also, the 'text =' line needs a comment showing an example of what you're stripping out of the text string. Treat the next reader / maintainer of the code nicely. Or avoid the regexp if you can. I am biased, tho, because I don't use regexps until I'm absolutely forced to, and I always feel bad for the next developer having to maintain my regexps. But I digress onto my regexp soapbox. I'm sorry...

import urllib2
from bs4 import BeautifulSoup

first():
    with codecs.open("test.html", "a", "utf-8") as output:
        for i in range(1, 10):
            html_doc = urllib2.urlopen('http://test'+str(i)+'.xyz'+'?action=source').read()
            soup = BeautifulSoup(html_doc)

            #remove empty 'b' and 'i' tags:
            for tag in soup.find_all(['b', 'i']):
                if not tag.text: tag.extract()

            #get all tables, but not sub-tables
            tables = soup.find_all("table", class_="main", recursive=False)

            #not sure what the output regexp is doing, but I suggest:
            for table in tables:
                table.string = re.sub(r'\[:[\/]?T.*?:\]', '', table.string)            
                output.write(table.prettify())

Is it possible not to use strings or/and to achieve that we receive not things like ... but things like:

This is what the prettify() method will do for you, so you don't have to worry about the newlines. You can also pass your own output formatter to BeautifulSoup if you so choose.

Code untested and unrun... Just a thought at another approach. Beware bugs and syntax errors.

Enjoy! I hope this helps.

asked	5 months ago
viewed	139 times
active	5 months ago

current community

more stack exchange communities

Simplification of Python code (HTML extraction)

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python html or ask your own question.

Hot Network Questions

current community

more stack exchange communities

Simplification of Python code (HTML extraction)

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python html or ask your own question.

Related

Hot Network Questions