Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I'd like to extract HTML content of websites and save it into a file. To achieve this, I have the following (working) code:

output = codecs.open("test.html", "a", "utf-8")

def first():
    for i in range(1, 10):

        root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()

        for empty in root.xpath('//*[self::b or self::i][not(node())]'):
            empty.getparent().remove(empty)

        tables = root.cssselect('table.main')
        tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

        txt = []

        txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])


        text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)

        output.write(text.decode("utf-8"))
        output.write("\n\n")

Is this "nice" code? I ask because I'm not sure if it's a good idea to use strings, and because of the fact that other HTML texts I've seen use one tag-ending (for example </div>) per line. My code produces partly more tag-endings per line.

Is it possible not to use strings or/and to achieve that we receive not things like

<td class="cell" valign="top" style=" width:0.00pt;"></td> but things like:

<td class="cell" valign="top" style=" width:0.00pt;">
</td>

? Thanks for any proposition :)

share|improve this question
add comment

2 Answers

Try something like this, you can definitely clean up the two list comprehensions you have, but for now this should suffice.

def first():
    with codecs.open("test.html", "a", "utf-8") as output:
        for i in range(1, 10):
            txt = []
            root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()
            for empty in root.xpath('//*[self::b or self::i][not(node())]'):
                empty.getparent().remove(empty)

            #tables = root.cssselect('table.main')  <--Dont need this, its is being overwritten
            tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

            txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])
            text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)

            output.write(text.decode("utf-8") + "\n\n")
share|improve this answer
add comment

I like beautifulsoup, which is based on lxml but has a cleaner interface, imho.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

I'm not quite sure what the "for empty in root.xpath" does - it looks like it's looking for empty b or i tags and removing them. I recommend a comment for that line.

Also, the 'text =' line needs a comment showing an example of what you're stripping out of the text string. Treat the next reader / maintainer of the code nicely. Or avoid the regexp if you can. I am biased, tho, because I don't use regexps until I'm absolutely forced to, and I always feel bad for the next developer having to maintain my regexps. But I digress onto my regexp soapbox. I'm sorry...

import urllib2
from bs4 import BeautifulSoup

first():
    with codecs.open("test.html", "a", "utf-8") as output:
        for i in range(1, 10):
            html_doc = urllib2.urlopen('http://test'+str(i)+'.xyz'+'?action=source').read()
            soup = BeautifulSoup(html_doc)

            #remove empty 'b' and 'i' tags:
            for tag in soup.find_all(['b', 'i']):
                if not tag.text: tag.extract()

            #get all tables, but not sub-tables
            tables = soup.find_all("table", class_="main", recursive=False)

            #not sure what the output regexp is doing, but I suggest:
            for table in tables:
                table.string = re.sub(r'\[:[\/]?T.*?:\]', '', table.string)            
                output.write(table.prettify())

Is it possible not to use strings or/and to achieve that we receive not things like ... but things like:

This is what the prettify() method will do for you, so you don't have to worry about the newlines. You can also pass your own output formatter to BeautifulSoup if you so choose.

Code untested and unrun... Just a thought at another approach. Beware bugs and syntax errors.

Enjoy! I hope this helps.

share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.