I'd like to extract HTML content of websites and save it into a file. To achieve this, I have the following (working) code:
output = codecs.open("test.html", "a", "utf-8")
def first():
for i in range(1, 10):
root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
tables = root.cssselect('table.main')
tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')
txt = []
txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])
text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)
output.write(text.decode("utf-8"))
output.write("\n\n")
Is this "nice" code? I ask because I'm not sure if it's a good idea to use strings, and because of the fact that other HTML texts I've seen use one tag-ending (for example </div>
) per line. My code produces partly more tag-endings per line.
Is it possible not to use strings or/and to achieve that we receive not things like
<td class="cell" valign="top" style=" width:0.00pt;"></td>
but things like:
<td class="cell" valign="top" style=" width:0.00pt;">
</td>
? Thanks for any proposition :)