44
votes
7answers
34k views

Convert XML/HTML Entities into Unicode String in Python

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type? For ...
79
votes
10answers
48k views

Strip HTML from strings in Python

from mechanize import Browser br = Browser() br.open('http://somewebpage') html = br.response().readlines() for line in html: print line When printing a line in an HTML file, I'm trying to find a ...
38
votes
3answers
21k views

Decode HTML entities in Python string?

I'm trying to work out if there is a better way to achieve the following: from lxml import html from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("<p>&pound;682m</p>") ...
55
votes
13answers
53k views

Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I'd like something more ...
42
votes
9answers
15k views

Python HTML sanitizer / scrubber / filter

I'm looking for a module that will remove any HTML tags from a string that are not found in a whitelist.
40
votes
4answers
27k views

What's the easiest way to escape HTML in Python?

cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?
8
votes
6answers
11k views

How do I unescape HTML entities in a string in Python 3.1?

I have looked all around and only found solutions for python 2.6 and earlier, NOTHING on how to do this in python 3.X. (I only have access to Win7 box.) I HAVE to be able to do this in 3.1 and ...
26
votes
4answers
4k views

How can you make a vote-up-down button like in Stackoverflow?

Problems how to make an Ajax buttons (upward and downward arrows) such that the number can increase or decrease how to save the action af an user to an variable NumberOfVotesOfQuestionID I am not ...
14
votes
8answers
15k views

Filter out HTML tags and resolve entities in python

Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
11
votes
5answers
13k views

Python library for rendering HTML and javascript

Is there any python module for rendering a HTML page with javascript and get back a DOM object? I want to parse a page which generates almost all of its content using javascript.
9
votes
5answers
2k views

Concurrent downloads - Python

the plan is this: I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best ...
12
votes
2answers
11k views

Converting PDF to HTML with Python

How can I convert PDF files to HTML with Python? I was thinking something alone the lines of what Google does (or seems to do) to index PDF files. My final goal is to setup Apache to show the HTML ...
4
votes
6answers
1k views

How to find/replace text in html while preserving html tags/structure

I use regexps to transform text as I want, but I want to preserve the HTML tags. e.g. if I want to replace "stack overflow" with "stack underflow", this should work as expected: if the input is stack ...
37
votes
7answers
17k views

html to pdf for a Django site

For my django powered site, I am looking for an easy solution to convert dynamic html pages (generated using django views and templates datas generated using GET forms) which also contains some graph ...
12
votes
2answers
8k views

Parse HTML table to Python list?

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table. If, for example, I had an HTML table ...

15 30 50 per page