Tagged Questions
44
votes
7answers
34k views
Convert XML/HTML Entities into Unicode String in Python
I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?
For ...
79
votes
10answers
48k views
Strip HTML from strings in Python
from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line
When printing a line in an HTML file, I'm trying to find a ...
38
votes
3answers
21k views
Decode HTML entities in Python string?
I'm trying to work out if there is a better way to achieve the following:
from lxml import html
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<p>£682m</p>")
...
55
votes
13answers
53k views
Extracting text from HTML file using Python
I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more ...
42
votes
9answers
15k views
Python HTML sanitizer / scrubber / filter
I'm looking for a module that will remove any HTML tags from a string that are not found in a whitelist.
40
votes
4answers
27k views
What's the easiest way to escape HTML in Python?
cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?
8
votes
6answers
11k views
How do I unescape HTML entities in a string in Python 3.1?
I have looked all around and only found solutions for python 2.6 and earlier, NOTHING on how to do this in python 3.X. (I only have access to Win7 box.)
I HAVE to be able to do this in 3.1 and ...
26
votes
4answers
4k views
How can you make a vote-up-down button like in Stackoverflow?
Problems
how to make an Ajax buttons (upward and downward arrows) such that the number can increase or decrease
how to save the action af an user to an variable NumberOfVotesOfQuestionID
I am not ...
14
votes
8answers
15k views
Filter out HTML tags and resolve entities in python
Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
11
votes
5answers
13k views
Python library for rendering HTML and javascript
Is there any python module for rendering a HTML page with javascript and get back a DOM object?
I want to parse a page which generates almost all of its content using javascript.
9
votes
5answers
2k views
Concurrent downloads - Python
the plan is this:
I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best ...
12
votes
2answers
11k views
Converting PDF to HTML with Python
How can I convert PDF files to HTML with Python?
I was thinking something alone the lines of what Google does (or seems to do) to index PDF files.
My final goal is to setup Apache to show the HTML ...
4
votes
6answers
1k views
How to find/replace text in html while preserving html tags/structure
I use regexps to transform text as I want, but I want to preserve the HTML tags.
e.g. if I want to replace "stack overflow" with "stack underflow", this should work as
expected: if the input is stack ...
37
votes
7answers
17k views
html to pdf for a Django site
For my django powered site, I am looking for an easy solution to convert dynamic html pages (generated using django views and templates datas generated using GET forms) which also contains some graph ...
12
votes
2answers
8k views
Parse HTML table to Python list?
I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.
If, for example, I had an HTML table ...