Tagged Questions
24
votes
7answers
29k views
Parsing HTML in Python
What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated.
I would prefer if it could stomache a ...
23
votes
5answers
8k views
Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?
From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason ...
4
votes
6answers
1k views
How to find/replace text in html while preserving html tags/structure
I use regexps to transform text as I want, but I want to preserve the HTML tags.
e.g. if I want to replace "stack overflow" with "stack underflow", this should work as
expected: if the input is stack ...
5
votes
3answers
21k views
How can I use the python HTMLParser library to extract data from a specific div tag?
I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this html element:
...
<div id="remository">20</div>
...
...
1
vote
4answers
4k views
python UnicodeEncodeError > How can I simply remove troubling unicode characters?
Heres what I did..
>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec ...
0
votes
0answers
38 views
How can i select a value from dropdown menu which is found in a double form tag with Python mechanize?
I'm trying to select a value from a second form tag and submitting:
<form action="manager.php" method="GET">
.....
<form action="manager.php" method="post">.
====here i need to select a ...
10
votes
6answers
3k views
How to parse malformed HTML in python, using standard libraries
There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.
I've found plenty of great third-party libraries for this task, ...
7
votes
1answer
10k views
BeautifulSoup HTML table parsing
I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1
Currently I am using BeautifulSoup and the code I have looks like this
...
5
votes
2answers
3k views
BeautifulSoup - easy way to to obtain HTML-free contents
I'm using this code to find all interesting links in a page:
soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))
And it does its job pretty well. Unfortunately inside that a tag there are ...
3
votes
2answers
5k views
Python HTMLParser
I'm parsing a html document using HTMLParser and I want to print the contents between the start and end of a p tag
see my code snippet
def handle_starttag(self, tag, attrs):
if tag == ...
2
votes
2answers
2k views
html parser python
I am trying to parse a website. I am using the HTMLParser module. The problem is i want to parse the first <a href=""> after the comment: <!-- /topOfPage -->, but I don't really know how ...
2
votes
1answer
225 views
Get text outside one tag and inside another
I am parsing a web page with BeautifulSoup, and it has some elements like the following:
<td><font size="2" color="#00009c"><b>Consultant Registration Number ...
1
vote
3answers
4k views
How to parse a HTML file with table using Python
I have got a html file with table ( its a large one, so only sample code is given ). I want to retrieve the values in tables. I tried the HTMLParser library from python.
I started coding like below. ...
1
vote
1answer
441 views
extracting paragraph in python using lxml
I would like to extract paragraphs in html by python. I used lxml module but it doesn't do exactly what I am looking for.
print html.parse(url).xpath('//p')[1].text_content()
<span ...
0
votes
2answers
242 views
python extracting HTML tag attributes without regular expressions
Is there any way using urlib, urllib2 or BeautifulSoup to extract HTML tag attributes?
for example:
<a href="xyz" title="xyz">xyz</a>
gets href=xyz, title=xyz
There is another ...
4
votes
4answers
536 views
What’s the most forgiving HTML parser in Python?
I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but ...
4
votes
2answers
4k views
Python HTMLParser: UnicodeDecodeError
I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser.
I tried using chardet to detect the encodings and to ...
3
votes
2answers
2k views
Using HTMLParser in Python efficiently
In response to Python regular expression I tried to implement an HTML parser using HTMLParser:
import HTMLParser
class ExtractHeadings(HTMLParser.HTMLParser):
def __init__(self):
...
2
votes
4answers
520 views
Iteratively parsing HTML (with lxml?)
I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as:
...
2
votes
6answers
682 views
HTML code processing
I want to process some HTML code and remove the tags as in the example:
"<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting ...
1
vote
1answer
348 views
Have HTMLParser differentiate between link-text and other data?
Say I have html code similar to this:
<a href="http://example.org/">Stuff I do want</a>
<p>Stuff I don't want</p>
Using HTMLParser's handle_data doesn't differentiate ...
1
vote
2answers
613 views
Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup
I am trying to crawl wikipedia to get some data for text mining. I am using python's urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of the unnecessary tags(like ...
1
vote
2answers
2k views
How to make XPath select multiple table elements with identical id attributes?
I'm currently trying to extract information from a badly formatted web page. Specifically, the page has used the same id attribute for multiple table elements. The markup is equivalent to something ...
0
votes
1answer
83 views
Sending a large amount of texts
So, here's my program. What it does is send me a text when a new pm appears on a forum I'm on. The problem is it doesn't send just one, it sends hundreds.
How do I fix this? I'm asuming a break ...
0
votes
1answer
1k views
How to use Python's HTMLParser to extract specific links
I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this:
def handle_starttag(self, tag, attrs):
...
0
votes
2answers
444 views
Python if-statement based on content of HTML title tag
We're trying to write a Python script to parse HTML with the following conditions:
If the HTML title tag contains the string "Record doesn't exist," then continue running a loop.
If NOT, download ...
0
votes
3answers
252 views
Processing HTML files Python
I dont know much about html...
How do you remove just text from the page?
For example if the html page reads as:
<meta name="title" content="How can I make money at home online? No gimmacks ...
0
votes
1answer
411 views
Parse HTML/XML and find locations of elements in original document
Is there a way to get the original location of an element in a document, ie. the start and end character index, when parsing html/xml in Python?
I've looked through the lxml documentation and ...
0
votes
3answers
780 views
Python HTML parsing
I am currently trying to make a program that given a word will look up its definition and return it. Although I have gotten this to work, I had to resort to using RegEx to search for the text between ...
0
votes
2answers
309 views
Python how to search and correct html tags and attributes?
I have to fix all the closing tags of the <img> tag as shown in the text below. Instead of closing the <img> with a >, it should close with />.
Is there any easy way to search for ...