Frequent 'html-parsing python' Questions

24

votes

7answers

29k views

Parsing HTML in Python

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated. I would prefer if it could stomache a ...

python html-parsing

asked Apr 4 '09 at 18:11

andybak
4,39641839

23

votes

5answers

8k views

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason ...

asked Dec 17 '09 at 14:08

Monika Sulik
1,81211636

4

votes

6answers

1k views

How to find/replace text in html while preserving html tags/structure

I use regexps to transform text as I want, but I want to preserve the HTML tags. e.g. if I want to replace "stack overflow" with "stack underflow", this should work as expected: if the input is stack ...

python html html-parsing

asked Dec 6 '09 at 17:44

vbfoobar
212

5

votes

3answers

21k views

How can I use the python HTMLParser library to extract data from a specific div tag?

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this html element: ... <div id="remository">20</div> ... ...

python html parsing html-parsing

asked Jul 18 '10 at 15:06

Martin
3,13432150

1

vote

4answers

4k views

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Heres what I did.. >>> soup = BeautifulSoup (html) >>> soup Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec ...

python parsing unicode html-parsing

asked Mar 8 '11 at 18:04

Nullpoet
89331124

0

votes

0answers

38 views

How can i select a value from dropdown menu which is found in a double form tag with Python mechanize?

I'm trying to select a value from a second form tag and submitting: <form action="manager.php" method="GET"> ..... <form action="manager.php" method="post">. ====here i need to select a ...

python forms parsing html-parsing mechanize

asked Apr 8 at 20:03

Mike Thunder
847

10

votes

6answers

3k views

How to parse malformed HTML in python, using standard libraries

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing. I've found plenty of great third-party libraries for this task, ...

python html dom parsing html-parsing

asked Apr 20 '10 at 16:29

bukzor
5,99511646

7

votes

1answer

10k views

BeautifulSoup HTML table parsing

I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1 Currently I am using BeautifulSoup and the code I have looks like this ...

python table beautifulsoup mechanize html-parsing

asked Jan 13 '10 at 18:50

Stephen Tanner
4112

5

votes

2answers

3k views

BeautifulSoup - easy way to to obtain HTML-free contents

I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are ...

python beautifulsoup html-parsing html-content-extraction

asked Nov 17 '09 at 23:38

Andrea Ambu
4,71872850

3

votes

2answers

5k views

Python HTMLParser

I'm parsing a html document using HTMLParser and I want to print the contents between the start and end of a p tag see my code snippet def handle_starttag(self, tag, attrs): if tag == ...

python html html-parsing

asked Aug 26 '11 at 11:36

Ruth
53321028

2

votes

2answers

2k views

html parser python

I am trying to parse a website. I am using the HTMLParser module. The problem is i want to parse the first <a href=""> after the comment: , but I don't really know how ...

python html-parsing

asked Nov 3 '11 at 23:22

user1010775
9118

2

votes

1answer

225 views

Get text outside one tag and inside another

I am parsing a web page with BeautifulSoup, and it has some elements like the following: <td><font size="2" color="#00009c"><b>Consultant Registration Number ...

python html-parsing beautifulsoup

asked Aug 25 '11 at 16:08

murgatroid99
3,7291930

1

vote

3answers

4k views

How to parse a HTML file with table using Python

I have got a html file with table ( its a large one, so only sample code is given ). I want to retrieve the values in tables. I tried the HTMLParser library from python. I started coding like below. ...

python html parsing html-parsing

asked May 7 '11 at 11:04

user567879
413932

1

vote

1answer

441 views

extracting paragraph in python using lxml

I would like to extract paragraphs in html by python. I used lxml module but it doesn't do exactly what I am looking for. print html.parse(url).xpath('//p')[1].text_content() <span ...

python html-parsing lxml paragraphs

asked Feb 17 '11 at 20:38

user702846
611512

0

votes

2answers

242 views

python extracting HTML tag attributes without regular expressions

Is there any way using urlib, urllib2 or BeautifulSoup to extract HTML tag attributes? for example: <a href="xyz" title="xyz">xyz</a> gets href=xyz, title=xyz There is another ...

python html-parsing beautifulsoup

asked Aug 21 '11 at 21:57

daydreamer
3,873331101

4

votes

4answers

536 views

What’s the most forgiving HTML parser in Python?

I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but ...

python html-parsing beautifulsoup lxml pyquery

asked Jul 29 '11 at 8:22

Vaibhav Mishra
875823

4

votes

2answers

4k views

Python HTMLParser: UnicodeDecodeError

I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser. I tried using chardet to detect the encodings and to ...

python character-encoding html-parsing

asked Jan 25 '11 at 4:45

Nona Urbiz
1,00032451

3

votes

2answers

2k views

Using HTMLParser in Python efficiently

In response to Python regular expression I tried to implement an HTML parser using HTMLParser: import HTMLParser class ExtractHeadings(HTMLParser.HTMLParser): def __init__(self): ...

python api html-parsing

asked Nov 15 '10 at 8:11

Roland Illig
12.3k1939

2

votes

4answers

520 views

Iteratively parsing HTML (with lxml?)

I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as: ...

python html-parsing lxml iterparse

asked Dec 12 '11 at 16:41

Acorn
11.9k22664

2

votes

6answers

682 views

HTML code processing

I want to process some HTML code and remove the tags as in the example: "<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting ...

python html-parsing

asked Oct 22 '10 at 15:07

Laurențiu Dascălu
4151618

1

vote

1answer

348 views

Have HTMLParser differentiate between link-text and other data?

Say I have html code similar to this: <a href="http://example.org/">Stuff I do want</a> <p>Stuff I don't want</p> Using HTMLParser's handle_data doesn't differentiate ...

python html-parsing

asked Feb 22 '12 at 22:38

purpleladydragons
4351415

1

vote

2answers

613 views

Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

I am trying to crawl wikipedia to get some data for text mining. I am using python's urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of the unnecessary tags(like ...

python html html-parsing beautifulsoup wikipedia

asked Nov 8 '11 at 1:06

LangerHansIslands
309213

1

vote

2answers

2k views

How to make XPath select multiple table elements with identical id attributes?

I'm currently trying to extract information from a badly formatted web page. Specifically, the page has used the same id attribute for multiple table elements. The markup is equivalent to something ...

python xpath html-parsing web-scraping scrapy

asked Oct 25 '11 at 11:15

Edwardr
680314

0

votes

1answer

83 views

Sending a large amount of texts

So, here's my program. What it does is send me a text when a new pm appears on a forum I'm on. The problem is it doesn't send just one, it sends hundreds. How do I fix this? I'm asuming a break ...

python cookies html-parsing

asked Jul 30 '12 at 22:54

user1564081
1

0

votes

1answer

1k views

How to use Python's HTMLParser to extract specific links

I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this: def handle_starttag(self, tag, attrs): ...

python parsing hyperlink crawler html-parsing

asked Mar 14 '12 at 1:37

initWithStyle
727

0

votes

2answers

444 views

Python if-statement based on content of HTML title tag

We're trying to write a Python script to parse HTML with the following conditions: If the HTML title tag contains the string "Record doesn't exist," then continue running a loop. If NOT, download ...

python parsing if-statement html-parsing

asked Feb 17 '12 at 21:43

zhaoy
5014

0

votes

3answers

252 views

Processing HTML files Python

I dont know much about html... How do you remove just text from the page? For example if the html page reads as: <meta name="title" content="How can I make money at home online? No gimmacks ...

python html html-parsing

asked Jan 9 '12 at 2:43

Fraz
1,8451033

0

votes

1answer

411 views

Parse HTML/XML and find locations of elements in original document

Is there a way to get the original location of an element in a document, ie. the start and end character index, when parsing html/xml in Python? I've looked through the lxml documentation and ...

python xml-parsing html-parsing lxml

asked Nov 24 '11 at 14:21

Acorn
11.9k22664

0

votes

3answers

780 views

Python HTML parsing

I am currently trying to make a program that given a word will look up its definition and return it. Although I have gotten this to work, I had to resort to using RegEx to search for the text between ...

python html python-3.x html-parsing

asked Feb 4 '11 at 6:13

Kironide
2,80031332

0

votes

2answers

309 views

Python how to search and correct html tags and attributes?

I have to fix all the closing tags of the <img> tag as shown in the text below. Instead of closing the <img> with a >, it should close with />. Is there any easy way to search for ...

python html string html-parsing

asked Jul 29 '10 at 9:19

Hoang Pham
3,47543257

Tagged Questions

Related Tags