Join the Stack Overflow Community
Stack Overflow is a community of 6.8 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I'm working a page that needs to fetch info from some other pages and then display parts of that information/data on the current page.

I have the HTML source code that I need to parse in a string. I'm looking for a library that can help me do this easily. (I just need to extract specific tags and the text they contain) The HTML is well formed (All closing/ending tags present).

I've looked at some options but they are all being extremely difficult to work with for various reasons.

I've tried the following solutions:

  1. jkl-parsexml library (The library js file itself throws up HTTPError 101)
  2. jQuery.parseXML Utility (Didn't find much documentation/many examples to figure out what to do)
  3. XPATH (The Execute statement is not working but the JS Error Console shows no errors)

And so I'm looking for a more user friendly library or anything(tutorials/books/references/documentation) that can let me use the aforementioned tools better, more easily and efficiently.

An Ideal solution would be something like BeautifulSoup available in Python.

share|improve this question
5  
You could add it to the DOM, hide it, then access your elements with plain js or jQuery. That's actually letting the browser parse it for you, and using js to traverse the DOM. – bfavaretto Sep 11 '12 at 22:53
    
The HTML I have is heavily nested(10-12 levels deep) and lacks class,name and id attributes; i.e the getELementById and similar functions are rendered effectively useless. So recovering the required data would be a real bother that way. – ffledgling Sep 11 '12 at 22:56
    
And how a custom parser would address that? – bfavaretto Sep 11 '12 at 22:58
2  
Hm. Take a look at jquery selectors. It should be powerful enough. Something like this "div p span" will find all spans located inside div and than inside p. "div>p>span" will do the same, but now p must be a direct child of div and span - direct child of such p. And there are a lot of other helpful selectors/functions in jquery – Viktor S. Sep 11 '12 at 23:00
    
@bfavaretto I can't say for sure that a custom parser will make the job easier, but this was the first approach I tried and it was extremely time consuming. I was hoping that the parser would give me nested dictionaries which I could loop through more easily. – ffledgling Sep 11 '12 at 23:03
up vote 4 down vote accepted

Using jQuery, it would be as simple as $(HTMLstring); to create a jQuery object with the HTML data from the string inside it (this DOM would be disconnected from your document). From there it's very easy to do whatever you want with it--and traversing the loaded data is, of course, a cinch with jQuery.

share|improve this answer
    
I'm not sure if this is a problem with my code or the HTML itself but I get "Error: Invalid XML" when I try this. Here is the code I used ` htmlDoc = $.parseXML(pagetext);$html = $( htmldoc );$html.find("body");` – ffledgling Sep 11 '12 at 23:07
    
@Ayos: I would guess it's because you're trying to pass something into .parseXML that is invalid XML. What's the contents of pagetext? – Elliot Bonneville Sep 11 '12 at 23:09
    
The page contains HTML with CSS in the head and Javascript within the <script> tags. It's basically the entire source code of a website obtained via XHR's responseText. – ffledgling Sep 11 '12 at 23:13
    
Try var $html = $(pagetext) directly, then. – Elliot Bonneville Sep 11 '12 at 23:13

You can do something like this:

$("string with html here").find("jquery selector")

$("string with html here") this will create a document fragment and put an html into it (basically, it will parse your HTML). And find will search for elements in that document fragment (and only inside it). At the same time it will not put it in page DOM

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.