I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
|
locked by Kev♦ Nov 15 '11 at 17:09
This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: FAQ.
closed as not constructive by Kev♦ Nov 15 '11 at 17:09
As it currently stands, this question is not a good fit for our Q&A; format. We expect answers to be supported by facts, references, or specific expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, see the FAQ for guidance.
I used the HTMLAgilityPack on a project for a previous employer and it was pretty effective. It wasn't foolproof, but it did handle most of the malformed tags, etc. that you find on the web these days. |
|||||||||||
|
|
|||||||||
|
You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser. Another alternative would be to use the builtin engine mshtml:
This allows you to use javascript-like functions like getElementById() |
|||||||||||||||
|
I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath. |
|||||||
|
You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:
... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.
you can do similar reflection stuff to submit forms, etc. Enjoy. |
||||
|
I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:
I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial... So now that expectations are realistically low, here's the code :)
|
|||||||||
|
The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience. |
|||
|
I'm not sure about "best" but I'd start here: This will probably give you what you need. |
|||
|
I think @Erlend's use of |
|||
|
Take a look at Chris Lovett's SGML Reader inside DasBlog. It'll turn HTML into an XML document and let you get the elements that way. |
|||
|
Look for the HtmlAgilityPack. It's an open-source library that parses HTML, and will fix errors (e.g. unclosed tags). Once loaded, you can use XPath via the XPathNavigator class to select the specific content you desire. Additionally, it can convert HTML to well-formed X(HT)ML. At work, this library is an VERY critical part of our software. We run hundreds of thousands of ugly third-party HTML documents and XML feeds through it daily, to assure they're well-formed before we attempt to parse data out of it. |
||||
|
Anyone been using Fizzler? I just found out about this recently, it uses htmlagilitypack, and support jQuery style selectors. trust me, if you are familiar with jQuery, you won't look for another parser! think i read about it first time here, Looking for C# HTML parser |
|||
|
No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net
usage:
|
||||
|
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis. |
|||||||||
|
I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[@class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution. |
|||
|
You could use a HTML DTD, and the generic XML parsing libraries. |
|||||||
|
Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser] |
|||
|
Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser. |
|||
|
Try this script. http://www.biterscripting.com/SS_URLs.html When I use it with this url,
It shows me all the links on the page for this thread.
You can modify that script to check for images, variables, whatever. |
|||
|
I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs. You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c. There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class. |
||||
|