I'm looking for an effective way to parse html content in node.JS. The objective is to extract data from inside the html, not handle objects. It is a server-side thing.
I tried using jsdom and I found two major issues:
- Massive memory usage (probably some memory leak)
- Won't parse properly if the code is a malformed html.
So I'm considering using regex to seek inside html streams. In the code bellow I slim down the html stream removing extras spaces and new lines so the regex will cost less to match:
html = html.replace(/\r?\n|\s{2,}/g,' ');
console.log(html.match(/<my regex>/));
I also thought of putting it on a function that would narrowing down even more by getting only the part of the html that matters like:
<html>
<!-- a lot of irrelevant code -->
<table id="fooTable"> </table>
<!-- a lot of irrelevant code -->
</html>
This would narrow down the code to cost even less to apply the regex match:
var i = html.indexOf('fooTable');
var chunck = html.substring(i);
Please have your say.
Would regex be an elegant/effective way to parse large html content? Is it cpu expensive to run a regex on a very large string?