Parsing HTML in Node.js with regex

Question

I'm looking for an effective way to parse html content in node.JS. The objective is to extract data from inside the html, not handle objects. It is a server-side thing.

I tried using jsdom and I found two major issues:

Massive memory usage (probably some memory leak)
Won't parse properly if the code is a malformed html.

So I'm considering using regex to seek inside html streams. In the code bellow I slim down the html stream removing extras spaces and new lines so the regex will cost less to match:

html = html.replace(/\r?\n|\s{2,}/g,' ');
console.log(html.match(/<my regex>/));

I also thought of putting it on a function that would narrowing down even more by getting only the part of the html that matters like:

<html> 

<!-- a lot of irrelevant code -->

<table id="fooTable">   </table>

<!-- a lot of irrelevant code -->

</html>

This would narrow down the code to cost even less to apply the regex match:

var i = html.indexOf('fooTable');
var chunck = html.substring(i);

Please have your say.

Would regex be an elegant/effective way to parse large html content? Is it cpu expensive to run a regex on a very large string?

I wouldn't bail on using a DOM-focused approach because of concerns with one library. — Mike Brant, Nov 18 at 14:27

Joseph the Dreamer · Answer 1 · 2016-11-18 16:25:56Z

First, you don't parse HTML with RegEx. It's a known fact. Don't even try.

If you meant manipulating HTML as some arbitrary string (ignoring the structure, semantics, rules and all that jazz), that's another thing. RegEx might help you, but not without problems.

Here's potential problems that you'll be facing:

The preciseness of your pattern with respect to the HTML spec. HTML is more forgiving than XML. That means there are quirks that still make the markup valid even when they don't look valid. Your pattern might not pick up certain cases.

html-minifier is a good example of a library that knows about (and takes advantage of) quirks in HTML to minify HTML. It has a table that summarizes some of HTML's quirks.
The input you'll be receiving. I'll assume it's arbitrary and/or external (otherwise, you wouldn't be manipulating it this way). A common problem is when the string isn't what you expect it to be. An example is jQuery expecting JSON, but the server responded HTML of an HTTP 500 error. jQuery runs JSON.parse, then blows up.

Here's some other problems:

html = html.replace(/\r?\n|\s{2,}/g,' ');

This will blow away content that are sensitive to white-space, like the contents of <pre>. It will also blow away any content that intentionally contains multiple white-space, like contents coming from a wysiwyg editor.

console.log(html.match(/<my regex>/));

As mentioned earlier, the accuracy of your pattern.

Thanks. What I meant was to parse the html as a very long string. Ignoring completely the fact that it have <tags>. I wouldn't be crazy to try getElementsByClassName() or such by using regex, not at all. The point here is treat the html as an ordinary text and extracts from it some text that comes after another text fooTable. My point is: is it cpu expensive? Is regEx meant to be used in any string length or it is not recommended to run a regEx on very large strings? — Azevedo, Nov 18 at 16:49
@Azevedo For that, I would try run a benchmark instead of pure speculation. — Joseph the Dreamer, Nov 18 at 17:00

asked	13 days ago
viewed	42 times
active	13 days ago

current community

your communities

more stack exchange communities

Parsing HTML in Node.js with regex

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged javascript node.js or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Parsing HTML in Node.js with regex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged javascript node.js or ask your own question.

Related

Hot Network Questions