Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I'm looking for an effective way to parse html content in node.JS. The objective is to extract data from inside the html, not handle objects. It is a server-side thing.

I tried using jsdom and I found two major issues:

  1. Massive memory usage (probably some memory leak)
  2. Won't parse properly if the code is a malformed html.

So I'm considering using regex to seek inside html streams. In the code bellow I slim down the html stream removing extras spaces and new lines so the regex will cost less to match:

html = html.replace(/\r?\n|\s{2,}/g,' ');
console.log(html.match(/<my regex>/));

I also thought of putting it on a function that would narrowing down even more by getting only the part of the html that matters like:

<html> 

<!-- a lot of irrelevant code -->

<table id="fooTable">   </table>

<!-- a lot of irrelevant code -->

</html> 

This would narrow down the code to cost even less to apply the regex match:

var i = html.indexOf('fooTable');
var chunck = html.substring(i); 

Please have your say.

Would regex be an elegant/effective way to parse large html content? Is it cpu expensive to run a regex on a very large string?

share|improve this question
    
Have you had a look at cheerio? – px06 Nov 18 at 14:19
2  
I wouldn't bail on using a DOM-focused approach because of concerns with one library. – Mike Brant Nov 18 at 14:27

First, you don't parse HTML with RegEx. It's a known fact. Don't even try.

If you meant manipulating HTML as some arbitrary string (ignoring the structure, semantics, rules and all that jazz), that's another thing. RegEx might help you, but not without problems.

Here's potential problems that you'll be facing:

  1. The preciseness of your pattern with respect to the HTML spec. HTML is more forgiving than XML. That means there are quirks that still make the markup valid even when they don't look valid. Your pattern might not pick up certain cases.

    html-minifier is a good example of a library that knows about (and takes advantage of) quirks in HTML to minify HTML. It has a table that summarizes some of HTML's quirks.

  2. The input you'll be receiving. I'll assume it's arbitrary and/or external (otherwise, you wouldn't be manipulating it this way). A common problem is when the string isn't what you expect it to be. An example is jQuery expecting JSON, but the server responded HTML of an HTTP 500 error. jQuery runs JSON.parse, then blows up.

Here's some other problems:

html = html.replace(/\r?\n|\s{2,}/g,' ');

This will blow away content that are sensitive to white-space, like the contents of <pre>. It will also blow away any content that intentionally contains multiple white-space, like contents coming from a wysiwyg editor.

console.log(html.match(/<my regex>/));

As mentioned earlier, the accuracy of your pattern.

share|improve this answer
    
Thanks. What I meant was to parse the html as a very long string. Ignoring completely the fact that it have <tags>. I wouldn't be crazy to try getElementsByClassName() or such by using regex, not at all. The point here is treat the html as an ordinary text and extracts from it some text that comes after another text fooTable. My point is: is it cpu expensive? Is regEx meant to be used in any string length or it is not recommended to run a regEx on very large strings? – Azevedo Nov 18 at 16:49
2  
@Azevedo For that, I would try run a benchmark instead of pure speculation. – Joseph the Dreamer Nov 18 at 17:00

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.