Convert recursive PHP regex to JavaScript

Question

I need help to replicate this PHP regex in JavaScript:

#\<code>((?:[^<]|\<(?!/?code>)|(?R))+)\</code>#

It strips all tags except those inside code tag.

This regex certainly does not do what you say it does. [code] matches a single letter, either c, o, d, or e. At least the brackets would have to be escaped. — Tim Pietzcker
– Tim Pietzcker, Commented Nov 25, 2011 at 21:29
Instead, why don't you define the parameters of the regex you want and show us what you tried and we'll be able to help you. — Alex Turpin
– Alex Turpin, Commented Nov 25, 2011 at 21:35

Tim Pietzcker · Accepted Answer · 2011-11-26 12:01:11Z

It's not possible.

You can't translate this regex to the JavaScript flavor because it uses recursion (?R) which the JavaScript regex engine does not support.

I would suggest a different approach. I'm assuming that you want to remove everything within angle brackets including the surrounding brackets, unless those brackets are found within a <code>...</code> block. Right? Well, the best thing a JavaScript regex (which does not even support lookbehind assertions) can do for you would be this:

result = subject.replace(/<(?!\/code)[^<>]*>\s*(?!(?:(?!<code>)[\s\S])*<\/code>)/g, "");

What this does (unfortunately, JavaScript doesn't even support verbose regexes, either; this regex is hard to wrap your head around...):

<             # Match a <
(?!/code)     # (unless it's part of a </code> tag)
[^<>]*        # and any number of non-bracket characters
>             # followed by >
\s*           # and any whitespace.
(?!           # Assert that we can't match the following here:
 (?:          # The following expression:
  (?!         # Unless we are right before a
   <code>     # <code> tag
  )           # Then match
  [\s\S]      # any character
 )*           # any number of times
 </code>      # until the next </code> tag
)             # End of lookahead assertion

This ensures that we only match a tag if the next <code>/</code> tag that follows is an opening <code> tag, not a closing </code> tag (or if no such tag follows at all).

So it transforms

This <b> is bold </b> text, 
but we want <code> these <i> tags <b> here </b> to remain </i> </code> 
while those <b> can be deleted</b>.

into

This is bold text, 
but we want <code> these <i> tags <b> here </b> to remain </i> </code> 
while those can be deleted.

If you want to remove the code tags themselves, too, you can use

result = subject.replace(/<[^<>]*>\s*(?!(?:(?!<code>)[\s\S])*<\/code>)|<code>\s*/g, "");

which will give the result

This is bold text, 
but we want these <i> tags <b> here </b> to remain </i> 
while those can be deleted.

None of these regexes work if code tags can be nested, though.

The regex was wrong, I updated. Is it possible to implement recursion (?R) in javascript in other way? — seb34
– seb34, Commented Nov 25, 2011 at 21:37
You can't do recursion, but if you're willing to accept the limitations of my solution, that's the next best solution you get using regexes in JavaScript alone. A parser would be a lot more bulletproof, though. — Tim Pietzcker
– Tim Pietzcker, Commented Nov 25, 2011 at 22:01
1. Is it possible to strip the <code></code> tags and just keep the html inside?. 2. In case, inside a <code>...</code> you find others <code></code>, these could be not deleted? — seb34
– seb34, Commented Nov 25, 2011 at 22:34
I'm not entirely sure what you mean by 1. Do you want to keep the behavior that the regex has now and additionally remove code tags? About 2., that's where this regex falls down. Nested code tags will cause failure because nesting is something you do need recursion for. So if you can't guarantee that code tags will never be nested, you can't use a regex in JavaScript. — Tim Pietzcker
– Tim Pietzcker, Commented Nov 25, 2011 at 22:37
An example to figure out: <pre> This is bold text, but we want <code> these tags here to remain and <code>inner code tags</code> too</code>, while others are deleted . </pre> <pre> This is bold text, but we want these tags here to remain and <code>inner code tags</code> too, while others are deleted. </pre> thanks... — seb34
– seb34, Commented Nov 25, 2011 at 22:40

Weston C · Accepted Answer · 2011-11-26 07:57:33Z

If you want to do this in JavaScript, my guess is that you're probably working in an environment where you've already got a full-fledged set of HTML parsing and traversal tools at hand -- the browser DOM.

If that is indeed the case, the generally good advice that regular expressions aren't the ideal tool for working with markup applies doubly here, and you might consider doing something else instead.

Getting a given piece of markup into a form where you can manipulate it using the DOM interface is pretty straightforward:

var working = document.createElement('div');  //create a new empty element
working.innerHTML = sourceToSanitize;         //put your HTML source inside it
var sanitized = sanitize(working);            //call sanitization function!

Now you'd just need a sanitize function you can call on that element that will traverse every node in the DOM tree within it, and give you back a fragment of transformed HTML.

Something like this might work:

function sanitize(emt) {
    if(emt.nodeType == 3)     // terminal cond #1: just return text nodes
        return emt.textContent;
    if(emt.nodeType != 1)     // terminal cond #2: non-text/element nodes yield null
        return null;
    if(emt.tagName=='code' || emt.tagName=='CODE') //#3: code tags returned untouched
        return outerHTML(emt);

                              // recurse over all child nodes
    var schf = [], // *S*anitized *C*hild *H*TML *F*ragments
        children = emt.childNodes;
    for(var i=0,z=children.length; i<z; i++) 
        schf.push(sanitize(children[i]));
    return schf.join('');     // smoosh results together and serve fresh!
}

function outerHTML(emt) {
    if(emt.outerHTML) return emt.outerHTML;
    var tmp = document.createElement('div');
    tmp.appendChild(emt.cloneNode(true));
    return tmp.innerHTML;
}

Collectives™ on Stack Overflow

Convert recursive PHP regex to JavaScript

2 Answers 2

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related