I need help to replicate this PHP regex in JavaScript:
#\<code>((?:[^<]|\<(?!/?code>)|(?R))+)\</code>#
It strips all tags except those inside code
tag.
It's not possible.
You can't translate this regex to the JavaScript flavor because it uses recursion (?R)
which the JavaScript regex engine does not support.
I would suggest a different approach. I'm assuming that you want to remove everything within angle brackets including the surrounding brackets, unless those brackets are found within a <code>...</code>
block. Right? Well, the best thing a JavaScript regex (which does not even support lookbehind assertions) can do for you would be this:
result = subject.replace(/<(?!\/code)[^<>]*>\s*(?!(?:(?!<code>)[\s\S])*<\/code>)/g, "");
What this does (unfortunately, JavaScript doesn't even support verbose regexes, either; this regex is hard to wrap your head around...):
< # Match a <
(?!/code) # (unless it's part of a </code> tag)
[^<>]* # and any number of non-bracket characters
> # followed by >
\s* # and any whitespace.
(?! # Assert that we can't match the following here:
(?: # The following expression:
(?! # Unless we are right before a
<code> # <code> tag
) # Then match
[\s\S] # any character
)* # any number of times
</code> # until the next </code> tag
) # End of lookahead assertion
This ensures that we only match a tag if the next <code>
/</code>
tag that follows is an opening <code>
tag, not a closing </code>
tag (or if no such tag follows at all).
So it transforms
This <b> is bold </b> text,
but we want <code> these <i> tags <b> here </b> to remain </i> </code>
while those <b> can be deleted</b>.
into
This is bold text,
but we want <code> these <i> tags <b> here </b> to remain </i> </code>
while those can be deleted.
If you want to remove the code
tags themselves, too, you can use
result = subject.replace(/<[^<>]*>\s*(?!(?:(?!<code>)[\s\S])*<\/code>)|<code>\s*/g, "");
which will give the result
This is bold text,
but we want these <i> tags <b> here </b> to remain </i>
while those can be deleted.
None of these regexes work if code
tags can be nested, though.
If you want to do this in JavaScript, my guess is that you're probably working in an environment where you've already got a full-fledged set of HTML parsing and traversal tools at hand -- the browser DOM.
If that is indeed the case, the generally good advice that regular expressions aren't the ideal tool for working with markup applies doubly here, and you might consider doing something else instead.
Getting a given piece of markup into a form where you can manipulate it using the DOM interface is pretty straightforward:
var working = document.createElement('div'); //create a new empty element
working.innerHTML = sourceToSanitize; //put your HTML source inside it
var sanitized = sanitize(working); //call sanitization function!
Now you'd just need a sanitize
function you can call on that element that will traverse every node in the DOM tree within it, and give you back a fragment of transformed HTML.
Something like this might work:
function sanitize(emt) {
if(emt.nodeType == 3) // terminal cond #1: just return text nodes
return emt.textContent;
if(emt.nodeType != 1) // terminal cond #2: non-text/element nodes yield null
return null;
if(emt.tagName=='code' || emt.tagName=='CODE') //#3: code tags returned untouched
return outerHTML(emt);
// recurse over all child nodes
var schf = [], // *S*anitized *C*hild *H*TML *F*ragments
children = emt.childNodes;
for(var i=0,z=children.length; i<z; i++)
schf.push(sanitize(children[i]));
return schf.join(''); // smoosh results together and serve fresh!
}
function outerHTML(emt) {
if(emt.outerHTML) return emt.outerHTML;
var tmp = document.createElement('div');
tmp.appendChild(emt.cloneNode(true));
return tmp.innerHTML;
}
[code]
matches a single letter, either c, o, d, or e. At least the brackets would have to be escaped.