Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I need to remove inline javascript for a given string. Examples:

If user typed: <img onload="something" />

I should need to convert into <img />

I created this PHP code and it works(apparently without issues):

http://writecodeonline.com/php/

function test_input($input){
  //I have a list with all events but for this example I used two
  $html_events = 'onload|onclick';
  $pattern = "/(<[A-Z][A-Z0-9]*[^>]*)($html_events)([\s]*=[\s]*)('[^>]*'|\"[^>]*\")([^>]*>)/i";
  $replacement = '$1$5';
  while( preg_match($pattern, $input) ){
    $input = preg_replace($pattern, $replacement, $input);
  }
  return htmlentities($input);
}

echo test_input('<img onload="alert(\'hello world\');" onclick="alert(\'hello world\');" />'). '<br />';
echo test_input('<img onload="alert(\'hello world\');"/>'). '<br />';
echo test_input('<div onload="alert(\'hello world\');" onclick="alert(\'hello world\');">hello buddies</div>'). '<br />';

I'm just looking for improvements or use cases that I did not supporting or that break my regex. I would appreciate if you tell me:

This: test_input('something bad'); breaks your regex.

Or if found an improvement that in a benchmark demonstrate better performance I should be happy to apply it as long as it does not break use cases already supported.

Thank You!

Update I finally used htmlpurifier

share|improve this question
 
Dear lord, let us pray that Cthulhu doesn't get word of this. Regex is not the tool for the job! Parse the markup, iterate the nodes, and check the attributes. If an onload or onlclick attribute is found remove it –  Elias Van Ootegem Aug 21 '13 at 20:22
 
Yeah that could be a possibility to explore also. :) –  Daniel Aranda Aug 21 '13 at 20:29
 
I've added an answer that shows you how to parse, and remove any attribute you need removing... –  Elias Van Ootegem Aug 21 '13 at 20:30
add comment

1 Answer

Parsing markup with regex is like building your house using lego... it's not the right tool for the job. HTML is not a regular language, therefore regular expressions don't cut the mustard. More than that: You're actively working to bring the world as we know it to an end, which drives people insane
What you need is a DOM parser, and as luck would have it, PHP has the DOMDocument object, which is just that:

$dom = new DOMDocument;
$dom->loadHTML('<img onload="alert(\'hello world\');" onclick="alert(\'hello world\');" />');
$nodes = $dom->getElementsByTagName('*');//just get all nodes, 
//$dom->getElementsByTagName('img'); would work, too
foreach($nodes as $node)
{
    if ($node->hasAttribute('onload'))
    {
        $node->removeAttribute('onload');
    }
    if ($node->hasAttribute('onclick'))
    {
        $node->removeAttribute('onclick');
    }
}
echo $dom->saveHTML();//will include html, head, body tags and doctype

Tadaa... both onload and onclick have been removed from the markup, without the pain of writing a reliable and stable regex, that can deal with in-line JS... As an added bonus, this code will be far more maintainable (and expandable) in the future. I'd much prefer maintaining this code, than having to rework a regular expression somebody wrote a couple of months ago...

If you want, you can echo only the tags you've changed, like so:

$changed = array();
$attributesOfDeath = array('onload', 'onclick');
foreach($nodes as $node)
{
    $current = null;
    foreach($attributesOfDeath as $attr)
    {
        if ($node->hasAttribute($attr))
        {
            $node->removeAttribute($attr);
            $current = $node;
        }
    }
    if ($current)
    {
        $changed[] = $current;//add to changed array
    }
}
$changed = array_map(array($dom, 'saveXML'), $changed);
echo implode(PHP_EOL, $changed);

As Jan said, for maintainability it's best to use an array of "forbidden attributes". That's what the $attributesOfDeath array is for. If you want to, later on, check for a third or fourth attribute, you can simply add that to the array, and nothing else in your code need change. It'll just keep on working as before.

share|improve this answer
 
My interest is to prevent XSS the input could or not could be HTML. But assuming that is HTML seems like DOMDocument should be the best option. I will explore it. –  Daniel Aranda Aug 21 '13 at 20:31
 
for the purposes of maintainability, you should put onload and onclick in a single array and iterate over it. Otherwise, I agree with the answer. –  Jan Dvorak Aug 21 '13 at 20:33
 
@JanDvorak: Not only that, though I did think about doing that in my code example: when using this code, it might be worth while using a class (assigning the DOMDocument instance to a property, so as to not create a new instance on each function call). Anyway, I've edited my answer, and now use an $attributesOfDeath array in my second example –  Elias Van Ootegem Aug 21 '13 at 20:37
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.