Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I am creating a function for sanitising user inputs and user inputted data outputs.

Please offer advise on any improvements which could be made:

function cleanse($input) {

    $search = array(
        '@<script[^>]*?>.*?</script>@si',   // Strip out javascript
        '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
        '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
        '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments
    );

    $output = preg_replace($search, '', $input);
    return $output;
}

function sanitise($input) {
    if (is_array($input)) {
        foreach($input as $var=>$val) {
            $output[$var] = sanitize($val);
        }
    } else {
        $input  = cleanse($input);
        $output = htmlspecialchars($input, ENT_QUOTES,'UTF-8',false);
    }
    return $output;
}
share|improve this question

2 Answers 2

Personally I tend to work with a white-list of characters.

So instead of sanitizing and removing bad stuf, I simply only accept good stuff.
e.g. [0-9]+ when validating an age field

This approach is less error prone because you don't have to think of all the bad things one can enter. And you know that the data you are serving is actualy what you tink it is (e.g. a number when referencing age).

share|improve this answer
    
I'm working with sensitive information and nearly all inputted data is inputted and outputted as strings, all the inputs I have that are numerical use validation to ensure they are numeric for example. In the situation described, am I taking an approach which is not optimal? –  danielsmile May 14 '14 at 16:25
    
@danielsmile It's not that it isn't optimal, it makes things easier for you to maintain in the long run if you only have to think about data that is allowed rather than not. –  glitchmunki May 14 '14 at 16:37

Please see discussions on StackOverflow regarding parsing all HTML with regular expressions such as - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags -- The gist of it being: you can't do it. I would recommend using actual HTML parsing for this purpose, such as HTMLPurifier: http://htmlpurifier.org/

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.