Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I am building a highlight feature for search results on my website and need to escape the user's input so that I could highlight the matching string within the original text.

The function

function sanitize_for_regex(s){
    var escaped = '';
    for(var i = 0; i < s.length; ++i){
        switch(s[i]){
            case '{':
            case '}':
            case '[':
            case ']':
            case '-':
            case '/':
            case '\\':
            case '(':
            case ')':
            case '*':
            case '+':
            case '?':
            case '.':
            case '^':
            case '$':
            case '|':
                escaped+= '\\';
            default:
                escaped+= s[i];
        }
    }
    return escaped;
}

How it's used

var input_from_user = 'Hey + wi$ll th<is \'" { tex|t / \\ mess w?>ith you?';

var highlighted_text = original_text.replace(new RegExp('('+sanitize_for_regex(input_from_user)+')', 'gi'), '<span style="background-color:#FBFB73;">$1</span>');

My test cases show that it's working quite well but I would like to get some feedback from other professionals.

Does this function look like it will escape all injection "attempts"? (I quoted attempts because the user is usually not aware of attempting to break anything)

Could the performance be improved?

share|improve this question
1  
Why not look at some prior work? This is what Google uses, and they're notoriously keen on performance... stackoverflow.com/a/18151038/3757232 – Jared Smith 17 hours ago
up vote 3 down vote accepted

I'm not a fan of "sanitizing" data. There isn't a clear definition of what "sanitizing" means, other than that it takes untrustworthy input and somehow makes it valid. It might entail discarding the invalid parts of the input — which is not what you are doing here. Escaping is a clearer term to describe what you are doing.

Repeated string concatenation is considered poor practice for performance. Since JavaScript strings are immutable, it would be better to devise a solution that constructs the string "all at once".

The documentation on developer.mozilla.org suggests the following solution for escaping a regular expression:

Escaping user input to be treated as a literal string within a regular expression can be accomplished by simple replacement:

function escapeRegExp(string){
  return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
}

I strongly recommend replacing your custom solution with the standard recipe (including a citation).

share|improve this answer
    
Oh wow, this is really excellent. I've come across other solutions like this on StackOverflow but for some reason I could not get it to work with my code; probably due to my regex inexperience. My code used to be a larger mess until I narrowed my vision for what I actually need it to accomplish. This works as a drop-in replacement for my function and I feel silly for not seeing the Mozilla docs sooner. Thanks a million! – MonkeyZeus 17 hours ago
    
I think you forgot to add -(hyphen) in the character class. When used in character class, this need to escape or it may select range. – Tushar 12 hours ago
    
@Tushar But character classes are simply not possible if the square brackets are already escaped. – 200_success 12 hours ago
    
Aah. True. Didn't think about it. – Tushar 12 hours ago

A couple options as far as escaping goes:

  1. You can use a regular expression that matches any character that needs escaped and replaces it with "\\&$", which is "A backslash, followed by whatever character matched.":

    // The inside regexp just replaces every character with a leading backslash for readability.
    // It can be optimized out just fine.
    
    var pattern = RegExp("[" + "{}[]-/\\()*+?.%$|".replace(RegExp(".", "g"), "\\$&") + "]", "g");
    
    function sanitize_for_regex(val) {
        return val.replace(pattern, "\\$&");
    }
    
  2. You can keep your current code somewhat as-is (but it's relatively inefficient), but checking a string rather than a bunch of case:

    function sanitize_for_regex(s) {
        var escaped;
        for(var i = 0; i < s.length; ++i){
            if("{}[]-/\\()*+?.%$|".indexOf(s[i]) !== -1) {
                escaped += "\\";
            }
            escaped += s[i];
        }
        return escaped;
    }
    

As for what needs to be escaped, you're probably matching far more than you need to -- many of those symbols only have meaning when paired with other symbols that you're already escaping, thus:

  • It's not neccessary to escape - or ^, because you're already escaping [ and ] and thus -- unless you're using the input inside of that part of a pattern, they'll never be treated as special characters.
  • It's technically not necessary to escape ], because the [ that would begin that part of the pattern is already escaped.
  • If you're only using the output inside of a pattern (as opposed to inside of a replacement), I believe the only things that really need escaping are the backslash (\), wildcards and the start of character classes (. and [), quantifiers (?, +, * and {), parenthesis (( and )) and branching expressions (|).
  • So, in short, the list of characters to escape are ()[.?+*{|.

This may vary a bit depending on your usage though.

share|improve this answer
    
Great suggestions, I'm really learning quite a bit! I just facepalmed pretty hard because I did not think of the string and indexOf() approach. Since this is targeted for use upon hitting a key and the search terms should never exceed more than 10 characters I think it will be safe to use the indexOf() approach. Even with 1,000 chars in the search term there is no slow down. Thanks! – MonkeyZeus 19 hours ago
    
@MonkeyZeus The regex approach is likely faster (it's not constructing a string one character at a time), but I listed the other to show one that was closer to your original approach. – dewin 19 hours ago
    
I decided to construct the string outside of the for(){} loop's scope and assign it to a variable named escape_chars. My for(){} loop looks like this on the inside escaped_val+= (escape_chars.indexOf(s[i]) !== -1 ? '\\' : '')+s[i]; I am choosing to avoid regex because I cannot read it very well so readable code trumps uber-optimization in this situation. – MonkeyZeus 19 hours ago

This is not a big performance deal, yet you could use a "hashset" instead of going through the cases:

var dispatch = {
    '{':  true,
    '}':  true,
    '[':  true,
    ']':  true,
    '-':  true,
    '/':  true,
    '\\': true,
    '(':  true,
    ')':  true,
    '*':  true,
    '+':  true,
    '?':  true,
    '.':  true,
    '^':  true,
    '$':  true,
    '|':  true
};

function sanitize_for_regex2(val) {
    var escaped = "";

    for (var i = 0; i !== val.length; ++i) {
        if (dispatch[val[i]]) {
            escaped += "\\";
        }

        escaped += val[i];
    }

    return escaped;
}

alert(sanitize_for_regex2("{yeah}"));
alert(sanitize_for_regex("{yeah}"));
share|improve this answer
    
This will certainly clean up my function but I hesitate to put function-specific variables outside of the function's scope. Do you think to would be a huge penalty to declare the hash every time the function is called? Any ideas as to whether or not I've considered all possible regex injection points? – MonkeyZeus 20 hours ago
    
@MonkeyZeus The answer to your 1st question: it would seem the hash will be created every time the function is called. Maybe closures may help. 2nd one: I really don't know (I am not a professional JavaScript developer). :-( – coderodde 20 hours ago
    
I appreciate the insight, this definitely gives me something mull over, thanks! – MonkeyZeus 20 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.