Take the tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to maintain IE8 compatibility.

/**
 * Count bytes in string
 *
 * Count and return the number of bytes in a given string
 *
 * @access  public
 * @param   string
 * @return  int
 */
function getByteLen(normal_val)
{
    // Force string type
    normal_val = String(normal_val);

    // Split original string into array
    var normal_pieces = normal_val.split('');
    // Get length of original array
    var normal_length = normal_pieces.length;

    // Declare array for encoded normal array
    var encoded_pieces = new Array();

    // Declare array for individual byte pieces
    var byte_pieces = new Array();

    // Loop through normal pieces and convert to URL friendly format
    for(var i = 0; i <= normal_length; i++)
    {
        if(normal_pieces[i] && normal_pieces[i] != '')
        {
            encoded_pieces[i] = encodeURI(normal_pieces[i]);
        }
    }

    // Get length of encoded array
    var encoded_length = encoded_pieces.length;

    // Loop through encoded array
    // Scan individual items for a %
    // Split on % and add to byte array
    // If no % exists then add to byte array
    for(var i = 0; i <= encoded_length; i++)
    {
        if(encoded_pieces[i] && encoded_pieces[i] != '')
        {
            // % exists
            if(encoded_pieces[i].indexOf('%') != -1)
            {
                // Split on %
                var split_code = encoded_pieces[i].split('%');
                // Get length
                var split_length = split_code.length;

                // Loop through pieces
                for(var j = 0; j <= split_length; j++)
                {
                    if(split_code[j] && split_code[j] != '')
                    {
                        // Push to byte array
                        byte_pieces.push(split_code[j]);
                    }
                }
            }
            else
            {
                // No percent
                // Push to byte array
                byte_pieces.push(encoded_pieces[i]);
            }
        }
    }

    // Array length is the number of bytes in string
    var byte_length = byte_pieces.length;

    return byte_length;
}
share|improve this question
add comment

2 Answers

up vote 3 down vote accepted

It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI().

/**
 * Count bytes in a string's UTF-8 representation.
 *
 * @param   string
 * @return  int
 */
function getByteLen(normal_val) {
    // Force string type
    normal_val = String(normal_val);

    var byteLen = 0;
    for (var i = 0; i < normal_val.length; i++) {
        var c = normal_val.charCodeAt(i);
        byteLen += c < (1 <<  7) ? 1 :
                   c < (1 << 11) ? 2 :
                   c < (1 << 16) ? 3 :
                   c < (1 << 21) ? 4 :
                   c < (1 << 26) ? 5 :
                   c < (1 << 31) ? 6 : Number.NaN;
    }
    return byteLen;
}
share|improve this answer
 
Nice, I saw similar but less clean code on the site I linked to, +1 –  tomdemuyt Dec 17 '13 at 13:23
 
I will give your suggestion a try as well but are the comments on this site relevant to the use of charCodeAt()? –  user2191572 Dec 17 '13 at 14:24
2  
Good question! The code at forrst.com is bogus. Although ceil(log_256(charCode)) tells you the number of bytes it would take to represent charCode, there's nothing about UTF-8 in their byteLength() function. UTF-8 is a variable-length encoding scheme, and the few most-significant bits of every byte are necessary to indicate how many bytes form each character. Since any variable-length encoding scheme will have such padding, their byteLength() function gives a wrong answer for any encoding, including UTF-8. –  200_success Dec 17 '13 at 22:44
 
Great answer, thank you! My test scenario was chopped down to .1 seconds with your code by the way, compared to ~3.2 respectively for @tomdemuyt , so it is definitely extremely efficient. I just have to wrap my head around these bitwise operators (I've never used them before) and then I can have rock-solid confidence in the script. According to Wikipedia, UTF-8 was ultimately restricted to a max of 4 bytes per char so is checking for 5 and 6 bytes fruitless or do you think UTF-8 will eventually extend to 5 and 6 bytes? I see your shorthand if statements will essentially never get to 5 and 6. –  user2191572 Dec 18 '13 at 13:22
 
1 << n is simply a way to write 2^n. I think it's easier to understand 1 << n than the magic numbers 128, 2048, 16384, etc. –  200_success Dec 18 '13 at 17:21
show 1 more comment

My 2 cents

  • Please do not abbreviate words, choose short words or acronyms ( Len -> Length )
  • Please lower camel case ( normal_val -> normalValue )
  • Consider using spartan conventions ( s -> generic string )
  • new Array() is considered old skool, consider var byte_pieces = []
  • You are using byte_pieces to track the bytes just to get the length, you could have just kept track of the length, this would be more efficient
  • I am not sure what abnormal pieces would be here:

if(normal_pieces[i] && normal_pieces[i] != '')

  • You check again for these here, probably not needed:

if(encoded_pieces[i] && encoded_pieces[i] != '')

  • You could just do return byte_pieces.length instead of
// Array length is the number of bytes in string
var byte_length = byte_pieces.length;

return byte_length;

All that together, I would counter propose something like this:

function getByteCount( s )
{
  var count = 0, stringLength = s.length, i;
  s = String( s || "" );
  for( i = 0 ; i < stringLength ; i++ )
  {
    var partCount = encodeURI( s[i] ).split("%").length;
    count += partCount==1?1:partCount-1;
  }
  return count;
}
getByteCount("i ♥ js");
getByteCount("abc def");

You could get the sum by using .reduce(), I leave that as an exercise to the reader.

Finally, if you are truly concerned about performance, there are some very fancy performant js libraries out there.

share|improve this answer
1  
Thank you so much, looks like a lot of good stuff in your post. I will give them a go and see if I can get better performance numbers. I am not overly concerned about performance but my original code took ~6 seconds for 1200 iterations of 2400 Euro signs deduced by one char per iteration until I hit 1200 for my enforceMaxByteLength script and this code took ~3.8 so hopefully I can shave off a bit more –  user2191572 Dec 16 '13 at 18:51
1  
Your counter proposition is genius, it shaved another .6 seconds off my benchmark, thank you. –  user2191572 Dec 16 '13 at 19:14
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.