Count byte length of string

Question

I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to maintain IE8 compatibility.

/**
 * Count bytes in string
 *
 * Count and return the number of bytes in a given string
 *
 * @access  public
 * @param   string
 * @return  int
 */
function getByteLen(normal_val)
{
    // Force string type
    normal_val = String(normal_val);

    // Split original string into array
    var normal_pieces = normal_val.split('');
    // Get length of original array
    var normal_length = normal_pieces.length;

    // Declare array for encoded normal array
    var encoded_pieces = new Array();

    // Declare array for individual byte pieces
    var byte_pieces = new Array();

    // Loop through normal pieces and convert to URL friendly format
    for(var i = 0; i <= normal_length; i++)
    {
        if(normal_pieces[i] && normal_pieces[i] != '')
        {
            encoded_pieces[i] = encodeURI(normal_pieces[i]);
        }
    }

    // Get length of encoded array
    var encoded_length = encoded_pieces.length;

    // Loop through encoded array
    // Scan individual items for a %
    // Split on % and add to byte array
    // If no % exists then add to byte array
    for(var i = 0; i <= encoded_length; i++)
    {
        if(encoded_pieces[i] && encoded_pieces[i] != '')
        {
            // % exists
            if(encoded_pieces[i].indexOf('%') != -1)
            {
                // Split on %
                var split_code = encoded_pieces[i].split('%');
                // Get length
                var split_length = split_code.length;

                // Loop through pieces
                for(var j = 0; j <= split_length; j++)
                {
                    if(split_code[j] && split_code[j] != '')
                    {
                        // Push to byte array
                        byte_pieces.push(split_code[j]);
                    }
                }
            }
            else
            {
                // No percent
                // Push to byte array
                byte_pieces.push(encoded_pieces[i]);
            }
        }
    }

    // Array length is the number of bytes in string
    var byte_length = byte_pieces.length;

    return byte_length;
}

200_success · Accepted Answer · 2013-12-16 23:05:20Z

up vote 3 down vote accepted

It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI().

/**
 * Count bytes in a string's UTF-8 representation.
 *
 * @param   string
 * @return  int
 */
function getByteLen(normal_val) {
    // Force string type
    normal_val = String(normal_val);

    var byteLen = 0;
    for (var i = 0; i < normal_val.length; i++) {
        var c = normal_val.charCodeAt(i);
        byteLen += c < (1 <<  7) ? 1 :
                   c < (1 << 11) ? 2 :
                   c < (1 << 16) ? 3 :
                   c < (1 << 21) ? 4 :
                   c < (1 << 26) ? 5 :
                   c < (1 << 31) ? 6 : Number.NaN;
    }
    return byteLen;
}

answered Dec 16 '13 at 23:05

200_success
8,573634

Nice, I saw similar but less clean code on the site I linked to, +1 – tomdemuyt Dec 17 '13 at 13:23

I will give your suggestion a try as well but are the comments on this site relevant to the use of charCodeAt()? – user2191572 Dec 17 '13 at 14:24

2

Good question! The code at forrst.com is bogus. Although ceil(log_256(charCode)) tells you the number of bytes it would take to represent charCode, there's nothing about UTF-8 in their byteLength() function. UTF-8 is a variable-length encoding scheme, and the few most-significant bits of every byte are necessary to indicate how many bytes form each character. Since any variable-length encoding scheme will have such padding, their byteLength() function gives a wrong answer for any encoding, including UTF-8. – 200_success Dec 17 '13 at 22:44

Great answer, thank you! My test scenario was chopped down to .1 seconds with your code by the way, compared to ~3.2 respectively for @tomdemuyt , so it is definitely extremely efficient. I just have to wrap my head around these bitwise operators (I've never used them before) and then I can have rock-solid confidence in the script. According to Wikipedia, UTF-8 was ultimately restricted to a max of 4 bytes per char so is checking for 5 and 6 bytes fruitless or do you think UTF-8 will eventually extend to 5 and 6 bytes? I see your shorthand if statements will essentially never get to 5 and 6. – user2191572 Dec 18 '13 at 13:22

1 << n is simply a way to write 2^n. I think it's easier to understand 1 << n than the magic numbers 128, 2048, 16384, etc. – 200_success Dec 18 '13 at 17:21

show 1 more comment

tomdemuyt · Answer 2 · 2013-12-16 18:55:25Z

My 2 cents

Please do not abbreviate words, choose short words or acronyms ( Len -> Length )
Please lower camel case ( normal_val -> normalValue )
Consider using spartan conventions ( s -> generic string )
new Array() is considered old skool, consider var byte_pieces = []
You are using byte_pieces to track the bytes just to get the length, you could have just kept track of the length, this would be more efficient
I am not sure what abnormal pieces would be here:

if(normal_pieces[i] && normal_pieces[i] != '')

You check again for these here, probably not needed:

if(encoded_pieces[i] && encoded_pieces[i] != '')

You could just do return byte_pieces.length instead of

// Array length is the number of bytes in string
var byte_length = byte_pieces.length;

return byte_length;

All that together, I would counter propose something like this:

function getByteCount( s )
{
  var count = 0, stringLength = s.length, i;
  s = String( s || "" );
  for( i = 0 ; i < stringLength ; i++ )
  {
    var partCount = encodeURI( s[i] ).split("%").length;
    count += partCount==1?1:partCount-1;
  }
  return count;
}
getByteCount("i ♥ js");
getByteCount("abc def");

You could get the sum by using .reduce(), I leave that as an exercise to the reader.

Finally, if you are truly concerned about performance, there are some very fancy performant js libraries out there.

Thank you so much, looks like a lot of good stuff in your post. I will give them a go and see if I can get better performance numbers. I am not overly concerned about performance but my original code took ~6 seconds for 1200 iterations of 2400 Euro signs deduced by one char per iteration until I hit 1200 for my enforceMaxByteLength script and this code took ~3.8 so hopefully I can shave off a bit more — user2191572, Dec 16 '13 at 18:51
Your counter proposition is genius, it shaved another .6 seconds off my benchmark, thank you. — user2191572, Dec 16 '13 at 19:14

asked	16 days ago
viewed	86 times
active	16 days ago

Explore our sites

Count byte length of string

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged javascript optimization strings utf-8 or ask your own question.

Hot Network Questions

Explore our sites

Count byte length of string

2 Answers

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged javascript optimization strings utf-8 or ask your own question.

Related

Hot Network Questions