Java String.getBytes(“UTF8”) JavaScript analog

Question

Functions written there work properly that is pack(unpack("string")) yields to "string". But I would like to have the same result as "string".getBytes("UTF8") gives in Java.

The question is how to make a function giving the same functionality as Java getBytes("UTF8") in JavaScript?

For Latin strings unpack(str) from the article mentioned above provides the same result as getBytes("UTF8") except it adds 0 for odd positions. But with non-Latin strings it works completely different as it seems to me. Is there a way to work with string data in JavaScript like Java does?

Nope... "中".getBytes("UTF8") yields to {-28, -72, -83}, but the function from the answer to [78, 45]. — ivkremer, Sep 20 '12 at 19:43
@Kremchik JavaScript uses UTF-16, hence the 0s -- they're the upper half of each 16-bit code unit. That Hanzhi character requires 3-bytes when encoded according to UTF-8 scheme while only 2-bytes via UTF-16. — oldrinb, Sep 20 '12 at 21:31

Joni · Accepted Answer · 2012-09-21 11:28:50Z

You can use this function (gist):

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        else {
            // let's keep things simple and only handle chars up to U+FFFF...
            utf8.push(0xef, 0xbf, 0xbd); // U+FFFE "replacement character"
        }
    }
    return utf8;
}

Example of use:

>>> toUTF8Array("中€")
[228, 184, 173, 226, 130, 172]

If you want negative numbers for values over 127, like Java's byte-to-int conversion does, you have to tweak the constants and use

            utf8.push(0xffffffc0 | (charcode >> 6), 
                      0xffffff80 | (charcode & 0x3f));

and

            utf8.push(0xffffffe0 | (charcode >> 12), 
                      0xffffff80 | ((charcode>>6) & 0x3f), 
                      0xffffff80 | (charcode & 0x3f));

best answer, well done. also elegant code. I've made the method smaller and added a reverse method to toUTF8Array, the final touch is putting it all into String.prototype which made the usage really much more clear && simpler. check it out: JavaScript-Bitwise-Shenanigans @GitHub. — Ġiĺàɗ, Apr 5 '15 at 23:51

bobince · Answer 2 · 2013-11-30 17:37:07Z

You don't need to write a full-on UTF-8 encoder; there is a much easier JS idiom to convert a Unicode string into a string of bytes representing UTF-8 code units:

unescape(encodeURIComponent(str))

(This works because the odd encoding used by escape/unescape uses %xx hex sequences to represent ISO-8859-1 characters with that code, instead of UTF-8 as used by URI-component escaping. Similarly decodeURIComponent(escape(bytes)) goes in the other direction.)

So if you want an Array out it would be:

function toUTF8Array(str) {
    var utf8= unescape(encodeURIComponent(str));
    var arr= new Array(utf8.length);
    for (var i= 0; i<utf8.length; i++)
        arr[i]= utf8.charCodeAt(i);
    return arr;
}

Your code is more simple, thank you for the better answer. – ivkremer Dec 2 '13 at 15:10 — ivkremer, Dec 2 '13 at 15:10

HelloSam · Answer 3 · 2013-01-22 08:35:31Z

The following function will deal with those above U+FFFF.

Because javascript text are in UTF-16, two "characters" are used in a string to represent a character above BMP, and charCodeAt returns the corresponding surrogate code. The fixedCharCodeAt handles this.

function encodeTextToUtf8(text) {
    var bin = [];
    for (var i = 0; i < text.length; i++) {
        var v = fixedCharCodeAt(text, i);
        if (v === false) continue;
        encodeCharCodeToUtf8(v, bin);
    }
    return bin;
}

function encodeCharCodeToUtf8(codePt, bin) {
    if (codePt <= 0x7F) {
        bin.push(codePt);
    } else if (codePt <= 0x7FF) {
        bin.push(192 | (codePt >> 6), 128 | (codePt & 63));
    } else if (codePt <= 0xFFFF) {
        bin.push(224 | (codePt >> 12),
            128 | ((codePt >> 6) & 63),
            128 | (codePt & 63));
    } else if (codePt <= 0x1FFFFF) {
        bin.push(240 | (codePt >> 18),
            128 | ((codePt >> 12) & 63), 
            128 | ((codePt >> 6) & 63),
            128 | (codePt & 63));
    }
}

function fixedCharCodeAt (str, idx) {  
    // ex. fixedCharCodeAt ('\uD800\uDC00', 0); // 65536  
    // ex. fixedCharCodeAt ('\uD800\uDC00', 1); // 65536  
    idx = idx || 0;  
    var code = str.charCodeAt(idx);  
    var hi, low;  
    if (0xD800 <= code && code <= 0xDBFF) { // High surrogate (could change last hex to 0xDB7F to treat high private surrogates as single characters)  
        hi = code;  
        low = str.charCodeAt(idx+1);  
        if (isNaN(low)) {  
            throw(encoding_error.invalid_surrogate_pair.replace('%pos%', idx));
        }  
        return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;  
    }  
    if (0xDC00 <= code && code <= 0xDFFF) { // Low surrogate  
        // We return false to allow loops to skip this iteration since should have already handled high surrogate above in the previous iteration  
        return false;  
        /*hi = str.charCodeAt(idx-1); 
          low = code; 
          return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;*/  
    }  
    return code;  
}

Kevin Hakanson · Answer 4 · 2014-09-01 03:55:30Z

TextEncoder is part of the Encoding Living Standard and according to the Encoding API entry from the Chromium Dashboard, it shipped in Firefox and will ship in Chrome 38. There is also a text-encoding polyfill available for other browsers.

The JavaScript code sample below returns a Uint8Array filled with the values you expect.

(new TextEncoder()).encode("string") 
// [115, 116, 114, 105, 110, 103]

A more interesting example that betters shows UTF-8 replaces the in in string with îñ:

(new TextEncoder()).encode("strîñg")
[115, 116, 114, 195, 174, 195, 177, 103]

asked	3 years ago
viewed	8249 times
active	1 year ago

current community

your communities

more stack exchange communities

Java String.getBytes(“UTF8”) JavaScript analog

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged java javascript string utf-8 byte or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Java String.getBytes(“UTF8”) JavaScript analog

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged java javascript string utf-8 byte or ask your own question.

Linked

Related

Hot Network Questions