Customised Java UTF-8

Question

I have implemented a customized UTF-8 encoding mechanism. The code works fine, but I have a lot of concerns regarding the code.

public class Utf8Encoding {

    public static void main(String[] args) {

           byte [] arr = new byte[1000];
           int iStr = 95000; // or > 65535, as till 65535 all chars are unicode and above are surrogates.
           String str = new String(Character.toChars(iStr));
           encode(arr,0,str);
           System.out.println(decode(arr,utf8Length(str)));
           System.out.println(decode(arr,utf8Length(str)).equals(str));
    }

    public static byte[] encode(byte[] aByteArray , int offset ,String str) { 

        int len = str.length();
        int j = offset;

        try {
            for (int k = 0; k < len; ++k) {
                int l = str.charAt(k) & 0xFFFF;
                if ((l >= 1) && (l <= 127)) {
                    aByteArray[(j++)] = (byte) l;
                } else if ((l == 0) || ((l >= 128) && (l <= 2047))) {
                    aByteArray[(j++)] = (byte) (192 + (l >> 6));
                    aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
                } else {
                    aByteArray[(j++)] = (byte) (224 + (l >> 12));
                    aByteArray[(j++)] = (byte) (128 + (l >> 6 & 0x3F));
                    aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
                }
            }
        } catch (ArrayIndexOutOfBoundsException localArrayIndexOutOfBoundsException) {
            throw new InternalError(
                    "Cannot encode the chracter "+str);
        }

        return aByteArray;

    }
     public static String decode(byte[] aByteArray ,int len) {

         int j = 0;
         int aOffset = 0;
         int i = len;
         char[] charArray = new char[len];
         while ((j < len) && (aByteArray[aOffset] >= 0)) {
                charArray[(j++)] = (char) aByteArray[(aOffset++)];
            }
            while (aOffset < i) {
                int l = aByteArray[(aOffset++)];
                if (l >= 0) {
                    charArray[(j++)] = (char) l;
                } else {
                    int i1;
                    if (l >> 5 == -2) {
                        if (aOffset < i) {
                            i1 = aByteArray[(aOffset++)];
                            charArray[(j++)] = (char) (l << 6 ^ i1 ^ 0xF80);
                        }
                    }
                    int i2;
                    if (l >> 4 == -2) {
                        if (aOffset + 1 < i) {
                            i1 = aByteArray[(aOffset++)];
                            i2 = aByteArray[(aOffset++)];

                            charArray[(j++)] = (char) (l << 12 ^ i1 << 6
                                    ^ i2 ^ 0xFFFE1F80);
                        }

                    }
                    if (l >> 3 == -2) {
                        if (aOffset + 2 < i) {
                            i1 = aByteArray[(aOffset++)];
                            i2 = aByteArray[(aOffset++)];
                            int i3 = aByteArray[(aOffset++)];
                            int i4 = l << 18 ^ i1 << 12 ^ i2 << 6 ^ i3 ^ 0x381F80;

                            charArray[(j++)] = (char) getHighSurrogate(i4);
                            charArray[(j++)] = (char) getLowSurrogate(i4);
                        }


                        }

                    }


                }
            return new String(charArray,0,j);

     }

     public static int utf8Length(String str) {
            int i = str.length();
            int j = 0;
            for (int k = 0; k < i; ++k) {
                int l = str.charAt(k) & 0xFFFF;
                if ((l >= 1) && (l <= 127))
                    ++j;
                else if ((l == 0) || ((l >= 128) && (l <= 2047))) {
                    j += 2;
                } else
                    j += 3;
            }

            return j;
        }

     public static int getLowSurrogate(int number) {
            return (number & 0x3ff) + '\uDC00';
        }

        public static int getHighSurrogate(int number) {
            return (65989 >>> 10)+ ('\uD800' - (0x010000 >>> 10));
        }

}

Few points of concern :

If I encode the above supplementary character, it is taking 6 bytes. But when I perform str.getBytes("utf8") it returns only 4 bytes. How many bytes does a surrogate character occupies in UTF-8 ? is it 4 or 6 ?
Can we use UTF-8 where we have lots of Non-ASCII / native language characters ? The characters whose code point > 3000 land up occupying 3 bytes ?

Is there a particular reason why you're reinventing the wheel instead of using the UTF-8 encoder and decoder in the standard API? — jarnbjo, Feb 26 '14 at 13:49
All code points are "unicode". Even surrogates. "utf8" in utf8 only takes 4 bytes. Surrogates are not part of UTF-8, they are entirely a UTF-16 thing (Java strings use UTF-16). UTF-8 is designed for non-ASCII/native language characters. It originally could handle even more codepoints than Java's UTF-16, but then they officially disallowed those extra codepoints. If you don't know this stuff, maybe you shouldn't be trying to write an encoder. — Mooing Duck, Feb 26 '14 at 21:01

200_success · Answer 1 · 2014-02-26 16:05:24Z

Proceeding with a "regular" code review…

Many of my previous remarks about your UTF-16 encoder also apply here.

The parameters to encode() feel backwards to me.
Your encode() lets the caller specify an offset; your decode() lets the caller specify a length. Design consistency would be appreciated.
Does the len parameter to decode() specify the number of bytes to interpret or the number of chars to accept?
In encode(), int l = str.charAt(k) & 0xFFFF could just be int l = str.charAt(k).
The spacing around your commas ,which is inconsistent , is awkward ,though harmless.

The exception handling in encode() is perplexing. An ArrayIndexOutOfBoundsException would not be an InternalError. You should never throw an InternalError, which indicates that the JVM has detected something seriously wrong with itself. Rather, if an ArrayIndexOutOfBoundsException does arise, it means that the caller passed in an array that is too small. The best course of action, in my opinion, is ignore the exception and let it propagate.

As I read it, decode() seems to have a serious problem. Your first while-loop would likely copy the entire byte array into the char array (with type promotion), probably terminating when j == len. Since aOffset is synonymous with j (they both get incremented the same number of times in the loop), and i is synonymous with len, you'll enter the next while-loop with aOffset == i, thus skipping the entire second while-loop.

There are already a lot of variables to keep track of. I don't think you need to add more variables by defining int j = offset in encode() and i = len in decode().

Since you are doing a lot of bit-twiddling, I would find the code easier to understand if you wrote hexadecimal (or binary) numeric literals instead of base-10 values. I had to break out my decimal-to-hex converter just to understand your code.

The getLowSurrogate() and getHighSurrogate() methods have no business being part of the interface of a Utf8Encoding. Such helper methods should be private.

200_success · Answer 2 · 2014-02-26 13:19:24Z

Handling of supplementary characters

No, your code does not handle supplementary characters correctly. A string encoded in UTF-8 must not contain high and low surrogate halves. Instead, supplementary characters (in the range U+10000 to U+10FFFF) should be encoded as 4-byte UTF-8 sequences. Your encode() function only has three cases, covering 1-, 2-, and 3-byte sequences.

As mentioned in a previous answer, Java stores its strings internally using UTF-16. Characters that don't fit into a 16-bit char are split into a high surrogate and a low surrogate char. You'll need to detect such surrogates pairs in the string and merge them before encoding.

Overlong encoding of `NUL`

You encode the NUL byte as 11000000 10000000. According to the specification, that's an overlong encoding, and it's illegal.

In short, what you have implemented is Modified UTF-8.

But this code encodes and decodes supplementary characters correctly . Just try to encode and decode any supplementary character :) — srikanth, Feb 26 '14 at 13:25
As I said, you've implemented Modified UTF-8. You could consider it "correct", but you shouldn't claim that it is proper UTF-8. At the least, you should declare that it is Modified UTF-8 in the JavaDoc, and possibly rename the class as well. — 200_success♦, Feb 26 '14 at 13:32

asked	1 year ago
viewed	408 times
active	1 year ago

current community

your communities

more stack exchange communities

Customised Java UTF-8

2 Answers 2

Handling of supplementary characters

Overlong encoding of `NUL`

Your Answer

Not the answer you're looking for? Browse other questions tagged java reinventing-the-wheel utf-8 or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Customised Java UTF-8

2 Answers 2

Handling of supplementary characters

Overlong encoding of NUL

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged java reinventing-the-wheel utf-8 or ask your own question.

Linked

Related

Hot Network Questions

Overlong encoding of `NUL`