I have implemented a customized UTF-8 encoding mechanism. The code works fine, but I have a lot of concerns regarding the code.
public class Utf8Encoding {
public static void main(String[] args) {
byte [] arr = new byte[1000];
int iStr = 95000; // or > 65535, as till 65535 all chars are unicode and above are surrogates.
String str = new String(Character.toChars(iStr));
encode(arr,0,str);
System.out.println(decode(arr,utf8Length(str)));
System.out.println(decode(arr,utf8Length(str)).equals(str));
}
public static byte[] encode(byte[] aByteArray , int offset ,String str) {
int len = str.length();
int j = offset;
try {
for (int k = 0; k < len; ++k) {
int l = str.charAt(k) & 0xFFFF;
if ((l >= 1) && (l <= 127)) {
aByteArray[(j++)] = (byte) l;
} else if ((l == 0) || ((l >= 128) && (l <= 2047))) {
aByteArray[(j++)] = (byte) (192 + (l >> 6));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
} else {
aByteArray[(j++)] = (byte) (224 + (l >> 12));
aByteArray[(j++)] = (byte) (128 + (l >> 6 & 0x3F));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
}
}
} catch (ArrayIndexOutOfBoundsException localArrayIndexOutOfBoundsException) {
throw new InternalError(
"Cannot encode the chracter "+str);
}
return aByteArray;
}
public static String decode(byte[] aByteArray ,int len) {
int j = 0;
int aOffset = 0;
int i = len;
char[] charArray = new char[len];
while ((j < len) && (aByteArray[aOffset] >= 0)) {
charArray[(j++)] = (char) aByteArray[(aOffset++)];
}
while (aOffset < i) {
int l = aByteArray[(aOffset++)];
if (l >= 0) {
charArray[(j++)] = (char) l;
} else {
int i1;
if (l >> 5 == -2) {
if (aOffset < i) {
i1 = aByteArray[(aOffset++)];
charArray[(j++)] = (char) (l << 6 ^ i1 ^ 0xF80);
}
}
int i2;
if (l >> 4 == -2) {
if (aOffset + 1 < i) {
i1 = aByteArray[(aOffset++)];
i2 = aByteArray[(aOffset++)];
charArray[(j++)] = (char) (l << 12 ^ i1 << 6
^ i2 ^ 0xFFFE1F80);
}
}
if (l >> 3 == -2) {
if (aOffset + 2 < i) {
i1 = aByteArray[(aOffset++)];
i2 = aByteArray[(aOffset++)];
int i3 = aByteArray[(aOffset++)];
int i4 = l << 18 ^ i1 << 12 ^ i2 << 6 ^ i3 ^ 0x381F80;
charArray[(j++)] = (char) getHighSurrogate(i4);
charArray[(j++)] = (char) getLowSurrogate(i4);
}
}
}
}
return new String(charArray,0,j);
}
public static int utf8Length(String str) {
int i = str.length();
int j = 0;
for (int k = 0; k < i; ++k) {
int l = str.charAt(k) & 0xFFFF;
if ((l >= 1) && (l <= 127))
++j;
else if ((l == 0) || ((l >= 128) && (l <= 2047))) {
j += 2;
} else
j += 3;
}
return j;
}
public static int getLowSurrogate(int number) {
return (number & 0x3ff) + '\uDC00';
}
public static int getHighSurrogate(int number) {
return (65989 >>> 10)+ ('\uD800' - (0x010000 >>> 10));
}
}
Few points of concern :
If I encode the above supplementary character, it is taking 6 bytes. But when I perform
str.getBytes("utf8")
it returns only 4 bytes. How many bytes does a surrogate character occupies in UTF-8 ? is it 4 or 6 ?Can we use UTF-8 where we have lots of Non-ASCII / native language characters ? The characters whose code point > 3000 land up occupying 3 bytes ?