character values in UTF-16 and UTF-32 encoding formats
-
Saturday, November 17, 2012 12:52 PM
all characters in .NET are 16bit (UTF-16)
we can define this type of characters as below:
let chr = '\u0061' val chr : char = 'a'
also in F# we can define characters in UTF-32 encoding format as below:
let chr = '\u00000061' val chr : char = 'a'
my question is:
we know that characters work in UTF16 format. so where going other 16bit?
can you example an character that fill all 32 bits? or extra 16bit is reserved for future use?
- Edited by Rainmater Saturday, November 17, 2012 1:14 PM
All Replies
-
Saturday, November 17, 2012 7:48 PM
- You need a uppercase U in the escape sequence
- You can only use characters off the Basic Multilingual Plane in strings, because they are UTF-16 encoded, and occupy 2 or more 16-bit character cells
- You need a font to display them, of which there aren't many, which itself is a \U0001F4A9
- There are no 32-bit characters in Unicode -- the longest are 21-bit, in the private use range \U00100000 to \U0010FFFF
Worked Example
Here is one I prepared earlier (one I was working with earlier today in a recherché context where it comes after Q)
> let chr = '\U0001D107';; let chr = '\U0001D107';; ----------^^^^^^^^^^^^ stdin(6,11): error FS1159: This Unicode encoding is only valid in string literals > let str = "Unicode Character 'MUSICAL SYMBOL RIGHT REPEAT SIGN' = \U0001D107";; val str : string = "Unicode Character 'MUSICAL SYMBOL RIGHT REPEAT SIGN' = ??"
> str |> Seq.iter (fun x -> printfn "%A" (int x));;
85
110
105...
61
32
65533
57399
val it : unit = ()
>
Note that the character turns into two ?? because the terminal encoding doesn't understand UTF-16 surrogates.
-
Sunday, November 18, 2012 2:11 AM
1)why This Unicode encoding is only valid in string literals?
can you explain it to me?
From unicode website:The following table summarizes some of the properties of each of the UTFs:
Name
UTF-8
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
Smallest code point
0000
0000
0000
0000
0000
0000
0000
Largest code point
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
Code unit size
8 bits
16 bits
16 bits
16 bits
32 bits
32 bits
32 bits
Byte order
N/A
<BOM>
big-endian
little-endian
<BOM>
big-endian
little-endian
Fewest bytes per character
1
2
2
2
4
4
4
Most bytes per character
4
4
4
4
4
4
4
we can see in table that UTF-16 can use 4byte of memory in most state.
2) can you give me an example of UTF-16 encoding that use 4 byte?
thank you.
-
Sunday, November 18, 2012 10:27 AM
Your questions were answered in my first post, but I'll cut and paste the relevant bits with some extra commentary in []
> why This Unicode encoding is only valid in string literals?
Characters off the Basic Multilingual Plane occupy 2 or more 16-bit character cells. [Character values over 0xFFFF just can't fit into just one 16 bit type, obviously]
> can you give me an example of UTF-16 encoding that use 4 byte?
Unicode Character 'MUSICAL SYMBOL RIGHT REPEAT SIGN' (U+1D107) gets represented as the two 16-bit values 65533, 57399 [0Xfffd, 0Xe037 --see http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF for the detailed bit-twiddling required]
- Marked As Answer by Rainmater Sunday, November 18, 2012 12:10 PM
-
Sunday, November 18, 2012 12:10 PM
a good tool for working with characters and viewing their numeric code in windows is "Character Map"
Character Map enables you to view the characters that are available in a selected font. Using Character Map, you can copy individual characters or a group of characters to the Clipboard and paste them into any program that can display them.
to run character map:
pressing win+R. in the appeared box type "charmap" and press enter.
-
Tuesday, November 20, 2012 3:53 AM
according to unicode website unicode characters are in range 0x00000000 to 0x0010FFFF.
1)but why we can have a character(inside string) out of this range? for example:
let str = "\U0FFFFFFF"
val str : string = "ힿ�"
in other hands whats happen when we do this?
2) when we use \XXX format to use characters what is XXX? is it ASCII code? and what is its range?
thank you.
-
Wednesday, November 21, 2012 1:25 PM
after some test i got:
In F# we can specify a character using its ASCII code in two way:
1) using escape sequences \xxx where xxx is ASCII code of a character in decimal notation.
2) using escape sequences \xhh where hh is ASCII code of a character in hexadecimal notation.for example we can specify character 'a' in two way as show below:
let char1 = '\097' let char2 = '\x61'
val char1 : char = 'a'
val char2 : char = 'a'but a question!:
as we know ASCII characters are in range 0 to 127(00 to 7F hex)
but why when assign a value out of this range F# show us a character?
let char = '\xFF'
val char : char = 'ÿ'
it is also for UTF-16 and UTF-32 based characters.