I have some the folowing requirements:
...The document must be encoded in UTF-8... The Lastname field only allows (Extended) ASCII ... City only allows ISOLatin1 ...The message must be put on the (IBM Websphere) MessageQueue as a IBytesMessage
The XML document, for simplicities sake, looks like this:
<?xml version="1.0" encoding="utf-8"?>
<foo>
<lastname>John ÐØë</lastname>
<city>John ÐØë</city>
<other>UTF-8 string</other>
</foo>
The "ÐØë" part are (or should be) ASCII values 208, 216, 235 respectively.
I also have an object:
public class foo {
public string lastname { get; set; }
}
So I instantiate an object and set the lastname:
var x = new foo() { lastname = "John ÐØë", city = "John ÐØë" };
Now this is where my headache sets in (or the inception if you will...):
- Visual studio / source code is in Unicode
- Hence: Object has an Unicode lastname
- The XML Serializer uses UTF-8 to encode the document
- Lastname should contain only (Extended) ASCII characters; the characters are valid ASCII chars but ofcourse in UTF-8 encoded form
I normally don't experience any trouble with my encodings; I am familiar with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) but this one's got me stumped...
I understand that the UTF-8 document will be perfectly able to "contain" both encodings because the codepoints 'overlap'. But where I get lost is when I need to convert the serialized message to a byte-array. When doing a dump I see C3 XX C3 XX C3 XX
(I don't have the actual dump at hand). It's clear (or I've been staring at this for too long) that the lastname / city strings are put in the serialized document in their unicode form; the byte-array suggests so.
Now what will I have to do, and where, to ensure the Lastname string goes into the XML document and finally the byte-array as an ASCII string (and the actual 208, 216, 235 byte sequence), and that City makes it in there as ISOLatin1?
I know the requirements are backwards, but I can't change those (3rd party). I always use UTF-8 for our internal projects so I have to support the unicode-utf8=>ASCII/ISOLatin1 conversion (ofcourse, only for chars that are in those sets).
My head hurts...
The document must be encoded in UTF-8
is that the only part of the requirement we have to care about? – Chris S Feb 15 at 18:19