Encoding of ASCII string in UTF8 XML document in Byte array

Question

I have some the folowing requirements:

...The document must be encoded in UTF-8... The Lastname field only allows (Extended) ASCII ... City only allows ISOLatin1 ...The message must be put on the (IBM Websphere) MessageQueue as a IBytesMessage

The XML document, for simplicities sake, looks like this:

<?xml version="1.0" encoding="utf-8"?>
<foo>
  <lastname>John ÐØë</lastname>
  <city>John ÐØë</city>
  <other>UTF-8 string</other>
</foo>

The "ÐØë" part are (or should be) ASCII values 208, 216, 235 respectively.

I also have an object:

public class foo {
  public string lastname { get; set; }
}

So I instantiate an object and set the lastname:

var x = new foo() { lastname = "John ÐØë", city = "John ÐØë" };

Now this is where my headache sets in (or the inception if you will...):

Visual studio / source code is in Unicode
Hence: Object has an Unicode lastname
The XML Serializer uses UTF-8 to encode the document
Lastname should contain only (Extended) ASCII characters; the characters are valid ASCII chars but ofcourse in UTF-8 encoded form

I normally don't experience any trouble with my encodings; I am familiar with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) but this one's got me stumped...

I understand that the UTF-8 document will be perfectly able to "contain" both encodings because the codepoints 'overlap'. But where I get lost is when I need to convert the serialized message to a byte-array. When doing a dump I see C3 XX C3 XX C3 XX (I don't have the actual dump at hand). It's clear (or I've been staring at this for too long) that the lastname / city strings are put in the serialized document in their unicode form; the byte-array suggests so.

Now what will I have to do, and where, to ensure the Lastname string goes into the XML document and finally the byte-array as an ASCII string (and the actual 208, 216, 235 byte sequence), and that City makes it in there as ISOLatin1?

I know the requirements are backwards, but I can't change those (3rd party). I always use UTF-8 for our internal projects so I have to support the unicode-utf8=>ASCII/ISOLatin1 conversion (ofcourse, only for chars that are in those sets).

My head hurts...

"The "ÐØë" part are (or should be) ASCII values 208, 216, 235 respectively." That is nonsensical. There are no ASCII values >127
There is, it's called Extended ASCII (en.wikipedia.org/wiki/Extended_ASCII). Although it's not-standardized, I am required to allow (some) diactritics and thus am forced to use it (and hope for the best).
The document must be encoded in UTF-8 is that the only part of the requirement we have to care about?
@ChrisS Well, the document must be in UTF8, the message must be put on the queue as byte-array (wich is (or should be) just the "UTF8 bytes"), the value(s) in the lastname/city nodes must be in ASCII/ISOLatin1 (or at least, be encoded as ASCII/ISOLatin1 "within" the UTF8 or something)... I didn't make this up...

Nicholas Carey · Answer 1 · 2012-02-16 17:45:57Z

Never mind how the XML document is encoded for transmission. The right way to do what you want to do—encode certain non-ASCII characters so they survive the trip unscathed—is to use XML character references to represent the characters that need to be so preserved. For instance, your

ÐØë

is represented using XML character references as

&#x00D0;&#x00D8;&#x00EB;

The receiving [conformant] XML processor will/should/must convert those numeric character references back to the characters they represent. Here's some code that will do the trick:

public static string ConvertToXmlCharacterReference( this string xml )
{
  StringBuilder sb  = new StringBuilder( s.Length ) ;
  const char    SP  = '\u0020' ; // anything lower than SP is a control character
  const char    DEL = '\u007F' ; // anything above DEL isn't ASCII, per se.

  foreach( char ch in xml )
  {
    bool isPrintableAscii = ch >= SP && ch <= DEL ;

    if ( isPrintableAscii ) { sb.Append(ch)                             ; }
    else                    { sb.AppendFormat( "&#x{0:X4}" , (int) ch ) ; }

  }

  string instance = sb.ToString() ;
  return instance ;
}

You could also use a regular expression to make the replacement or write an XSLT that would do the same thing. But the task is so trivial, it doesn't really warrant that sort of approach. The above code is probably faster and less memory intensive and...it's easier to understand.

You should note though that since you want to preserve two different encodings in the same document, your conversion routine will need to differentiate between the conversion from "extended ASCII" to an XML character reference and the conversion from "ISO Latin 1" to an XML character reference.

In both cases, the character reference specifies a codepoint in the ISO/IEC 10646 character set — essentially unicode. You'll want to map the characters to the appropriate code point. Since string in the CLR world are UTF-16 encoded, that shouldn't be much of an issue. The above code should work fine, I believe, unless you've get something really oddball that doesn't play very nicely with UTF-16.

Hmmm; I hadn't considered XML character references (I do/did know of their existence). Now I am only curious if they will count Ð as 8 bytes or as 1 since they told me my test-string (see my other responses) was too long... Will test...
From the standpoint of the document, it's a single character. Once it's properly parsed, the consumer should see a single character, just as you'd have to represent a < in your document's content as < or <.
I understand it's a single character (from the documents standpoint) but now I am really curious on what the 3rd parties response will be. I'm afraid they're counting bytes... Oh well, their problem :P
I will accept this as an answer for now. I will keep everyone posted on what the actual result is going to be when the testing with the 3rd party has resumed...

Boo · Answer 2 · 2012-02-15 17:53:59Z

up vote 0 down vote

So.. System.Text.Encoding.ASCII.GetBytes(string) will probably do what you want.. convert a string into an ascii-encoded byte array.

answered Feb 15 at 17:53

Boo
5,1923823

	That results in ???? characters... – RobIII Feb 15 at 18:08
	hmmm.. now my head hurts too. – Boo Feb 15 at 18:16

feedback

Alexei Levenkov · Answer 3 · 2012-02-15 18:02:16Z

up vote 0 down vote

You simply can't have 208, 216, 235 byte sequence in UTF-8 encoded string/byte array.

I hope you can save XML as ISO 8859-1 with or without mentioning it in XML <?xml version="1.0" encoding="XXXXXXXXXX"?> processing instruction (maybe even specifying invalid UTF-8 encoding in XML header).

Otherwise if your requirements are as you stated - just ask for exact expected byte array for given input and craft your own custom serialization (or maybe custom encoding, also not sure if it is possible).

answered Feb 15 at 18:02

Alexei Levenkov
16.3k21326

	"You simply can't have 208, 216, 235 byte sequence in UTF-8 encoded string/byte array." That would be because of the "Extended" part of the ASCII, right? Because "normal" ASCII does share codepoints (0-127) with UTF-8. I am not really looking forward to crafting my own serialization or custom encoding and should that be the only solution then sod it; I'll put the problem back in their lap. – RobIII Feb 15 at 18:04
	Found this on wikipedia: link: "That means all bytes 0x00-0x7F have the same meaning as in ASCII". So it looks like I will just have to drop the support for diacritics... – RobIII Feb 15 at 18:14
	Because 208 fall into 0x80-0x7FF range that must be encoded as 2 bytes in UTF8 (en.wikipedia.org/wiki/UTF-8). Valid UTF8 byte stream does not allow 2 bytes with `11` as highest bits to follow one each other. – Alexei Levenkov Feb 15 at 18:21
	So I was correct all the way when I told the 3rd party they were mistakenly thinking that diacritics would be fine? There is no way to conform to these requirements and allow diacritics, am I right? – RobIII Feb 15 at 18:28
	You can not get 208, 216, 235 byte sequence, what you can get is valid encoding of characters with codes 208, 216, 235 as UTF8 byte stream (`Encoding.UTF8.GetBytes("\u00D0\u00D8\u00EB")` -> 195,144,195,152, 195,171). Obviously 208 is not the Unicode code for character you are looking for, but you still can have 208 in the string if have to (i.e. manually converting string to another encoding and rebuilding string with converted codes). – Alexei Levenkov Feb 15 at 19:06

feedback

Douglas · Answer 4 · 2012-02-15 19:24:25Z

The document must be encoded in UTF-8. The Lastname field only allows ASCII. City only allows ISOLatin1. The message must be put on the (IBM Websphere) MessageQueue as a IBytesMessage.

If that is the precise specification, then I think you might be misunderstanding it. Your task is not one of encoding, but one of validation/fallback. The entire document – including the Lastname and City fields – must be encoded as UTF-8. Quite simply, the XML document would be invalid if it declares its encoding as UTF-8 and then contains byte values that are not valid under that encoding.

Conveniently, ASCII overlaps with the first 128 codepoints of Unicode; Latin1 overlaps with the first 256.

To check whether Lastname can be represented as ASCII, then you could check that all its characters have codepoints within the 0–127 range.

bool isLastnameAscii = foo.Lastname.All(c => (int)c < 128);

To conform with your specification, you would have to force invalid characters to fall back to the replacement character (typically ?) by encoding the string as ASCII, and then decoding it back:

foo.Lastname = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(foo.Lastname));

Similarly for City:

bool isCityLatin1 = foo.City.All(c => (int)c < 256);

Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
foo.City = latin1.GetString(latin1.GetBytes(foo.City));

Subsequently, you should just save everything as UTF-8.

My assumption is that your third-party software can correctly decode the XML document using UTF-8; however, it must then extract the Lastname and City fields, and use them somewhere where only ASCII and Latin1 are allowed. It imposes the restrictions on you in order to ensure that it would not be forced to incur data loss (because of the presence of disallowed characters).

Edit: This is the workaround that you’re proposing. I’m using Latin1 in the place of “Extended ASCII” because the latter term is ambiguous.

var x = new foo() { lastname = "John ÐØë", city = "John ÐØë", other = "—" };

using (var stream = new MemoryStream())
using (var utf8writer = new StreamWriter(stream, Encoding.UTF8))            
using (var latin1writer = new StreamWriter(stream, Encoding.GetEncoding("iso-8859-1")))
{
    utf8writer.WriteLine("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
    utf8writer.WriteLine("<foo>");
    utf8writer.Flush();

    latin1writer.WriteLine("  <lastname>" + SecurityElement.Escape(x.lastname) + "</lastname>");
    latin1writer.WriteLine("  <city>" + SecurityElement.Escape(x.city) + "</city>");
    latin1writer.Flush();

    utf8writer.WriteLine("  <other>" + SecurityElement.Escape(x.other) + "</other>");
    utf8writer.WriteLine("/<foo>");
    utf8writer.Flush();

    byte[] bytes = stream.ToArray();
}

SecurityElement.Escape replaces invalid XML characters in a string with their valid XML equivalent (e.g. < to &lt and & to &).

Unfortunately: no. For example: the lastname is (also) restricted to 70 chars. I sent the test-string "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz&'!<>ÀÁÂÃÄÅÆÇÈÉÊËÌ" which resulted in (the 3rd parties) response: the lastname is too long: it is 83 bytes which should be 70. Ofcourse it is, it contains diacritics. But I am also explicitly told diacritics are allowed in the lastname field...
But I am also explicitly told diacritics are allowed in the lastname field... As a result I started doubting my own knowledge but it seems to turn out I wasn't wrong all along. You are probably correct they need to interface with legacy stuff; and I had the regexes (0x00-0x7F) in place already but removed them because I have to handle names like "Du Pré" and they told me explicitly diacritics wouldn't be a problem. But now they're telling me the test-string is too long; it isn't: it is exactly 70 chars long, but 83 bytes....
Try the code given in my update; it would encode Latin1 characters (such as é) to just one byte.
I'm am for sure as hell not going to write out the XML document "by hand". I need to serialize a crapload of object-types; that would cost days to get right for each document.

asked	5 months ago
viewed	449 times
active	5 months ago

Encoding of ASCII string in UTF8 XML document in Byte array

4 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged c# encoding utf-8 ascii or ask your own question.

Hello World!

Encoding of ASCII string in UTF8 XML document in Byte array

4 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged c# encoding utf-8 ascii or ask your own question.

Hello World!

Related