Well i have a byte array, and i know its a xml serilized object in the byte array is there any way to get the encoding from it?

Im not going to deserilize it but im saving it in a xml field on a sql server... so i need to convert it to a string?

share|improve this question
feedback

3 Answers

up vote 5 down vote accepted

You could look at the first 40-ish bytes*. They should contain the document declaration (assuming it has an document declaration) which should either contain the encoding or you can assume it's UTF-8 or UTF-16, which should should be obvious from how you've understood the "

Realistically, do you expect you'll ever get anything other than UTF-8 or UTF-16? If not, you could check for the patterns you get at the start of both of those and throw an exception if it doesn't follow either pattern. Alternatively, if you want to make another attempt, you could always try to decode the document as UTF-8, re-encode it and see if you get the same bytes back. It's not ideal, but it might just work.

I'm sure there are more rigorous ways of doing this, but they're likely to be finicky :)

* Quite possibly less than this. I figure 20 characters should be enough, which is 40 bytes in UTF-16.

share|improve this answer
1  
Downvoters: if you're going to downvote, please provide a comment. Otherwise the downvote serves no real purpose. – Jon Skeet Apr 24 '09 at 17:15
They just be hatin. – Donnie Jun 2 '12 at 2:07
feedback

A solution similar to this question could solve this by using a Stream over the byte array. Then you won't have to fiddle at the byte level. Like this:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}
share|improve this answer
feedback

The first 2 or 3 bytes may be a BOM which can tell you whether the stream is UTF-8, Unicode-LittleEndian or Unicode-BigEndian.

UTF-8 BOM is 0xEF 0xBB 0xBF Unicode-Bigendian is 0xFE 0xFF Unicode-LittleEndiaon is 0xFF 0xFE

If none of these are present then you can use ASCII to test for <?xml (note most modern XML generation sticks to the standard that no white space may preceed the xml declare).

ASCII is use up until ?> so you can find the precence of encoding= and find its value. If encoding isn't present or <?xml declare is not present then you can assume UTF-8.

share|improve this answer
feedback

Your Answer

 
or
required, but never shown
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.