Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

The base of the problem:

Pun intended.

The problem starts with a very old dBase database where the textual information is encoded directly into DOS Cyrillic (CP-866), and because that's not enough of a problem, it's also being transferred to a MySQL database every evening, to which I have access.

I've installed the MySQL Providers and connected to the database with Entity Framework, which was my main data access method, and then for experimental reasons with pure ADO.NET as well.

Everything was going better than expected until I attempted to convert the supposedly CP-866 values from the database to UTF-8, like so:

var cp866 = Encoding.GetEncoding(866);
var utf8 = Encoding.UTF8;

string source = "some unreadable set of characters from the database";
byte[] cp866bytes = cp866.GetBytes(source);
byte[] utf8bytes = Encoding.Convert(cp866, utf8, cp866bytes);
string result = utf8.GetString(utf8bytes);

I've read it once with EntityFramework and once with ADO.NET with the same result.

For unknown at the time and less-unknown now reasons, it didn't work. After reading some important articles about encoding and string values I've determined that it is not possible to apply such conversions to the string equivalent of the varchar field in the database because of the nature of the string variable itself.

Few keyboard bangs later, I've finally made it happen by using ADO.NET MySQL Provider and customizing my query by adding CONVERT(varcharColumn, Binary) to the column I was testing with.

From then on, I used the above code with the only difference that I already had the cp866 byte array from the convert. I originally intended to do something similar but the MySQL provider wasn't able to directly read bytes from a varchar field, neither I found a way to do it with Entity Framework.

Yes, it works, but it doesn't feel right even to my unexperienced self.



Questions:

1: Can I specify how Entity Framework should select specific fields?

I would like to somehow explain my beloved ORM that it should be converting specific varchar fields to binary during the read, without returning the string representation at all, because it messes everything up.

2: Is there a way to make the ADO.NET MySQL provider to get the bytes of a varchar field, without pulling it as a string first?

The GetBytes method throws an exception when used with varchar, and the GetSqlBytes method that is normally present in ADO.NET provider is missing in the MySQL version. I don't really want to write Binary Convert on every field which I need to read properly.

3: Bonus Question: Is it possible to read the CP-866 encoded varchar field as a string as I did, but this time properly changing the encoding to UTF-8?

There is still a lot of chaos in my head on the encoding topic after today's reading. I still believe there might be something I am missing and that is possible to read a string from the cp-866 encoded varchar fields, like:

string cp866EncodedValue = "Œ€„‹… Œ‹€„…Ž‚€ Šš…‚€"; //actual copy-pasted value

..and then convert it to UTF-8, while having in mind the field in the database was encoded with CP-866. From what I've read, as soon as it's in a string, it's unicode and the string is immutable. I've tried getting it's byre array representation, changing it to cp866, then to utf8, I tried using it as it is cp866 itself, but without success.

share|improve this question
MySQL supports CP-866 natively; if set up correctly, it can convert to UTF-8 for you (upon either data insertion or retrieval, as you wish). Chances are that the character set of the connection over which you originally inserted the data was incorrect, which has led to storage of bad data in the table: you can quickly verify this manually with e.g. SELECT HEX(mycolumn) FROM mytable. – eggyal Jun 8 at 7:28
Is that setting from the server or is it being setup through the connection string? I tried setting charset option on the connection string without success. I haven't deal with the data insertion, it's another team that was importing from some dBase databases in another company. – Peter Jun 9 at 22:47
1  
The driver will request the desired character set upon connection to the server; it may be possible to inform the driver of the desired character set in the DSN, but one can always change the configuration after connecting using a suitable command e.g. SET NAMES. See Connection Character Sets and Collations for more info. – eggyal Jun 9 at 23:34

This question has an open bounty worth +50 reputation from Peter ending tomorrow.

The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.