How do I convert a string
to a byte[]
in .NET (C#)?
Also, why should encoding be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?
|
Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in". For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.Just do this instead:
As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason. Additional benefit to this approach:It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!It will be encoded and decoded just the same, because you are just looking at the bytes. If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters. |
|||||||||||||||||||||
|
It depends on the encoding of your string (ASCII, UTF-8, ...). For example:
A small sample why encoding matters:
ASCII simply isn't equipped to deal with special characters. Internally, the .NET framework uses UTF-16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use See Character Encoding in the .NET Framework (MSDN) for more information. |
|||||||||||||||||||||
|
The accepted answer is very, very complicated. Use the included .NET classes for this:
Don't reinvent the wheel if you don't have to... |
|||||||||||||||||
|
|
|||||||||||||||||||||
|
You need to take the encoding into account, because 1 character could be represented by 1 or more bytes (up to about 6), and different encodings will treat these bytes differently. Joel has a posting on this: |
|||||||||||||
|
This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the later first. Common NeedEvery string has a character set and encoding. When you convert a
The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution or skipping. The default policy is to substitute a '?'.
Clearly, conversions are not necessarily lossless! Note: For The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what an encoding is. Specific NeedNow, the question author asks, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?" He doesn't want any conversion. From the C# spec:
So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:
But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:
That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype The only way to get to the actual bytes the String is stored in is to use a pointer. The
To do so, the compiler writes code skip over the other parts of the string object with
As @CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that. |
|||||||||
|
Just to demonstrate that Mehrdrad's sound answer works, his approach can even persist the unpaired surrogate characters(of which many had leveled against my answer, but of which everyone are equally guilty of, e.g.
Output:
Try that with System.Text.Encoding.UTF8.GetBytes or System.Text.Encoding.Unicode.GetBytes, they will merely replace high surrogate characters with value fffd Every time there's a movement in this question, I'm still thinking of a serializer(be it from Microsoft or from 3rd party component) that can persist strings even it contains unpaired surrogate characters; I google this every now and then: serialization unpaired surrogate character .NET. This doesn't make me lose any sleep, but it's kind of annoying when every now and then there's somebody commenting on my answer that it's flawed, yet their answers are equally flawed when it comes to unpaired surrogate characters. Darn, Microsoft should have just used 谢谢! |
|||||||||||||
|
Try this, a lot less code:
|
||||
|
The first part of your question (how to get the bytes) was already answered by others: look in the I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself? The answer is that the bytes used internally by the string class don't matter. If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes any more: the Serialization formatter does it for you. On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case you definitely do need to worry about encoding, because obviously this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end. I understand that in this case you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. But that's just not important compared to making sure that your output is understood at the other end, and to guarantee that you must be explicit with your encoding. If you really want to match your internal bytes, just use the |
|||||
|
|
|||||||||
|
Well, I've read all answers and they were about using encoding or one about serialization that drops unpaired surrogates. It's bad when the string, for example, comes from SQL Server where it was built from a byte array storing, for example, a password hash. If we drop anything from it, it'll store an invalid hash, and if we want to store it in XML, we want to leave it intact (because the XML writer drops an exception on any unpaired surrogate it finds). So I use Base64 encoding of byte arrays in such cases, but hey, on the Internet there is only one solution to this in C#, and it has bug in it and is only one way, so I've fixed the bug and written back procedure. Here you are, future googlers:
|
||||
|
You can use the following code for conversion between string and byte array.
|
|||
|
C# to convert a string to a byte array:
|
||||
|
Because there is no such thing as "the bytes of the string". A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene. An encoding is nothing but a convention to translate logical characters to physical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in English. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays. So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language". By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) |
|||||||||
|
I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character". take this example sample:
Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second. So if you just want the bytes used by the string, simply use Encoding.Unicode, but it will be inefficient with storage space. |
|||
|
Fastest way
|
|||||
|
The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string. Utf8 is a popular encoding, it is compact and not lossy. |
|||||||||
|
the result is:
|
|||
|
You can use following code to convert a
|
||||
|
Here is my unsafe implementation of String to Byte[] conversion:
It's way faster than the accepted anwser's one, even if not as elegant as it is. Here are my Stopwatch benchmarks over 10000000 iterations:
In order to use it, you have to tick "Allow Unsafe Code" in your project build properties. As per .NET Framework 3.5, this method can also be used as String extension:
|
||||
|
simple code with LINQ
EDIT : as commented below, it is not a good way. but you can still use it to understand LINQ with a more appropriate coding :
|
|||||||||
|
|
|||||
|
Two ways:
And,
I tend to use the bottom one more often than the top, haven't benchmarked them for speed. |
|||||
|
|
|||
|
Simply use this:
|
||||
|
String can be converted do byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in assumption of different UTF, if can be screwed up. Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation UTF-8 is usually used. It is also a standard-de-facto for Internet. Not surprisingly, serialization of string into array of byte and deserialization is supported by the class Ref: http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx For serialization to array of bytes use Example:
|
|||
|
from byte[] to string:
|
|||
|
The closest approach to the OP's question is Tom Blodget's, which actually goes into the object and extracts the bytes. I say closest because it depends on implementation of the String Object.
Sure, but that's where the fundamental error in the question arises. The String is an object which could have an interesting datastructure. We already know it does, because it allows unpaired surrogates to be stored. It might store the length. It might keep a pointer to each of the 'paired' surrogates allowing quick counting. Etc. All of these extra bytes are not part of the character data. What you want is each character's bytes in an array. And that is where 'encoding' comes in. By default you will get UTF-16LE. If you don't care about the bytes themselves except for the round trip then you can choose any encoding including the 'default', and convert it back later (assuming the same parameters such as what the default encoding was, code points, bug fixes, things allowed such as unpaired surrogates, etc. But why leave the 'encoding' up to magic? Why not specify the encoding so that you know what bytes you are gonna get?
Encoding (in this context) simply means the bytes that represent your string. Not the bytes of the string object. You wanted the bytes the string has been stored in -- this is where the question was asked naively. You wanted the bytes of string in a contiguous array that represent the string, and not all of the other binary data that a string object may contain. Which means how a string is stored is irrelevant. You want a string "Encoded" into bytes in a byte array. I like Tom Bloget's answer because he took you towards the 'bytes of the string object' direction. It's implementation dependent though, and because he's peeking at internals it might be difficult to reconstitute a copy of the string. Mehrdad's response is wrong because it is misleading at the conceptual level. You still have a list of bytes, encoded. His particular solution allows for unpaired surrogates to be preserved -- this is implementation dependent. His particular solution would not produce the string's bytes accurately if GetBytes returned the string in UTF-8 by default. |
|||
|
Here is the code:
|
|||||
|
|
||||
|
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site.
Would you like to answer one of these unanswered questions instead?