Converting a string to byte-array without using an encoding (byte-by-byte)

Question

How do I convert a string to a byte[] in .NET (C#)?

Also, why should encoding be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

Every string is stored as an array of bytes right? Why can't I simply have those bytes? — Agnel Kurian, Jan 23 '09 at 14:05
The encoding is what maps the characters to the bytes. For example, in ASCII, the letter 'A' maps to the number 65. In a different encoding, it might not be the same. The high-level approach to strings taken in the .NET framework makes this largely irrelevant, though (except in this case). — Lucas Jones, Apr 13 '09 at 14:13
To play devil's advocate: If you wanted to get the bytes of an in-memory string (as .NET uses them) and manipulate them somehow (i.e. CRC32), and NEVER EVER wanted to decode it back into the original string...it isn't straight forward why you'd care about encodings or how you choose which one to use. — Greg, Dec 1 '09 at 19:47
Surprised no-one has given this link yet: joelonsoftware.com/articles/Unicode.html — Bevan, Jun 29 '10 at 2:57
A char is not a byte and a byte is not a char. A char is both a key into a font table and a lexical tradition. A string is a sequence of chars. (A words, paragraphs, sentences, and titles also have their own lexical traditions that justify their own type definitions -- but I digress). Like integers, floating point numbers, and everything else, chars are encoded into bytes. There was a time when the encoding was simple one to one: ASCII. However, to accommodate all of human symbology, the 256 permutations of a byte were insufficient and encodings were devised to selectively use more bytes. — George, Aug 28 '14 at 15:43

Brad Bruce · Accepted Answer · 2015-07-18 21:46:49Z

up vote 1264 down vote accepted

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Just do this instead:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

Additional benefit to this approach:

It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

It will be encoded and decoded just the same, because you are just looking at the bytes.

If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

edited Jul 18 at 21:46

Brad Bruce

5,02612447

answered Apr 30 '12 at 7:44

Mehrdad

92.2k51280561

113

What's ugly about this one is, that GetString and GetBytes need to executed on a system with the same endianness to work. So you can't use this to get bytes you want to turn into a string elsewhere. So I have a hard time to come up with a situations where I'd want to use this. – CodesInChaos May 13 '12 at 11:14

41

@CodeInChaos: Like I said, the whole point of this is if you want to use it on the same kind of system, with the same set of functions. If not, then you shouldn't use it. – Mehrdad May 13 '12 at 18:00

94

-1 I guarantee that someone (who doesn't understand bytes vs characters) is going to want to convert their string into a byte array, they will google it and read this answer, and they will do the wrong thing, because in almost all cases, the encoding IS relevant. – artbristol Jun 15 '12 at 11:07

201

@artbristol: If they can't be bothered to read the answer (or the other answers...), then I'm sorry, then there's no better way for me to communicate with them. I generally opt for answering the OP rather than trying to guess what others might do with my answer -- the OP has the right to know, and just because someone might abuse a knife doesn't mean we need to hide all knives in the world for ourselves. Though if you disagree that's fine too. – Mehrdad Jun 15 '12 at 14:04

94

This answer is wrong on so many levels but foremost because of it's decleration "you DON'T need to worry about encoding!". The 2 methods, GetBytes and GetString are superfluous in as much as they are merely re-implementations of what Encoding.Unicode.GetBytes() and Encoding.Unicode.GetString() already do. The statement "As long as your program (or other programs) don't try to interpret the bytes" is also fundamentally flawed as implicitly they mean the bytes should be interpreted as Unicode. – David Jul 11 '12 at 12:36

| show 97 more comments

Peter Mortensen · Answer 2 · 2015-04-24 09:52:05Z

up vote 837 down vote

It depends on the encoding of your string (ASCII, UTF-8, ...).

For example:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

A small sample why encoding matters:

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'

ASCII simply isn't equipped to deal with special characters.

Internally, the .NET framework uses UTF-16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...).

See Character Encoding in the .NET Framework (MSDN) for more information.

edited Apr 24 at 9:52

Peter Mortensen

9,143106498

answered Jan 23 '09 at 13:43

bmotmans

9,64441214

6

But, why should encoding be taken into consideration? Why can't I simply get the bytes without having to see what encoding is being used? Even if it were required, shouldn't the String object itself know what encoding is being used and simply dump what is in memory? – Agnel Kurian Jan 23 '09 at 13:48

31

A .NET strings are always encoded as Unicode. So use System.Text.Encoding.Unicode.GetBytes(); to get the set of bytes that .NET would using to represent the characters. However why would you want that? I recommend UTF-8 especially when most characters are in the western latin set. – AnthonyWJones Jan 23 '09 at 14:33

6

Also: the exact bytes used internally in the string don't matter if the system that retrieves them doesn't handle that encoding or handles it as the wrong encoding. If it's all within .Net, why convert to an array of bytes at all. Otherwise, it's better to be explicit with your encoding – Joel Coehoorn Jan 23 '09 at 15:42

5

@Joel, Be careful with System.Text.Encoding.Default as it could be different on each machine it is run. That's why it's recommended to always specify an encoding, such as UTF-8. – Ash Jan 28 '10 at 9:01

17

You don't need the encodings unless you (or someone else) actually intend(s) to interpret the data, instead of treating it as a generic "block of bytes". For things like compression, encryption, etc., worrying about the encoding is meaningless. See my answer for a way to do this without worrying about the encoding. (I might have given a -1 for saying you need to worry about encodings when you don't, but I'm not feeling particularly mean today. :P) – Mehrdad Apr 30 '12 at 7:55

| show 3 more comments

Vlad · Answer 3 · 2015-07-23 14:32:52Z

up vote 124 down vote

The accepted answer is very, very complicated. Use the included .NET classes for this:

const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);
var decoded = System.Text.Encoding.UTF8.GetString(bytes);

Don't reinvent the wheel if you don't have to...

edited Jul 23 at 14:32

Vlad

10.5k32050

answered Apr 30 '12 at 7:26

Erik A. Brandstadmoen

5,26811629

31

The accepted answer is not only very complicated but also a recipe for disaster. – Konamiman Jun 13 '13 at 8:40

4

In case the accepted answer gets changed, for record purposes, it is Mehrdad's answer at this current time and date. Hopefully the OP will revisit this and accept a better solution. – Thomas Eding Sep 27 '13 at 18:20

1

good in principle but, the encoding should be System.Text.Encoding.Unicode to be equivalent to Mehrdad's answer. – Jodrell Nov 25 '14 at 9:08

1

The question has been edited an umptillion times since the original answer, so, maybe my answer is a bit outdates. I never intended to give an exace equivalent to Mehrdad's answer, but give a sensible way of doing it. But, you might be right. However, the phrase "get what bytes the string has been stored in" in the original question is very unprecise. Stored, where? In memory? On disk? If in memory, System.Text.Encoding.Unicode.GetBytes would probably be more precise. – Erik A. Brandstadmoen Nov 26 '14 at 11:36

add a comment |

Michael Buen · Answer 4 · 2009-01-26 06:29:52Z

up vote 86 down vote

BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());

edited Jan 26 '09 at 6:29

answered Jan 23 '09 at 16:36

Michael Buen

24.2k15085

2

You could use the same BinaryFormatter instance for all of those operations – Joel Coehoorn Jan 23 '09 at 17:25

3

Very Interesting. Apparently it will drop any high surrogate Unicode character. See the documentation on [BinaryFormatter] – John Robertson Nov 18 '10 at 18:51

1

Waaaay too complicated... See my answer below (or above) – Erik A. Brandstadmoen Apr 30 '12 at 7:24

1

@ErikA.Brandstadmoen See my tests here: stackoverflow.com/a/10384024 – Michael Buen May 13 '12 at 11:12

16

Congratulations, you won the obfuscated C# contest. – Konamiman Jun 13 '13 at 8:42

add a comment |

Zhaph - Ben Duguid · Answer 5 · 2009-01-23 14:03:30Z

up vote 55 down vote

You need to take the encoding into account, because 1 character could be represented by 1 or more bytes (up to about 6), and different encodings will treat these bytes differently.

Joel has a posting on this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

answered Jan 23 '09 at 14:03

Zhaph - Ben Duguid

20.7k44588

4

"1 character could be represented by 1 or more bytes" I agree. I just want those bytes regardless of what encoding the string is in. The only way a string can be stored in memory is in bytes. Even characters are stored as 1 or more bytes. I merely want to get my hands on them bytes. – Agnel Kurian Jan 23 '09 at 14:07

8

You don't need the encodings unless you (or someone else) actually intend(s) to interpret the data, instead of treating it as a generic "block of bytes". For things like compression, encryption, etc., worrying about the encoding is meaningless. See my answer for a way to do this without worrying about the encoding. – Mehrdad Apr 30 '12 at 7:54

4

@Mehrdad - Totally, but the original question, as stated when I initially answered, didn't caveat what OP was going to happen with those bytes after they'd converted them, and for future searchers the information around that is pertinent - this is covered by Joel's answer quite nicely - and as you state within your answer: provided you stick within the .NET world, and use your methods to convert to/from, you're happy. As soon as you step outside of that, encoding will matter. – Zhaph - Ben Duguid Apr 30 '12 at 10:48

add a comment |

Tom Blodget · Answer 6 · 2013-12-02 04:43:48Z

This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the later first.

Common Need

Every string has a character set and encoding. When you convert a System.String object to an array of System.Byte you still have a character set and encoding. For most usages, you'd know which character set and encoding you need and .NET makes it simple to "copy with conversion." Just choose the appropriate Encoding class.

// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")

The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution or skipping. The default policy is to substitute a '?'.

// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100")); 
                                                      // -> "You win ?100"

Clearly, conversions are not necessarily lossless!

Note: For System.String the source character set is Unicode.

The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. Encoding.Unicode should be called Encoding.UTF16.

That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what an encoding is.

Specific Need

Now, the question author asks, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?"

He doesn't want any conversion.

From the C# spec:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:

Encoding.Unicode.GetBytes(".NET String to byte array")

But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:

".NET String to byte array".ToCharArray()

That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype System.Char.

The only way to get to the actual bytes the String is stored in is to use a pointer. The fixed statement allows taking the address of values. From the C# spec:

[For] an expression of type string, ... the initializer computes the address of the first character in the string.

To do so, the compiler writes code skip over the other parts of the string object with RuntimeHelpers.OffsetToStringData. So, to get the raw bytes, just create a pointer to the string and copy the number of bytes needed.

// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
    if (s == null) return null;
    var codeunitCount = s.Length;
    /* We know that String is a sequence of UTF-16 codeunits 
       and such codeunits are 2 bytes */
    var byteCount = codeunitCount * 2; 
    var bytes = new byte[byteCount];
    fixed(void* pRaw = s)
    {
        Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
    }
    return bytes;
}

As @CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that.

@Jan That's correct but string length already gives the number of code-units (not codepoints). — Tom Blodget, Feb 4 '14 at 2:35
+1 Excellent answer that won't mislead the majority of readers! — Jon Coombs, Mar 24 '14 at 2:20

4 revs · Answer 7 · 2012-05-13 11:09:58Z

Just to demonstrate that Mehrdrad's sound answer works, his approach can even persist the unpaired surrogate characters(of which many had leveled against my answer, but of which everyone are equally guilty of, e.g. System.Text.Encoding.UTF8.GetBytes, System.Text.Encoding.Unicode.GetBytes; those encoding methods can't persist the high surrogate characters d800 for example, and those just merely replace high surrogate characters with value fffd ) :

using System;

class Program
{     
    static void Main(string[] args)
    {
        string t = "爱虫";            
        string s = "Test\ud800Test"; 

        byte[] dumpToBytes = GetBytes(s);
        string getItBack = GetString(dumpToBytes);

        foreach (char item in getItBack)
        {
            Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
        }    
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }        
}

Output:

T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74

Try that with System.Text.Encoding.UTF8.GetBytes or System.Text.Encoding.Unicode.GetBytes, they will merely replace high surrogate characters with value fffd

Every time there's a movement in this question, I'm still thinking of a serializer(be it from Microsoft or from 3rd party component) that can persist strings even it contains unpaired surrogate characters; I google this every now and then: serialization unpaired surrogate character .NET. This doesn't make me lose any sleep, but it's kind of annoying when every now and then there's somebody commenting on my answer that it's flawed, yet their answers are equally flawed when it comes to unpaired surrogate characters.

Darn, Microsoft should have just used System.Buffer.BlockCopy in its BinaryFormatter ツ

谢谢！

Don't surrogates have to appear in pairs to form valid code points? If that's the case, I can understand why the data would be mangled. — dtanders, Jun 14 '12 at 14:27
@dtanders Yes,that's my thoughts too, they have to appear in pairs, unpaired surrogate characters just happen if you deliberately put them on string and make them unpaired. What I don't know is why other devs keep on harping that we should use encoding-aware approach instead, as they deemed the serialization approach(my answer,which was an accepted answer for more than 3 years) doesn't keep the unpaired surrogate character intact. But they forgot to check that their encoding-aware solutions doesn't keep the unpaired surrogate character too,the irony ツ — Michael Buen, Jun 14 '12 at 23:23
@dtanders: A System.String is an immutable sequence of Char; .NET has always allowed a String object to be constructed from any Char[] and export its content to a Char[] containing the same values, even if the original Char[] contains unpaired surrogates. — supercat, Nov 12 '14 at 21:57

Peter Mortensen · Answer 8 · 2015-04-24 09:58:10Z

up vote 24 down vote

Try this, a lot less code:

System.Text.Encoding.UTF8.GetBytes("TEST String");

edited Apr 24 at 9:58

Peter Mortensen

9,143106498

answered Jul 25 '11 at 22:52

Nathan

36738

add a comment |

Joel Coehoorn · Answer 9 · 2009-04-13 14:09:48Z

The first part of your question (how to get the bytes) was already answered by others: look in the System.Text.Encoding namespace.

I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself?

The answer is that the bytes used internally by the string class don't matter.

If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes any more: the Serialization formatter does it for you.

On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case you definitely do need to worry about encoding, because obviously this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end.

I understand that in this case you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. But that's just not important compared to making sure that your output is understood at the other end, and to guarantee that you must be explicit with your encoding. If you really want to match your internal bytes, just use the Unicode encoding.

There are areas in .NET where you do have to get byte arrays for strings. Many of the .NET Cryptrography classes contain methods such as ComputeHash() that accept byte array or stream. You have no alternative but to convert a string to a byte array first (choosing an Encoding) and then optionally wrap it in a stream. However as long as you choose an encoding (ie UTF8) an stick with it there are no problems with this. — Ash, Jan 28 '10 at 9:33

gkrogers · Answer 10 · 2009-01-23 13:43:18Z

up vote 15 down vote

byte[] strToByteArray(string str)
{
    System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
    return enc.GetBytes(str);
}

answered Jan 23 '09 at 13:43

gkrogers

5,61421831

3

This doesn't always work. Some special characters can get lost in using such a method I've found the hard way. – JB King Jan 23 '09 at 17:14

1

if the charset was utf it wouldn't work! – ahmadali shafiee Sep 18 '12 at 6:27

| show 1 more comment

Peter Mortensen · Answer 11 · 2015-04-24 09:57:25Z

Well, I've read all answers and they were about using encoding or one about serialization that drops unpaired surrogates.

It's bad when the string, for example, comes from SQL Server where it was built from a byte array storing, for example, a password hash. If we drop anything from it, it'll store an invalid hash, and if we want to store it in XML, we want to leave it intact (because the XML writer drops an exception on any unpaired surrogate it finds).

So I use Base64 encoding of byte arrays in such cases, but hey, on the Internet there is only one solution to this in C#, and it has bug in it and is only one way, so I've fixed the bug and written back procedure. Here you are, future googlers:

    public static byte[] StringToBytes(string str)
    {
        byte[] data = new byte[str.Length * 2];
        for (int i = 0; i < str.Length; ++i)
        {
            char ch = str[i];
            data[i * 2] = (byte)(ch & 0xFF);
            data[i * 2 + 1] = (byte)((ch & 0xFF00) >> 8);
        }

        return data;
    }

    public static string StringFromBytes(byte[] arr)
    {
        char[] ch = new char[arr.Length / 2];
        for (int i = 0; i < ch.Length; ++i)
        {
            ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
        }
        return new String(ch);
    }

Jarvis · Answer 12 · 2014-09-09 11:30:51Z

You can use the following code for conversion between string and byte array.

string s = "Hello World";

// String to Byte[]

byte[] byte1 = System.Text.Encoding.Default.GetBytes(s);

// OR

byte[] byte2 = System.Text.ASCIIEncoding.Default.GetBytes(s);

// Byte[] to string

string str = System.Text.Encoding.UTF8.GetString(byte1);

Navnath · Answer 13 · 2013-10-10 06:51:37Z

up vote 11 down vote

C# to convert a string to a byte array:

public static byte[] StrToByteArray(string str)
{
   System.Text.UTF8Encoding  encoding=new System.Text.UTF8Encoding();
   return encoding.GetBytes(str);
}

edited Oct 10 '13 at 6:51

Navnath

2,04121326

answered Jun 5 '13 at 10:52

Shyam sundar shah

1,2461023

add a comment |

Konamiman · Answer 14 · 2015-10-23 06:19:47Z

Also please explain why encoding should be taken into consideration. Can't I simply get what bytes the string has been stored in? Why this dependency on encoding?!!!

Because there is no such thing as "the bytes of the string".

A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene.

An encoding is nothing but a convention to translate logical characters to physical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in English. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays.

So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language".

By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Allow me to clarify: An encoding has been used to translate "hello world" to physical bytes. Since the string is stored on my computer, I am sure that it must be stored in bytes. I merely want to access those bytes to save them on disk or for any other reason. I do not want to interpret these bytes. Since I do not want to interpret these bytes, the need for an encoding at this point is as misplaced as requiring a phone line to call printf. — Agnel Kurian, Jul 16 '09 at 15:30

Ed Marty · Answer 15 · 2009-01-23 14:34:03Z

I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".

take this example sample:

String str = "asdf éß";
String str2 = "asdf gh";
EncodingInfo[] info =  Encoding.GetEncodings();
foreach (EncodingInfo enc in info)
{
    System.Console.WriteLine(enc.Name + " - " 
      + enc.GetEncoding().GetByteCount(str)
      + enc.GetEncoding().GetByteCount(str2));
}

Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.

So if you just want the bytes used by the string, simply use Encoding.Unicode, but it will be inefficient with storage space.

Peter Mortensen · Answer 16 · 2015-04-24 09:54:43Z

up vote 6 down vote

Fastest way

public static byte[] GetBytes(string text)
{
    return System.Text.ASCIIEncoding.UTF8.GetBytes(text);
}

edited Apr 24 at 9:54

Peter Mortensen

9,143106498

answered Mar 22 '10 at 8:40

Sunrising

819820

4

ASCIIEncoding..... is not needed. Simply using Encoding.UTF8.GetBytes(text) is preferred. – Makotosan Feb 17 '12 at 20:40

add a comment |

Hans Passant · Answer 17 · 2009-01-23 14:15:26Z

up vote 5 down vote

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.

Utf8 is a popular encoding, it is compact and not lossy.

answered Jan 23 '09 at 14:15

Hans Passant

594k677671371

1

UTF-8 is compact only if the majority of your characters are in the English (ASCII) character set. If you had a long string of Chinese characters, UTF-16 would be a more compact encoding than UTF-8 for that string. This is because UTF-8 uses one byte to encode ASCII, and 3 (or maybe 4) otherwise. – Joel Mueller Jan 23 '09 at 20:40

6

True. But, how can you not know about encoding if you're familiar with handling Chinese text? – Hans Passant Jan 24 '09 at 3:40

add a comment |

mashet · Answer 18 · 2013-10-22 12:55:59Z

up vote 5 down vote

    string text = "string";
    byte[] array = System.Text.Encoding.UTF8.GetBytes(text);

the result is:

[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103

answered Oct 22 '13 at 12:55

mashet

315411

add a comment |

İlker Elçora · Answer 19 · 2014-05-02 07:39:30Z

up vote 5 down vote

You can use following code to convert a string to a byte array in .NET

string s_unicode = "abcéabc";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);

edited May 2 '14 at 7:39

İlker Elçora

12318

answered Sep 2 '13 at 11:21

Shyam sundar shah

1,2461023

add a comment |

Zarathos · Answer 20 · 2013-01-15 11:55:21Z

Here is my unsafe implementation of String to Byte[] conversion:

public static unsafe Byte[] GetBytes(String s)
{
    Int32 length = s.Length * sizeof(Char);
    Byte[] bytes = new Byte[length];

    fixed (Char* pInput = s)
    fixed (Byte* pBytes = bytes)
    {
        Byte* source = (Byte*)pInput;
        Byte* destination = pBytes;

        if (length >= 16)
        {
            do
            {
                *((Int64*)destination) = *((Int64*)source);
                *((Int64*)(destination + 8)) = *((Int64*)(source + 8));

                source += 16;
                destination += 16;
            }
            while ((length -= 16) >= 16);
        }

        if (length > 0)
        {
            if ((length & 8) != 0)
            {
                *((Int64*)destination) = *((Int64*)source);

                source += 8;
                destination += 8;
            }

            if ((length & 4) != 0)
            {
                *((Int32*)destination) = *((Int32*)source);

                source += 4;
                destination += 4;
            }

            if ((length & 2) != 0)
            {
                *((Int16*)destination) = *((Int16*)source);

                source += 2;
                destination += 2;
            }

            if ((length & 1) != 0)
            {
                ++source;
                ++destination;

                destination[0] = source[0];
            }
        }
    }

    return bytes;
}

It's way faster than the accepted anwser's one, even if not as elegant as it is. Here are my Stopwatch benchmarks over 10000000 iterations:

[Second String: Length 20]
Buffer.BlockCopy: 746ms
Unsafe: 557ms

[Second String: Length 50]
Buffer.BlockCopy: 861ms
Unsafe: 753ms

[Third String: Length 100]
Buffer.BlockCopy: 1250ms
Unsafe: 1063ms

In order to use it, you have to tick "Allow Unsafe Code" in your project build properties. As per .NET Framework 3.5, this method can also be used as String extension:

public static unsafe class StringExtensions
{
    public static Byte[] ToByteArray(this String s)
    {
        // Method Code
    }
}

Avlin · Answer 21 · 2013-12-18 10:13:26Z

up vote 3 down vote

simple code with LINQ

string s = "abc"
byte[] b = s.Select(e => (byte)e).ToArray();

EDIT : as commented below, it is not a good way.

but you can still use it to understand LINQ with a more appropriate coding :

string s = "abc"
byte[] b = s.Cast<byte>().ToArray();

edited Dec 18 '13 at 10:13

answered Oct 11 '12 at 9:45

Avlin

359115

1

It's hardly more faster, let alone most fastest. It's certainly an interesting alternative, but it's essentially the same as Encoding.Default.GetBytes(s) which, by the way, is way faster. Quick testing suggests that Encoding.Default.GetBytes(s) performs at least 79% faster. YMMV. – WynandB Oct 25 '13 at 4:36

2

Try it with a €. This code will not crash, but will return a wrong result (which is even worse). Try casting to a short instead of byte to see the difference. – Hans Kesting Dec 18 '13 at 8:57

add a comment |

cyberbobcat · Answer 22 · 2009-01-23 13:43:58Z

up vote 2 down vote

// C# to convert a string to a byte array.
public static byte[] StrToByteArray(string str)
{
    System.Text.ASCIIEncoding  encoding=new System.Text.ASCIIEncoding();
    return encoding.GetBytes(str);
}


// C# to convert a byte array to a string.
byte [] dBytes = ...
string str;
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
str = enc.GetString(dBytes);

answered Jan 23 '09 at 13:43

cyberbobcat

70511431

3

1) That will lose data due to using ASCII as the encoding. 2) There's no point in creating a new ASCIIEncoding - just use the Encoding.ASCII property. – Jon Skeet Jan 27 '09 at 6:35

add a comment |

harmonik · Answer 23 · 2009-02-19 21:03:34Z

Two ways:

public static byte[] StrToByteArray(this string s)
{
    List<byte> value = new List<byte>();
    foreach (char c in s.ToCharArray())
        value.Add(c.ToByte());
    return value.ToArray();
}

And,

public static byte[] StrToByteArray(this string s)
{
    s = s.Replace(" ", string.Empty);
    byte[] buffer = new byte[s.Length / 2];
    for (int i = 0; i < s.Length; i += 2)
        buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16);
    return buffer;
}

I tend to use the bottom one more often than the top, haven't benchmarked them for speed.

What about multibyte characters? – Agnel Kurian Feb 23 '09 at 9:57 — Agnel Kurian, Feb 23 '09 at 9:57

user1120193 · Answer 24 · 2012-01-02 11:07:00Z

up vote 2 down vote

bytes[] buffer = UnicodeEncoding.UTF8.GetBytes(string something); //for converting to UTF then get its bytes

bytes[] buffer = ASCIIEncoding.ASCII.GetBytes(string something); //for converting to ascii then get its bytes

answered Jan 2 '12 at 11:07

user1120193

1109

add a comment |

jonsca · Answer 25 · 2015-07-01 01:14:44Z

up vote 2 down vote

Simply use this:

byte[] myByte= System.Text.ASCIIEncoding.Default.GetBytes(myString);

edited Jul 1 at 1:14

jonsca

7,15893349

answered Jun 30 at 14:39

alireza amini

272214

add a comment |

Vijay Singh Rana · Answer 26 · 2014-06-11 11:29:06Z

String can be converted do byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in assumption of different UTF, if can be screwed up.

Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation UTF-8 is usually used. It is also a standard-de-facto for Internet.

Not surprisingly, serialization of string into array of byte and deserialization is supported by the class System.Text.Encoding, which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs (`System.Text.UnicodeEncoding' supports UTF-16)

Ref: http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

For serialization to array of bytes use System.Text.Encoding.GetBytes. For the inverse operation use System.Text.Encoding.GetChars. This function returns array of characters, so to get a string, use a string constructor System.String(char[]). Ref: http://unicode.org/, http://unicode.org/faq/utf_bom.html

Example:

string myString = //... some string

System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);

//next lines are written in response to a follow-up questions:

myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);

//how many times shall I repeat it to show there is a round-trip? :-)

Piero Alberto · Answer 27 · 2015-01-21 14:05:34Z

up vote 1 down vote

from byte[] to string:

        return BitConverter.ToString(bytes);

answered Jan 21 at 14:05

Piero Alberto

5482725

add a comment |

Gerard ONeill · Answer 28 · 2015-08-18 17:04:21Z

The closest approach to the OP's question is Tom Blodget's, which actually goes into the object and extracts the bytes. I say closest because it depends on implementation of the String Object.

"Can't I simply get what bytes the string has been stored in?"

Sure, but that's where the fundamental error in the question arises. The String is an object which could have an interesting datastructure. We already know it does, because it allows unpaired surrogates to be stored. It might store the length. It might keep a pointer to each of the 'paired' surrogates allowing quick counting. Etc. All of these extra bytes are not part of the character data.

What you want is each character's bytes in an array. And that is where 'encoding' comes in. By default you will get UTF-16LE. If you don't care about the bytes themselves except for the round trip then you can choose any encoding including the 'default', and convert it back later (assuming the same parameters such as what the default encoding was, code points, bug fixes, things allowed such as unpaired surrogates, etc.

But why leave the 'encoding' up to magic? Why not specify the encoding so that you know what bytes you are gonna get?

"Why is there a dependency on character encodings?"

Encoding (in this context) simply means the bytes that represent your string. Not the bytes of the string object. You wanted the bytes the string has been stored in -- this is where the question was asked naively. You wanted the bytes of string in a contiguous array that represent the string, and not all of the other binary data that a string object may contain.

Which means how a string is stored is irrelevant. You want a string "Encoded" into bytes in a byte array.

I like Tom Bloget's answer because he took you towards the 'bytes of the string object' direction. It's implementation dependent though, and because he's peeking at internals it might be difficult to reconstitute a copy of the string.

Mehrdad's response is wrong because it is misleading at the conceptual level. You still have a list of bytes, encoded. His particular solution allows for unpaired surrogates to be preserved -- this is implementation dependent. His particular solution would not produce the string's bytes accurately if GetBytes returned the string in UTF-8 by default.

shytikov · Answer 29 · 2013-01-23 06:41:24Z

up vote 0 down vote

Here is the code:

// Input string.
const string input = "Dot Net Perls";

// Invoke GetBytes method.
// ... You can store this array as a field!
byte[] array = Encoding.ASCII.GetBytes(input);

// Loop through contents of the array.
foreach (byte element in array)
{
    Console.WriteLine("{0} = {1}", element, (char)element);
}

edited Jan 23 '13 at 6:41

shytikov

3,79222555

answered Jan 23 '13 at 6:21

sagardhavale

1

2

Might not work if string is not ASCII. – Agnel Kurian Jan 24 '13 at 13:38

add a comment |

Knickerless-Noggins · Answer 30 · 2014-04-14 08:32:44Z

up vote 0 down vote

string s = "abcdefghijklmnopqrstuvwxyz";
byte[] b = new System.Text.UTF32Encoding().GetBytes(s);

edited Apr 14 '14 at 8:32

answered Apr 9 '14 at 12:39

Knickerless-Noggins

7901932

| show 5 more comments

asked	6 years ago
viewed	892703 times
active	1 month ago

current community

your communities

more stack exchange communities

Converting a string to byte-array without using an encoding (byte-by-byte)

33 Answers 33

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Additional benefit to this approach:

It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

Common Need

Specific Need

protected by Paŭlo Ebermann Jun 27 '13 at 19:25

Not the answer you're looking for? Browse other questions tagged c# .net string or ask your own question.

Visit Chat

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Converting a string to byte-array without using an encoding (byte-by-byte)

33 Answers 33

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Additional benefit to this approach:

It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

Common Need

Specific Need

protected by Paŭlo Ebermann Jun 27 '13 at 19:25

Not the answer you're looking for? Browse other questions tagged c# .net string or ask your own question.

Visit Chat

Linked

Related

Hot Network Questions