array 如何在C#中獲得字符串的一致字節表示,而無需手動指定編碼?




string c# (24)

From byte[] to string :

        return BitConverter.ToString(bytes);

如何在沒有手動指定特定編碼的情況下將string轉換為.NET(C#)中的byte[]

我要加密字符串。 我可以在不轉換的情況下對其進行加密,但我仍然想知道為什麼編碼會在這裡播放。

另外,為什麼要考慮編碼? 我不能簡單地得到字符串存儲在哪個字節? 為什麼對字符編碼有依賴性?


BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());

byte[] strToByteArray(string str)
{
    System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
    return enc.GetBytes(str);
}

它取決於您的字符串的編碼( ASCIIUTF-8 ,...)。

例如:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

一個為什麼編碼很重要的小例子

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'

ASCII根本不具備處理特殊字符的能力。

在內部,.NET框架使用UTF-16來表示字符串,因此如果您只想獲取.NET使用的確切字節,請使用System.Text.Encoding.Unicode.GetBytes (...)

有關更多信息,請參見.NET Framework (MSDN) 中的字符編碼


Fastest way

public static byte[] GetBytes(string text)
{
    return System.Text.ASCIIEncoding.UTF8.GetBytes(text);
}

EDIT as Makotosan commented this is now the best way:

Encoding.UTF8.GetBytes(text)

使用:

    string text = "string";
    byte[] array = System.Text.Encoding.UTF8.GetBytes(text);

The result is:

[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103

試試這個,少了很多代碼:

System.Text.Encoding.UTF8.GetBytes("TEST String");

bytes[] buffer = UnicodeEncoding.UTF8.GetBytes(string something); //for converting to UTF then get its bytes

bytes[] buffer = ASCIIEncoding.ASCII.GetBytes(string something); //for converting to ascii then get its bytes

If you really want a copy of the underlying bytes of a string, you can use a function like the one that follows. However, you shouldn't please read on to find out why.

[DllImport(
        "msvcrt.dll",
        EntryPoint = "memcpy",
        CallingConvention = CallingConvention.Cdecl,
        SetLastError = false)]
private static extern unsafe void* UnsafeMemoryCopy(
    void* destination,
    void* source,
    uint count);

public static byte[] GetUnderlyingBytes(string source)
{
    var length = source.Length * sizeof(char);
    var result = new byte[length];
    unsafe
    {
        fixed (char* firstSourceChar = source)
        fixed (byte* firstDestination = result)
        {
            var firstSource = (byte*)firstSourceChar;
            UnsafeMemoryCopy(
                firstDestination,
                firstSource,
                (uint)length);
        }
    }

    return result;
}

This function will get you a copy of the bytes underlying your string, pretty quickly. You'll get those bytes in whatever way they are encoding on your system. This encoding is almost certainly UTF-16LE but that is an implementation detail you shouldn't have to care about.

It would be safer, simpler and more reliable to just call,

System.Text.Encoding.Unicode.GetBytes()

In all likelihood this will give the same result, is easier to type, and the bytes will always round-trip with a call to

System.Text.Encoding.Unicode.GetString()

With the advent of Span<T> released with C# 7.2, the canonical technique to capture the underlying memory representation of a string into a managed byte array is:

byte[] bytes = "rubbish_\u9999_string".AsSpan().AsBytes().ToArray();

Converting it back should be a non-starter because that means you are in fact interpreting the data somehow, but for the sake of completeness:

string s;
unsafe
{
    fixed (char* f = &bytes.AsSpan().NonPortableCast<byte, char>().DangerousGetPinnableReference())
    {
        s = new string(f);
    }
}

The names NonPortableCast and DangerousGetPinnableReference should further the argument that you probably shouldn't be doing this.

Note that working with Span<T> requires installing the System.Memory NuGet package .

Regardless, the actual original question and follow-up comments imply that the underlying memory is not being "interpreted" (which I assume means is not modified or read beyond the need to write it as-is), indicating that some implementation of the Stream class should be used instead of reasoning about the data as strings at all.


被接受的答案非常非常複雜。 使用包含的.NET類為此:

const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);
var decoded = System.Text.Encoding.UTF8.GetString(bytes);

如果你不需要重新發明輪子......


也請解釋為什麼要考慮編碼。 我不能簡單地得到字符串存儲在哪個字節? 為什麼這種編碼依賴?

因為沒有“字符串的字節”之類的東西。

字符串(或更一般地說,文本)由字符組成:字母,數字和其他符號。 就這樣。 然而,電腦對角色一無所知, 他們只能處理字節。 因此,如果要使用計算機存儲或傳輸文本,則需要將字符轉換為字節。 你是怎樣做的? 這是編碼到達現場的地方。

編碼不過是將邏輯字符轉換為物理字節的約定。 最簡單和最知名的編碼是ASCII,如果你用英文書寫,這就是你所需要的。 對於其他語言,您需要更完整的編碼,因為任何Unicode編碼都是當今最安全的選擇。

因此,簡而言之,試圖“不使用編碼獲取字符串的字節”與“不使用任何語言編寫文本”一樣不可能。

順便說一下,我強烈建議你(以及任何人)閱讀這一小小的智慧: joelonsoftware.com/articles/Unicode.html


C#將string轉換為byte數組:

public static byte[] StrToByteArray(string str)
{
   System.Text.UTF8Encoding  encoding=new System.Text.UTF8Encoding();
   return encoding.GetBytes(str);
}

The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.

Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.

Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding , which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs ( System.Text.UnicodeEncoding supports UTF-16)

Ref this link.

For serialization to an array of bytes using System.Text.Encoding.GetBytes . For the inverse operation use System.Text.Encoding.GetChars . This function returns an array of characters, so to get a string, use a string constructor System.String(char[]) .
Ref this page.

例:

string myString = //... some string

System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);

//next lines are written in response to a follow-up questions:

myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);

//how many times shall I repeat it to show there is a round-trip? :-)

與這裡的答案相反, 如果不需要解釋字節,則不必擔心編碼問題!

就像你提到的那樣,你的目標就是“獲取字符串存儲的字節數”
(當然,為了能夠從字節中重新構造字符串。)

對於這些目標,我實在不明白為什麼人們總是告訴你,你需要編碼。 你當然不需要擔心編碼的問題。

只需做到這一點:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

只要您的程序(或其他程序)不嘗試以某種方式解釋字節,而您顯然沒有提到您打算這麼做,那麼這種方法沒有任何問題! 擔心編碼只會讓你的生活更加複雜,沒有真正的原因。

這種方法的其他好處:

這個字符串是否包含無效字符並不重要,因為您仍然可以獲取數據並重建原始字符串!

它將被編碼和解碼,因為你只是在看字節

不過,如果你使用了特定的編碼,它會給編碼/解碼無效字符帶來麻煩。


You can use following code to convert a string to a byte array in .NET

string s_unicode = "abcéabc";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);

您的問題的第一部分(如何獲取字節)已被其他人回答:查看System.Text.Encoding命名空間。

我將解答你的後續問題:你為什麼需要選擇一種編碼? 為什麼你不能從字符串類本身獲得?

答案分兩部分。

首先,字符串類內部使用的字節無關緊要 ,只要您認為它們確實會引入錯誤。

如果您的程序完全位於.Net世界中,那麼即使您通過網絡發送數據,也不必擔心獲取字符串的字節數組。 相反,使用.Net序列化來擔心傳輸數據。 你不必擔心實際的字節:序列化格式化程序為你做。

另一方面,如果你發送這些字節的地方,你不能保證會從.Net序列化的流中提取數據? 在這種情況下,你肯定需要擔心編碼問題,因為顯然這個外部系統很關心。 因此,字符串使用的內部字節無關緊要:您需要選擇一種編碼,以便您可以在接收端明確此編碼,即使它與.Net內部使用的是相同的編碼。

我知道在這種情況下,你可能更喜歡在可能的情況下在內存中使用字符串變量存儲的實際字節,並認為它可以節省創建字節流的一些工作。 然而,我把它給你,這與確保你的輸出在另一端被理解並保證你必須明確你的編碼相比並不重要。 另外,如果你真的想匹配你的內部字節,你可以選擇Unicode編碼,並且節省下來。

這使我想到了第二部分...選擇Unicode編碼告訴.Net使用底層字節。 你需要選擇這種編碼,因為當一些新的Unicode-Plus出來時,.Net運行時需要自由地使用這個更新的,更好的編碼模型而不會破壞你的程序。 但是,就目前而言(以及將來可預見的),只要選擇Unicode編碼就可以得到你想要的。

理解你的字符串必須重新寫入連線也很重要,並且即使在使用匹配編碼時也至少需要對位模式進行一些翻譯。 計算機需要考慮Big vs Little Endian,網絡字節順序,打包和會話信息等。


您需要考慮編碼,因為1個字符可以用1個或多個字節表示(最多約6個字符),不同的編碼會以不同的方式處理這些字節。

Joel在此發表了一篇文章:

joelonsoftware.com/articles/Unicode.html


I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".

take this example sample:

String str = "asdf éß";
String str2 = "asdf gh";
EncodingInfo[] info =  Encoding.GetEncodings();
foreach (EncodingInfo enc in info)
{
    System.Console.WriteLine(enc.Name + " - " 
      + enc.GetEncoding().GetByteCount(str)
      + enc.GetEncoding().GetByteCount(str2));
}

Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.

So if you just want the bytes used by the string, simply use Encoding.Unicode , but it will be inefficient with storage space.


Simply use this:

byte[] myByte= System.Text.ASCIIEncoding.Default.GetBytes(myString);

simple code with LINQ

string s = "abc"
byte[] b = s.Select(e => (byte)e).ToArray();

EDIT : as commented below, it is not a good way.

but you can still use it to understand LINQ with a more appropriate coding :

string s = "abc"
byte[] b = s.Cast<byte>().ToArray();

為了證明Mehrdrad的share有效,他的方法甚至可以堅持[BinaryFormatter (許多人對我的回答提出了反對意見,但每個人都同樣有罪,例如System.Text.Encoding.UTF8.GetBytesSystem.Text.Encoding.Unicode.GetBytes ;這些編碼方法不能保持高代理字符d800 ,例如,那些只是用值fffd代替高代理字符):

using System;

class Program
{     
    static void Main(string[] args)
    {
        string t = "爱虫";            
        string s = "Test\ud800Test"; 

        byte[] dumpToBytes = GetBytes(s);
        string getItBack = GetString(dumpToBytes);

        foreach (char item in getItBack)
        {
            Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
        }    
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }        
}

輸出:

T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74

嘗試使用System.Text.Encoding.UTF8.GetBytesSystem.Text.Encoding.Unicode.GetBytes ,它們將僅替換值高的代理字符fffd

每當這個問題出現動作時,我仍然想著一個序列化程序(不管是來自Microsoft還是來自第三方組件),它們可以保留字符串,即使它包含不成對的替代字符; 我偶爾谷歌這一點: 序列化不配對代理字符.NET 。 這並不會讓我失去睡眠,但有時候會有人評論我的答案,說明它存在缺陷,但是當涉及不成對的代理角色時,他們的答案同樣存在缺陷。

戴恩,微軟應該在其BinaryFormatterツ中使用System.Buffer.BlockCopy

謝謝!


It depends on what you want the bytes FOR

This is because, as Tyler so aptly said , "Strings aren't pure data. They also have information ." In this case, the information is an encoding that was assumed when the string was created.

Assuming that you have binary data (rather than text) stored in a string

This is based off of OP's comment on his own question, and is the correct question if I understand OP's hints at the use-case.

Storing binary data in strings is probably the wrong approach because of the assumed encoding mentioned above! Whatever program or library stored that binary data in a string (instead of a byte[] array which would have been more appropriate) has already lost the battle before it has begun. If they are sending the bytes to you in a REST request/response or anything that must transmit strings, Base64 would be the right approach.

If you have a text string with an unknown encoding

Everybody else answered this incorrect question incorrectly.

If the string looks good as-is, just pick an encoding (preferably one starting with UTF), use the corresponding System.Text.Encoding.???.GetBytes() function, and tell whoever you give the bytes to which encoding you picked.


Two ways:

public static byte[] StrToByteArray(this string s)
{
    List<byte> value = new List<byte>();
    foreach (char c in s.ToCharArray())
        value.Add(c.ToByte());
    return value.ToArray();
}

And,

public static byte[] StrToByteArray(this string s)
{
    s = s.Replace(" ", string.Empty);
    byte[] buffer = new byte[s.Length / 2];
    for (int i = 0; i < s.Length; i += 2)
        buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16);
    return buffer;
}

I tend to use the bottom one more often than the top, haven't benchmarked them for speed.





string