array - string c#




如何在C#中獲得字符串的一致字節表示,而無需手動指定編碼? (20)

如何在沒有手動指定特定編碼的情況下將string轉換為.NET(C#)中的byte[]

我要加密字符串。 我可以在不轉換的情況下對其進行加密,但我仍然想知道為什麼編碼會在這裡播放。

另外,為什麼要考慮編碼? 我不能簡單地得到字符串存儲在哪個字節? 為什麼對字符編碼有依賴性?


也請解釋為什麼要考慮編碼。 我不能簡單地得到字符串存儲在哪個字節? 為什麼這種編碼依賴?

因為沒有“字符串的字節”之類的東西。

字符串(或更一般地說,文本)由字符組成:字母,數字和其他符號。 就這樣。 然而,電腦對角色一無所知, 他們只能處理字節。 因此,如果要使用計算機存儲或傳輸文本,則需要將字符轉換為字節。 你是怎樣做的? 這是編碼到達現場的地方。

編碼不過是將邏輯字符轉換為物理字節的約定。 最簡單和最知名的編碼是ASCII,如果你用英文書寫,這就是你所需要的。 對於其他語言,您需要更完整的編碼,因為任何Unicode編碼都是當今最安全的選擇。

因此,簡而言之,試圖“不使用編碼獲取字符串的字節”與“不使用任何語言編寫文本”一樣不可能。

順便說一下,我強烈建議你(以及任何人)閱讀這一小小的智慧: joelonsoftware.com/articles/Unicode.html


It depends on what you want the bytes FOR

This is because, as Tyler so aptly said , "Strings aren't pure data. They also have information ." In this case, the information is an encoding that was assumed when the string was created.

Assuming that you have binary data (rather than text) stored in a string

This is based off of OP's comment on his own question, and is the correct question if I understand OP's hints at the use-case.

Storing binary data in strings is probably the wrong approach because of the assumed encoding mentioned above! Whatever program or library stored that binary data in a string (instead of a byte[] array which would have been more appropriate) has already lost the battle before it has begun. If they are sending the bytes to you in a REST request/response or anything that must transmit strings, Base64 would be the right approach.

If you have a text string with an unknown encoding

Everybody else answered this incorrect question incorrectly.

If the string looks good as-is, just pick an encoding (preferably one starting with UTF), use the corresponding System.Text.Encoding.???.GetBytes() function, and tell whoever you give the bytes to which encoding you picked.


C#將string轉換為byte數組:

public static byte[] StrToByteArray(string str)
{
   System.Text.UTF8Encoding  encoding=new System.Text.UTF8Encoding();
   return encoding.GetBytes(str);
}

Fastest way

public static byte[] GetBytes(string text)
{
    return System.Text.ASCIIEncoding.UTF8.GetBytes(text);
}

EDIT as Makotosan commented this is now the best way:

Encoding.UTF8.GetBytes(text)

From byte[] to string :

        return BitConverter.ToString(bytes);

Here is my unsafe implementation of String to Byte[] conversion:

public static unsafe Byte[] GetBytes(String s)
{
    Int32 length = s.Length * sizeof(Char);
    Byte[] bytes = new Byte[length];

    fixed (Char* pInput = s)
    fixed (Byte* pBytes = bytes)
    {
        Byte* source = (Byte*)pInput;
        Byte* destination = pBytes;

        if (length >= 16)
        {
            do
            {
                *((Int64*)destination) = *((Int64*)source);
                *((Int64*)(destination + 8)) = *((Int64*)(source + 8));

                source += 16;
                destination += 16;
            }
            while ((length -= 16) >= 16);
        }

        if (length > 0)
        {
            if ((length & 8) != 0)
            {
                *((Int64*)destination) = *((Int64*)source);

                source += 8;
                destination += 8;
            }

            if ((length & 4) != 0)
            {
                *((Int32*)destination) = *((Int32*)source);

                source += 4;
                destination += 4;
            }

            if ((length & 2) != 0)
            {
                *((Int16*)destination) = *((Int16*)source);

                source += 2;
                destination += 2;
            }

            if ((length & 1) != 0)
            {
                ++source;
                ++destination;

                destination[0] = source[0];
            }
        }
    }

    return bytes;
}

It's way faster than the accepted anwser's one, even if not as elegant as it is. Here are my Stopwatch benchmarks over 10000000 iterations:

[Second String: Length 20]
Buffer.BlockCopy: 746ms
Unsafe: 557ms

[Second String: Length 50]
Buffer.BlockCopy: 861ms
Unsafe: 753ms

[Third String: Length 100]
Buffer.BlockCopy: 1250ms
Unsafe: 1063ms

In order to use it, you have to tick "Allow Unsafe Code" in your project build properties. As per .NET Framework 3.5, this method can also be used as String extension:

public static unsafe class StringExtensions
{
    public static Byte[] ToByteArray(this String s)
    {
        // Method Code
    }
}

If you really want a copy of the underlying bytes of a string, you can use a function like the one that follows. However, you shouldn't please read on to find out why.

[DllImport(
        "msvcrt.dll",
        EntryPoint = "memcpy",
        CallingConvention = CallingConvention.Cdecl,
        SetLastError = false)]
private static extern unsafe void* UnsafeMemoryCopy(
    void* destination,
    void* source,
    uint count);

public static byte[] GetUnderlyingBytes(string source)
{
    var length = source.Length * sizeof(char);
    var result = new byte[length];
    unsafe
    {
        fixed (char* firstSourceChar = source)
        fixed (byte* firstDestination = result)
        {
            var firstSource = (byte*)firstSourceChar;
            UnsafeMemoryCopy(
                firstDestination,
                firstSource,
                (uint)length);
        }
    }

    return result;
}

This function will get you a copy of the bytes underlying your string, pretty quickly. You'll get those bytes in whatever way they are encoding on your system. This encoding is almost certainly UTF-16LE but that is an implementation detail you shouldn't have to care about.

It would be safer, simpler and more reliable to just call,

System.Text.Encoding.Unicode.GetBytes()

In all likelihood this will give the same result, is easier to type, and the bytes will always round-trip with a call to

System.Text.Encoding.Unicode.GetString()

Simply use this:

byte[] myByte= System.Text.ASCIIEncoding.Default.GetBytes(myString);

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.

Utf8 is a popular encoding, it is compact and not lossy.


The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.

Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.

Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding , which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs ( System.Text.UnicodeEncoding supports UTF-16)

Ref this link.

For serialization to an array of bytes using System.Text.Encoding.GetBytes . For the inverse operation use System.Text.Encoding.GetChars . This function returns an array of characters, so to get a string, use a string constructor System.String(char[]) .
Ref this page.

例:

string myString = //... some string

System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);

//next lines are written in response to a follow-up questions:

myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);

//how many times shall I repeat it to show there is a round-trip? :-)

With the advent of Span<T> released with C# 7.2, the canonical technique to capture the underlying memory representation of a string into a managed byte array is:

byte[] bytes = "rubbish_\u9999_string".AsSpan().AsBytes().ToArray();

Converting it back should be a non-starter because that means you are in fact interpreting the data somehow, but for the sake of completeness:

string s;
unsafe
{
    fixed (char* f = &bytes.AsSpan().NonPortableCast<byte, char>().DangerousGetPinnableReference())
    {
        s = new string(f);
    }
}

The names NonPortableCast and DangerousGetPinnableReference should further the argument that you probably shouldn't be doing this.

Note that working with Span<T> requires installing the System.Memory NuGet package .

Regardless, the actual original question and follow-up comments imply that the underlying memory is not being "interpreted" (which I assume means is not modified or read beyond the need to write it as-is), indicating that some implementation of the Stream class should be used instead of reasoning about the data as strings at all.


You can use following code to convert a string to a byte array in .NET

string s_unicode = "abcéabc";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);

使用:

    string text = "string";
    byte[] array = System.Text.Encoding.UTF8.GetBytes(text);

The result is:

[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103

它取決於您的字符串的編碼( ASCIIUTF-8 ,...)。

例如:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

一個為什麼編碼很重要的小例子

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'

ASCII根本不具備處理特殊字符的能力。

在內部,.NET框架使用UTF-16來表示字符串,因此如果您只想獲取.NET使用的確切字節,請使用System.Text.Encoding.Unicode.GetBytes (...)

有關更多信息,請參見.NET Framework (MSDN) 中的字符編碼


您需要考慮編碼,因為1個字符可以用1個或多個字節表示(最多約6個字符),不同的編碼會以不同的方式處理這些字節。

Joel在此發表了一篇文章:

joelonsoftware.com/articles/Unicode.html


為了證明Mehrdrad的share有效,他的方法甚至可以堅持[BinaryFormatter (許多人對我的回答提出了反對意見,但每個人都同樣有罪,例如System.Text.Encoding.UTF8.GetBytesSystem.Text.Encoding.Unicode.GetBytes ;這些編碼方法不能保持高代理字符d800 ,例如,那些只是用值fffd代替高代理字符):

using System;

class Program
{     
    static void Main(string[] args)
    {
        string t = "爱虫";            
        string s = "Test\ud800Test"; 

        byte[] dumpToBytes = GetBytes(s);
        string getItBack = GetString(dumpToBytes);

        foreach (char item in getItBack)
        {
            Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
        }    
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }        
}

輸出:

T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74

嘗試使用System.Text.Encoding.UTF8.GetBytesSystem.Text.Encoding.Unicode.GetBytes ,它們將僅替換值高的代理字符fffd

每當這個問題出現動作時,我仍然想著一個序列化程序(不管是來自Microsoft還是來自第三方組件),它們可以保留字符串,即使它包含不成對的替代字符; 我偶爾谷歌這一點: 序列化不配對代理字符.NET 。 這並不會讓我失去睡眠,但有時候會有人評論我的答案,說明它存在缺陷,但是當涉及不成對的代理角色時,他們的答案同樣存在缺陷。

戴恩,微軟應該在其BinaryFormatterツ中使用System.Buffer.BlockCopy

謝謝!


試試這個,少了很多代碼:

System.Text.Encoding.UTF8.GetBytes("TEST String");

這是一個受歡迎的問題。 理解作者所問的問題非常重要,並且它可能是最常見的需求。 為了防止在不需要的地方濫用代碼,我先回答後面的代碼。

共同的需要

每個字符串都有一個字符集和編碼。 當您將System.String對象轉換為System.Byte的數組時,您仍然有一個字符集和編碼。 對於大多數用途,您可以知道您需要哪種字符集和編碼,而.NET使用“轉換複製”很容易。 只要選擇合適的Encoding類。

// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")

轉換可能需要處理目標字符集或編碼不支持源中字符的情況。 你有一些選擇:異常,替換或跳過。 默認策略是替換'?'。

// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100")); 
                                                      // -> "You win ?100"

顯然,轉換不一定是無損的!

注意:對於System.String ,源字符集是Unicode。

唯一令人困惑的是.NET使用字符集的名稱作為該字符集的一種特定編碼的名稱。 Encoding.Unicode應該被稱為Encoding.UTF16

這就是大多數用法。 如果這就是你需要的,停止閱讀這裡。 如果您不明白編碼是什麼,請參閱有趣的joelonsoftware.com/articles/Unicode.html

特殊需求

現在,問題作者問道:“每個字符串都是以字節數組的形式存儲的,對嗎?為什麼我不能簡單地擁有那些字節呢?”

他不想要任何轉換。

C#規範

C#中的字符和字符串處理使用Unicode編碼。 char類型表示一個UTF-16代碼單元,而字符串類型表示一系列UTF-16代碼單元。

所以,我們知道如果我們要求進行空轉換(即從UTF-16到UTF-16),我們將得到期望的結果:

Encoding.Unicode.GetBytes(".NET String to byte array")

但為了避免提及編碼,我們必須以另一種方式來做。 如果中間數據類型是可接受的,則有一個概念上的捷徑:

".NET String to byte array".ToCharArray()

這沒有得到我們想要的數據類型,但Mehrdad的答案顯示瞭如何使用BlockCopy將此Char數組轉換為Byte數組。 但是,這會復制兩次字符串! 並且,它明確地使用編碼特定的代碼:數據類型System.Char

獲取字符串存儲的實際字節的唯一方法是使用指針。 fixed語句允許獲取值的地址。 從C#規範:

[用於]字符串類型的表達式,...初始值設定項計算字符串中第一個字符的地址。

為此,編譯器使用RuntimeHelpers.OffsetToStringData將代碼跳過字符串對象的其他部分。 因此,要獲取原始字節,只需創建一個指向字符串的指針並複制所需的字節數。

// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
    if (s == null) return null;
    var codeunitCount = s.Length;
    /* We know that String is a sequence of UTF-16 codeunits 
       and such codeunits are 2 bytes */
    var byteCount = codeunitCount * 2; 
    var bytes = new byte[byteCount];
    fixed(void* pRaw = s)
    {
        Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
    }
    return bytes;
}

正如@CodesInChaos指出的那樣,結果取決於機器的字節順序。 但問題作者並不關心這一點。


BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());

byte[] strToByteArray(string str)
{
    System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
    return enc.GetBytes(str);
}






string