string类 - c#字符串




如何在C#中获得字符串的一致字节表示,而无需手动指定编码? (20)

也请解释为什么要考虑编码。 我不能简单地得到字符串存储在哪个字节? 为什么这种编码依赖?

因为没有“字符串的字节”之类的东西。

字符串(或更一般地说,文本)由字符组成:字母,数字和其他符号。 就这样。 然而,电脑对角色一无所知, 他们只能处理字节。 因此,如果要使用计算机存储或传输文本,则需要将字符转换为字节。 你是怎样做的? 这是编码到达现场的地方。

编码不过是将逻辑字符转换为物理字节的约定。 最简单和最知名的编码是ASCII,如果你用英文书写,这就是你所需要的。 对于其他语言,您需要更完整的编码,因为任何Unicode编码都是当今最安全的选择。

因此,简而言之,试图“不使用编码获取字符串的字节”与“不使用任何语言编写文本”一样不可能。

顺便说一下,我强烈建议你(以及任何人)阅读这一小小的智慧: joelonsoftware.com/articles/Unicode.html

如何在没有手动指定特定编码的情况下将string转换为.NET(C#)中的byte[]

我要加密字符串。 我可以在不转换的情况下对其进行加密,但我仍然想知道为什么编码会在这里播放。

另外,为什么要将编码考虑在内? 我不能简单地得到字符串存储在哪个字节? 为什么对字符编码有依赖性?


It depends on what you want the bytes FOR

This is because, as Tyler so aptly said , "Strings aren't pure data. They also have information ." In this case, the information is an encoding that was assumed when the string was created.

Assuming that you have binary data (rather than text) stored in a string

This is based off of OP's comment on his own question, and is the correct question if I understand OP's hints at the use-case.

Storing binary data in strings is probably the wrong approach because of the assumed encoding mentioned above! Whatever program or library stored that binary data in a string (instead of a byte[] array which would have been more appropriate) has already lost the battle before it has begun. If they are sending the bytes to you in a REST request/response or anything that must transmit strings, Base64 would be the right approach.

If you have a text string with an unknown encoding

Everybody else answered this incorrect question incorrectly.

If the string looks good as-is, just pick an encoding (preferably one starting with UTF), use the corresponding System.Text.Encoding.???.GetBytes() function, and tell whoever you give the bytes to which encoding you picked.


C#将string转换为byte数组:

public static byte[] StrToByteArray(string str)
{
   System.Text.UTF8Encoding  encoding=new System.Text.UTF8Encoding();
   return encoding.GetBytes(str);
}

Fastest way

public static byte[] GetBytes(string text)
{
    return System.Text.ASCIIEncoding.UTF8.GetBytes(text);
}

EDIT as Makotosan commented this is now the best way:

Encoding.UTF8.GetBytes(text)

From byte[] to string :

        return BitConverter.ToString(bytes);

Here is my unsafe implementation of String to Byte[] conversion:

public static unsafe Byte[] GetBytes(String s)
{
    Int32 length = s.Length * sizeof(Char);
    Byte[] bytes = new Byte[length];

    fixed (Char* pInput = s)
    fixed (Byte* pBytes = bytes)
    {
        Byte* source = (Byte*)pInput;
        Byte* destination = pBytes;

        if (length >= 16)
        {
            do
            {
                *((Int64*)destination) = *((Int64*)source);
                *((Int64*)(destination + 8)) = *((Int64*)(source + 8));

                source += 16;
                destination += 16;
            }
            while ((length -= 16) >= 16);
        }

        if (length > 0)
        {
            if ((length & 8) != 0)
            {
                *((Int64*)destination) = *((Int64*)source);

                source += 8;
                destination += 8;
            }

            if ((length & 4) != 0)
            {
                *((Int32*)destination) = *((Int32*)source);

                source += 4;
                destination += 4;
            }

            if ((length & 2) != 0)
            {
                *((Int16*)destination) = *((Int16*)source);

                source += 2;
                destination += 2;
            }

            if ((length & 1) != 0)
            {
                ++source;
                ++destination;

                destination[0] = source[0];
            }
        }
    }

    return bytes;
}

It's way faster than the accepted anwser's one, even if not as elegant as it is. Here are my Stopwatch benchmarks over 10000000 iterations:

[Second String: Length 20]
Buffer.BlockCopy: 746ms
Unsafe: 557ms

[Second String: Length 50]
Buffer.BlockCopy: 861ms
Unsafe: 753ms

[Third String: Length 100]
Buffer.BlockCopy: 1250ms
Unsafe: 1063ms

In order to use it, you have to tick "Allow Unsafe Code" in your project build properties. As per .NET Framework 3.5, this method can also be used as String extension:

public static unsafe class StringExtensions
{
    public static Byte[] ToByteArray(this String s)
    {
        // Method Code
    }
}

If you really want a copy of the underlying bytes of a string, you can use a function like the one that follows. However, you shouldn't please read on to find out why.

[DllImport(
        "msvcrt.dll",
        EntryPoint = "memcpy",
        CallingConvention = CallingConvention.Cdecl,
        SetLastError = false)]
private static extern unsafe void* UnsafeMemoryCopy(
    void* destination,
    void* source,
    uint count);

public static byte[] GetUnderlyingBytes(string source)
{
    var length = source.Length * sizeof(char);
    var result = new byte[length];
    unsafe
    {
        fixed (char* firstSourceChar = source)
        fixed (byte* firstDestination = result)
        {
            var firstSource = (byte*)firstSourceChar;
            UnsafeMemoryCopy(
                firstDestination,
                firstSource,
                (uint)length);
        }
    }

    return result;
}

This function will get you a copy of the bytes underlying your string, pretty quickly. You'll get those bytes in whatever way they are encoding on your system. This encoding is almost certainly UTF-16LE but that is an implementation detail you shouldn't have to care about.

It would be safer, simpler and more reliable to just call,

System.Text.Encoding.Unicode.GetBytes()

In all likelihood this will give the same result, is easier to type, and the bytes will always round-trip with a call to

System.Text.Encoding.Unicode.GetString()

Simply use this:

byte[] myByte= System.Text.ASCIIEncoding.Default.GetBytes(myString);

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.

Utf8 is a popular encoding, it is compact and not lossy.


The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.

Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.

Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding , which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs ( System.Text.UnicodeEncoding supports UTF-16)

Ref this link.

For serialization to an array of bytes using System.Text.Encoding.GetBytes . For the inverse operation use System.Text.Encoding.GetChars . This function returns an array of characters, so to get a string, use a string constructor System.String(char[]) .
Ref this page.

例:

string myString = //... some string

System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);

//next lines are written in response to a follow-up questions:

myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);

//how many times shall I repeat it to show there is a round-trip? :-)

With the advent of Span<T> released with C# 7.2, the canonical technique to capture the underlying memory representation of a string into a managed byte array is:

byte[] bytes = "rubbish_\u9999_string".AsSpan().AsBytes().ToArray();

Converting it back should be a non-starter because that means you are in fact interpreting the data somehow, but for the sake of completeness:

string s;
unsafe
{
    fixed (char* f = &bytes.AsSpan().NonPortableCast<byte, char>().DangerousGetPinnableReference())
    {
        s = new string(f);
    }
}

The names NonPortableCast and DangerousGetPinnableReference should further the argument that you probably shouldn't be doing this.

Note that working with Span<T> requires installing the System.Memory NuGet package .

Regardless, the actual original question and follow-up comments imply that the underlying memory is not being "interpreted" (which I assume means is not modified or read beyond the need to write it as-is), indicating that some implementation of the Stream class should be used instead of reasoning about the data as strings at all.


You can use following code to convert a string to a byte array in .NET

string s_unicode = "abcéabc";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);

为了证明Mehrdrad的share有效,他的方法甚至可以坚持[BinaryFormatter (许多人对我的回答提出了反对意见,但每个人都同样有罪,例如System.Text.Encoding.UTF8.GetBytesSystem.Text.Encoding.Unicode.GetBytes ;这些编码方法不能坚持高代理字符d800 ,例如,那些只是用值fffd代替高代理字符):

using System;

class Program
{     
    static void Main(string[] args)
    {
        string t = "爱虫";            
        string s = "Test\ud800Test"; 

        byte[] dumpToBytes = GetBytes(s);
        string getItBack = GetString(dumpToBytes);

        foreach (char item in getItBack)
        {
            Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
        }    
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }        
}

输出:

T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74

尝试使用System.Text.Encoding.UTF8.GetBytesSystem.Text.Encoding.Unicode.GetBytes ,它们将仅替换值高的代理字符fffd

每当这个问题出现动作时,我仍然想着一个序列化程序(不管是来自Microsoft还是来自第三方组件),它可以保留字符串,即使它包含不成对的替代字符; 我偶尔谷歌这一点: 序列化不配对代理字符.NET 。 这并不会让我失去睡眠,但有时候会有人评论我的答案,说明它存在缺陷,但是当涉及不成对的代理角色时,他们的答案同样存在缺陷。

戴恩,微软应该在其BinaryFormatterツ中使用System.Buffer.BlockCopy

谢谢!


使用:

    string text = "string";
    byte[] array = System.Text.Encoding.UTF8.GetBytes(text);

The result is:

[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103

您的问题的第一部分(如何获取字节)已被其他人回答:查看System.Text.Encoding命名空间。

我将解答你的后续问题:你为什么需要选择一种编码? 为什么你不能从字符串类本身获得?

答案分两部分。

首先,字符串类内部使用的字节无关紧要 ,只要您认为它们确实会引入错误。

如果您的程序完全位于.Net世界中,那么即使您通过网络发送数据,也不必担心获取字符串的字节数组。 相反,使用.Net序列化来担心传输数据。 你不必担心实际的字节:序列化格式化程序为你做。

另一方面,如果你发送这些字节的地方,你不能保证会从.Net序列化的流中提取数据? 在这种情况下,你肯定需要担心编码问题,因为显然这个外部系统很关心。 因此,字符串使用的内部字节无关紧要:您需要选择一种编码,以便您可以在接收端明确此编码,即使它与.Net内部使用的是相同的编码。

我知道在这种情况下,你可能更喜欢在可能的情况下在内存中使用字符串变量存储的实际字节,并认为它可以节省创建字节流的一些工作。 然而,我把它给你,这与确保你的输出在另一端被理解并保证你必须明确你的编码相比并不重要。 另外,如果你真的想匹配你的内部字节,你可以选择Unicode编码,并且节省下来。

这使我想到了第二部分...选择Unicode编码告诉.Net使用底层字节。 你需要选择这种编码,因为当一些新的Unicode-Plus出来时,.Net运行时需要自由地使用这个更新的,更好的编码模型而不会破坏你的程序。 但是,就目前而言(以及将来可预见的),只要选择Unicode编码就可以得到你想要的。

理解你的字符串必须重新写入连线也很重要,并且即使在使用匹配编码时也至少需要对位模式进行一些翻译。 计算机需要考虑Big vs Little Endian,网络字节顺序,打包和会话信息等。


您需要考虑编码,因为1个字符可以用1个或多个字节表示(最多约6个字符),不同的编码会以不同的方式处理这些字节。

Joel在此发表了一篇文章:

joelonsoftware.com/articles/Unicode.html


试试这个,少了很多代码:

System.Text.Encoding.UTF8.GetBytes("TEST String");

这是一个受欢迎的问题。 理解作者所问的问题非常重要,并且它可能是最常见的需求。 为了防止在不需要的地方滥用代码,我先回答后面的代码。

共同的需要

每个字符串都有一个字符集和编码。 当您将System.String对象转换为System.Byte的数组时,您仍然有一个字符集和编码。 对于大多数用途,您可以知道您需要哪种字符集和编码,而.NET使用“转换复制”很容易。 只要选择合适的Encoding类。

// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")

转换可能需要处理目标字符集或编码不支持源中字符的情况。 你有一些选择:例外,替换或跳过。 默认策略是替换'?'。

// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100")); 
                                                      // -> "You win ?100"

显然,转换不一定是无损的!

注意:对于System.String ,源字符集是Unicode。

唯一令人困惑的是.NET使用字符集的名称作为该字符集的一种特定编码的名称。 Encoding.Unicode应该被称为Encoding.UTF16

这就是大多数用法。 如果这就是你需要的,停止阅读这里。 如果您不明白编码是什么,请参阅有趣的joelonsoftware.com/articles/Unicode.html

特殊需求

现在,问题作者问道:“每个字符串都是以字节数组的形式存储的,对吗?为什么我不能简单地拥有那些字节呢?”

他不想要任何转换。

C#规范

C#中的字符和字符串处理使用Unicode编码。 char类型表示一个UTF-16代码单元,而字符串类型表示一系列UTF-16代码单元。

所以,我们知道如果我们要求进行空转换(即从UTF-16到UTF-16),我们将得到期望的结果:

Encoding.Unicode.GetBytes(".NET String to byte array")

但为了避免提及编码,我们必须以另一种方式来做。 如果中间数据类型是可接受的,则有一个概念上的捷径:

".NET String to byte array".ToCharArray()

这没有得到我们想要的数据类型,但Mehrdad的答案显示了如何使用BlockCopy将此Char数组转换为Byte数组。 但是,这会复制两次字符串! 并且,它明确地使用编码特定的代码:数据类型System.Char

获取字符串存储的实际字节的唯一方法是使用指针。 fixed语句允许获取值的地址。 从C#规范:

[用于]字符串类型的表达式,...初始值设定项计算字符串中第一个字符的地址。

为此,编译器使用RuntimeHelpers.OffsetToStringData将代码跳过字符串对象的其他部分。 因此,要获取原始字节,只需创建一个指向字符串的指针并复制所需的字节数。

// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
    if (s == null) return null;
    var codeunitCount = s.Length;
    /* We know that String is a sequence of UTF-16 codeunits 
       and such codeunits are 2 bytes */
    var byteCount = codeunitCount * 2; 
    var bytes = new byte[byteCount];
    fixed(void* pRaw = s)
    {
        Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
    }
    return bytes;
}

正如@CodesInChaos指出的那样,结果取决于机器的字节顺序。 但问题作者并不关心这一点。


BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());

byte[] strToByteArray(string str)
{
    System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
    return enc.GetBytes(str);
}




string