[c++] std::wstring VS std::string


I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.

My view is summarized in http://utf8everywhere.org of which I am a co-author.

Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.

And now, answering your questions:

  1. A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in direct vicinity of such API calls.
  2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat it's content. My recommendation is UTF-8, so it will be able to hold all unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
  3. No.
  4. Wide character is a confusing name. In the early days of Unicode, there was a belief that character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

  1. When should I use std::wstring over std::string?
  2. Can std::string hold the entire ASCII character set, including the special characters?
  3. Is std::wstring supported by all popular C++ compilers?
  4. What is exactly a "wide character"?

A good question! I think DATA ENCODING (sometime CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to file or transfer data via network, so I answer this question as:

1.When should I use std::wstring over std::string?

If the programming platform or API function is a single-byte one, and we want to process or parse some unicode datas, e.g read from Windows' .REG file or network 2-byte stream, we should declare std::wstring variable to easy process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.

2.Can std::string hold the entire ASCII character set, including the special characters?

Yes. But notice: American ASCII, means each 0x00~0xFF octet stand for one character ,including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset ,e.g. Chinese, use 2 octets to stand for one character.

3.Is std::wstring supported by all popular C++ compilers?

Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES

4.What is exactly a "wide character"?

wide character mostly indicate using 2 octets or 4 octets to hold all countries's characters. 2 octets UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)

1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english

4) Check this out for wide character http://en.wikipedia.org/wiki/Wide_character

  1. When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.

    If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.

  2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.
  3. Yes, all major compilers support it.