c++ example to - std::wstring VS std::string





6 Answers

I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.

My view is summarized in http://utf8everywhere.org of which I am a co-author.

Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.

And now, answering your questions:

  1. A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
  2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
  3. No.
  4. Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
find substr replace

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

  1. When should I use std::wstring over std::string?
  2. Can std::string hold the entire ASCII character set, including the special characters?
  3. Is std::wstring supported by all popular C++ compilers?
  4. What is exactly a "wide character"?



  1. When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.

    If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.

  2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.
  3. Yes, all major compilers support it.






  1. when you want to use Unicode strings and not just ascii, helpful for internationalisation
  2. yes, but it doesn't play well with 0
  3. not aware of any that don't
  4. wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html



There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using chars to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout using << and passing a filename to std::fstream.

I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.

String literals

If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.

If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?.

The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string, but then you may as well have not used a wide string literal.

std::cout

When outputting to the console using << you can only use std::string, not std::wstring and the text must be encoded using your locale codepage. If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ? (maybe you can change the character, I can't remember).

std::fstream filenames

Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring. There is no other way. This is a Microsoft specific extension to std::fstream so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.

Your options

If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string everywhere.

If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.

If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.

Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code

#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>

#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
    std::string result;
    //Call WideCharToMultiByte to do the conversion
    return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
    return str;
}
#endif

int main()
{

    unicodestring fileName(UNI("fileName"));
    std::ofstream fout;
    fout.open(fileName);
    std::cout << formatForConsole(fileName) << std::endl;
    return 0;
}

would be fine on either platform I think.

Answers

So To answer your questions

1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never.

2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.

3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char

4)A wide character is a character type which is bigger than the 1 byte standard char type. On Windows it is 2 bytes, on Linux it is 4 bytes.




When should you NOT use wide-characters?

When you're writing code before the year 1990.

Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?




Related