utf8 Is the u8 string literal necessary in C++11

string literal c (4)

From Wikipedia:

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.

I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this

const char[] str = "Test String";

or this?

const char[] str = u8"Test String";

Is there be any reason not to use the latter for every string literal in your code?

What happens when there are non-ASCII-Characters inside the TestString?

The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).

The encoding of u8"Test String" is always UTF-8.

The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.

If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.

You quote Wikipedia:

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.

Well, the “For the purpose of” is bullshit. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 § “incorporated” into the C++ standard.

If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.

Regarding the effect of the u8 literal prefix, it

  • affects the encoding of the string in the executable, but

  • unfortunately it does not affect the type.

Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.

If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.

However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:

std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.

The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.

That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string whee each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.