mysql utf8mb4_unicode_ci What are the diffrences between utf8_general_ci and utf8_unicode_ci?
what is utf8_unicode_ci (2)
What's the difference between utf8_general_ci and utf8_unicode_ci
I've got two options for unicode that look promising for a mysql database.
utf8_general_ci unicode (multilingual), case-insensitive utf8_unicode_ci unicode (multilingual), case-insensitive
Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?
For any Unicode character set, operations performed using the
_general_cicollation are faster than those for the
_unicode_cicollation. For example, comparisons for the
utf8_general_cicollation are faster, but slightly less correct, than comparisons for
utf8_unicode_ci. The reason for this is that
utf8_unicode_cisupports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “
ß” is equal to “
utf8_unicode_cialso supports contractions and ignorable characters.
utf8_general_ciis a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:
- converts to Unicode normalization form D for canonical decomposition
- removes any combining characters
- converts to upper case
This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:
- The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
- There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
- Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.
There are many other subtleties.
utf8_unicode_ciuses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_general_ci does not support expansions/ligatures, it sorts
all these letters as single characters, and sometimes in a wrong order.
utf8_unicode_ciis generally more accurate for all scripts. For example, on Cyrillic block:
utf8_unicode_ciis fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
The cost of
utf8_unicode_ci is that it is a little bit
utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice.
It is very difficult to ever justify giving wrong answers, so it’s best to assume that
utf8_general_ci doesn’t exist and to always use
utf8_unicode_ci. Well, unless you want wrong answers.