Currently I'm using VARCHAR/TEXT with utf8_general_ci for all character columns in mysql. Now I want to improve database layout/performance.
What I figured out so far is to better use
CHAR instead of VARCHAR for fixed length columns as GUIDs or session idsCHAR for small columns having length of 1 or maybe 2?As I do not want to go as wide to save my GUIDs as BINARY(16) because of handling issues, I'd rather save them as CHAR(32) to especially improve keys. (I would even save 2/3 when switching from utf8 to some 1-byte-charset)
Is it good practice to mix up different character sets in same mysql (innodb) table? Or do I get better performance when all columns have same charset within same table? Or even database?
GUID/UUID/MD5/SHA1 are all hex and dash. For them
CHAR(..) CHARACTER SET ascii COLLATE ascii_general_ci
That will allow for A=a when comparing hex strings.
For Base64 things, use either of
CHAR(..) CHARACTER SET ascii COLLATE ascii_bin
BINARY(..)
since A is not semantically the same as a.
Further notes...
BINARY may be a tiny bit faster than any _bin collation, but not enough to notice.CHAR for columns that are truly fixed length; don't mislead the user by using it for other cases.%_bin is faster than %_general_ci, which is faster than other collations. Again, you would be hard-pressed to measure a difference.TINYTEXT or TINYBLOB.utf8mb4, use utf8mb4_unicode_520_ci (or utf8mb4_900_ci if using version 8.0). The 520 and 900 refer to Unicode standards; new collations are likely to come in the future.If you are entirely in Czech, then consider these charsets and collations. I list them in preferred order:
mysql> show collation like '%czech%';
+------------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+------------------+---------+-----+---------+----------+---------+
| utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | -- opens up the world
| utf8_czech_ci | utf8 | 202 | | Yes | 8 | -- opens up most of the world
| latin2_czech_cs | latin2 | 2 | | Yes | 4 | -- kinda like latin1
The rest are "useless":
| cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
| ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
| utf16_czech_ci | utf16 | 111 | | Yes | 8 |
| utf32_czech_ci | utf32 | 170 | | Yes | 8 |
+------------------+---------+-----+---------+----------+---------+
7 rows in set (0.00 sec)
More
ENUM is 1 byte, yet acts like a string. So you get the "best of both worlds". (There are drawbacks, and there is a 'religious war' among advocates for ENUM vs TINYINT vs VARCHAR.)country_code is always 2 letters, always ascii, always could benefit from case insensitive collation. So CHAR(2) CHARACTER SET ascii COLLATE ascii_general_ci is optimal. If you have something that is sometimes 1-char, sometimes 2, then flip a coin; whatever you do won't make much difference.VARCHAR (up to 255) has an extra 1-byte length attached to it. So, if your strings vary in length at all, VARCHAR is at least as good as CHAR. So simplify your brain processing: "variable length --> `VARCHAR".BIT, depending on version, may be implemented as a 1-byte TINYINT UNSIGNED. If you have only a few bits in your table, it is not worth worrying about.CREATE TABLE. I often see tables with BIGINT and DOUBLE (each 8 bytes) that could easily use smaller columns. Sometimes saving more than 50% (space).ENUM
ALTER TABLE, though it can be "inplace".'unknown' (or something like that) and making the column NOT NULL (versus NULL).A versus a).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With