You are here

MariaDB and MySQL Character Set Conversion

Contents

Introduction

Recently we had a consulting engagement where we had to help the customer to migrate from latin1 Character Set to utf8mb4 Character Set. In the same MySQL consulting engagement we considered to upgrade from MySQL 5.6 to MySQL 5.7 as well [ Lit. ]. We decided to split the change in 2 parts: Upgrading to 5.7 in the first step and converting to uft8mb4 in the second step. There were various reasons for this decision:

  • 2 smaller changes are easier to control then one big shot.
  • We assume that in 5.7 we experience less problems with utf8mb4 because the trend given by MySQL was more towards utf8mb4 in 5.7 than in MySQL 5.6. So we hope to hit less problems and bugs.

For Upgrading see also MariaDB and MySQL Upgrade Problems

Remark: It makes possibly also sens to think about Collations before starting with the conversion!

Character Sets

Historically MariaDB and MySQL had the default Character Set latin1 (Latin-1 or ISO-8859-1) which was sufficient for most of the western hemisphere.

But as technology spreads and demands increase other cultures want to have their characters represented understandably as well. So Unicode standard was invented. And MariaDB and MySQL applied this standard as well.

The original MariaDB/MySQL utf8(mb3) implementation was not perfect or complete so they implemented utf8mb4 as a super set of utf8(mb3). So at least since MariaDB/MySQL version 5.5 latin1, utf8 and utf8mb4 are available. The current MySQL 5.7 utf8mb4 implementation should cover Unicode 9.0.0:

SQL> SELECT * FROM information_schema.character_sets
WHERE character_set_name LIKE 'utf8%' OR character_set_name = 'latin1';
+--------------------+----------------------+----------------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION          | MAXLEN |
+--------------------+----------------------+----------------------+--------+
| latin1             | latin1_swedish_ci    | cp1252 West European |      1 |
| utf8               | utf8_general_ci      | UTF-8 Unicode        |      3 |
| utf8mb4            | utf8mb4_general_ci   | UTF-8 Unicode        |      4 |
+--------------------+----------------------+----------------------+--------+

The default Character Set up to MariaDB 10.4 and MySQL 5.7 was latin1. In MySQL 8.0 the default Character Set has changed to utf8mb4. There are no signs so far that MariaDB will take the same step:

SQL> status
--------------
mysql  Ver 8.0.16 for linux-glibc2.12 on x86_64 (MySQL Community Server - GPL)

Connection id:          84
Current database:
Current user:           root@localhost
SSL:                    Not in use
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server version:         8.0.16 MySQL Community Server - GPL
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    utf8mb4
Db     characterset:    utf8mb4
Client characterset:    utf8mb4
Conn.  characterset:    utf8mb4
UNIX socket:            /var/run/mysqld/mysql-3332.sock
Uptime:                 3 days 47 min 13 sec

So we see a general trend from latin1 to utf8(mb3) to utf8mb4 technically and business wise (aka globalization).

For the DBA this means sooner or later we have to think about a conversion of all tables of the whole database instance (all tables of all schemata) to utf8mb4!

Steps to convert Character Set to utf8mb4

Analyzing the Server

First of all one should analyze the system (O/S, database instance and client/application). On the server we can run the following command to verify the actual used and supported Character Set:

# locale
LANG=en_GB.UTF-8
LANGUAGE=
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

On the MariaDB/MySQL database instance we check the current server configuration and the session configuration with the following commands:

SQL> SHOW SESSION VARIABLES
WHERE Variable_name LIKE 'character_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name            | Value             |
+--------------------------+-------------------+
| character_set_client     | utf8              |
| character_set_connection | utf8              |
| character_set_database   | latin1            |
| character_set_filesystem | binary            |
| character_set_results    | utf8              |
| character_set_server     | latin1            |
| character_set_system     | utf8              |
| collation_connection     | utf8_general_ci   |
| collation_database       | latin1_swedish_ci |
| collation_server         | latin1_swedish_ci |
+--------------------------+-------------------+

SQL> SHOW GLOBAL VARIABLES
WHERE Variable_name LIKE 'character_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name            | Value             |
+--------------------------+-------------------+
| character_set_client     | latin1            |
| character_set_connection | latin1            |
| character_set_database   | latin1            |
| character_set_filesystem | binary            |
| character_set_results    | latin1            |
| character_set_server     | latin1            |
| character_set_system     | utf8              |
| collation_connection     | latin1_swedish_ci |
| collation_database       | latin1_swedish_ci |
| collation_server         | latin1_swedish_ci |
+--------------------------+-------------------+

These configuration variables are for Client/Server communication: character_set_client, character_set_connection and character_set_results. These for Server configuration: character_set_server and character_set_database (deprecated in MySQL 5.7). And these for System internals and File System access: character_set_system and character_set_filesystem.

Sometimes we see customers using the Logon Trigger init_connect to force clients for a specific Character Set:

SQL> SHOW GLOBAL VARIABLES LIKE 'init_connect';
+---------------+------------------+
| Variable_name | Value            |
+---------------+------------------+
| init_connect  | SET NAMES latin1 |
+---------------+------------------+

The SET NAMES command sets the character_set_client, character_set_connection and character_set_results session variables. [ Lit. ]

Analyzing the Application and the Clients

Similar steps to analyze the Application and Clients should be taken. We want to answer the following questions:

  • Support of utf8 of Application/Client O/S (Windows)?
  • Support of utf8 of Web Server (Apache (AddDefaultCharset utf-8), Nginx, IIS, ...)
  • Version of programming language (Java, PHP (5.4 and newer?), ...)
  • Version of MariaDB and MySQL Connectors (JDBC (5.1.47 and newer?), ODBC (5.3.11 and newer?), mysqli/mysqlnd (⋝7.0.19?, ⋝7.1.5?), ...)
  • Application code (header('Content-Type: text/html; charset=utf-8');, <?xml version="1.0" encoding="UTF-8"?>, <meta http-equiv="Content-Type" content="text/html; charset=utf-8">, <form accept-charset="utf-8">, htmlspecialchars($str, ENT_NOQUOTES, "UTF-8"), $mysqli->set_charset('utf8mb4');, mbstring [ Lit. ], etc.

See also: Configuring Application Character Set and Collation

If you do not have your Application under control your DBA can help you to find out with the General Query Log (SET GLOBAL general_log = 1;) what is going on:

190815 19:03:00    12 Connect   root@localhost on 
                   12 Query     select @@version_comment limit 1
                   12 Query     SET NAMES latin1
                   12 Query     SET NAMES utf8
190815 19:05:24    12 Quit

or with some queries on the PERFORMANCE_SCHEMA:

-- Works since MySQL 5.6/MariaDB 10.0
SQL> SELECT t.thread_id, t.processlist_id, t.processlist_user, t.processlist_host, t.processlist_db
     , sca.attr_name, sca.attr_value
  FROM performance_schema.threads AS t
  JOIN performance_schema.session_connect_attrs AS sca ON sca.processlist_id = t.processlist_id
 WHERE t.processlist_user IS NOT NULL
   AND t.thread_id = 103
;
+-----------+----------------+------------------+------------------+----------------+----------------------------------+---------------------+
| thread_id | processlist_id | processlist_user | processlist_host | processlist_db | attr_name                        | attr_value          |
+-----------+----------------+------------------+------------------+----------------+----------------------------------+---------------------+
|       103 |             78 | replication      | localhost        | NULL           | _os                              | linux-glibc2.12     |
|       103 |             78 | replication      | localhost        | NULL           | _client_name                     | libmysql            |
|       103 |             78 | replication      | localhost        | NULL           | _pid                             | 29269               |
|       103 |             78 | replication      | localhost        | NULL           | program_name                     | mysqld              |
|       103 |             78 | replication      | localhost        | NULL           | _platform                        | x86_64              |
|       103 |             78 | replication      | localhost        | NULL           | _client_replication_channel_name | NULL                |
|       103 |             78 | replication      | localhost        | NULL           | _client_role                     | binary_log_listener |
|       103 |             78 | replication      | localhost        | NULL           | _client_version                  | 5.7.26              |
+-----------+----------------+------------------+------------------+----------------+----------------------------------+---------------------+

-- Works since MySQL 5.7 only
SELECT t.thread_id, t.processlist_id, t.processlist_user, t.processlist_host, t.processlist_db
     , vbt.variable_name, vbt.variable_value
  FROM performance_schema.threads AS t
  JOIN performance_schema.variables_by_thread AS vbt ON vbt.thread_id = t.thread_id
 WHERE t.processlist_user IS NOT NULL
   AND (vbt.variable_name like 'charac%' OR vbt.variable_name LIKE 'coll%')
   AND t.thread_id = 103
;
+-----------+----------------+------------------+------------------+----------------+--------------------------+-------------------+
| thread_id | processlist_id | processlist_user | processlist_host | processlist_db | variable_name            | variable_value    |
+-----------+----------------+------------------+------------------+----------------+--------------------------+-------------------+
|       103 |             78 | replication      | localhost        | NULL           | character_set_client     | latin1            |
|       103 |             78 | replication      | localhost        | NULL           | character_set_connection | latin1            |
|       103 |             78 | replication      | localhost        | NULL           | character_set_database   | latin1            |
|       103 |             78 | replication      | localhost        | NULL           | character_set_filesystem | binary            |
|       103 |             78 | replication      | localhost        | NULL           | character_set_results    | latin1            |
|       103 |             78 | replication      | localhost        | NULL           | character_set_server     | latin1            |
|       103 |             78 | replication      | localhost        | NULL           | collation_connection     | latin1_swedish_ci |
|       103 |             78 | replication      | localhost        | NULL           | collation_database       | latin1_swedish_ci |
|       103 |             78 | replication      | localhost        | NULL           | collation_server         | latin1_swedish_ci |
+-----------+----------------+------------------+------------------+----------------+--------------------------+-------------------+

Preparation of the Server Settings and the Application

To have a better control of the impact of some changes we decided to do some changes on the Application first:

  • Application is setting the Character Set properly itself ($mysqli->set_charset('utf8mb4') [ Lit. ]). In the same step also sql_mode can be set by the application so we can use the defaults on server side in the future.
  • Apache and PHP are configured to support UTF-8.
  • After this step init_connect, character_set_server and character_set_database can be changed to utf8mb4 on the Server and --skip-character-set-client-handshake can be removed at the same time [ Lit. ]

Convert Tables to utf8mb4

First we checked and converted the default Character Set of the Schemata/Databases:

SQL> SELECT schema_name, default_character_set_name, default_collation_name
  FROM information_schema.schemata
 WHERE schema_name NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
;
+-------------+----------------------------+------------------------+
| schema_name | default_character_set_name | default_collation_name |
+-------------+----------------------------+------------------------+
| focmm       | latin1                     | latin1_swedish_ci      |
| test        | latin1                     | latin1_swedish_ci      |
| foodmart    | latin1                     | latin1_swedish_ci      |
| erp         | latin1                     | latin1_swedish_ci      |
| world       | latin1                     | latin1_swedish_ci      |
+-------------+----------------------------+------------------------+

Converting the Schemata is done as follows:

SQL> ALTER DATABASE ... DEFAULT CHARACTER SET utf8mb4;

which is a fast operation.

To convert the tables there are many different possibilities we considered:

  • The easy one: ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4;
  • The possibly faster one: dump/restore with sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4/'
  • The possibly fastest one: drop Secondary Indexes, dump/restore with sed, create Secondary Indexes again (mysqlpump).
  • The automated one: With Perconas pt-online-schema-change [ Lit. ] or Facebooks OnlineSchemaChange OSC [ Lit. ]
  • The most elegant but not supported one: Master/Slave Replication.

Already with converting the first table we run into a problem:

SQL> ALTER TABLE test CONVERT TO CHARACTER SET utf8mb4;
ERROR 1709 (HY000): Index column size too large. The maximum column size is 767 bytes.

This table had a Primary Key of a length of more than 255 bytes and this is not possible with the old InnoDB Antelope File Format:

SQL> SHOW GLOBAL VARIABLES
 WHERE Variable_name LIKE 'innodb_file_format%' OR Variable_name LIKE 'innodb_large%';
+--------------------------+----------+
| Variable_name            | Value    |
+--------------------------+----------+
| innodb_file_format       | Antelope |
| innodb_file_format_check | ON       |
| innodb_file_format_max   | Antelope |
| innodb_large_prefix      | OFF      |
+--------------------------+----------+

So we have to find out first, which tables are still in old Antelope File Format:

SQL> SELECT table_schema
     , CASE WHEN row_format = 'Dynamic' THEN 'Barracuda'
            WHEN row_format = 'Compressed' THEN 'Barracuda'
            WHEN row_format = 'Compact' THEN 'Antelope'
            WHEN row_format = 'Redundant' THEN 'Antelope' END AS 'file_format'
     , COUNT(*)
  FROM information_schema.tables
 WHERE engine = 'InnoDB'
   AND table_schema NOT IN ('information_schema', 'sys', 'mysql')
 GROUP BY table_schema, file_format
;
+--------------+-------------+----------+
| table_schema | file_format | count(*) |
+--------------+-------------+----------+
| foodmart     | Barracuda   |       23 |
| test         | Barracuda   |        1 |
| world        | Antelope    |        2 |
| world        | Barracuda   |        1 |
+--------------+-------------+----------+

Then we could covert the table correctly forcing the Barracuda File format:

SQL> ALTER TABLE test CONVERT TO CHARACTER SET utf8mb4, ROW_FORMAT=dynamic;

Testing of new Character Set

The last but most important step is to test the changes. Here we recommend to do as many as possible different tests:

  • MySQL CLI: mysql
  • phpMyAdmin
  • MySQL Workbench
  • Other GUIs
  • Your Application

Especially test careful data with umlauts (öäüÄÖÜß), foreign Characters (Turkish, Cyrillic, CJK Characters) and Emojis. Good Candidates are: Lastname, City or free text fields like Comment.

Some aid you also can get from here:


MySQL Pump

mysqlpumpA Database Backup Program

This utility is currently not available for MariaDB. If works for MySQL 5.7 and newer and does NOT support MySQL 5.6. So we cannot use it for the Upgrade Process from MySQL 5.6 to 5.7. Newest MySQL Releases contain Bug fixes and even new Features in mysqlpump so we can assume it is still supported and actively maintained. Recent releases contain some fixes for trivial bugs so we can assume mysqlpump is not widely used yet and not as mature yet as mysqldump. An alternative product would be MyDumper from Domas@Facebook (Lauchpad, GitHub).

Interesting features are:

  • Parallel dumping of databases.
  • Secondary Index restore separated from Table Restore.

Missing features are:

  • Does not support MariaDB and MySQL 5.6 and earlier.
  • No binary log position is stored with backup (--master-data) but only GTID? So we have to use/enable GTID based Replication.

First test:

# mysqlpump --user=root --all-databases --single-transaction --triggers --routines --events --hex-blob > /tmp/mysqlpump_backup.sql
Dump progress: 1/4 tables, 0/87 rows
Dump progress: 41/46 tables, 789469/6956116 rows
Dump progress: 41/46 tables, 1608219/6956116 rows
Dump progress: 45/46 tables, 2358412/6956116 rows
Dump progress: 45/46 tables, 3437912/6956116 rows
Dump progress: 45/46 tables, 4493662/6956116 rows
Dump progress: 45/46 tables, 5177662/6956116 rows
Dump progress: 45/46 tables, 5568662/6956116 rows
Dump progress: 45/46 tables, 6216162/6956116 rows
Dump progress: 45/46 tables, 6716662/6956116 rows
Dump completed in 9860 milliseconds

ALTER TABLE `foodmart`.`account` ADD KEY `i_account_parent` (`account_parent_id`);
ALTER TABLE `foodmart`.`category` ADD KEY `i_category_parent` (`category_parent_id`);
ALTER TABLE `foodmart`.`employee` ADD KEY `i_employee_department_id` (`department_id`);
ALTER TABLE `foodmart`.`employee` ADD KEY `i_employee_store_id` (`store_id`);
ALTER TABLE `foodmart`.`employee` ADD KEY `i_employee_supervisor_id` (`supervisor_id`);
ALTER TABLE `foodmart`.`expense_fact` ADD KEY `i_expense_store_id` (`store_id`);
ALTER TABLE `foodmart`.`expense_fact` ADD KEY `i_expense_account_id` (`account_id`);
ALTER TABLE `foodmart`.`expense_fact` ADD KEY `i_expense_time_id` (`time_id`);

Test against MySQL 5.6:

Server version is not compatible. Server version should be 5.7.8 or above.

MySQL Master/Slave Replication for Character Set conversion

Replication of Columns Having Different Data Types should work since MySQL 5.1.21. General rule when Statement Based Replication is used: If the query would work executed directly on the Slave it should also work on the Slave with Replication! So utf8mb4 on the Slave should also be possible with SBR. In MySQL 5.5.3 Attribute Promotion and Attribute Demotion was introduced with LOSSY and NON_LOSSY slave_type_conversions. But: Replication between columns using different character sets is not supported!

Some tests:

MasterSQL> INSERT INTO test VALUES (NULL, 'öäüÄÖÜß', NULL);
MasterSQL> SELECT * FROM test;
+----+----------------+---------------------+
| id | data           | ts                  |
+----+----------------+---------------------+
|  1 | öäüÄÖÜß        | 2019-08-15 19:48:48 |
|  2 | öäüÄÖÜß        | 2019-08-15 19:50:00 |
|  3 | öäüÄÖÜß        | 2019-08-15 19:52:29 |
+----+----------------+---------------------+

SlaveSQL> SHOW SLAVE STATUS\G
...
Last_SQL_Errno: 1677
Last_SQL_Error: Column 1 of table 'test.test' cannot be converted from type 'varchar(255(bytes))' to type 'varchar(1020(bytes) utf8mb4)'
...
SlaveSQL> SELECT * FROM test;
+----+----------------+---------------------+
| id | data           | ts                  |
+----+----------------+---------------------+
|  1 | öäüÄÖÜß        | 2019-08-15 19:48:48 |
|  2 | öäüÄÖÜß        | 2019-08-15 19:50:00 |
+----+----------------+---------------------+

See also: Why is varchar(255) not varchar(255)? And:

MasterSQL> SET SESSION binlog_format=STATEMENT;
MasterSQL> INSERT INTO test VALUES (NULL, 'öäüÄÖÜß', NULL);

SlaveSQL> SELECT * FROM test;
+----+----------------+---------------------+
| id | data           | ts                  |
+----+----------------+---------------------+
|  1 | öäüÄÖÜß        | 2019-08-15 19:48:48 |
|  2 | öäüÄÖÜß        | 2019-08-15 19:50:00 |
|  4 | öäüÄÖÜß        | 2019-08-15 19:54:16 |
+----+----------------+---------------------+

MySQL Shell, mysqlsh Upgrade Checker Utility

Since MySQL 5.7 we have a new MySQL Utility, the MySQL Shell (mysqlsh). The Upgrade Checker Utility in MySQL Shell will replace mysql_upgrade in MySQL 8.0. So we wanted to know if it is of any use for converting Character Set to utf8mb4? You can use the Upgrade Checker Utility to check MySQL 5.7 server instances for compatibility errors and issues for upgrading. But: The Upgrade Checker Utility does not support checking MySQL Server instances 5.6 an earlier. MySQL Shell 8.0 is highly recommended for use with MySQL Server 8.0 and 5.7. Please upgrade to MySQL Shell 8.0. [ Lit. ]

# ./mysqlsh root@localhost:3319

JS > util.checkForServerUpgrade()
The MySQL server at localhost:3319, version 5.6.42-log - MySQL Community Server
(GPL), will now be checked for compatibility issues for upgrade to MySQL
8.0.17...
Util.checkForServerUpgrade: Detected MySQL server version is 5.6.42, but this tool requires server to be at least at version 5.7 (LogicError)

JS > util.checkForServerUpgrade('root@127.0.0.1:3309')
The MySQL server at localhost:3309, version 5.7.26-log - MySQL Community Server
(GPL), will now be checked for compatibility issues for upgrade to MySQL
8.0.17...

...

2) Usage of db objects with names conflicting with reserved keywords in 8.0
  No issues found

3) Usage of utf8mb3 charset
  Warning: The following objects use the utf8mb3 character set. It is
    recommended to convert them to use utf8mb4 instead, for improved Unicode
    support.
  More information:
    https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

  test.test.data - column's default character set: utf8

4) Table names in the mysql schema conflicting with new tables in 8.0
  No issues found

...

Errors:   0
Warnings: 1
Notices:  0

No fatal errors were found that would prevent an upgrade, but some potential issues were detected. Please ensure that the reported issues are not significant before upgrading.

JS > util.checkForServerUpgrade('root@127.0.0.1:3309', {'targetVersion':'5.7.27'})
The MySQL server at 127.0.0.1:3309, version 5.7.26-log - MySQL Community Server
(GPL), will now be checked for compatibility issues for upgrade to MySQL
5.7.27...
Util.checkForServerUpgrade: This tool supports checking upgrade to MySQL server versions 8.0.11 to 8.0.17 (LogicError)

Because latin1 tables are still supported in MySQL 8.0 the Upgrade Checker Utility does not complain or do anything for latin1 tables. So it is not of any use for the utf8mb4 conversion task except you already have some old utf8(mb3) tables.