You are here
MariaDB and MySQL Character Set Conversion
Contents
- Introduction
- Character Sets
- Steps to convert Character Set to utf8mb4
- MySQL Pump
- MySQL Master/Slave Replication for Character Set conversion
- MySQL Shell,
mysqlsh
Upgrade Checker Utility
Introduction
Recently we had a consulting engagement where we had to help the customer to migrate from latin1 Character Set to utf8mb4 Character Set. In the same MySQL consulting engagement we considered to upgrade from MySQL 5.6 to MySQL 5.7 as well [ Lit. ]. We decided to split the change in 2 parts: Upgrading to 5.7 in the first step and converting to uft8mb4 in the second step. There were various reasons for this decision:
- 2 smaller changes are easier to control then one big shot.
- We assume that in 5.7 we experience less problems with utf8mb4 because the trend given by MySQL was more towards utf8mb4 in 5.7 than in MySQL 5.6. So we hope to hit less problems and bugs.
For Upgrading see also MariaDB and MySQL Upgrade Problems
Remark: It makes possibly also sens to think about Collations before starting with the conversion!
Character Sets
Historically MariaDB and MySQL had the default Character Set latin1 (Latin-1 or ISO-8859-1) which was sufficient for most of the western hemisphere.
But as technology spreads and demands increase other cultures want to have their characters represented understandably as well. So Unicode standard was invented. And MariaDB and MySQL applied this standard as well.
The original MariaDB/MySQL utf8(mb3) implementation was not perfect or complete so they implemented utf8mb4 as a super set of utf8(mb3). So at least since MariaDB/MySQL version 5.5 latin1, utf8 and utf8mb4 are available. The current MySQL 5.7 utf8mb4 implementation should cover Unicode 9.0.0:
SQL> SELECT * FROM information_schema.character_sets WHERE character_set_name LIKE 'utf8%' OR character_set_name = 'latin1'; +--------------------+----------------------+----------------------+--------+ | CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN | +--------------------+----------------------+----------------------+--------+ | latin1 | latin1_swedish_ci | cp1252 West European | 1 | | utf8 | utf8_general_ci | UTF-8 Unicode | 3 | | utf8mb4 | utf8mb4_general_ci | UTF-8 Unicode | 4 | +--------------------+----------------------+----------------------+--------+
The default Character Set up to MariaDB 10.4 and MySQL 5.7 was latin1. In MySQL 8.0 the default Character Set has changed to utf8mb4. There are no signs so far that MariaDB will take the same step:
SQL> status -------------- mysql Ver 8.0.16 for linux-glibc2.12 on x86_64 (MySQL Community Server - GPL) Connection id: 84 Current database: Current user: root@localhost SSL: Not in use Current pager: stdout Using outfile: '' Using delimiter: ; Server version: 8.0.16 MySQL Community Server - GPL Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: utf8mb4 Db characterset: utf8mb4 Client characterset: utf8mb4 Conn. characterset: utf8mb4 UNIX socket: /var/run/mysqld/mysql-3332.sock Uptime: 3 days 47 min 13 sec
So we see a general trend from latin1 to utf8(mb3) to utf8mb4 technically and business wise (aka globalization).
For the DBA this means sooner or later we have to think about a conversion of all tables of the whole database instance (all tables of all schemata) to utf8mb4!
Steps to convert Character Set to utf8mb4
Analyzing the Server
First of all one should analyze the system (O/S, database instance and client/application). On the server we can run the following command to verify the actual used and supported Character Set:
# locale LANG=en_GB.UTF-8 LANGUAGE= LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF-8" LC_ALL=
On the MariaDB/MySQL database instance we check the current server configuration and the session configuration with the following commands:
SQL> SHOW SESSION VARIABLES WHERE Variable_name LIKE 'character_set\_%' OR Variable_name LIKE 'collation%'; +--------------------------+-------------------+ | Variable_name | Value | +--------------------------+-------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | collation_connection | utf8_general_ci | | collation_database | latin1_swedish_ci | | collation_server | latin1_swedish_ci | +--------------------------+-------------------+ SQL> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE 'character_set\_%' OR Variable_name LIKE 'collation%'; +--------------------------+-------------------+ | Variable_name | Value | +--------------------------+-------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | collation_connection | latin1_swedish_ci | | collation_database | latin1_swedish_ci | | collation_server | latin1_swedish_ci | +--------------------------+-------------------+
These configuration variables are for Client/Server communication: character_set_client, character_set_connection and character_set_results. These for Server configuration: character_set_server and character_set_database (deprecated in MySQL 5.7). And these for System internals and File System access: character_set_system and character_set_filesystem.
Sometimes we see customers using the Logon Trigger init_connect
to force clients for a specific Character Set:
SQL> SHOW GLOBAL VARIABLES LIKE 'init_connect'; +---------------+------------------+ | Variable_name | Value | +---------------+------------------+ | init_connect | SET NAMES latin1 | +---------------+------------------+
The SET NAMES
command sets the character_set_client
, character_set_connection
and character_set_results
session variables. [ Lit. ]
Analyzing the Application and the Clients
Similar steps to analyze the Application and Clients should be taken. We want to answer the following questions:
- Support of utf8 of Application/Client O/S (Windows)?
- Support of utf8 of Web Server (Apache (
AddDefaultCharset utf-8
), Nginx, IIS, ...) - Version of programming language (Java, PHP (5.4 and newer?), ...)
- Version of MariaDB and MySQL Connectors (JDBC (5.1.47 and newer?), ODBC (5.3.11 and newer?), mysqli/mysqlnd (⋝7.0.19?, ⋝7.1.5?), ...)
- Application code (
header('Content-Type: text/html; charset=utf-8');
,<?xml version="1.0" encoding="UTF-8"?>
,<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
,<form accept-charset="utf-8">
,htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")
,$mysqli->set_charset('utf8mb4');
, mbstring [ Lit. ], etc.
See also: Configuring Application Character Set and Collation
If you do not have your Application under control your DBA can help you to find out with the General Query Log (SET GLOBAL general_log = 1;
) what is going on:
190815 19:03:00 12 Connect root@localhost on 12 Query select @@version_comment limit 1 12 Query SET NAMES latin1 12 Query SET NAMES utf8 190815 19:05:24 12 Quit
or with some queries on the PERFORMANCE_SCHEMA
:
-- Works since MySQL 5.6/MariaDB 10.0 SQL> SELECT t.thread_id, t.processlist_id, t.processlist_user, t.processlist_host, t.processlist_db , sca.attr_name, sca.attr_value FROM performance_schema.threads AS t JOIN performance_schema.session_connect_attrs AS sca ON sca.processlist_id = t.processlist_id WHERE t.processlist_user IS NOT NULL AND t.thread_id = 103 ; +-----------+----------------+------------------+------------------+----------------+----------------------------------+---------------------+ | thread_id | processlist_id | processlist_user | processlist_host | processlist_db | attr_name | attr_value | +-----------+----------------+------------------+------------------+----------------+----------------------------------+---------------------+ | 103 | 78 | replication | localhost | NULL | _os | linux-glibc2.12 | | 103 | 78 | replication | localhost | NULL | _client_name | libmysql | | 103 | 78 | replication | localhost | NULL | _pid | 29269 | | 103 | 78 | replication | localhost | NULL | program_name | mysqld | | 103 | 78 | replication | localhost | NULL | _platform | x86_64 | | 103 | 78 | replication | localhost | NULL | _client_replication_channel_name | NULL | | 103 | 78 | replication | localhost | NULL | _client_role | binary_log_listener | | 103 | 78 | replication | localhost | NULL | _client_version | 5.7.26 | +-----------+----------------+------------------+------------------+----------------+----------------------------------+---------------------+ -- Works since MySQL 5.7 only SELECT t.thread_id, t.processlist_id, t.processlist_user, t.processlist_host, t.processlist_db , vbt.variable_name, vbt.variable_value FROM performance_schema.threads AS t JOIN performance_schema.variables_by_thread AS vbt ON vbt.thread_id = t.thread_id WHERE t.processlist_user IS NOT NULL AND (vbt.variable_name like 'charac%' OR vbt.variable_name LIKE 'coll%') AND t.thread_id = 103 ; +-----------+----------------+------------------+------------------+----------------+--------------------------+-------------------+ | thread_id | processlist_id | processlist_user | processlist_host | processlist_db | variable_name | variable_value | +-----------+----------------+------------------+------------------+----------------+--------------------------+-------------------+ | 103 | 78 | replication | localhost | NULL | character_set_client | latin1 | | 103 | 78 | replication | localhost | NULL | character_set_connection | latin1 | | 103 | 78 | replication | localhost | NULL | character_set_database | latin1 | | 103 | 78 | replication | localhost | NULL | character_set_filesystem | binary | | 103 | 78 | replication | localhost | NULL | character_set_results | latin1 | | 103 | 78 | replication | localhost | NULL | character_set_server | latin1 | | 103 | 78 | replication | localhost | NULL | collation_connection | latin1_swedish_ci | | 103 | 78 | replication | localhost | NULL | collation_database | latin1_swedish_ci | | 103 | 78 | replication | localhost | NULL | collation_server | latin1_swedish_ci | +-----------+----------------+------------------+------------------+----------------+--------------------------+-------------------+
Preparation of the Server Settings and the Application
To have a better control of the impact of some changes we decided to do some changes on the Application first:
- Application is setting the Character Set properly itself (
$mysqli->set_charset('utf8mb4')
[ Lit. ]). In the same step alsosql_mode
can be set by the application so we can use the defaults on server side in the future. - Apache and PHP are configured to support UTF-8.
- After this step
init_connect
,character_set_server
andcharacter_set_database
can be changed to utf8mb4 on the Server and--skip-character-set-client-handshake
can be removed at the same time [ Lit. ]
Convert Tables to utf8mb4
First we checked and converted the default Character Set of the Schemata/Databases:
SQL> SELECT schema_name, default_character_set_name, default_collation_name FROM information_schema.schemata WHERE schema_name NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys') ; +-------------+----------------------------+------------------------+ | schema_name | default_character_set_name | default_collation_name | +-------------+----------------------------+------------------------+ | focmm | latin1 | latin1_swedish_ci | | test | latin1 | latin1_swedish_ci | | foodmart | latin1 | latin1_swedish_ci | | erp | latin1 | latin1_swedish_ci | | world | latin1 | latin1_swedish_ci | +-------------+----------------------------+------------------------+
Converting the Schemata is done as follows:
SQL> ALTER DATABASE ... DEFAULT CHARACTER SET utf8mb4;
which is a fast operation.
To convert the tables there are many different possibilities we considered:
- The easy one:
ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4;
- The possibly faster one: dump/restore with
sed 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4/'
- The possibly fastest one: drop Secondary Indexes, dump/restore with
sed
, create Secondary Indexes again (mysqlpump
). - The automated one: With Perconas
pt-online-schema-change
[ Lit. ] or Facebooks OnlineSchemaChange OSC [ Lit. ] - The most elegant but not supported one: Master/Slave Replication.
Already with converting the first table we run into a problem:
SQL> ALTER TABLE test CONVERT TO CHARACTER SET utf8mb4; ERROR 1709 (HY000): Index column size too large. The maximum column size is 767 bytes.
This table had a Primary Key of a length of more than 255 bytes and this is not possible with the old InnoDB Antelope File Format:
SQL> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE 'innodb_file_format%' OR Variable_name LIKE 'innodb_large%'; +--------------------------+----------+ | Variable_name | Value | +--------------------------+----------+ | innodb_file_format | Antelope | | innodb_file_format_check | ON | | innodb_file_format_max | Antelope | | innodb_large_prefix | OFF | +--------------------------+----------+
So we have to find out first, which tables are still in old Antelope File Format:
SQL> SELECT table_schema , CASE WHEN row_format = 'Dynamic' THEN 'Barracuda' WHEN row_format = 'Compressed' THEN 'Barracuda' WHEN row_format = 'Compact' THEN 'Antelope' WHEN row_format = 'Redundant' THEN 'Antelope' END AS 'file_format' , COUNT(*) FROM information_schema.tables WHERE engine = 'InnoDB' AND table_schema NOT IN ('information_schema', 'sys', 'mysql') GROUP BY table_schema, file_format ; +--------------+-------------+----------+ | table_schema | file_format | count(*) | +--------------+-------------+----------+ | foodmart | Barracuda | 23 | | test | Barracuda | 1 | | world | Antelope | 2 | | world | Barracuda | 1 | +--------------+-------------+----------+
Then we could covert the table correctly forcing the Barracuda File format:
SQL> ALTER TABLE test CONVERT TO CHARACTER SET utf8mb4, ROW_FORMAT=dynamic;
Testing of new Character Set
The last but most important step is to test the changes. Here we recommend to do as many as possible different tests:
- MySQL CLI:
mysql
- phpMyAdmin
- MySQL Workbench
- Other GUIs
- Your Application
Especially test careful data with umlauts (öäüÄÖÜß), foreign Characters (Turkish, Cyrillic, CJK Characters) and Emojis. Good Candidates are: Lastname, City or free text fields like Comment.
Some aid you also can get from here:
MySQL Pump
mysqlpump
— A Database Backup Program
This utility is currently not available for MariaDB. If works for MySQL 5.7 and newer and does NOT support MySQL 5.6. So we cannot use it for the Upgrade Process from MySQL 5.6 to 5.7. Newest MySQL Releases contain Bug fixes and even new Features in mysqlpump
so we can assume it is still supported and actively maintained. Recent releases contain some fixes for trivial bugs so we can assume mysqlpump
is not widely used yet and not as mature yet as mysqldump
. An alternative product would be MyDumper from Domas@Facebook (Lauchpad, GitHub).
Interesting features are:
- Parallel dumping of databases.
- Secondary Index restore separated from Table Restore.
Missing features are:
- Does not support MariaDB and MySQL 5.6 and earlier.
- No binary log position is stored with backup (
--master-data
) but only GTID? So we have to use/enable GTID based Replication.
First test:
# mysqlpump --user=root --all-databases --single-transaction --triggers --routines --events --hex-blob > /tmp/mysqlpump_backup.sql Dump progress: 1/4 tables, 0/87 rows Dump progress: 41/46 tables, 789469/6956116 rows Dump progress: 41/46 tables, 1608219/6956116 rows Dump progress: 45/46 tables, 2358412/6956116 rows Dump progress: 45/46 tables, 3437912/6956116 rows Dump progress: 45/46 tables, 4493662/6956116 rows Dump progress: 45/46 tables, 5177662/6956116 rows Dump progress: 45/46 tables, 5568662/6956116 rows Dump progress: 45/46 tables, 6216162/6956116 rows Dump progress: 45/46 tables, 6716662/6956116 rows Dump completed in 9860 milliseconds ALTER TABLE `foodmart`.`account` ADD KEY `i_account_parent` (`account_parent_id`); ALTER TABLE `foodmart`.`category` ADD KEY `i_category_parent` (`category_parent_id`); ALTER TABLE `foodmart`.`employee` ADD KEY `i_employee_department_id` (`department_id`); ALTER TABLE `foodmart`.`employee` ADD KEY `i_employee_store_id` (`store_id`); ALTER TABLE `foodmart`.`employee` ADD KEY `i_employee_supervisor_id` (`supervisor_id`); ALTER TABLE `foodmart`.`expense_fact` ADD KEY `i_expense_store_id` (`store_id`); ALTER TABLE `foodmart`.`expense_fact` ADD KEY `i_expense_account_id` (`account_id`); ALTER TABLE `foodmart`.`expense_fact` ADD KEY `i_expense_time_id` (`time_id`);
Test against MySQL 5.6:
Server version is not compatible. Server version should be 5.7.8 or above.
MySQL Master/Slave Replication for Character Set conversion
Replication of Columns Having Different Data Types should work since MySQL 5.1.21. General rule when Statement Based Replication is used: If the query would work executed directly on the Slave it should also work on the Slave with Replication! So utf8mb4 on the Slave should also be possible with SBR. In MySQL 5.5.3 Attribute Promotion and Attribute Demotion was introduced with LOSSY
and NON_LOSSY
slave_type_conversions
. But: Replication between columns using different character sets is not supported!
Some tests:
MasterSQL> INSERT INTO test VALUES (NULL, 'öäüÄÖÜß', NULL); MasterSQL> SELECT * FROM test; +----+----------------+---------------------+ | id | data | ts | +----+----------------+---------------------+ | 1 | öäüÄÖÜß | 2019-08-15 19:48:48 | | 2 | öäüÄÖÜß | 2019-08-15 19:50:00 | | 3 | öäüÄÖÜß | 2019-08-15 19:52:29 | +----+----------------+---------------------+ SlaveSQL> SHOW SLAVE STATUS\G ... Last_SQL_Errno: 1677 Last_SQL_Error: Column 1 of table 'test.test' cannot be converted from type 'varchar(255(bytes))' to type 'varchar(1020(bytes) utf8mb4)' ... SlaveSQL> SELECT * FROM test; +----+----------------+---------------------+ | id | data | ts | +----+----------------+---------------------+ | 1 | öäüÄÖÜß | 2019-08-15 19:48:48 | | 2 | öäüÄÖÜß | 2019-08-15 19:50:00 | +----+----------------+---------------------+
See also: Why is varchar(255) not varchar(255)? And:
MasterSQL> SET SESSION binlog_format=STATEMENT; MasterSQL> INSERT INTO test VALUES (NULL, 'öäüÄÖÜß', NULL); SlaveSQL> SELECT * FROM test; +----+----------------+---------------------+ | id | data | ts | +----+----------------+---------------------+ | 1 | öäüÄÖÜß | 2019-08-15 19:48:48 | | 2 | öäüÄÖÜß | 2019-08-15 19:50:00 | | 4 | öäüÄÖÜß | 2019-08-15 19:54:16 | +----+----------------+---------------------+
MySQL Shell, mysqlsh
Upgrade Checker Utility
Since MySQL 5.7 we have a new MySQL Utility, the MySQL Shell (mysqlsh
). The Upgrade Checker Utility in MySQL Shell will replace mysql_upgrade
in MySQL 8.0. So we wanted to know if it is of any use for converting Character Set to utf8mb4? You can use the Upgrade Checker Utility to check MySQL 5.7 server instances for compatibility errors and issues for upgrading. But: The Upgrade Checker Utility does not support checking MySQL Server instances 5.6 an earlier. MySQL Shell 8.0 is highly recommended for use with MySQL Server 8.0 and 5.7. Please upgrade to MySQL Shell 8.0. [ Lit. ]
# ./mysqlsh root@localhost:3319 JS > util.checkForServerUpgrade() The MySQL server at localhost:3319, version 5.6.42-log - MySQL Community Server (GPL), will now be checked for compatibility issues for upgrade to MySQL 8.0.17... Util.checkForServerUpgrade: Detected MySQL server version is 5.6.42, but this tool requires server to be at least at version 5.7 (LogicError) JS > util.checkForServerUpgrade('root@127.0.0.1:3309') The MySQL server at localhost:3309, version 5.7.26-log - MySQL Community Server (GPL), will now be checked for compatibility issues for upgrade to MySQL 8.0.17... ... 2) Usage of db objects with names conflicting with reserved keywords in 8.0 No issues found 3) Usage of utf8mb3 charset Warning: The following objects use the utf8mb3 character set. It is recommended to convert them to use utf8mb4 instead, for improved Unicode support. More information: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html test.test.data - column's default character set: utf8 4) Table names in the mysql schema conflicting with new tables in 8.0 No issues found ... Errors: 0 Warnings: 1 Notices: 0 No fatal errors were found that would prevent an upgrade, but some potential issues were detected. Please ensure that the reported issues are not significant before upgrading. JS > util.checkForServerUpgrade('root@127.0.0.1:3309', {'targetVersion':'5.7.27'}) The MySQL server at 127.0.0.1:3309, version 5.7.26-log - MySQL Community Server (GPL), will now be checked for compatibility issues for upgrade to MySQL 5.7.27... Util.checkForServerUpgrade: This tool supports checking upgrade to MySQL server versions 8.0.11 to 8.0.17 (LogicError)
Because latin1 tables are still supported in MySQL 8.0 the Upgrade Checker Utility does not complain or do anything for latin1 tables. So it is not of any use for the utf8mb4 conversion task except you already have some old utf8(mb3) tables.
- Shinguz's blog
- Log in or register to post comments
Comments
question on effect of conversion
Can altering the table character set to utf8mb4 cause error or special characters to appear in existing text or binary (blob) data. That is when ALTER TABLE test CONVERT TO CHARACTER SET utf8mb4 is executed will existing latin1 text be converted properly. Perhaps this is a mute point as I believe latin1 is part of UTF8MB4, but I am not sure.
Thanks for any insight!!
Hal Richman
Effect of conversion
Hello Hal
I do not really get your question. Do you have a good and reproducible example for your question?
The MySQL documentation states in Changing the Character Set:
So the conversion SHOULD happen without an error. Everything else I consider a bug and should be filed as such.
But it theoretically CAN result in errors... Do you have an example for such an error?
I would say, that latin1 is technically not an exact sub-set of utf8mb4 because of the following example:
So a conversion is needed!
Possibly you stumbled apon the following problem:
Example here: