Talk:Collation

From Freephile Wiki
Jump to navigation Jump to search

This page is all about Collation in MediaWiki - and should probably become a new article to supplement the general topic page.

See Also[edit source]

https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation and https://github.com/wikimedia/mediawiki/blob/master/maintenance/updateCollation.php Other maintenance scripts such as https://github.com/wikimedia/mediawiki/blob/master/maintenance/uppercaseTitlesForUnicodeTransition.php or other extensions etc. may be helpful if you need to "fix" things.


In MWStake General, I said Short answer:

1. yes, you should change the $wgDBTableOptions (see below) to the new default. However, this setting is only used during new table creation, so it doesn't really affect your existing database(s) and the tables in them. 2. Do NOT CONVERT TO all your existing tables. You have to be more precise with what columns you change - if you change anything at all. You really only convert if you have some encoding problem (aka Mojibake).

Using

SHOW VARIABLES WHERE Variable_name LIKE 'character_set_%' OR Variable_name LIKE 'collation%';

will show you some info about how your database server, client, filesystem etc. is currently setup.

Here's a Stack Overflow question + answer that explains the different encodings (byte mappings) and collations (sorting rules) and why it _might_ matter to you. The newest collations only really matter if you're trying to chase the long tail of performance optimizations.

BTW, the answer is by Rick James (not the singer - the database expert)

From WMF, I see this puppet template for the galera configuration, updated on 2020-06-11 specifying

# character sets and collations
character-set-server = utf8mb4
collation-server     = utf8mb4_general_ci

As you point out, the setting for $wgDBTableOptions changed back in REL1_32 to use a "binary" character set as the default $wgDBTableOptions = 'ENGINE=InnoDB, DEFAULT CHARSET=binary';

Apparently the prior "option" to choose the charset (between utf8 and binary) was hidden from the UI of the installer in MediaWiki 1.31.0-rc.2, so the change in REL1_32 was to make the default config setting match what the installer was already doing (using the binary charset for NEW installations).

Keep in mind that MediaWiki outputs UTF8 to the browser: $response->header( 'Content-Type: text/html; charset=UTF-8' ); But that is a different aspect to the whole UTF-8 question.

In the database backend, MediaWiki switched over some years ago (I think it was REL1_35) to use the Doctrine Database Abstraction Layer (Doctrine DBAL); using JSON to represent the schemas for each table as well as to define "abstract" schema changes that work across multiple DB engines. There is a PHP script that turns that into generated SQL

You can see how the default Character Set for MySQL is 'binary' in the installer, but this does not mean that MediaWiki changes the character set of an existing installation.

My recent installation of MediaWiki 1.39 is actually latin1 -- I'm off to figure out why that is the case... possibly due to the default of the MariaDB instance I'm using.

Although it is old information (2012), User Daniel Renfro put together some nice notes on Character Encoding and how it plays out (or plays havoc) for you.

Tests[edit source]

select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
select CONVERT(CONVERT('Mike’s' USING latin1) USING binary);

produces

Mike’s
Mikeâ  s

MediaWiki[edit source]

This is one of the few places in the code where 'create database' is ever called

Conversion[edit source]

If you need to do conversion, check out the MySQL docs v8.3 and MySQL docs v5.7

checkStorage.php[edit source]

What does the CheckStorage class (Manual source) actually do? It is used in conjunction with the Manual:External_storage (Wikitech) capability of MediaWiki and therefore is irrelevant here because we're not using External Storage. But, here's the output of running the script for information's sake:

Errors:

Flag statistics:
                                     1575 15.53%
utf-8                                8564 84.47%

Local object statistics:

Tables[edit source]

SET @DATABASE_NAME = 'wiki_greg';
SELECT 
    CONCAT("ALTER TABLE `",
            table_name,
            "` CHARACTER SET = 'binary' COLLATE = 'BINARY';") AS sql_statements
FROM
    information_schema.tables AS tb
WHERE
    table_schema = @DATABASE_NAME
        AND `TABLE_COLLATION` = 'latin1_swedish_ci'
        AND `TABLE_TYPE` = 'BASE TABLE'
        AND `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;

First thing you need to do is use wiki_greg;

Then you can alter all the tables by copy/pasting all the generated lines. Altering table defaults will NOT change definitions on columns which specify CHARACTER SET or collation.

ALTER TABLE `archive` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `bad_behavior` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `blobs` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `brokenlinks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `category` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `categorylinks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `change_tag` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `cur` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `dune_hitcounter` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `external_user` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `externallinks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `filearchive` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `image` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `imagelinks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `interwiki` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `ipblocks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `ipblocks_old` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `job` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `l10n_cache` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `langlinks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `links` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `linkscc` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `linktarget` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `log_search` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `logging` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `math` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `objectcache` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `oldimage` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `online` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `page` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `page_props` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `page_restrictions` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `pagelinks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `protected_titles` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `querycache` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `querycache_info` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `querycachetwo` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `recentchanges` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `redirect` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `revision` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `site_identifiers` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `site_stats` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `sites` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `templatelinks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `text` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `trackbacks` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `updatelog` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `user` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `user_autocreate_serial` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `user_groups` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `user_newtalk` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `user_openid` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `user_properties` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `user_rights` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `validate` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `watchlist` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `watchlist_expiry` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `wiretap` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `wiretap_counter_alltime` CHARACTER SET = 'binary' COLLATE = 'BINARY';

Columns[edit source]

Show the columns that are not 'binary' and have a specific character set

SET @DATABASE_NAME = 'wiki_greg';
SELECT
	TABLE_SCHEMA,
	TABLE_NAME,
	COLUMN_NAME,
	CHARACTER_SET_NAME,
	COLLATION_NAME
FROM
	information_schema.columns AS cols
WHERE
	table_schema = @DATABASE_NAME
	AND COLLATION_NAME NOT IN ('binary');
TABLE_SCHEMA|TABLE_NAME       |COLUMN_NAME             |CHARACTER_SET_NAME|COLLATION_NAME   |
------------+-----------------+------------------------+------------------+-----------------+
wiki_greg   |archive          |ar_timestamp            |latin1            |latin1_bin       |
wiki_greg   |bad_behavior     |ip                      |latin1            |latin1_swedish_ci|
wiki_greg   |bad_behavior     |request_method          |latin1            |latin1_swedish_ci|
wiki_greg   |bad_behavior     |request_uri             |latin1            |latin1_swedish_ci|
wiki_greg   |bad_behavior     |server_protocol         |latin1            |latin1_swedish_ci|
wiki_greg   |bad_behavior     |http_headers            |latin1            |latin1_swedish_ci|
wiki_greg   |bad_behavior     |user_agent              |latin1            |latin1_swedish_ci|
wiki_greg   |bad_behavior     |request_entity          |latin1            |latin1_swedish_ci|
wiki_greg   |bad_behavior     |key                     |latin1            |latin1_swedish_ci|
wiki_greg   |blobs            |blob_index              |latin1            |latin1_bin       |
wiki_greg   |brokenlinks      |bl_to                   |latin1            |latin1_bin       |
wiki_greg   |categorylinks    |cl_type                 |latin1            |latin1_swedish_ci|
wiki_greg   |cur              |cur_title               |latin1            |latin1_bin       |
wiki_greg   |cur              |cur_text                |latin1            |latin1_swedish_ci|
wiki_greg   |cur              |cur_user_text           |latin1            |latin1_bin       |
wiki_greg   |cur              |cur_timestamp           |latin1            |latin1_bin       |
wiki_greg   |cur              |cur_touched             |latin1            |latin1_bin       |
wiki_greg   |cur              |inverse_timestamp       |latin1            |latin1_bin       |
wiki_greg   |dune_searchindex |si_title                |latin1            |latin1_swedish_ci|
wiki_greg   |dune_searchindex |si_text                 |latin1            |latin1_swedish_ci|
wiki_greg   |external_user    |eu_external_id          |latin1            |latin1_bin       |
wiki_greg   |filearchive      |fa_storage_group        |latin1            |latin1_swedish_ci|
wiki_greg   |filearchive      |fa_storage_key          |latin1            |latin1_bin       |
wiki_greg   |filearchive      |fa_deleted_timestamp    |latin1            |latin1_bin       |
wiki_greg   |filearchive      |fa_media_type           |latin1            |latin1_swedish_ci|
wiki_greg   |filearchive      |fa_major_mime           |latin1            |latin1_swedish_ci|
wiki_greg   |filearchive      |fa_timestamp            |latin1            |latin1_bin       |
wiki_greg   |image            |img_media_type          |latin1            |latin1_swedish_ci|
wiki_greg   |image            |img_major_mime          |latin1            |latin1_swedish_ci|
wiki_greg   |interwiki        |iw_prefix               |latin1            |latin1_swedish_ci|
wiki_greg   |interwiki        |iw_url                  |latin1            |latin1_swedish_ci|
wiki_greg   |interwiki        |iw_wikiid               |latin1            |latin1_swedish_ci|
wiki_greg   |ipblocks         |ipb_timestamp           |latin1            |latin1_bin       |
wiki_greg   |ipblocks         |ipb_expiry              |latin1            |latin1_bin       |
wiki_greg   |ipblocks_old     |ipb_address             |latin1            |latin1_bin       |
wiki_greg   |ipblocks_old     |ipb_timestamp           |latin1            |latin1_bin       |
wiki_greg   |ipblocks_old     |ipb_expiry              |latin1            |latin1_bin       |
wiki_greg   |ipblocks_old     |ipb_range_start         |latin1            |latin1_swedish_ci|
wiki_greg   |ipblocks_old     |ipb_range_end           |latin1            |latin1_swedish_ci|
wiki_greg   |job              |job_cmd                 |latin1            |latin1_swedish_ci|
wiki_greg   |l10n_cache       |lc_key                  |latin1            |latin1_swedish_ci|
wiki_greg   |log_search       |ls_value                |latin1            |latin1_swedish_ci|
wiki_greg   |logging          |log_timestamp           |latin1            |latin1_swedish_ci|
wiki_greg   |math             |math_inputhash          |latin1            |latin1_swedish_ci|
wiki_greg   |math             |math_outputhash         |latin1            |latin1_swedish_ci|
wiki_greg   |math             |math_html               |latin1            |latin1_swedish_ci|
wiki_greg   |math             |math_mathml             |latin1            |latin1_swedish_ci|
wiki_greg   |objectcache      |keyname                 |latin1            |latin1_bin       |
wiki_greg   |objectcache      |modtoken                |latin1            |latin1_swedish_ci|
wiki_greg   |oldimage         |oi_timestamp            |latin1            |latin1_bin       |
wiki_greg   |oldimage         |oi_media_type           |latin1            |latin1_swedish_ci|
wiki_greg   |oldimage         |oi_major_mime           |latin1            |latin1_swedish_ci|
wiki_greg   |online           |username                |latin1            |latin1_swedish_ci|
wiki_greg   |online           |timestamp               |latin1            |latin1_swedish_ci|
wiki_greg   |online           |wikiid                  |latin1            |latin1_swedish_ci|
wiki_greg   |page             |page_touched            |latin1            |latin1_bin       |
wiki_greg   |page_restrictions|pr_type                 |latin1            |latin1_swedish_ci|
wiki_greg   |page_restrictions|pr_level                |latin1            |latin1_swedish_ci|
wiki_greg   |page_restrictions|pr_expiry               |latin1            |latin1_bin       |
wiki_greg   |querycache       |qc_type                 |latin1            |latin1_swedish_ci|
wiki_greg   |querycache_info  |qci_type                |latin1            |latin1_swedish_ci|
wiki_greg   |querycache_info  |qci_timestamp           |latin1            |latin1_swedish_ci|
wiki_greg   |querycachetwo    |qcc_type                |latin1            |latin1_swedish_ci|
wiki_greg   |recentchanges    |rc_log_type             |latin1            |latin1_bin       |
wiki_greg   |recentchanges    |rc_log_action           |latin1            |latin1_bin       |
wiki_greg   |redirect         |rd_interwiki            |latin1            |latin1_swedish_ci|
wiki_greg   |revision         |rev_timestamp           |latin1            |latin1_bin       |
wiki_greg   |searchindex      |si_title                |latin1            |latin1_swedish_ci|
wiki_greg   |searchindex      |si_text                 |latin1            |latin1_swedish_ci|
wiki_greg   |sites            |site_domain             |latin1            |latin1_swedish_ci|
wiki_greg   |smw_ft_search    |o_text                  |utf8              |utf8_general_ci  |
wiki_greg   |text             |old_title               |latin1            |latin1_bin       |
wiki_greg   |text             |old_text                |latin1            |latin1_swedish_ci|
wiki_greg   |text             |old_user_text           |latin1            |latin1_bin       |
wiki_greg   |text             |old_timestamp           |latin1            |latin1_bin       |
wiki_greg   |text             |inverse_timestamp       |latin1            |latin1_bin       |
wiki_greg   |trackbacks       |tb_title                |latin1            |latin1_swedish_ci|
wiki_greg   |trackbacks       |tb_url                  |latin1            |latin1_swedish_ci|
wiki_greg   |trackbacks       |tb_ex                   |latin1            |latin1_swedish_ci|
wiki_greg   |trackbacks       |tb_name                 |latin1            |latin1_swedish_ci|
wiki_greg   |updatelog        |ul_key                  |latin1            |latin1_swedish_ci|
wiki_greg   |user             |user_email              |latin1            |latin1_swedish_ci|
wiki_greg   |user             |user_token              |latin1            |latin1_bin       |
wiki_greg   |user             |user_email_authenticated|latin1            |latin1_bin       |
wiki_greg   |user             |user_email_token        |latin1            |latin1_bin       |
wiki_greg   |user             |user_email_token_expires|latin1            |latin1_bin       |
wiki_greg   |user             |user_registration       |latin1            |latin1_bin       |
wiki_greg   |user             |user_newpass_time       |latin1            |latin1_bin       |
wiki_greg   |user_newtalk     |user_ip                 |latin1            |latin1_swedish_ci|
wiki_greg   |user_openid      |uoi_openid              |latin1            |latin1_swedish_ci|
wiki_greg   |validate         |val_comment             |latin1            |latin1_swedish_ci|
wiki_greg   |validate         |val_ip                  |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |page_name               |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |user_name               |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |hit_timestamp           |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |hit_year                |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |hit_month               |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |hit_day                 |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |hit_hour                |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |page_action             |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |diff                    |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |referer_url             |latin1            |latin1_swedish_ci|
wiki_greg   |wiretap          |referer_title           |latin1            |latin1_swedish_ci|