MySQL encoding

I have run into the MySQL encoding issue detailed in this post, basically a transcoding issue when going from latin-1 to utf-8.

The solution proposed in the port is very elegant, though I have not had a chance to test it.

The solution I came up with was to create my own, perl based, data dumper and data loader scripts, which wound up addressing lots of other data dumping and loading issues that I had with mysqldump, and I ended up using these scripts a lot.

I would add two things to the post.

The first is to pay attention to the encoding used when setting up a mysql table. When a character based field is set to utf-8, its space allocation is three times the space allocated in the create statements, so a CHAR(20) will require 60 bytes if it is utf-8, rather than 20 if it is latin-1. VARCHARS are similarly affected. So if you know that your character data can be represented with latin-1 and if you are going to have a lot of it, then you should use latin-1 as the encoding for that field or that table, MySQL allows you to have a high degree of granularity.

The second is to preset the encoding used by the clients when connecting to the servers by setting the appropriate parameters in the my.cnf file used on the server as follows:


# Set the default character set to utf8
default_character_set = utf8

# Set the server character set
character_set_server = utf8

# Set the default collation to utf8_general_ci
#default_collation = utf8_general_ci

# Set the names to utf8 when a client connects
init_connect = 'SET NAMES utf8'

This allows you to deal with older mysql client libraries which you may have installed on your systems.

3 thoughts on “MySQL encoding

  1. aww man I was dealing with this very issue last year and ran across the same article. One thing I read though was it’s better to avoid char in UTF and just make everything varchar because varchar won’t use more bytes than neccessary on standard ascii/latin chars. Of course you lose the performance boost of setting all fields to char but the only place I ever saw anyone do that was at that one rss search engine company…

    Here’s the article I read:
    http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

    It includes a few handy extras too like:

    httpd.conf:
    AddCharset UTF-8 .utf8
    AddDefaultCharset UTF-8

    php.ini
    default_charset = “utf-8”

    my.cnf
    character-set-server=utf8
    default-collation=utf8_unicode_ci

    Also if you don’t want everything connecting to the mysqld to get set to utf8, you can fire off a query immediately after connecting: “SET NAMES utf8” .

    This is good if you have some legacy crap you can’t/won’t/don’t want to change.

  2. Thanks for the information! I was looking for these assorted options (to force a development server to talk UTF-8) and found your page here.

    FYI, read about the space requirements for UTF-8 at:
    http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html

    Specifically, it states:
    “Tip: To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.”

    In other words, you _are_ correct in stating that a CHAR(20) in UTF-8 will require 60 bytes, but a VARCHAR will continue to reserve only the number of bytes necessary to hold the string. In other words, a VARCHAR(20) _could_ use up to 60 bytes of storage, but any lower-ASCII characters will still only take up one byte.

    Or at least, so I understand it. YMMV. =)

Leave a comment