MySQL encoding

May 18, 2007 ~ François Schiettecatte

I have run into the MySQL encoding issue detailed in this post, basically a transcoding issue when going from latin-1 to utf-8.

The solution proposed in the port is very elegant, though I have not had a chance to test it.

The solution I came up with was to create my own, perl based, data dumper and data loader scripts, which wound up addressing lots of other data dumping and loading issues that I had with mysqldump, and I ended up using these scripts a lot.

I would add two things to the post.

The first is to pay attention to the encoding used when setting up a mysql table. When a character based field is set to utf-8, its space allocation is three times the space allocated in the create statements, so a CHAR(20) will require 60 bytes if it is utf-8, rather than 20 if it is latin-1. VARCHARS are similarly affected. So if you know that your character data can be represented with latin-1 and if you are going to have a lot of it, then you should use latin-1 as the encoding for that field or that table, MySQL allows you to have a high degree of granularity.

The second is to preset the encoding used by the clients when connecting to the servers by setting the appropriate parameters in the my.cnf file used on the server as follows:

# Set the default character set to utf8 default_character_set = utf8


# Set the server character set

character_set_server              = utf8
# Set the default collation to utf8_general_ci

#default_collation                 = utf8_general_ci

# Set the names to utf8 when a client connects init_connect = 'SET NAMES utf8'

This allows you to deal with older mysql client libraries which you may have installed on your systems.

Published by François Schiettecatte

Entrepreneur and independent consultant creating companies and building complex software systems for clients. Thirty years in software development/engineering/architecture with a strong emphasis in genetic research and information retrieval systems. View all posts by François Schiettecatte

3 thoughts on “MySQL encoding”

noel says:

June 3, 2007 at 1:31 am

aww man I was dealing with this very issue last year and ran across the same article. One thing I read though was it’s better to avoid char in UTF and just make everything varchar because varchar won’t use more bytes than neccessary on standard ascii/latin chars. Of course you lose the performance boost of setting all fields to char but the only place I ever saw anyone do that was at that one rss search engine company…

Here’s the article I read:
http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

It includes a few handy extras too like:

httpd.conf:
AddCharset UTF-8 .utf8
AddDefaultCharset UTF-8

php.ini
default_charset = “utf-8”

my.cnf
character-set-server=utf8
default-collation=utf8_unicode_ci

Also if you don’t want everything connecting to the mysqld to get set to utf8, you can fire off a query immediately after connecting: “SET NAMES utf8” .

This is good if you have some legacy crap you can’t/won’t/don’t want to change.

Reply
Joseph Riesen says:

August 28, 2007 at 3:29 am

Thanks for the information! I was looking for these assorted options (to force a development server to talk UTF-8) and found your page here.

FYI, read about the space requirements for UTF-8 at:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html

Specifically, it states:
“Tip: To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.”

In other words, you _are_ correct in stating that a CHAR(20) in UTF-8 will require 60 bytes, but a VARCHAR will continue to reserve only the number of bytes necessary to hold the string. In other words, a VARCHAR(20) _could_ use up to 60 bytes of storage, but any lower-ASCII characters will still only take up one byte.

Or at least, so I understand it. YMMV. =)

Reply
François Schiettecatte says:

August 28, 2007 at 6:42 am

Joseph

Thanks, that would be my understanding too, but the documentation is silent on the issue. I would assume that they have done the “right thing” though.

Reply

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Share this:

Related

Published by François Schiettecatte

3 thoughts on “MySQL encoding”

Leave a comment Cancel reply