Friday, November 21, 2008

character encoding in PHP

A character set is a group of characters without associated numerical values. An example of a character set is the Latin alphabet or the Cyrillic alphabet.

Coded character sets are character sets in which each character is associated with a scalar value: a code point. For example, in ASCII, the uppercase letter “A” has the value 65. Examples for coded character sets are ASCII and Unicode. A coded character set is meant to be encoded, i.e. converted into a digital representation so that the characters can be serialized in files, databases, or strings. This is done through a character encoding scheme or encoding. The encoding method maps each character value to a given sequence of bytes.

In many cases, the encoding is just a direct projection of the scalar values, and there is no real distinction between the coded character set and its serialized representation. For example, in ISO 8859-1 (Latin 1), the character “A” (code point 65) is encoded as a byte 0×41 (i.e. 65). In other cases, the encoding method is more complex. For example, in UTF-8, an encoding of Unicode, the character “á” (225) is encoded as two bytes: 0xC3 and 0xA1.

For Unicode (also called Universal Character Set or UCS), a coded character set developed by the Unicode consortium, there a several possible encodings: UTF-8, UTF-16, and UTF-32. Of these, UTF-8 is most relevant for a web application.

UTF-8 is a multibyte 8-bit encoding in which each Unicode scalar value is mapped to a sequence of one to four bytes.

Now what is the issue in PHP ?

Suppose you have string "Iñtërnâtiônàlizætiøn" and when you counted with your eye, you can see it contains 20 characters But PHP did it in different way .. ;)

PHP will report 27 characters.That’s because the string, encoded as UTF-8, contains multi-byte characters which PHP‘s strlen function will count as being multiple characters.

we can get an idea of the characters that PHP‘s htmlentities function would translate like this (note get_html_translation_table cannot be told what charset to use - see below.)

print_r(array_map('htmlspecialchars',get_html_translation_table(HTML_ENTITIES)));

here's a short summary for those of you who might one day have the same problem.

Introduction about multi-byte strings

My aim was to use UTF-8 encoding my whole PHP application. UTF-8 is a multi-byte character encoding that supports the characters of almost any language of this planet (latin, greek, hebrew, arabic, chinese, etc). The problem with UTF-8 is that as since one character can be encoded in up to four bytes, it doesn't work with some string functions that use a start position, end position or string length such as substr. You have to use the multi-byte safe function instead: mb_substr!

PHP components

First of all, each element may have it's own encoding:
- PHP files
- PHP internal
- Output to client
- Client browser
- Input from client (get, post, cookies, files)
- MySQL storage
- MySQL link
- etc (other inputs/outputs or components)

PHP files

Make sure to have a text editor where you can chose what kind of encoding you save your files in. Many editors fail on doing that in a nice and easy way, but this is the first step: make sure you save all your file in UTF-8.

Note: I used to have bbedit on mac which had a very annoying bug with UTF-8: it prepended an invisible character at the beginning of the file, making the output start and sending headers even after the first
PHP internal

PHP's internal encoding should be the same the one in which the PHP files are saved in. To set the encoding, call mb_internal_encoding at the very beginning of your script:
mb_internal_encoding('UTF-8');

Output to client

You can get or set the encoding of the output with mb_http_output but by default it is 'pass', which is good: like that nothing will get modified between what you send to the client and what the client receives.

Client browser

To make sure the client sets the correct encoding when displaying the generated HTML file, send the following HTTP header to the client. Careful, you've to call that function before any output is sent to the client!
header('Content-Type: text/html; charset=UTF-8');

You may also specify the encoding in the HTML source, but i never saw a navigator that choses the encoding based on that.

Input from client

There are different ways the client can send data to your PHP application such as GET, POST, COOKIES or FILES. Each one of these input may have a different encoding you should in theory check for. But you can also rely on the fact that usually a browser replies with the same encoding it used to display the page (the one you sent in the HTTP header). Excepted for the files that the browser sends as they are stored on the client's computer.

MySQL storage

Make sure to specify the collation of your tables and especially text fields to be UTF8 too. I usually use MySQL's utf8_general_ci.

Note: It happened to me to have the database in another collation than my PHP encoding was in, but the characters still displaying correctly in my final HTML file. This was due to the fact that my strings were badly encoded when saved them in the database, but at the same time decoded back correctly when retrieved. At first I didn't care that they weren't correctly saved in the database as long as they eventually displayed correctly. But this lead to many problems when i added the search option to my application! Fulltext searches for "e" or "E" should have matched "é" or "è" for instance, but didn't since these special characters reached the database in a corrupted form.

MySQL link

Finally, the link your application is using to talk with MySQL has an encoding too you can change with the following MySQL instructions just after opening the connection to the database:
mysql_query("SET NAMES 'utf8';", $link);
mysql_query("SET CHARACTER SET 'utf8';", $link);

No comments:

Post a Comment