turrier.fr

Source : ct|04.05.09

< Tutorials Computer, Multimedia, Chinese

Standard iso-8859, GB2312 and UFT8

All the languages of the world includes more than 100.000 characters : European ( Latin, Armenian, Cyrillic, Greek ), African (Ethiopic), Indic (Bengali, Sinhala, Tamil...), East Asian ( Chinese, Japanese, Korean...), Central Asian ( Mongolian, Tibetan... ), Middle Eastern ( Arabic, Hebrew, Syriac...), Philippine, American, South East Asian ( Khmer, Thai...), Ancient Scripts ( Greek, Gothic, Phoenician, Persian...).

The first 256 characters, including Latin characters, are "classic" characters.

ascii characters

The other characters are "extended" characters.

extended characters

The standards and the character sets

To code the characters, there are various standards. The standards Ascii and Unicode are the most frequently met.

The standard ascii (where every character is represented with 1 byte) allows to represent only the classic characters. The standard unicode (where every character is represented with 2 bytes at least) is necessary to represent the extended characters.

To allow the Internet browsers to recognize and to display all the characters of the various languages, several character sets (charset) can be used. The following character sets are frequently met:
- The charset iso-8859-1;
- The charset GB2312;
- The charset UFT-8.

The charset iso-8859-1

With the character set iso-8859-1, Latin characters can be represented with 1 byte or with 2 bytes each (at choice). The extended characters must be represented with 2 bytes each. The character set iso-8859-1 allows the browsers to display perfectly Latin characters. With this character set, the extended characters, in particular Chinese characters (ideograms), are also display with Internet Explorer, but sometimes with not homogeneous aspects.

The meta tag corresponding to this character set is the following one:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

The charset GB2312

With the character set GB2312, a Latin character can be represented with 1 byte or with 2 bytes each (at choice). The extended characters must be represented with 2 bytes each.
The character set GB2312 allows the browsers to display perfectly Chinese characters. With this character set, the other characters, in particular Latin characters, are also displayed, but certain (c with cedille or the circumflex accent, for example) are more or less correctly displayed.

The meta tag corresponding to this character set is the following one:

<meta http-equiv="Content-Type" content="text/html; charset=GB2312" />

The charset UFT-8

The character set UFT-8 (Unicode Transformation Format 8 bits) allows the browsers to display perfectly all the characters. With this character set classic Latin characters are represented with a single byte each, as ascii. Other characters are represented each with a number of bytes being able to vary of one for four. This character set is more complicated to use, considering the possibly variable size of the characters.

Conclusion

The standard iso-8859-1 is adapted for web pages containing a majority of Latin characters, and additionally some extended characters, having verified that these last ones display correctly. The standard GB2312 is adapted for web pages containing Latin characters and also numerous Chinese ideograms. The standard UFT-8 must be envisaged when there is no simpler solution.


Valid XHTML 1.0 Transitional

© http://turrier.fr (2007)