Far Eastern Encodings

The Far Eastern scripts, which include Chinese and Japanese, contain thousands of written symbols, far more than could be incorporated in 8-bit (single-byte) code pages. For this reason, an encoding solution that could handle the CJK scripts had to be developed.

The general encoding solution for the CJK scripts was developed in Japan. It uses a 16-bit (two-byte) code page for each of these scripts. Double-byte code pages can handle up to 16,536 written symbols, more than enough for most text processing applications. Different computer implementations were later developed to improve performance with double-byte codes. Today, three general schemes are used for CJK encodings.

The first scheme, modal encoding, uses a two-stage process. The first stage is mode switching (for example, switching between a sequence of single-byte Latin letters and double-byte Japanese kanji characters), which is signaled by an escape sequence of specific codes. The second stage is the handling of the actual bytes that represent the characters. Modal encoding methods typically use 7-bit codes. An example is Japanese JIS encoding.

The second scheme, non-modal encoding, uses the numeric byte value of a text stream to decide when to switch between one- and two-byte-per character modes. For example, in both the Japanese Shift-JIS and traditional Chinese Big-5 encodings, a value in the range 0x20-0x7F represents the corresponding 7-bit ASCII (English) letters. However, a value in the range 0x80-0xFF indicates that the value is the first byte in a double-byte ideograph. Non-modal encoding methods typically use 8-bit codes.

The third scheme, fixed-width encoding, uses the same number of bytes to represent all the characters available in a character set. There is no switching between one- and two-byte-per-character modes. This encoding method simplifies searching, indexing, and sorting of text, but can use a large amount of space. UTF-16 Unicode is an example of a fixed-width encoding scheme.

These schemes are generally applied to actual code page implementation by an operating system-specific method. Some systems use both a single-byte code page (for Latin text or kana) and a double-byte code page (for ideographs), and switch between them. In addition, there are multiple code page solutions on the same platform for each of these languages, particularly Chinese.

	WebFOCUS