Page 1254

The UCS standard (ISO 10646) describes a 31-bit character-set architecture; however, today only the first 65534 code positions (0x0000 to 0xfffd), which are called the Basic Multilingual Plane (BMP), have been assigned characters, and it is expected that only very exotic characters (such as Hieroglyphics) for special scientific purposes will ever get a place outside this 16-bit BMP.

The UCS characters 0x0000 to 0x007f are identical to those of the classic US-ASCII character set and the characters in the range 0x0000 to 0x00ff are identical to those in the ISO 8859-1 Latin-1 character set.

COMBINING CHARACTERS

Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character just adds an accent to the previous character. The most important accented characters have codes of their own in UCS; however, the combining character mechanism allows you to add accents and other diacritical marks to any character. The combining characters always follow the character that they modify. For example, the German character Umlaut-A ("Latin capital letter A with diaeresis") can either be represented by the precomposed UCS code 0x00c4 or alternately as the combination of a normal "Latin capital letter A" followed by a "combining diaeresis": 0x0041 0x0308.

IMPLEMENTATION LEVELS

As not all systems are expected to support advanced mechanisms such as combining characters, ISO 10646 specifies the following three implementation levels of UCS:

Level 1	Combining characters and Hangul Jamo characters (a special, more complicated encoding of the Korean script, where Hangul syllables are coded as two or three subcharacters) are not supported.
Level 2	Like level 1, except in some scripts, some combining characters are now allowed (such as Hebrew, Arabic, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai, and Lao).
Level 3	All UCS characters are supported.

The Unicode 1.1 standard published by the Unicode Consortium contains exactly the UCS Basic Multilingual Plane at implementation Level 3, as described in ISO 10646. Unicode 1.1 also adds some semantical definitions for some characters to the definitions of ISO 10646.

UNICODE UNDER LINUX

Under Linux, only the BMP at implementation Level 1 should be used at the moment to keep the implementation complexity of combining characters low. The higher implementation levels are more suitable for special word processing formats but not as a generic system character set. The C type wchar_t is on Linux an unsigned 16-bit integer type and its values are interpreted as UCS Level 1 BMP codes.

The locale setting specifies whether the system character encoding is UTF-8 or ISO 8859-1, for example. Library functions such as wctomb, mbtowc, or wprintf can be used to transform the internal wchar_t characters and strings into the system character encoding and back.

PRIVATE AREA

In the BMP, the range 0xe000 to 0xf8ff will never be assigned any characters by the standard and is reserved for private usage. For the Linux community, this private area is subdivided further into the range 0xe000 to 0xefff, which can be used individually by any end user and the Linux zone in the range 0xf000 to 0xf8ff where extensions are coordinated among all Linux users. The registry of the characters assigned to the Linux zone is currently maintained by H. Peter Anvin (Peter.Anvin@linux.org), Yggdrasil Computing, Inc. It contains some DEC VT100 graphics characters missing in Unicode, gives direct access to the characters in the console font buffer, and contains the characters used by a few advanced scripts such as Klingon.

Page 1255

LITERATURE

Information technology—Universal Multiple-Octet Coded Character Set (UCS). Part 1: Architecture and Basic Multilingual Plane. International Standard ISO 10646-1, International Organization for Standardization, Geneva, 1993.

This is the official specification of UCS. Pretty official, pretty thick, and pretty expensive. For ordering information, check www.iso.ch.

The Unicode Standard—Worldwide Character Encoding Version 1.0. The Unicode Consortium, Addison-Wesley, Reading, MA, 1991.

There is already Unicode 1.1.4 available. The changes to the 1.0 book are available from ftp.unicode.org. Unicode 2.0 will be published again as a book in 1996.

S. Harbison, G. Steele. C, A Reference Manual. Fourth edition, Prentice Hall, Englewood Cliffs, 1995,
ISBN 0-13-326224-3.

A good reference book about the C programming language. The fourth edition now covers also the 1994 Amendment 1 to the ISO C standard (ISO/IEC 9899:1990), which adds a large number of new C library functions for handling wide character sets.

BUGS

At the time when this man page was written, the Linux libc support for UCS was far from complete.

AUTHOR

Markus Kuhn (mskuhn@cip.informatik.uni-erlangen.de)

SEE ALSO

utf-8(7)

Linux, 27 December 1995

UTF-8

UTF-8—An ASCII-compatible multibyte Unicode encoding.

DESCRIPTION

The Unicode character set occupies a 16-bit code space. The most obvious Unicode encoding (known as UCS-2) consists of a sequence of 16-bit words. Such strings can contain as parts of many 16-bit characters bytes such as \0 or /, which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and can't read 16-bit words as characters without major modifications. For these reasons, UCS-2 is not a suitable external encoding of Unicode in filenames, text files, environment variables, and so on. The ISO 10646 Universal Character Set (UCS), a superset of Unicode, occupies even a 31-bit code space, and the obvious UCS-4 encoding for it (a sequence of 32-bit words) has the same problems.

The UTF-8 encoding of Unicode and UCS does not have these problems and is the way to go for using the Unicode character set under UNIX-style operating systems.

PROPERTIES

The UTF-8 encoding has the following nice properties:

UCS characters 0x00000000 to 0x0000007f (the classical U.S. ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings that contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All UCS characters greater than 0x7f are encoded as a multibyte sequence consisting only of bytes in the range 0x80 to 0xfd, so no ASCII byte can appear as part of another character and there are no problems with \0 or /.

Previous | Table of Contents | Next