UCS-2 character with two zero-valued octets. However, pairs of
UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
parlance), being actually UCS-4 characters transformed through
UTF-16, need special treatment: the UTF-16 transformation must be
undone, yielding a UCS-4 character that is then transformed as
above.
Decoding from UTF-8 to UCS-4 proceeds as follows:
1) Initialize the 4 octets of the UCS-4 character with all bits set
to 0.
2) Determine which bits encode the character value from the number of
octets in the sequence and the second column of the table above
(the bits marked x).
3) Distribute the bits from the sequence to the UCS-4 character,
first the lower-order bits from the last octet of the sequence and
proceeding to the left until no x bits are left.
If the UTF-8 sequence is no more than three octets long, decoding
can proceed directly to UCS-2.
RFC 2279 UTF-8 January 1998
NOTE -- actual implementations of the decoding algorithm above
should protect against decoding invalid sequences. For
instance, a naive implementation may (wrongly) decode the
invalid UTF-8 sequence C0 80 into the character U+0000, which
may have security consequences and/or cause other problems. See
the Security Considerations section below.
A more detailed algorithm and formulae can be found in [FSS_UTF],
[UNICODE] or Annex R to [ISO-10646].
3. Versions of the standards
ISO/IEC 10646 is updated from time to time by published amendments;
similarly, different versions of the Unicode standard exist: 1.0, 1.1
and 2.0 as of this writing. Each new version obsoletes and replaces
the previous one, but implementations, and more significantly data,
are not updated instantly.
In general, the changes amount to adding new characters, which does
not pose particular problems with old data. Amendment 5 to ISO/IEC
10646, however, has moved and expanded the Korean Hangul block,
thereby making any previous data containing Hangul characters invalid
under the new version. Unicode 2.0 has the same difference from
Unicode 1.1. The official justification for allowing such an
incompatible change was that no implementations and no data
containing Hangul existed, a statement that is likely to be true but
remains unprovable. The incident has been dubbed the "Korean mess",
and the relevant committees have pledged to never, ever again make
such an incompatible change.
New versions, and in particular any incompatible changes, have q
conseuences regarding MIME character encoding labels, to be discussed
in section 5.
4. Examples
The UCS-2 sequence "A<NOT IDENTICAL TO>." (0041, 2262, 0391,
002E) may be encoded in UTF-8 as follows:
41 E2 89 A2 CE 91 2E
The UCS-2 sequence representing the Hangul characters for the Korean
word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows:
ED 95 9C EA B5 AD EC 96 B4
RFC 2279 UTF-8 January 1998
The UCS-2 sequence representing the Han characters for the Japanese
word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows:
E6 97 A5 E6 9C AC E8 AA 9E
5. MIME registration
This memo is meant to serve as the basis for registration of a MIME
character set parameter (charset) [CHARSET-REG]. The proposed
charset parameter value is "UTF-8". This string labels media types
containing text consisting of characters from the repertoire of
ISO/IEC 10646 including all amendments at least up to amendment 5
(Korean block), encoded to a sequence of octets using the encoding
scheme outlined above. UTF-8 is suitable for use in MIME content
=3= |