PROXY  WHOIS  RQUOTE  TEXTS  SOFT  FOREX  BBOARD
 Music  Philosophy  Code  Literature  Russian

= ROOT|Technical|RFC|rfc2279.txt =

page 3 of 6



      UCS-2 character with two zero-valued octets.  However, pairs of
      UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
      parlance), being actually UCS-4 characters transformed through
      UTF-16, need special treatment: the UTF-16 transformation must be
      undone, yielding a UCS-4 character that is then transformed as
      above.

      Decoding from UTF-8 to UCS-4 proceeds as follows:

   1) Initialize the 4 octets of the UCS-4 character with all bits set
      to 0.

   2) Determine which bits encode the character value from the number of
      octets in the sequence and the second column of the table above
      (the bits marked x).

   3) Distribute the bits from the sequence to the UCS-4 character,
      first the lower-order bits from the last octet of the sequence and
      proceeding to the left until no x bits are left.

      If the UTF-8 sequence is no more than three octets long, decoding
      can proceed directly to UCS-2.




 
RFC 2279                         UTF-8                      January 1998


        NOTE -- actual implementations of the decoding algorithm above
        should protect against decoding invalid sequences.  For
        instance, a naive implementation may (wrongly) decode the
        invalid UTF-8 sequence C0 80 into the character U+0000, which
        may have security consequences and/or cause other problems.  See
        the Security Considerations section below.

   A more detailed algorithm and formulae can be found in [FSS_UTF],
   [UNICODE] or Annex R to [ISO-10646].

3.  Versions of the standards

   ISO/IEC 10646 is updated from time to time by published amendments;
   similarly, different versions of the Unicode standard exist: 1.0, 1.1
   and 2.0 as of this writing.  Each new version obsoletes and replaces
   the previous one, but implementations, and more significantly data,
   are not updated instantly.

   In general, the changes amount to adding new characters, which does
   not pose particular problems with old data.  Amendment 5 to ISO/IEC
   10646, however, has moved and expanded the Korean Hangul block,
   thereby making any previous data containing Hangul characters invalid
   under the new version.  Unicode 2.0 has the same difference from
   Unicode 1.1. The official justification for allowing such an
   incompatible change was that no implementations and no data
   containing Hangul existed, a statement that is likely to be true but
   remains unprovable.  The incident has been dubbed the "Korean mess",
   and the relevant committees have pledged to never, ever again make
   such an incompatible change.

   New versions, and in particular any incompatible changes, have q
   conseuences regarding MIME character encoding labels, to be discussed
   in section 5.

4.  Examples

   The UCS-2 sequence "A<NOT IDENTICAL TO>." (0041, 2262, 0391,
   002E) may be encoded in UTF-8 as follows:

   41 E2 89 A2 CE 91 2E

   The UCS-2 sequence representing the Hangul characters for the Korean
   word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows:

   ED 95 9C EA B5 AD EC 96 B4







 
RFC 2279                         UTF-8                      January 1998


   The UCS-2 sequence representing the Han characters for the Japanese
   word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows:

   E6 97 A5 E6 9C AC E8 AA 9E

5.  MIME registration

   This memo is meant to serve as the basis for registration of a MIME
   character set parameter (charset) [CHARSET-REG].  The proposed
   charset parameter value is "UTF-8".  This string labels media types
   containing text consisting of characters from the repertoire of
   ISO/IEC 10646 including all amendments at least up to amendment 5
   (Korean block), encoded to a sequence of octets using the encoding
   scheme outlined above.  UTF-8 is suitable for use in MIME content
=3=

1|2| < PREV = PAGE 3 = NEXT > |4|5|6

UP TO ROOT | UP TO DIR | TO FIRST PAGE

Google
 


E-mail Facebook Google Digg del.icio.us BlinkList Fark Furl Ma.gnolia Netscape NewsVine Reddit Slashdot Spurl StumbleUpon Technorati YahooMyWeb LiveJournal Blogmarks TwitThis Live News2.ru BobrDobr.ru Memori.ru MoeMesto.ru

0.0108662 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU)