types under the "text" top-level type.
It is noteworthy that the label "UTF-8" does not contain a version
identification, referring generically to ISO/IEC 10646. This is
intentional, the rationale being as follows:
A MIME charset label is designed to give just the information needed
to interpret a sequence of bytes received on the wire into a sequence
of characters, nothing more (see RFC 2045, section 2.2, in [MIME]).
As long as a character set standard does not change incompatibly,
version numbers serve no purpose, because one gains nothing by
learning from the tag that newly assigned characters may be received
that one doesn't know about. The tag itself doesn't teach anything
about the new characters, which are going to be received anyway.
Hence, as long as the standards evolve compatibly, the apparent
advantage of having labels that identify the versions is only that,
apparent. But there is a disadvantage to such version-dependent
labels: when an older application receives data accompanied by a
newer, unknown label, it may fail to recognize the label and be
completely unable to deal with the data, whereas a generic, known
label would have triggered mostly correct processing of the data,
which may well not contain any new characters.
Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
change, in principle contradicting the appropriateness of a version
independent MIME charset label as described above. But the
compatibility problem can only appear with data containing Korean
Hangul characters encoded according to Unicode 1.1 (or equivalently
ISO/IEC 10646 before amendment 5), and there is arguably no such data
to worry about, this being the very reason the incompatible change
was deemed acceptable.
RFC 2279 UTF-8 January 1998
In practice, then, a version-independent label is warranted, provided
the label is understood to refer to all versions after Amendment 5,
and provided no incompatible change actually occurs. Should
incompatible changes occur in a later version of ISO/IEC 10646, the
MIME charset label defined here will stay aligned with the previous
version until and unless the IETF specifically decides otherwise.
It is also proposed to register the charset parameter value
"UNICODE-1-1-UTF-8", for the exclusive purpose of labelling text data
containing Hangul syllables encoded to UTF-8 without taking into
account Amendment 5 of ISO/IEC 10646 (i.e. using the pre-amendment 5
code point assignments). Any other UTF-8 data SHOULD NOT use this
label, in particular data not containing any Hangul syllables, and it
is felt important to strongly recommend against creating any new
Hangul-containing data without taking Amendment 5 of ISO/IEC 10646
into account.
6. Security Considerations
Implementors of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.
A particularly subtle form of this attack could be carried out
against a parser which performs security-critical validity checks
against the UTF-8 encoded form of its input, but interprets certain
illegal octet sequences as characters. For example, a parser might
prohibit the NUL character when encoded as the single-octet sequence
00, but allow the illegal two-octet sequence C0 80 and interpret it
as a NUL character. Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.
Acknowledgments
The following have participated in the drafting and discussion of
this memo:
James E. Agenbroad Andries Brouwer
Martin J. D|rst Ned Freed
David Goldsmith Edwin F. Hart
Kent Karlsson Markus Kuhn
Michael Kung Alain LaBonte
John Gardiner Myers Murray Sargent
Keld Simonsen Arnold Winkler
RFC 2279 UTF-8 January 1998
Bibliography
=4= |