A "name" is an identifier such as a person's name, a hostname, a
domainname, a filename or an E-mail address; it is often treated as
an identifier rather than as a piece of text, and is often used in
protocols as an identifier for entities, without surrounding text.
3.1. What charset to use
All protocols MUST identify, for all character data, which charset is
in use.
RFC 2277 Charset Policy January 1998
Protocols MUST be able to use the UTF-8 charset, which consists of
the ISO 10646 coded character set combined with the UTF-8 character
encoding scheme, as defined in [10646] Annex R (published in
Amendment 2), for all text.
Protocols MAY specify, in addition, how to use other charsets or
other character encoding schemes for ISO 10646, such as UTF-16, but
lack of an ability to use UTF-8 is a violation of this policy; such a
violation would need a variance procedure ([BCP9] section 9) with
clear and solid justification in the protocol specification document
before being entered into or advanced upon the standards track.
For existing protocols or protocols that move data from existing
datastores, support of other charsets, or even using a default other
than UTF-8, may be a requirement. This is acceptable, but UTF-8
support MUST be possible.
When using other charsets than UTF-8, these MUST be registered in the
IANA charset registry, if necessary by registering them when the
protocol is published.
(Note: ISO 10646 calls the UTF-8 CES a "Transformation Format" rather
than a "character encoding scheme", but it fits the charset workshop
report definition of a character encoding scheme).
3.2. How to decide a charset
When the protocol allows a choice of multiple charsets, someone must
make a decision on which charset to use.
In some cases, like HTTP, there is direct or semi-direct
communication between the producer and the consumer of data
containing text. In such cases, it may make sense to negotiate a
charset before sending data.
In other cases, like E-mail or stored data, there is no such
communication, and the best one can do is to make sure the charset is
clearly identified with the stored data, and choosing a charset that
is as widely known as possible.
Note that a charset is an absolute; text that is encoded in a charset
cannot be rendered comprehensibly without supporting that charset.
(This also applies to English texts; charsets like EBCDIC do NOT have
ASCII as a proper subset)
RFC 2277 Charset Policy January 1998
Negotiating a charset may be regarded as an interim mechanism that is
to be supported until support for interchange of UTF-8 is prevalent;
however, the timeframe of "interim" may be at least 50 years, so
there is every reason to think of it as permanent in practice.
4. Languages
4.1. The need for language information
All human-readable text has a language.
Many operations, including high quality formatting, text-to-speech
synthesis, searching, hyphenation, spellchecking and so on benefit
greatly from access to information about the language of a piece of
text. [WC 3.1.1.4].
Humans have some tolerance for foreign languages, but are generally
very unhappy with being presented text in a language they do not
understand; this is why negotiation of language is needed.
In most cases, machines will not be able to deduce the language of a
transmitted text by themselves; the protocol must specify how to
transfer the language information if it is to be available at all.
The interaction between language and processing is complex; for
instance, if I compare "name-of-thing(lang=en)" to "name-of-
=2= |