While there may be other text attributes intimately associated with
the language of the document, such as desired font or text direction,
these should be specified with other identifiers rather than
overloading the language tag.
3.2: On the wire
There are three segments of the model which are required for
completely specifying the content of a transmitted text stream (with
the occasional exception of the Language component, mentioned above).
These components are:
1) Coded Character Set,
2) Character Encoding Scheme, and
3) Transfer Encoding Syntax.
Each of these abstract components must be explicitly specified by the
transmitter when the data is sent. There may be instances of an
implicit specification due to the protocol/standard being used (i.e.
ANSI/NISO Z39.50). Also, in MIME, the Coded Character Set and
Character Encoding Scheme are specified by the Charset parameter to
the Content-Type header field, and Transfer Encoding Syntax is
specified by the Content-Transfer-Encoding header field.
3.2.1: Coded Character Set
A Coded Character Set (CCS) is a mapping from a set of abstract
characters to a set of integers. Examples of coded character sets
are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
[ISO-8859].
3.2.2: Character Encoding Scheme
A Character Encoding Scheme (CES) is a mapping from a Coded Character
Set or several coded character sets to a set of octets. Examples of
Character Encoding Schemes are ISO 2022 [ISO-2022] and UTF-8 [UTF-8].
A given CES is typically associated with a single CCS; for example,
UTF-8 applies only to ISO 10646.
RFC 2130 Character Set Workshop Report April 1997
3.2.3: Transfer Encoding Syntax
It is frequently necessary to transform encoded text into a format
which is transmissible by specific protocols. The Transfer Encoding
Syntax (TES) is a transformation applied to character data encoded
using a CCS and possibly a CES to allow it to be transmitted.
Examples of Transfer Encoding Syntaxes are Base64 Encoding [Base64],
gzip encoding, and so forth.
3.3: Determining which values of CCS, CES, and TES are used
To completely specify which CCS, CES, and TES are used in a specific
text transmission, there needs to be a consistent set of labels for
specifying which CCS, CES, and TES are used. Once the appropriate
mechanisms have been selected, there are six techniques for attaching
these labels to the data.
The labels themselves are named and registered, either with IANA
[IANA] or with some other registry. Ideally, their definitions are
retrievable from some registration authority.
Labels may be determined in one of the following ways:
- Determined by guessing, where the receiver of the text has to
guess the values of the CCS, CES, and TES. For example: "I got
this from Sweden so it's probably ISO-8859-1." This is
obviously not a very foolproof way to decode text.
- Determined by the standard, where the protocol used to transmit
the data has made documented choices of CCS, CES, and TES in the
standard. Thus, the encodings used are known through the
access protocol, for example HTTP [HTTP] uses (but is not
limited to) ISO-8859-1, SMTP uses US-ASCII.
- Attached to the transfer envelope, where the descriptive labels are
attached to the wrapper placed around the text for transport.
MIME headers are a good example of this technique.
- Included in the data stream, where the data stream itself has
been encoded in such a way as to signal the character set used.
For example, ISO-2022 encodes the data with escape sequences to
provide information on the character subset currently being used.
- Agreed by prior bilateral agreement, where some out-of-band
negotiation has allowed the text transmitter and receiver to
determine the CCS, CES, and TES for the transmitted text.
- Agreed to by negotiation during some phase, typically
initialization of the protocol.
=5= |