Both 'encoding' and 'charset' names are case-independent. Thus the
charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the
encoding named "Q" may be spelled either "Q" or "q".
An 'encoded-word' may not be more than 75 characters long, including
'charset', 'encoding', 'encoded-text', and delimiters. If it is
desirable to encode more text than will fit in an 'encoded-word' of
75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
be used.
While there is no limit to the length of a multiple-line header
field, each line of a header field that contains one or more
'encoded-word's is limited to 76 characters.
The length restrictions are included both to ease interoperability
through internetwork mail gateways, and to impose a limit on the
amount of lookahead a header parser must employ (while looking for a
final ?= delimiter) before it can decide whether a token is an
"encoded-word" or something else.
RFC 2047 Message Header Extensions November 1996
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser. As a consequence, unencoded white space
characters (such as SPACE and HTAB) are FORBIDDEN within an
'encoded-word'. For example, the character sequence
=?iso-8859-1?q?this is some text?=
would be parsed as four 'atom's, rather than as a single 'atom' (by
an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words'). The correct way to encode the string "this is some
text" is to encode the SPACE characters as well, e.g.
=?iso-8859-1?q?this=20is=20some=20text?=
The characters which may appear in 'encoded-text' are further
restricted by the rules in section 5.
3. Character sets
The 'charset' portion of an 'encoded-word' specifies the character
set associated with the unencoded text. A 'charset' can be any of
the character set names allowed in an MIME "charset" parameter of a
"text/plain" body part, or any character set name registered with
IANA for use with the MIME text/plain content-type.
Some character sets use code-switching techniques to switch between
"ASCII mode" and other modes. If unencoded text in an 'encoded-word'
contains a sequence which causes the charset interpreter to switch
out of ASCII mode, it MUST contain additional control codes such that
ASCII mode is again selected at the end of the 'encoded-word'. (This
rule applies separately to each 'encoded-word', including adjacent
'encoded-word's within a single header field.)
When there is a possibility of using more than one character set to
represent the text in an 'encoded-word', and in the absence of
private agreements between sender and recipients of a message, it is
recommended that members of the ISO-8859-* series be used in
preference to other character sets.
4. Encodings
Initially, the legal values for "encoding" are "Q" and "B". These
encodings are described below. The "Q" encoding is recommended for
use when most of the characters to be encoded are in the ASCII
character set; otherwise, the "B" encoding should be used.
Nevertheless, a mail reader which claims to recognize 'encoded-word's
MUST be able to accept either encoding for any character set which it
supports.
RFC 2047 Message Header Extensions November 1996
Only a subset of the printable ASCII characters may be used in
'encoded-text'. Space and tab characters are not allowed, so that
the beginning and end of an 'encoded-word' are obvious. The "?"
character is used within an 'encoded-word' to separate the various
portions of the 'encoded-word' from one another, and thus cannot
appear in the 'encoded-text' portion. Other characters are also
illegal in certain contexts. For example, an 'encoded-word' in a
'phrase' preceding an address in a From header field may not contain
any of the "specials" defined in RFC 822. Finally, certain other
characters are disallowed in some contexts, to ensure reliability for
messages that pass through internetwork mail gateways.
The "B" encoding automatically meets these requirements. The "Q"
encoding allows a wide range of printable characters to be used in
=3= |