Network Working Group D. Goldsmith
Request for Comments: 2152 Apple Computer, Inc.
Obsoletes: RFC 1642 M. Davis
Category: Informational Taligent, Inc.
May 1997
UTF-7
A Mail-Safe Transformation Format of Unicode
Status of this Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
Abstract
The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as
amended) jointly define a character set (hereafter referred to as
Unicode) which encompasses most of the world's writing systems.
However, Internet mail (STD 11, RFC 822) currently supports only 7-
bit US ASCII as a character set. MIME (RFC 2045 through 2049) extends
Internet mail to support different media types and character sets,
and thus could support Unicode in mail messages. MIME neither defines
Unicode as a permitted character set nor specifies how it would be
encoded, although it does provide for the registration of additional
character sets over time.
This document describes a transformation format of Unicode that
contains only 7-bit ASCII octets and is intended to be readable by
humans in the limiting case that the document consists of characters
from the US-ASCII repertoire. It also specifies how this
transformation format is used in the context of MIME and RFC 1641,
"Using Unicode with MIME".
Motivation
Although other transformation formats of Unicode exist and could
conceivably be used in this context (most notably UTF-8, also known
as UTF-2 or UTF-FSS), they suffer the disadvantage that they use
octets in the range decimal 128 through 255 to encode Unicode
characters outside the US-ASCII range. Thus, in the context of mail,
those octets must themselves be encoded. This requires putting text
through two successive encoding processes, and leads to a significant
expansion of characters outside the US-ASCII range, putting non-
English speakers at a disadvantage. For example, using UTF-8 together
RFC 2152 UTF-7 May 1997
with the Quoted-Printable content transfer encoding of MIME
represents US-ASCII characters in one octet, but other characters may
require up to nine octets.
Overview
UTF-7 encodes Unicode characters as US-ASCII octets, together with
shift sequences to encode characters outside that range. For this
purpose, one of the characters in the US-ASCII repertoire is reserved
for use as a shift character.
Many mail gateways and systems cannot handle the entire US-ASCII
character set (those based on EBCDIC, for example), and so UTF-7
contains provisions for encoding characters within US-ASCII in a way
that all mail systems can accomodate.
UTF-7 should normally be used only in the context of 7 bit
transports, such as mail. In other contexts, straight Unicode or
UTF-8 is preferred.
See RFC 1641, "Using Unicode with MIME" for the overall specification
on usage of Unicode transformation formats with MIME.
Definitions
First, the definition of Unicode:
The 16 bit character set Unicode is defined by "The Unicode
Standard, Version 2.0". This character set is identical with the
character repertoire and coding of the international standard
ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2;
Subset=300; Implementation Level=3, including the first 7
amendments to 10646 plus editorial corrections.
Note. Unicode 2.0 further specifies the use and interaction of
these character codes beyond the ISO standard. However, any valid
10646 sequence is a valid Unicode sequence, and vice versa;
Unicode supplies interpretations of sequences on which the ISO
=1= |