from S. There are different principles for how this inevitable
difference should be handled. A choice between them should be made,
depending on the purpose and requirements of the conversion. Where
possible, the client application should be given mechanisms to
determine what has been done to the text.
3.5.2.1: Length-modifying conversion for human display
When the length of the target text T is allowed to differ from the
length of the source text S, one should use a conversion method in
which each source character is converted to one or several target
character(s), using a best resemblance criteria in the choice of that
target character(s).
Examples:
LATIN CAPITAL LETTER [*] -> AE
COPYRIGHT SIGN [*] -> (c)
3.5.2.2: Length-preserving conversion for human display
Where the text T must be presented and the length of T cannot differ
from the length of S, one should use a conversion method where each
source character is converted to one target character, using some
kind of best resemblance criteria in the choice of target character.
RFC 2130 Character Set Workshop Report April 1997
Examples:
LATIN CAPITAL LETTER [*] -> A
COPYRIGHT SIGN [*] -> C
3.5.2.3: Conversion without data loss
Where the conversion of the text S into T must be completely
reversible, apply a Character Encoding Syntax or other reversible
transformation method. This case is most frequently met in data
storage requirements.
Examples:
LATIN CAPITAL LETTER [*] -> &AE
COPYRIGHT SIGN [*] -> &(C
An alternate method, which can be used if the size of Rep(CCS(T)) >=
Rep(CCS(S)), then for each character in Rep(CCS(S)) which is not
present in Rep(CCS(T)), define a mapping into a character in
Rep(CCS(T)) which is not present in Rep(CCS(S)).
Examples:
LATIN CAPITAL LETTER [*] -> CYRILLIC CAPITAL LETTER [*]
COPYRIGHT SIGN [*] -> PARTIAL DIFFERENTIAL SIGN [*]
Note that conversion without data loss requires redefining some
member of T to indicate "the introduction of character data outside
T". This effectively adds another level of CES on top of CES(T).
4: Presentation issues
There are a number of considerations to make in selecting the base
character set. One such consideration is the protocol's convenience
to users with limited equipment (for example only ISO 8859-1 or a
keyboard without the ability to enter all the characters in ISO
10646). Alternative representation should be considered for these
users, both for input and output. Possible options for the
representation of characters that can not be displayed include
transliteration (a la CEN/TC304 or ISO TC46/SC2 ), RFC 1345 [RFC-
1345] representative icons, or the WG2 short name (u+xxxx).
5: Open issues
In addition to the issues declared out of scope and enumerated in
section 2.1, the following issues are still open and will need to be
addressed in other forums. These issues: language tags, public
identifiers such as URL names, and bi-directionality are briefly
discussed below as they repeatedly encroached the discussion.
RFC 2130 Character Set Workshop Report April 1997
5.1: Language tags
Although the workshop decided not to explicitly address the so-called
"CJK issue", a few members felt it was necessary to have some
mechanism to address the problem of correct Han character display in
the ISO-10646 issue, and that saying that it was a "font issue" would
not suffice.
The "CJK issue" refers to the extended discussion about "Han
unification", the use of a single ISO-10646 codepoint to represent
=8= |