451 lines
18 KiB
Plaintext
451 lines
18 KiB
Plaintext
|
.\" $XdotOrg: xc/doc/specs/CTEXT/ctext.tbl.ms,v 1.2 2004/04/23 18:42:15 eich Exp $
|
||
|
.\" Use tbl and -ms
|
||
|
.sp 8
|
||
|
.ce 5
|
||
|
\s+2\fBCompound Text Encoding\fP\s-2
|
||
|
.sp 6p
|
||
|
Version 1.1
|
||
|
X Consortium Standard
|
||
|
X Version 11, Release 6.8
|
||
|
Robert W. Scheifler
|
||
|
.sp 2
|
||
|
.LP
|
||
|
Copyright \(co 1989 by X Consortium
|
||
|
.LP
|
||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||
|
of this software and associated documentation files (the ``Software''), to deal
|
||
|
in the Software without restriction, including without limitation the rights
|
||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||
|
copies of the Software, and to permit persons to whom the Software is
|
||
|
furnished to do so, subject to the following conditions:
|
||
|
.LP
|
||
|
The above copyright notice and this permission notice shall be included in
|
||
|
all copies or substantial portions of the Software.
|
||
|
.LP
|
||
|
THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||
|
X CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
|
||
|
AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
||
|
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||
|
.LP
|
||
|
Except as contained in this notice, the name of the X Consortium shall not be
|
||
|
used in advertising or otherwise to promote the sale, use or other dealings
|
||
|
in this Software without prior written authorization from the X Consortium.
|
||
|
.sp 2
|
||
|
.NH 1
|
||
|
Overview
|
||
|
.LP
|
||
|
Compound Text is a format for multiple character set data, such as
|
||
|
multi-lingual text. The format is based on ISO
|
||
|
standards for encoding and combining character sets. Compound Text is intended
|
||
|
to be used in three main contexts: inter-client communication using selections,
|
||
|
as defined in the \fIInter-Client Communication Conventions Manual\fP (ICCCM);
|
||
|
window properties (e.g., window manager hints as defined in the ICCCM);
|
||
|
and resources (e.g., as defined in Xlib and the Xt Intrinsics).
|
||
|
.LP
|
||
|
Compound Text is intended as an external representation, or interchange format,
|
||
|
not as an internal representation. It is expected (but not required) that
|
||
|
clients will convert Compound Text to some internal representation for
|
||
|
processing and rendering, and convert from that internal representation to
|
||
|
Compound Text when providing textual data to another client.
|
||
|
.NH 1
|
||
|
Values
|
||
|
.LP
|
||
|
The name of this encoding is ``COMPOUND_TEXT''. When text values are used in
|
||
|
the ICCCM-compliant selection mechanism or are stored as window properties in
|
||
|
the server, the type used should be the atom for ``COMPOUND_TEXT''.
|
||
|
.LP
|
||
|
Octet values are represented in this document as two decimal numbers in the
|
||
|
form col/row. This means the value (col * 16) + row. For example, 02/01 means
|
||
|
the value 33.
|
||
|
.LP
|
||
|
For our purposes, the octet encoding space is divided into four ranges:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l.
|
||
|
C0 octets from 00/00 to 01/15
|
||
|
GL octets from 02/00 to 07/15
|
||
|
C1 octets from 08/00 to 09/15
|
||
|
GR octets from 10/00 to 15/15
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
C0 and C1 are ``control character'' sets, while GL and GR are ``graphic
|
||
|
character'' sets. Only a subset of C0 and C1 octets are used in the encoding,
|
||
|
and depending on the character set encoding defined as GL or GR, a subset of
|
||
|
GL and GR octets may be used; see below for details. All octets (00/00 to
|
||
|
15/15) may appear inside the text of extended segments (defined below).
|
||
|
.LP
|
||
|
[For those familiar with ISO 2022, we will use only an 8-bit environment, and
|
||
|
we will always use G0 for GL and G1 for GR.]
|
||
|
.NH 1
|
||
|
Control Characters
|
||
|
.LP
|
||
|
In C0, only the following values will be used:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l l.
|
||
|
00/09 HT HORIZONTAL TABULATION
|
||
|
00/10 NL NEW LINE
|
||
|
01/11 ESC (ESCAPE)
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
In C1, only the following value will be used:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l l.
|
||
|
09/11 CSI CONTROL SEQUENCE INTRODUCER
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
[The alternate 7-bit CSI encoding 01/11 05/11 is not used in Compound Text.]
|
||
|
.LP
|
||
|
No control sequences are defined in Compound Text for changing the C0 and C1
|
||
|
sets.
|
||
|
.LP
|
||
|
A horizontal tab can be represented with the octet 00/09. Specification of
|
||
|
tabulation width settings is not part of Compound Text and must be obtained
|
||
|
from context (in an unspecified manner).
|
||
|
.LP
|
||
|
[Inclusion of horizontal tab is for consistency with the STRING type currently
|
||
|
defined in the ICCCM.]
|
||
|
.LP
|
||
|
A newline (line separator/terminator) can be represented with the octet 00/10.
|
||
|
.LP
|
||
|
[Note that 00/10 is normally LINEFEED, but is being interpreted as NEWLINE.
|
||
|
This can be thought of as using the (deprecated) NEW LINE mode, E.1.3, in ISO
|
||
|
6429. Use of this value instead of 08/05 (NEL, NEXT LINE) is for consistency
|
||
|
with the STRING type currently defined in the ICCCM.]
|
||
|
.LP
|
||
|
The remaining C0 and C1 values (01/11 and 09/11) are only used in the control
|
||
|
sequences defined below.
|
||
|
.NH 1
|
||
|
Standard Character Set Encodings
|
||
|
.LP
|
||
|
The default GL and GR sets in Compound Text correspond to the left and right
|
||
|
halves of ISO 8859-1 (Latin 1). As such, any legal instance of a STRING type
|
||
|
(as defined in the ICCCM) is also a legal instance of type COMPOUND_TEXT.
|
||
|
.LP
|
||
|
.nf
|
||
|
[The implied initial state in ISO 2022 is defined with the sequence:
|
||
|
01/11 02/00 04/03 GO and G1 in an 8-bit environment only. Designation also invokes.
|
||
|
01/11 02/00 04/07 In an 8-bit environment, C1 represented as 8-bits.
|
||
|
01/11 02/00 04/09 Graphic character sets can be 94 or 96.
|
||
|
01/11 02/00 04/11 8-bit code is used.
|
||
|
01/11 02/08 04/02 Designate ASCII into G0.
|
||
|
01/11 02/13 04/01 Designate right-hand part of ISO Latin-1 into G1.
|
||
|
]
|
||
|
.fi
|
||
|
.LP
|
||
|
To define one of the approved standard character set encodings to be
|
||
|
the GL set, one of the following control sequences is used:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l.
|
||
|
01/11 02/08 {I} F 94 character set
|
||
|
01/11 02/04 02/08 {I} F 94\u\s-2N\s+2\d character set
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
To define one of the approved standard character set encodings to be
|
||
|
the GR set, one of the following control sequences is used:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l.
|
||
|
01/11 02/09 {I} F 94 character set
|
||
|
01/11 02/13 {I} F 96 character set
|
||
|
01/11 02/04 02/09 {I} F 94\u\s-2N\s+2\d character set
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
The ``F''in the control sequences above stands for ``Final character'', which
|
||
|
is always in the range 04/00 to 07/14. The ``{I}'' stands for zero or more
|
||
|
``intermediate characters'', which are always in the range 02/00 to 02/15, with
|
||
|
the first intermediate character always in the range 02/01 to 02/03. The
|
||
|
registration authority has defined an ``{I} F'' sequence for each registered
|
||
|
character set encoding.
|
||
|
.LP
|
||
|
[Final characters for private encodings (in the range 03/00 to 03/15) are not
|
||
|
permitted here in Compound Text.]
|
||
|
.LP
|
||
|
For GL, octet 02/00 is always defined as SPACE, and octet 07/15 (normally
|
||
|
DELETE) is never used. For a 94-character set defined as GR, octets 10/00 and
|
||
|
15/15 are never used.
|
||
|
.LP
|
||
|
[This is consistent with ISO 2022.]
|
||
|
.LP
|
||
|
A 94\u\s-2N\s+2\d character set uses N octets (N > 1) for each character.
|
||
|
The value of N is derived from the column value for F:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l.
|
||
|
column 04 or 05 2 octets
|
||
|
column 06 3 octets
|
||
|
column 07 4 or more octets
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
In a 94\u\s-2N\s+2\d encoding, the octet values 02/00 and 07/15 (in GL) and
|
||
|
10/00 and 15/15 (in GR) are never used.
|
||
|
.LP
|
||
|
[The column definitions come from ISO 2022.]
|
||
|
.LP
|
||
|
Once a GL or GR set has been defined, all further octets in that range (except
|
||
|
within control sequences and extended segments) are interpreted with respect to
|
||
|
that character set encoding, until the GL or GR set is redefined. GL and GR
|
||
|
sets can be defined independently, they do not have to be defined in pairs.
|
||
|
.LP
|
||
|
Note that when actually using a character set encoding as the GR set, you must
|
||
|
force the most significant bit (08/00) of each octet to be a one, so that it
|
||
|
falls in the range 10/00 to 15/15.
|
||
|
.LP
|
||
|
[Control sequences to specify character set encoding revisions (as in section
|
||
|
6.3.13 of ISO 2022) are not used in Compound Text. Revision indicators do not
|
||
|
appear to provide useful information in the context of Compound Text. The most
|
||
|
recent revision can always be assumed, since revisions are upward compatible.]
|
||
|
.NH 1
|
||
|
Approved Standard Encodings
|
||
|
.LP
|
||
|
The following are the approved standard encodings to be used with Compound
|
||
|
Text. Note that none have Intermediate characters; however, a good parser will
|
||
|
still deal with Intermediate characters in the event that additional encodings
|
||
|
are later added to this list.
|
||
|
.RS
|
||
|
.TS
|
||
|
l l l.
|
||
|
_
|
||
|
.sp 4p
|
||
|
\fB{I} F\fP \fB94/96\fP \fBDescription\fP
|
||
|
.sp 4p
|
||
|
_
|
||
|
.sp 6p
|
||
|
4/02 94 7-bit ASCII graphics (ANSI X3.4-1968),
|
||
|
Left half of ISO 8859 sets
|
||
|
04/09 94 Right half of JIS X0201-1976 (reaffirmed 1984),
|
||
|
8-Bit Alphanumeric-Katakana Code
|
||
|
04/10 94 Left half of JIS X0201-1976 (reaffirmed 1984),
|
||
|
8-Bit Alphanumeric-Katakana Code
|
||
|
.sp 6p
|
||
|
04/01 96 Right half of ISO 8859-1, Latin alphabet No. 1
|
||
|
04/02 96 Right half of ISO 8859-2, Latin alphabet No. 2
|
||
|
04/03 96 Right half of ISO 8859-3, Latin alphabet No. 3
|
||
|
04/04 96 Right half of ISO 8859-4, Latin alphabet No. 4
|
||
|
04/06 96 Right half of ISO 8859-7, Latin/Greek alphabet
|
||
|
04/07 96 Right half of ISO 8859-6, Latin/Arabic alphabet
|
||
|
04/08 96 Right half of ISO 8859-8, Latin/Hebrew alphabet
|
||
|
04/12 96 Right half of ISO 8859-5, Latin/Cyrillic alphabet
|
||
|
04/13 96 Right half of ISO 8859-9, Latin alphabet No. 5
|
||
|
.sp 6p
|
||
|
04/01 94\u\s-22\s+2\d GB2312-1980, China (PRC) Hanzi
|
||
|
04/02 94\u\s-22\s+2\d JIS X0208-1983, Japanese Graphic Character Set
|
||
|
04/03 94\u\s-22\s+2\d KS C5601-1987, Korean Graphic Character Set
|
||
|
.sp 6p
|
||
|
_
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
The sets listed as ``Left half of ...'' should always be defined as GL. The
|
||
|
sets listed as ``Right half of ...'' should always be defined as GR. Other
|
||
|
sets can be defined either as GL or GR.
|
||
|
.NH 1
|
||
|
Non-Standard Character Set Encodings
|
||
|
.LP
|
||
|
Character set encodings that are not in the list of approved standard
|
||
|
encodings can be included
|
||
|
using ``extended segments''. An extended segment begins with one of the
|
||
|
following sequences:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l.
|
||
|
01/11 02/05 02/15 03/00 M L variable number of octets per character
|
||
|
01/11 02/05 02/15 03/01 M L 1 octet per character
|
||
|
01/11 02/05 02/15 03/02 M L 2 octets per character
|
||
|
01/11 02/05 02/15 03/03 M L 3 octets per character
|
||
|
01/11 02/05 02/15 03/04 M L 4 octets per character
|
||
|
.TE
|
||
|
.RE
|
||
|
[This uses the ``other coding system'' of ISO 2022, using private Final
|
||
|
characters.]
|
||
|
.LP
|
||
|
The ``M'' and ``L'' octets represent a 14-bit unsigned value giving the number
|
||
|
of octets that appear in the remainder of the segment. The number is computed
|
||
|
as ((M - 128) * 128) + (L - 128). The most significant bit M and L are always
|
||
|
set to one. The remainder of the segment consists of two parts, the name of
|
||
|
the character set encoding and the actual text. The name of the encoding comes
|
||
|
first and is separated from the text by the octet 00/02 (STX, START OF TEXT).
|
||
|
Note that the length defined by M and L includes the encoding name and
|
||
|
separator.
|
||
|
.LP
|
||
|
[The encoding of the length is chosen to avoid having zero octets in Compound
|
||
|
Text when possible, because embedded NUL values are problematic in many C
|
||
|
language routines. The use of zero octets cannot be ruled out entirely
|
||
|
however, since some octets in the actual text of the extended segment may have
|
||
|
to be zero.]
|
||
|
.LP
|
||
|
The name of the encoding should be registered with the X Consortium to avoid
|
||
|
conflicts and should when appropriate match the CharSet Registry and Encoding
|
||
|
registration used in the X Logical Font Description. The name itself should be
|
||
|
encoded using ISO 8859-1 (Latin 1), should not use question mark (03/15) or
|
||
|
asterisk (02/10), and should use hyphen (02/13) only in accordance with the X
|
||
|
Logical Font Description.
|
||
|
.LP
|
||
|
Extended segments are not to be used for any character set encoding that can
|
||
|
be constructed from a GL/GR pair of approved standard encodings. For
|
||
|
example, it is incorrect to use an extended segment for any of the ISO 8859
|
||
|
family of encodings.
|
||
|
.LP
|
||
|
It should be noted that the contents of an extended segment are arbitrary;
|
||
|
for example,
|
||
|
they may contain octets in the C0 and C1 ranges, including 00/00, and
|
||
|
octets comprising a given character may differ in their most significant bit.
|
||
|
.LP
|
||
|
[ISO-registered ``other coding systems'' are not used in Compound Text;
|
||
|
extended segments are the only mechanism for non-2022 encodings.]
|
||
|
.NH 1
|
||
|
Directionality
|
||
|
.LP
|
||
|
If desired, horizontal text direction can be indicated using the following
|
||
|
control sequences:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l.
|
||
|
09/11 03/01 05/13 begin left-to-right text
|
||
|
09/11 03/02 05/13 begin right-to-left text
|
||
|
09/11 05/13 end of string
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
[This is a subset of the SDS (START DIRECTED STRING) control in the Draft
|
||
|
Bidirectional Addendum to ISO 6429.]
|
||
|
.LP
|
||
|
Directionality can be nested. Logically, a stack of directions is maintained.
|
||
|
Each of the first two control sequences pushes a new direction on the stack,
|
||
|
and the third sequence (revert) pops a direction from the stack. The stack
|
||
|
starts out empty at the beginning of a Compound Text string. When the stack is
|
||
|
empty, the directionality of the text is unspecified.
|
||
|
.LP
|
||
|
Directionality applies to all subsequent text, whether in GL, GR, or an
|
||
|
extended segment. If the desired directionality of GL, GR, or extended
|
||
|
segments differs, then directionality control sequences must be inserted when
|
||
|
switching between them.
|
||
|
.LP
|
||
|
Note that definition of GL and GR sets is independent of directionality;
|
||
|
defining a new GL or GR set does not change the current directionality, and
|
||
|
pushing or popping a directionality does not change the current GL and GR
|
||
|
definitions.
|
||
|
.LP
|
||
|
Specification of directionality is entirely optional; text direction should be
|
||
|
clear from context in most cases. However, it must be the case that either
|
||
|
all characters in a Compound Text string have explicitly specified direction
|
||
|
or that all characters have unspecified direction. That is, if directionality
|
||
|
control sequences are used, the first such control sequence must precede the
|
||
|
first graphic character in a Compound Text string, and graphic characters are
|
||
|
not permitted whenever the directionality stack is empty.
|
||
|
.NH 1
|
||
|
Resources
|
||
|
.LP
|
||
|
To use Compound Text in a resource, you can simply treat all octets as if they
|
||
|
were ASCII/Latin-1 and just replace all ``\\'' octets (05/12) with the two
|
||
|
octets ``\\\\'', all newline octets (00/10) with the two octets ``\\n'', and
|
||
|
all zero octets with the four octets ``\\000''.
|
||
|
It is up to the client making use of the resource to interpret the data as
|
||
|
Compound Text; the policy by which this is ascertained is not constrained by
|
||
|
the Compound Text specification.
|
||
|
.NH 1
|
||
|
Font Names
|
||
|
.LP
|
||
|
The following CharSet names for the standard character set encodings are
|
||
|
registered for use in font names under the X Logical Font Description:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l l.
|
||
|
_
|
||
|
.sp 6p
|
||
|
\fBName\fP \fBEncoding Standard\fP \fBDescription\fP
|
||
|
.sp 6p
|
||
|
_
|
||
|
.sp 6p
|
||
|
ISO8859-1 ISO 8859-1 Latin alphabet No. 1
|
||
|
ISO8859-2 ISO 8859-2 Latin alphabet No. 2
|
||
|
ISO8859-3 ISO 8859-3 Latin alphabet No. 3
|
||
|
ISO8859-4 ISO 8859-4 Latin alphabet No. 4
|
||
|
ISO8859-5 ISO 8859-5 Latin/Cyrillic alphabet
|
||
|
ISO8859-6 ISO 8859-6 Latin/Arabic alphabet
|
||
|
ISO8859-7 ISO 8859-7 Latin/Greek alphabet
|
||
|
ISO8859-8 ISO 8859-8 Latin/Hebrew alphabet
|
||
|
ISO8859-9 ISO 8859-9 Latin alphabet No. 5
|
||
|
JISX0201.1976-0 JIS X0201-1976 (reaffirmed 1984) 8-bit Alphanumeric-Katakana Code
|
||
|
GB2312.1980-0 GB2312-1980, GL encoding China (PRC) Hanzi
|
||
|
JISX0208.1983-0 JIS X0208-1983, GL encoding Japanese Graphic Character Set
|
||
|
KSC5601.1987-0 KS C5601-1987, GL encoding Korean Graphic Character Set
|
||
|
.sp 6p
|
||
|
_
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
.NH 1
|
||
|
Extensions
|
||
|
.LP
|
||
|
There is no absolute requirement for a parser to deal with anything but the
|
||
|
particular encoding syntax defined in this specification. However, it is
|
||
|
possible that Compound Text may be extended in the future, and as such it may
|
||
|
be desirable to construct the parser to handle 2022/6429 syntax more generally.
|
||
|
.LP
|
||
|
There are two general formats covering all control sequences that are expected
|
||
|
to appear in extensions:
|
||
|
.LP
|
||
|
01/11 {I} F
|
||
|
.IP
|
||
|
For this format, I is always in the range 02/00 to 02/15, and F is always
|
||
|
in the range 03/00 to 07/14.
|
||
|
.LP
|
||
|
09/11 {P} {I} F
|
||
|
.IP
|
||
|
For this format, P is always in the range 03/00 to 03/15, I is always in
|
||
|
the range 02/00 to 02/15, and F is always in the range 04/00 to 07/14.
|
||
|
.LP
|
||
|
In addition, new (singleton) control characters (in the C0 and C1 ranges) might
|
||
|
be defined in the future.
|
||
|
.LP
|
||
|
Finally, new kinds of ``segments'' might be defined in the future using syntax
|
||
|
similar to extended segments:
|
||
|
.LP
|
||
|
01/11 02/05 02/15 F M L
|
||
|
.IP
|
||
|
For this format, F is in the range 03/05 to 3/15. M and L are as defined
|
||
|
in extended segments. Such a segment will always be followed by the number
|
||
|
of octets defined by M and L. These octets can have arbitrary values and
|
||
|
need not follow the internal structure defined for current extended
|
||
|
segments.
|
||
|
.LP
|
||
|
If extensions to this specification are defined in the future, then any string
|
||
|
incorporating instances of such extensions must start with one of the following
|
||
|
control sequences:
|
||
|
.RS
|
||
|
.TS
|
||
|
l l.
|
||
|
01/11 02/03 V 03/00 ignoring extensions is OK
|
||
|
01/11 02/03 V 03/01 ignoring extensions is not OK
|
||
|
.TE
|
||
|
.RE
|
||
|
.LP
|
||
|
In either case, V is in the range 02/00 to 02/15 and indicates the major
|
||
|
version
|
||
|
minus one of the specification being used. These version control sequences are
|
||
|
for use by clients that implement earlier versions, but have implemented a
|
||
|
general parser. The first control sequence indicates that it is acceptable to
|
||
|
ignore all extension control sequences; no mandatory information will be lost
|
||
|
in the process. The second control sequence indicates that it is unacceptable
|
||
|
to ignore any extension control sequences; mandatory information would be lost
|
||
|
in the process. In general, it will be up to the client generating the
|
||
|
Compound Text to decide which control sequence to use.
|
||
|
.NH 1
|
||
|
Errors
|
||
|
.LP
|
||
|
If a Compound Text string does not match the specification here (e.g., uses
|
||
|
undefined control characters, or undefined control sequences, or incorrectly
|
||
|
formatted extended segments), it is best to treat the entire string as invalid,
|
||
|
except as indicated by a version control sequence.
|