Case-mapping, Unicode & internationalisation

Show/hide message thread
Case-mapping, Unicode & internationalisation shivers@xxxxxx (24 Jan 2000 13:37 UTC)
Re: Case-mapping, Unicode & internationalisation Sergei Egorov (24 Jan 2000 17:09 UTC)
text processes vs. string procedures shivers@xxxxxx (24 Jan 2000 21:52 UTC)
Re: text processes vs. string procedures Sergei Egorov (24 Jan 2000 22:39 UTC)
Re: text processes vs. string procedures shivers@xxxxxx (25 Jan 2000 01:19 UTC)
Case-mapping, Unicode & internationalisation shivers@xxxxxx 24 Jan 2000 13:37 UTC
I would like SRFI-13 to take advantage of the opportunity to tackle
the issues arising from internationalisation and Unicode, and do a
proper job. My design criteria for SRFI-13 are these:

  - The SRFI-13 spec is independent of the implementation chosen for
    representing characters -- one should be able to use SRFI-13 procedures
    in Schemes that use ASCII, Latin-1, Unicode or other encodings for chars.

  - The spec *is* designed to allow string-processing code to be portable
    across different character encodings. This means that we include string
    primitives (such as string comparison, case mapping) which cannot be
    portably implemented using simple character primitives for Unicode
    Schemes. For example, lower-casing a string requires more than mapping
    CHAR-DOWNCASE over the string -- see below for the subtleties involved
    when dealing with the full spectrum of Unicode.

In other words, I don't want to put in Unicode-specific ops, but I want all
the ops to make sense in a Unicode world. This is similar to my design
criteria for shared-text substrings.

Ben Goettner has been advising me on the subtleties of Unicode and case. The
good news is that there is a whole tech report from the Unicode people on this
issue. The bad news is that the possibility of Unicode does have impact on the
design of basic string operations.

The issues of case-mapping are laid out in Unicode Tech Report 21, which is
short, clear and available on the Web:
    http://www.unicode.org/unicode/reports/tr21/
(It can be easily read in a few minutes.)

The short summary is that we are dropping two procedures (STRING-UPCASE!
and STRING-DOWNCASE!) and reinstating WORD-CAPITALIZE with a new name
(STRING-TITLECASE) and new semantics.

Here are the issues and their impact on SRFI-13.

- Case-mapping requires surrounding context

  In Unicode, you can't actually do case-mapping on a single char in
  isolation.  In a few cases, it requires surrounding context info.  For
  example, Greek capital sigma downcases to two different chars depending upon
  whether it is the final character of a word or not.

  STRING-UPCASE & STRING-DOWNCASE use context in a Unicode Scheme. However,
  context does not extend beyond the limits of the start/end indices, when
  these are supplied.

  CHAR-UPCASE and CHAR-DOWNCASE are not in the purview of SRFI-13. However,
  this SRFI recommends that these two functions simply choose a reasonable
  default for these cases (e.g., the NON_FINAL mapping).

- Titlecase <> uppercase
  Unicode defines three kinds of case mapping: lowercase, uppercase, and
  titlecase. The difference between uppercasing and titlecasing a character
  or character sequence can be seen in compound characters (that is,
  a single character that represents a compount of two characters).

  For example, in Unicode, character U+01F3 is LATIN SMALL LETTER DZ.  (Let us
  write this compound character using ASCII as "dz".) This character
  uppercases to character U+01F1, LATIN CAPITAL LETTER DZ.  (Which is
  basically "DZ".) But it titlecases to to character U+01F2, LATIN CAPITAL
  LETTER D WITH SMALL LETTER Z. (Which we can write "Dz".)

      character	uppercase titlecase
      --------- --------- ---------
      dz	DZ	  Dz

  Scheme needs CHAR-TITLECASE and CHAR-TITLECASE? functions, but this is not
  in the purview of SRFI-13, which handles strings, not chars.

  STRING-CAPITALIZE is required to do the right thing with compound characters
  in a Unicode implementation.

  We also add STRING-TITLECASE, which uses the Unicode definition
  of titlecasing a text string: every character not preceded by a
  cased character is titlecased. All other characters are lowercased. E.g.
	(string-titlecase "olin g. sHIVERS") => "Olin G. Shivers"
	(string-titlecase "Laurence McCullough") => "Laurence Mccullough"
	(string-titlecase "3com mAkes ROUTERS.") => "3Com Makes Routers."

  (This is essentially the task handled by the old CAPITALIZE-WORDS function,
  which was dropped a few rounds ago.)  If the optional start index is given,
  it is treated as the beginning of the string. E.g.:

        (string-titlecase "jamie clark" 2) => "Mie Clark"

  To recap, STRING-CAPITALIZE titlecases the *initial* character of a string.
  STRING-TITLECASE processes the entire string.

- A single lowercase char can upcase into multiple chars

  For example, German eszet upcases to "SS".

  This is a problem for CHAR-UPCASE; STRING-UPCASE and STRING-TITLECASE can
  handle it properly, and are required to do so in a Latin-1 or Unicode Scheme.

  STRING-UPCASE! and STRING-DOWNCASE! are being dropped, since they cannot
  guarantee to handle their arguments in-place. (Bummer.)

- Turkish has different case mappings.

  Case-mapping functions are sensitive to external environment settings
  in ways not defined by this SRFI. E.g., the current $LC locale in Unix.

  Note that Turkish is the only language in the Unicode set with this problem.

- CHAR-UPCASE and CHAR-DOWNCASE
  These functions are not in the purview of SRFI-13. However,
  this SRFI recommends that these two functions

  - pass through unchanged characters whose case-mapping expands them into
    multi-character sequences, such as when upcasing the Latin-1 German
    eszet to "SS." This will allow old code to continue to work, and is
    consistent with what modern Unicode OS's do (e.g., Windows 2000) --
    hence implementations can use the native OS case-mapping facilities,
    when possible.

  - return a reasonable default when asked to case-map a character
    that has multiple possible results depending upon context (such as
    downcasing the Greek capital sigma).

- This SRFI additionally recommends

  - numeric codes for standard functions that map between characters and
    integers should be required to use the Unicode/Latin-1/ASCII mapping. This
    allows programmers to write portable code.

  - CHAR-TITLECASE be added to CHAR-UPCASE and CHAR-DOWNCASE

  - CHAR-TITLECASE? be added to CHAR-UPCASE? and CHAR-DOWNCASE?

  - Title/up/down-case functions might be added to the character-processing
    suite which return immutable string values. Note that the context issue
    (e.g., properly downcasing Greek Sigma) is not resolved by these functions.

  These recommendations are not a part of the SRFI-13 spec. Note also that
  requiring a Unicode/Latin-1/ASCII interface to integer/char mapping
  functions does not imply anything about the actual underlying encodings of
  characters.

- Summary
  (upcase-string    string [start end]) -> string
  (downcase-string  string [start end]) -> string
  (titlecase-string string [start end]) -> string

- Char function recommendations:
  (char-upcase    char) -> char
  (char-downcase  char) -> char
  (char-titlecase char) -> char

  (char-upcase?    char) -> boolean
  (char-downcase?  char) -> boolean
  (char-titlecase? char) -> boolean

  (upcase-char->string    char) -> immutable-string
  (downcase-char->string  char) -> immutable-string
  (titlecase-char->string char) -> immutable-string

- Other internationalisation issues

  Case mapping is not the only tricky issue in a rich character world
  like Unicode. I'll deal with the following issues in later notes.
  - Procedures to find word boundaries and line-break opportunities portably.
  - String comparison: collation order, case-folding, normalisation