Various opinions Jorgen Schaefer (15 Jul 2005 19:06 UTC)
Re: Various opinions bear (16 Jul 2005 02:59 UTC)

Various opinions Jorgen Schaefer 15 Jul 2005 19:06 UTC

The topic of Unicode support in a language is very difficult. I'm not
sure the array of codepoints idea is the best solution, but I won't
question it in this mail (mainly because I don't have a better
proposal).

Integers and Codepoints
=======================
The SRFI defines the procedures CHAR->INTEGER and INTEGER->CHAR,
but also defines the return value to be a Unicode codepoint. So it
would be better to name them

  char->codepoint
  codepoint->char

instead.

The newline character
=====================
#\NEWLINE always has been a problem, because a new line is a
system-dependent sequence of octets. #\LINEFEED is the correct term.

We also have (newline), which is the right thing to do, so we can just
drop #\NEWLINE.

x, u and U
==========
The SRFI defines x, u and U for two-digit, four-digit and eight-digit
hexadecimal codepoints in character literals and in strings.

First of all, for character literals, this is unnecessary. It would be
much more elegant to have

   (char=? #\xA #\d10 #\o12 #\b1010)

analogous to

   (= #xA #d10 #o12 #b1010)

Introducing characters which mark fixed-width tokens in strings
strikes me as problematic as well. The obvious alternative is
using delimiters, as already proposed on this list. Delimiters
improve readability, and since using explicit codepoints in
strings is rare, the extra length is not a problem. It is also not
clear that hexadecimal encoding is always preferable. So I would
propose a delimiter which allows for different bases.

There are different approaches. The main goal is to make it readable.

    "A\(#x42)C"  - This is _very_ readable, though verbose
    "A\x42;C"
    "A\x42:C"    - This is also very readable
    "A\x42#C"
    "A\#x42#C"   - This provides some consistency

Analogous to the character syntax described above, the following could
be possible:

   (apply char=? (string->list "\xA:\d10:\o12:\b1010:")) => #t

Quoted Strings
==============
I'm a bit confused as to why we need all those character shorthands.

The rationale "it's what is provided in other places" doesn't sound
right. Specifically, I have never seen \b or \v in use.

I also wonder why \? and \' got added there - those are equivalent to
the characters without the backslash, and neither the question mark
nor the quote are ever used in a context where they have to be quoted.

Newlines in strings
===================
I like the \<newline><intraline-whitespace> syntax, as it allows for
correctly-indented strings.

I would dislike being prevented from using newlines in a string,
though, and I don't see a reason to do so.

Here Strings
============
The introduction of here strings poses a few problems. Allowing for
any character in the delimiter does not seem useful (except for the
Obfuscated Scheme Code Contest, of course :-)), so I would think it to
be the correct choice to limit the number of allowed characters.

For consistency, a symbol could be used there. After the delimiter,
only whitespace may follow until the newline.

This allows for here-strings to be normal tokens (instead of being
possibly split up over several tokens), and doesn't lend itself to
hiding errors as easily.

Case-Insensitivity
==================
Unicode defines case folding for case-insensitive comparisons. This
works by mapping characters to specific case-folded characters - not
necessarily upper-case or lower-case, but a special case-folded
version.

This allows, for example, the Greek sigma - which has two forms in the
lower-case variant - to match correctly, as well as the German eszett
to match with the double-s in uppercase form.

The procedures that deal with case insensitivity in this SRFI -
i.e. *-CI* - should use case folding, not downcasing.

Normalization
=============
This SRFI lacks a notion of normalization, which is important for any
kind of string comparison. I don't see an easy way to integrate this
besides providing STRING-NORMALIZE-NF{C,D,KC,KD}, though.

It's What Others Do
===================
The discussions in this thread seem to reiterate one argument from
time to time which I find problematic. The argument is "This Is What
Others Do", or even "This Is What People Coming From Other Languages
Expect".

Since when was that a good argument against a sensible solution in
Scheme?

I think it would be useful to stop pondering about how to copy
questionable design decisions from other languages, and try to find
good solutions - it wouldn't be the first time Scheme does something
no one else did, because it is the right thing to do.

Greetings,
        -- Jorgen

--
((email . "xxxxxx@forcix.cx") (www . "http://www.forcix.cx/")
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))