Email list hosting service & mailing list manager

Allowing ASCII only, string escapes, and normalization Jorgen Schaefer (28 Jul 2005 18:47 UTC)

Allowing ASCII only, string escapes, and normalization Jorgen Schaefer 28 Jul 2005 18:47 UTC

Hi there!
Some more comments from my side.

Allowing ASCII only
===================
The current draft summarizes two problems of the SRFI as mentioned
on this list as both mandating too much for systems targeted to
small devices, and as mandating not enough for more sophisticated
implementations. I think the SRFI is a good middle ground and
allows a transition from the old string processing to newer and
more sophisticated designs. So the latter problem can only be
addressed by SRFIs which specify the better interfaces.

To mitigate the former problem, I just went over the draft again
with an eye for where it precludes an implementation to just use
ASCII. There's not much. If an implementation were allowed to
signal an error on unsupported code points, it would be trivial
for an implementation to just support ASCII (or Latin-1), as the
code points 0-127 (0-255) are equivalent in Unicode and ASCII
(Latin-1). This would open the specification for small devices.
(And even for other character sets, you only need a simple
translation table and signal errors on other code points)

This would mean that an implementation can support Unicode
code points fully or partially, just as implementations can support
the numeric tower fully or partially.

String Escapes
==============
My biggest problem with current draft is still xuU. More and more,
I come to think that delimited escapes are the way to go.
Specifically, parented escapes. I.e. "Foo\x(0A)Bar"

This has a number of advantages. We don't need u and U anymore, as
there's no ambiguity on what is part of the escape and what is
not. It is easy to read. And it is even friendly to users from
other languages: If a \x escape is not followed by a parenthesis,
an appropriate syntax error can be signalled, even explaining the
correct syntax.

If the latter is deemed less important than being able to write
\x0A itself, the parenthesises might be only required for hex
strings of a different length than two.

That problem does not exist for character constants, as those are
delimited otherwise anyways, so #\xA20 is always unambiguous.
Hence we can drop u and U from character constants as well.

This (type of) syntax even has precedence, in Perl 6 of all
languages. Apparently, they use \x{263A} in strings, and allow
\x[263A] and \x<263A> as well in regular expressions. All types of
bracketing are optional and only used for disambiguation. Cf.
http://www.perl.com/pub/a/2002/06/04/apo5.html?page=7 and
http://www.mail-archive.com/perl6xxxxxx@perl.org/msg00140.html

(I don't think we should adopt such a DWIM attitude - requiring
the parenthesis, and using only a single kind, looks like the best
way to me.)

Normalization
=============
String comparison on code point vectors without normalization is
useless. Hence, normalization will often be implemented right
away. Therefore, it might be useful to provide
STRING-NORMALIZE-NF{C,D} (maybe even NFKC/NFKD).
Cf. http://www.unicode.org/faq/normalization.html#1

If this is not included, a rationale should be added to the
document. At least it should mention normalization somewhere.

Greetings,
        -- Jorgen

--
((email . "xxxxxx@forcix.cx") (www . "http://www.forcix.cx/")
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))