Allowing ASCII only, string escapes, and normalization Jorgen Schaefer (28 Jul 2005 18:47 UTC)
|
Re: Allowing ASCII only, string escapes, and normalization
John.Cowan
(28 Jul 2005 21:05 UTC)
|
Allowing ASCII only, string escapes, and normalization Jorgen Schaefer 28 Jul 2005 18:47 UTC
Hi there! Some more comments from my side. Allowing ASCII only =================== The current draft summarizes two problems of the SRFI as mentioned on this list as both mandating too much for systems targeted to small devices, and as mandating not enough for more sophisticated implementations. I think the SRFI is a good middle ground and allows a transition from the old string processing to newer and more sophisticated designs. So the latter problem can only be addressed by SRFIs which specify the better interfaces. To mitigate the former problem, I just went over the draft again with an eye for where it precludes an implementation to just use ASCII. There's not much. If an implementation were allowed to signal an error on unsupported code points, it would be trivial for an implementation to just support ASCII (or Latin-1), as the code points 0-127 (0-255) are equivalent in Unicode and ASCII (Latin-1). This would open the specification for small devices. (And even for other character sets, you only need a simple translation table and signal errors on other code points) This would mean that an implementation can support Unicode code points fully or partially, just as implementations can support the numeric tower fully or partially. String Escapes ============== My biggest problem with current draft is still xuU. More and more, I come to think that delimited escapes are the way to go. Specifically, parented escapes. I.e. "Foo\x(0A)Bar" This has a number of advantages. We don't need u and U anymore, as there's no ambiguity on what is part of the escape and what is not. It is easy to read. And it is even friendly to users from other languages: If a \x escape is not followed by a parenthesis, an appropriate syntax error can be signalled, even explaining the correct syntax. If the latter is deemed less important than being able to write \x0A itself, the parenthesises might be only required for hex strings of a different length than two. That problem does not exist for character constants, as those are delimited otherwise anyways, so #\xA20 is always unambiguous. Hence we can drop u and U from character constants as well. This (type of) syntax even has precedence, in Perl 6 of all languages. Apparently, they use \x{263A} in strings, and allow \x[263A] and \x<263A> as well in regular expressions. All types of bracketing are optional and only used for disambiguation. Cf. http://www.perl.com/pub/a/2002/06/04/apo5.html?page=7 and http://www.mail-archive.com/perl6xxxxxx@perl.org/msg00140.html (I don't think we should adopt such a DWIM attitude - requiring the parenthesis, and using only a single kind, looks like the best way to me.) Normalization ============= String comparison on code point vectors without normalization is useless. Hence, normalization will often be implemented right away. Therefore, it might be useful to provide STRING-NORMALIZE-NF{C,D} (maybe even NFKC/NFKD). Cf. http://www.unicode.org/faq/normalization.html#1 If this is not included, a rationale should be added to the document. At least it should mention normalization somewhere. Greetings, -- Jorgen -- ((email . "xxxxxx@forcix.cx") (www . "http://www.forcix.cx/") (gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))