Matthew Flatt <xxxxxx@cs.utah.edu> writes:
> * `string-normalize-nfd', `string-normalize-nfkd,
> `string-normalize-nfc', and `string-normalize-nfkc', which each
> accept a string and produce its normalization according to normal
> form D, KD, C, or KC, respectively.
If the basic concept of the SRFI - a string being a sequence of
code points - does not change, I do think these procedures are
useful (contrary to bear and Alex Shinn). An implementation can
still normalize internally in the "usual case", and if the
programmer enforces a different normalization, that's eir problem.
STRING=? and similar procedures need to define which kind of
normalization they work on (or just "the same normalization for
all arguments").
STRING-DOWNCASE, STRING-APPEND etc. need to define whether they
may normalize their arguments, and if so, which normalization they
return. If the normalization shouldn't be prescribed, another
procedure, STRING-NORMALIZE (or similar), needs to be added to
return the normalization the implementation prefers.
A higher-level string API can (and should) be built on top of the
strings defined in this SRFI.
> The #\newline character
> -----------------------
>
> It is likely that #\newline will be removed from Scheme leaving only
> #\linefeed. Since R6RS will pin down characters to Unicode scalar
> values, the right name for the character is #\linefeed.
I'm always in favor of breaking stuff to get a clean result.
> Another view is that #\newline can serve as an abstaction of the
> end-of-line character sequence which is returned by read-char
> when the end-of-line character sequence is read (be it
> #\linefeed, or #\return, or # \return followed by #\linefeed).
> So even though #\newline and #\linefeed are the same characters,
> Scheme programs might use #\newline to highlight that the
> character is being used to denote the end-of-line sequence. The
> name #\newline would also reinforce the link with the escape
> sequence "\n" in strings.
If #\newline is considered to be some kind of abstraction of the
end-of-line character sequence, please remember that Unicode
defines U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR as
canonical new line code points, to finally get rid of all these
distinctions.
> Escape sequences
> ----------------
> with semi-colon terminator without terminator
>
> "A\x42;C" = "ABC" "A\x42\x43" = "ABC"
> "\x41;\x42;\x43;" = "ABC" "\x41\x42\x43" = "ABC"
> "\x03BB;x.x" = "λx.x" "\x03BBx.x" = "λx.x"
I agree with bear that the semicolon is a bad choice - why not use
the colon?
"\Ax42:C"
"\x41:\x42:\x43:"
"\x03BB:x.x"
> Using less-than and greater-than characters, which are not actual
> brackets, avoids this problem:
>
> #\x<03BB> = #\λ
Braces have been offered as an alternative:
#\x{03BB}
> However, they become somewhat more difficult to read when multiple
> escape appear in a string:
>
> "\x<41>\x<42>\x<43>" = "ABC"
"\x{41}\x{42}\x{43}"
> In either case, the trade-off is that Scheme strings are unlikely to be
> compatible with any other language's string syntax. A consequence is
> that there is additional burden on the programmer which must learn yet
> another string and character syntax.
I do think it's good that we don't go with bad decisions made by
other languages just because the decision has been made by them.
> Symbol characters
> -----------------
> [...]
> Meanwhile, the symbol escapes are similar yet not identical to the
> escapes in strings and characters, so there is a potential for mistakes
> if the programmer is not careful. For example one might expect a\nb to
> be a valid symbol, but it is an error.
Why not allow the same escapes in symbols and in strings?
All in all I like the changes you propose (modulo the comments
above). Thanks for the good work!
Regards,
-- Jorgen
--
((email . "xxxxxx@forcix.cx") (www . "http://www.forcix.cx/")
(gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))