Re: Simplifying SRFI 109, part 1: entities
Per Bothner 26 Feb 2013 06:46 UTC
[Sorry for sitting on this one for a while. I didn't forget, but
I needed to get some other things out of the way first.]
On 02/10/2013 12:04 AM, John Cowan wrote:
> This is the first of two posts proposing simplifications (reductions in
> scope) for SRFI 109. The idea is that by removing variable elements,
> this SRFI (unlike SRFIs 107 and 108) becomes purely lexical in scope:
> the output of the a SRFI-109-capable reader returns the same thing for
> a SRFI-109 string literal and a regular string, viz. an immutable Scheme
> string object.
See my reply to part 2: Enclosed expressions are IMO a prime
feature of strings *quasi*-literals. Thus in general the reader
can't return a literal string.
The reader *could* return a literal string in cases where
there are no enclosed expressions, but I feel uncomfortable with
that - it seems a bit hacky and inconsistent. For read/write
round-tripping we have the traditional string literals, so I
think it is cleaner to have the &{...} always return a ($string$ ...)
form.
> In this first post, I argue against the provision of user-defined
> entity names. Currently, when an entity reference appears in a SRFI
> 109 string literal, it is expanded into the identifier $entity:<name>$,
> where <name> is the entity referred to. Thus &{România} expands
> to ($string$ "Rom" $entity:acirc$ "nia"). In principle, this permits a
> user to rebind $entity:acirc$ to something else. However, there seems no
> reason why this should be allowed; it is only productive of confusion.
> Such entity references should just expand directly to the character, so
> that &{România} becomes ($string$ "România"), or just "România".
If we accept that we always get a ($string$ ...) form, that much reduces
the benefit of the reader expanding named characters. And there are
advantages to deferring it.
Deferring character name lookup allows user-defined character names
- or in general entity names (which can be longer strings).
Not hard-wiring in entity names is especially important for STFI-107,
since the XML/SGML model does allow user-defined entity names.
Having these be hard-wired into the reader is not IMO in the spirit
of XML. Even if the reader uses a user-programmable table it would
be information-losing for the reader to expand the entity names.
Even then using using a programmable read-time lookup table is
clearly less "Schemey" than using regular expand-time name-lookup.
If we defer entity name lookup for SRFI-107 then I think we should
do the same for SRFI-108 and SRFI-109, for simplicity.
> Nor is it likely that anyone will need character entities past the 2237
> already provided by the standard W3C list. It is already a requirement
> that systems not add names that conflict with any of these. True, you
> cannot write (say) Hindi in the Devanagari script using character entity
> references only. But if you are going to do that, you will probably
> want to use a UTF-8 compatible editor with appropriate fonts.
>
> I therefore believe that character entities should be expanded directly
> into characters by the implementation. This eliminates one of the
> use cases for requiringing SRFI 109 string literals to expand into calls
> on $string$.
>I would also strengthen, from a MAY to a SHOULD, the
> recommendation to implement the whole standard list.
I did that in the new draft. Let me know what you think of the change.
(I've also implemented this for Kawa.) I should probably also state
that an implementation MUST support the standard Scheme character names.
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/