Re: Simplifying SRFI 109, part 1: entities

Re: Simplifying SRFI 109, part 1: entities Per Bothner 26 Feb 2013 06:46 UTC
[Sorry for sitting on this one for a while.  I didn't forget, but
I needed to get some other things out of the way first.]

On 02/10/2013 12:04 AM, John Cowan wrote:
> This is the first of two posts proposing simplifications (reductions in
> scope) for SRFI 109.  The idea is that by removing variable elements,
> this SRFI (unlike SRFIs 107 and 108) becomes purely lexical in scope:
> the output of the a SRFI-109-capable reader returns the same thing for
> a SRFI-109 string literal and a regular string, viz. an immutable Scheme
> string object.

See my reply to part 2: Enclosed expressions are IMO a prime
feature of strings *quasi*-literals.  Thus in general the reader
can't return a literal string.

The reader *could* return a literal string in cases where
there are no enclosed expressions, but I feel uncomfortable with
that - it seems a bit hacky and inconsistent.  For read/write
round-tripping we have the traditional string literals, so I
think it is cleaner to have the &{...} always return a ($string$ ...)
form.

> In this first post, I argue against the provision of user-defined
> entity names.  Currently, when an entity reference appears in a SRFI
> 109 string literal, it is expanded into the identifier $entity:<name>$,
> where <name> is the entity referred to.  Thus &{Rom&acirc;nia} expands
> to ($string$ "Rom" $entity:acirc$ "nia").  In principle, this permits a
> user to rebind $entity:acirc$ to something else.  However, there seems no
> reason why this should be allowed; it is only productive of confusion.
> Such entity references should just expand directly to the character, so
> that &{Rom&acirc;nia} becomes ($string$ "RomÃ¢nia"), or just "RomÃ¢nia".

If we accept that we always get a ($string$ ...) form, that much reduces
the benefit of the reader expanding named characters.  And there are
advantages to deferring it.

Deferring character name lookup allows user-defined character names
- or in general entity names (which can be longer strings).

Not hard-wiring in entity names is especially important for STFI-107,
since the XML/SGML model does allow user-defined entity names.
Having these be hard-wired into the reader is not IMO in the spirit
of XML.  Even if the reader uses a user-programmable table it would
be information-losing for the reader to expand the entity names.
Even then using using a programmable read-time lookup table is
clearly less "Schemey" than using regular expand-time name-lookup.

If we defer entity name lookup for SRFI-107 then I think we should
do the same for SRFI-108 and SRFI-109, for simplicity.

> Nor is it likely that anyone will need character entities past the 2237
> already provided by the standard W3C list.  It is already a requirement
> that systems not add names that conflict with any of these.  True, you
> cannot write (say) Hindi in the Devanagari script using character entity
> references only.  But if you are going to do that, you will probably
> want to use a UTF-8 compatible editor with appropriate fonts.
>
> I therefore believe that character entities should be expanded directly
> into characters by the implementation.  This eliminates one of the
> use cases for requiringing SRFI 109 string literals to expand into calls
> on $string$.

>I would also strengthen, from a MAY to a SHOULD, the
> recommendation to implement the whole standard list.

I did that in the new draft.  Let me know what you think of the change.
(I've also implemented this for Kawa.)  I should probably also state
that an implementation MUST support the standard Scheme character names.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/