Re: Simplifying SRFI 109, part 1: entities Per Bothner 26 Feb 2013 06:46 UTC
[Sorry for sitting on this one for a while. I didn't forget, but I needed to get some other things out of the way first.] On 02/10/2013 12:04 AM, John Cowan wrote: > This is the first of two posts proposing simplifications (reductions in > scope) for SRFI 109. The idea is that by removing variable elements, > this SRFI (unlike SRFIs 107 and 108) becomes purely lexical in scope: > the output of the a SRFI-109-capable reader returns the same thing for > a SRFI-109 string literal and a regular string, viz. an immutable Scheme > string object. See my reply to part 2: Enclosed expressions are IMO a prime feature of strings *quasi*-literals. Thus in general the reader can't return a literal string. The reader *could* return a literal string in cases where there are no enclosed expressions, but I feel uncomfortable with that - it seems a bit hacky and inconsistent. For read/write round-tripping we have the traditional string literals, so I think it is cleaner to have the &{...} always return a ($string$ ...) form. > In this first post, I argue against the provision of user-defined > entity names. Currently, when an entity reference appears in a SRFI > 109 string literal, it is expanded into the identifier $entity:<name>$, > where <name> is the entity referred to. Thus &{România} expands > to ($string$ "Rom" $entity:acirc$ "nia"). In principle, this permits a > user to rebind $entity:acirc$ to something else. However, there seems no > reason why this should be allowed; it is only productive of confusion. > Such entity references should just expand directly to the character, so > that &{România} becomes ($string$ "România"), or just "România". If we accept that we always get a ($string$ ...) form, that much reduces the benefit of the reader expanding named characters. And there are advantages to deferring it. Deferring character name lookup allows user-defined character names - or in general entity names (which can be longer strings). Not hard-wiring in entity names is especially important for STFI-107, since the XML/SGML model does allow user-defined entity names. Having these be hard-wired into the reader is not IMO in the spirit of XML. Even if the reader uses a user-programmable table it would be information-losing for the reader to expand the entity names. Even then using using a programmable read-time lookup table is clearly less "Schemey" than using regular expand-time name-lookup. If we defer entity name lookup for SRFI-107 then I think we should do the same for SRFI-108 and SRFI-109, for simplicity. > Nor is it likely that anyone will need character entities past the 2237 > already provided by the standard W3C list. It is already a requirement > that systems not add names that conflict with any of these. True, you > cannot write (say) Hindi in the Devanagari script using character entity > references only. But if you are going to do that, you will probably > want to use a UTF-8 compatible editor with appropriate fonts. > > I therefore believe that character entities should be expanded directly > into characters by the implementation. This eliminates one of the > use cases for requiringing SRFI 109 string literals to expand into calls > on $string$. >I would also strengthen, from a MAY to a SHOULD, the > recommendation to implement the whole standard list. I did that in the new draft. Let me know what you think of the change. (I've also implemented this for Kawa.) I should probably also state that an implementation MUST support the standard Scheme character names. -- --Per Bothner xxxxxx@bothner.com http://per.bothner.com/