Re: More comments, and the ANTLR code is too complex

Re: More comments, and the ANTLR code is too complex David A. Wheeler 07 Jul 2013 23:09 UTC
On 29 May 2013 02:31:25 -0400, Mark H Weaver posted a long set of comments,
including a number of specific changes.

Below are his comments and my responses, *except* for two issues:
1. Rewriting the grammar so it's not LL(1).
2. Allowing "#;" to introduce an indented comment.
These two items are more complicated; I think it'll be better to discuss them separately.

Mark H Weaver went though the spec with a fine-toothed comb and found various things
that needed fixing.  THANK YOU!!!  I had fixed a few of them earlier, but by no means all;
I think I've now analyzed every item.  In a few cases I don't think a change is needed,
but I explain why.  If that argument isn't enough, maybe it *does* need changes...
so let's talk.

In general: If anyone has thoughts about the text below, please post!

--- David A. Wheeler

====================================================

> * "BLOCK_COMMENT : '#|' // This is #| ... #|"
> That should be "#| ... |#"

Fixed.

> * EOL_SEQUENCE is never used. EOL is used instead, even though it is
> not defined.

Fixed.    I didn't summarize the ANTLR grammar correctly at that point.

> * APOSW, QUASIQUOTEW, UNQUOTEW, and UNQUOTE_SPLICEW are not defined.

They're defined in the previous text, I've tried to clarify their
defining text.

> * Inconsistent syntax is used within {} in the ANTLR. In most places
> standard Scheme syntax is used, but in 'collecting_tail', the syntax
> is more like C.
> * Why are the action rules in 'n_expr' simply expressions that refer to
> values such as '$n1', but the action rules of 'collecting_tail' are
> instead assignment statements that refer to values such as '$more.v'?

As noted earlier, my fault, and I fixed most earlier.
I think it's all fixed now, let me know of any others like that.

> * Why is there special handling of (FF | VT)+ EOL ?

I wanted to create a special limitation: lines with FF or VT must be
be all by themselves, and not mixed with other text.
I think if they're mixed it's confusing.

> * What does 'isperiodp' do exactly? What if the datum really is "." or
> the symbol whose name is a single period? (written #{.}# in Guile).

Good point, that needs defining.  Here's my try:

 The BNF body production uses an isperiodp(x) function,
 which returns true iff x is the datum "." and begins
 with a period.
 This is used so that "a . b" is recognized as the pair (a . b),
 while "a |.| b" is 3-element list "(a |.| b)"

> * The non-terminals 'body' and 'it_expr' use the symbol 'same' even
> though the text implies that no extra symbol is generated by the
> preprocessing step in that case. Where does 'same' come from?

The "same" nonterminal isn't generated by anything; it's an
empty nonterminal I use as a comment (and as an error checker).
I've rewritten the text about it, in the hopes that it's clearer:

 The non-terminal <code>same</code> is an empty sequence;
 it acts as a comment to emphasize where there is a new
 line with unchanged indentation.

And here are some comments about the tutorial:

> * I'd like to see a few more examples for improper lists, such as:

Okay, I added more text to the tutorial to clarify this:

"A single delimited period (.) still sets the value of the cdr field of a pair.
If the period is not the first datum on the line, the next datum on the
line is the cdr value.
If the period is the only datum on the line, then the next (sibling) line
is the cdr value.
A period at the beginning of a line, with one datum after it,
escapes that datum (just like neoteric-expressions do);
a period at the end of a line, with at least one datum before it,
is considered the same as |.|."

NOTE: The way "." is handled at the end of a line, after other text,
is controlled by the "rest" rule.  We could turn it into a
"line continuation" semantic, an error, or something else...
if that seems like a good idea.  The rationale for the current
situation is simply to make it easier to use the datum ".".

> * In the tutorial, I found the examples of $ (SUBLIST) a bit confusing:
>a b $ c d ==> (a b (c d))
>
> a b $ c d e f $ g ==> (a b (c d e f g))
>; Not (a b (c d e f (g)))
>
>This leaves me uncertain of whether the second case is somehow
>caused by two $'s on one line, or because there's only one item
>after the $. I'd like to see an example like "a b $ c" or
>"a b $ c d e $ f g" to clarify.

Ah, good point.  I've separated that the "one datum on the right"
example from the "multiple $" example, that should help.

> * "A sweet-expression reader MUST support three modes: indentation
> processing, enclosed, and initial indent."
> [...]
> "A marker MUST only have its special meaning when indentation
> processing is enabled,"
> This sounds as if "*>" MUST not be recognized, because the reader
> will be in "enclosed" mode at that point, no?

No.  As noted two sentences below,
"The reader temporarily switches to enclosed mode when it is reading inside
any unescaped pairs of parentheses, brackets, or curly braces."

This means that ( *> )  is a list of one datum element (named "*>");
the "*>" only has its special meaning outside any parens.
This helps with backwards compatibility; markers like "<*", "$", and "*>"
have no special meaning inside (...).

Now it *is* true that a "*>" without a preceding matching "<*" is an error,
just like ")" without a preceding matching "(" is an error.
The reference implementation will report this error if that happens:
  Error: Closing *> without preceding matching <*

> * "2. If top is the empty string and the indentation length is nonzero,
> symbol INITIAL_INDENT is generated and the reader changes to initial
> indent mode. When an end-of-line sequence is reached the mode changes
> back to indentation processing."
>
> If the reader was in "enclosed" mode, then presumably the mode
> should not change back to indentation processing, right?

That list doesn't apply if you're in enclosed mode.  Its intro says:
"Then, when a line is read and the current mode is not enclosed,
the line indentation is removed and possibly replaced by other generated
symbols as follows (where &#8220;top&#8221; is the
value of the top of the indentation stack):"

I think the problem is that the mode condition is in
the middle of a sentence, hiding it.
I'll tweak that sentence to (hopefully) make that clearer, mainly
by moving that mode condition to the beginning of the sentence
(to emphasize it).

How's this?:

"Then, when the current mode is <i>not</i> enclosed, the line indentation
(if any) is read, removed, and possibly replaced by other generated
symbols as follows (where &#8220;top&#8221; is the
value of the top of the indentation stack):"

> * "1. If an end-of-line sequence immediately follows the indentation and
> the indentation length is nonzero:
> a. If the indentation contains “!”, it is ignored; an
> implementation MUST consume the end-of-line sequence and start
> applying these rules again, from the beginning, with the next
> line.
> b. If the indentation does not contain “!”, it is considered a
> line with no characters (thus indentation has zero length) and
> the rest of these rules are applied."
>
> I vaguely recall that the distinction here was going to be removed
> as a simplification of the rules. What that idea scrapped?

This was changed in May 1-4, see the discussion
"Why forbid ! in whitespace-only line?" circa May 1.  E.G.:
http://srfi.schemers.org/srfi-110/mail-archive/msg00130.html
Basically, in more-deeply indented constructs, removing the distinction
is inconvenient and error-prone.

> * "A marker MUST only have its special meaning when indentation
> processing is enabled, it is preceded by indentation or hspace, it is
> followed by an hspace or end-of-line, and when it starts with the
> character shown (e.g., neither |$| nor '$ contains a marker)."
>
> The last clause here, "when it starts with the character shown", is
> poorly worded IMO, and redundant with the requirement that "it is
> preceded by indentation or hspace".

It may be poorly worded, but it's not redundant.
The point is that |$| does not have the same meaning as $.
I'm not sure how to reword that one; suggestions welcome.