Re: Request for review of my binary encoding proposal

Show/hide message thread

Request for review of my binary encoding proposal John Cowan (17 Sep 2019 22:39 UTC)

Re: Request for review of my binary encoding proposal Lassi Kortela (18 Sep 2019 00:35 UTC)

Re: Request for review of my binary encoding proposal Alaric Snell-Pym (18 Sep 2019 10:09 UTC)

Re: Request for review of my binary encoding proposal John Cowan (18 Sep 2019 23:48 UTC)

Re: Request for review of my binary encoding proposal Arthur A. Gleckler (18 Sep 2019 23:51 UTC)

Data type registry Lassi Kortela (19 Sep 2019 16:47 UTC)

Re: Data type registry John Cowan (19 Sep 2019 20:21 UTC)

Re: Data type registry Arthur A. Gleckler (19 Sep 2019 21:37 UTC)

Symbol registry Lassi Kortela (19 Sep 2019 21:46 UTC)

Re: Symbol registry Arthur A. Gleckler (19 Sep 2019 21:48 UTC)

Why ASN.1 is not, like, actually evil John Cowan (18 Sep 2019 12:24 UTC)

Re: Why ASN.1 is not, like, actually evil hga@xxxxxx (18 Sep 2019 13:43 UTC)

Re: Why ASN.1 is not, like, actually evil John Cowan (18 Sep 2019 21:13 UTC)

Re: Why ASN.1 is not, like, actually evil Lassi Kortela (19 Sep 2019 17:01 UTC)

Re: Why ASN.1 is not, like, actually evil John Cowan (19 Sep 2019 18:27 UTC)

Re: Why ASN.1 is not, like, actually evil Lassi Kortela (19 Sep 2019 21:53 UTC)

Re: Request for review of my binary encoding proposal John Cowan (18 Sep 2019 23:29 UTC)

Re: Request for review of my binary encoding proposal Lassi Kortela (19 Sep 2019 16:08 UTC)

Re: Request for review of my binary encoding proposal Lassi Kortela 18 Sep 2019 00:35 UTC

> I'd like you (Lassi, but anyone else who wants to, of course) to closely
> review my ASN.1 LER pre-pre-SRFI.  This should make it clear that I am
> *not* against binary encoding in all contexts.  I'm cc-ing Schemepersist
> faute de mieux.

Thanks, I appreciate the invitation. Do you have a document up, or is
everything in that Google sheet at the moment?

> (The API is trivial: one procedure to write a Scheme
> object to a port, one to read a Scheme object from a port, and predicate
> for "uncodable object".)

Sounds very good.

> LER is *almost* a superset of X.690 ASN.1 DER (Distinguished Encoding
> Rules).  Each object is encoded as follows:
>
> 1) Type.  1-2 byte code which identifies the type of the serialized object
> and whether it is bytes or sub-objects.  Some are X.690 standard, others
> are "private specification" codes (as distinct from "private use" which we
> don't define, leaving it to application programmers).  There is only one
> 2-byte code (ISO 8601 duration) and I wouldn't weep too hard if we left it
> out.
>
> 2) Length.  A length of 1-127 inclusive is encoded in one byte.  A greater
> length is encoded as one byte with value 128+k, where k is the number of
> bytes (between 1 and 126 inclusive)  that represent the actual length.  The
> next k bytes are the actual length as a big-endian base-256 value.
> Practical values of k are probably 1, 2, 4, 8.
>
> 3) The content itself, either raw bytes or contained objects.  Note that
> all numeric content is big-endian.
>
> The Google Sheet at http://tinyurl.com/asn1-ler gives the proposed type
> codes and what they mean.
>
> The only deviation from DER is that sets do not have to be sorted into
> binary lexicographic order.

This all sounds like a very good idea, as does ASN.1 in general.
Unfortunately the devil is in the details: ASN.1 has a reputation as
badly over-engineered (even when limited to its binary encodings, not
the XML one), and there have been numerous bugs (many of them with
security implications) in parsers in times past. I tried to understand
the format once but it was so complex for what it did that I gave up.

Basically, here's a Reddit thread full of people facepalming at the
complexity of the format and the resulting pain:
<https://www.reddit.com/r/programming/comments/1hf7ds/useful_old_technologies_asn1/>.
Here's an equivalent Hacker News thread:
<https://news.ycombinator.com/item?id=8871453>. I hate to be against yet
another proposal, a binary format even, but hackers all seem to be
against it, and all for the same reason. Sorry about that.

So a Scheme implementation may be a very good idea if we need it for a
paricular purpose (and someone volunteers to write all that code). But
if we don't, then I'd recommend favoring simpler protocols. If ASN.1 is
needed, it may be worth to considering wrapping one of the many C
implementations.

Your list of data types looks good if a comprehensive format is needed.
But I would leave out all of the space-optimized ones unless someone has
measured that some specific task is too slow. I'd go with varints for
all numbers. For space savings without performance penalty, the recent
crop of fast compression algorithms (LZ4, Zstandard, Snappy) is amazing.

The f32vector etc. numeric arrays are a neat idea, but really large
numeric arrays tend to come from some matrix package (R, Matlab,
TensorFlow etc.) and it may make sense to use some format native to
those packages. There are unlikely to be many people who juggle huge
matrices in Scheme without interfacing to one of those de facto standard
matrix environments.

As for endianness, again I'd just use varints. I sound like a broken
record, but few people get how amazing they are. They make short work of
practically any data encoding problem that doesn't require extreme
performance. For space savings, try wrapping it all in LZ4 or Zstandard
first. A SRFI for one of those would be a fine idea IMHO :)

I don't have a good opinion on the bag, mapping, range etc. types. My
intuition favors simplicity, because the value of data is in exchange,
and the more intricate a format is, the fewer environments it can be
exchanged with. Porting S-expressions to a new environment, you have to
implement lists, symbols, strings and integers. Every type you add to
that means more porting work, which usually means people bother to port
to fewer environments. It may make sense for some applications to add
more data types if they are really useful. But generally I'd bet against
it (no concrete arguments here, just intuition). For anything subject to
network effects, a bigger network adds value much faster than technical
sophistication, as Lispers are all too aware.

This all brings to mind the old saying: "Recursive-descent" is
computer-science jargon for "simple enough to write on a liter of Coke".