Request for review of my binary encoding proposal
John Cowan
(17 Sep 2019 22:39 UTC)
|
Re: Request for review of my binary encoding proposal Lassi Kortela (18 Sep 2019 00:35 UTC)
|
Re: Request for review of my binary encoding proposal
Alaric Snell-Pym
(18 Sep 2019 10:09 UTC)
|
Re: Request for review of my binary encoding proposal
John Cowan
(18 Sep 2019 23:48 UTC)
|
Re: Request for review of my binary encoding proposal
Arthur A. Gleckler
(18 Sep 2019 23:51 UTC)
|
Data type registry
Lassi Kortela
(19 Sep 2019 16:47 UTC)
|
Re: Data type registry
John Cowan
(19 Sep 2019 20:21 UTC)
|
Re: Data type registry
Arthur A. Gleckler
(19 Sep 2019 21:37 UTC)
|
Symbol registry
Lassi Kortela
(19 Sep 2019 21:46 UTC)
|
Re: Symbol registry
Arthur A. Gleckler
(19 Sep 2019 21:48 UTC)
|
Why ASN.1 is not, like, actually evil
John Cowan
(18 Sep 2019 12:24 UTC)
|
Re: Why ASN.1 is not, like, actually evil
hga@xxxxxx
(18 Sep 2019 13:43 UTC)
|
Re: Why ASN.1 is not, like, actually evil
John Cowan
(18 Sep 2019 21:13 UTC)
|
Re: Why ASN.1 is not, like, actually evil
Lassi Kortela
(19 Sep 2019 17:01 UTC)
|
Re: Why ASN.1 is not, like, actually evil
John Cowan
(19 Sep 2019 18:27 UTC)
|
Re: Why ASN.1 is not, like, actually evil
Lassi Kortela
(19 Sep 2019 21:53 UTC)
|
Re: Request for review of my binary encoding proposal
John Cowan
(18 Sep 2019 23:29 UTC)
|
Re: Request for review of my binary encoding proposal
Lassi Kortela
(19 Sep 2019 16:08 UTC)
|
> I'd like you (Lassi, but anyone else who wants to, of course) to closely > review my ASN.1 LER pre-pre-SRFI. This should make it clear that I am > *not* against binary encoding in all contexts. I'm cc-ing Schemepersist > faute de mieux. Thanks, I appreciate the invitation. Do you have a document up, or is everything in that Google sheet at the moment? > (The API is trivial: one procedure to write a Scheme > object to a port, one to read a Scheme object from a port, and predicate > for "uncodable object".) Sounds very good. > LER is *almost* a superset of X.690 ASN.1 DER (Distinguished Encoding > Rules). Each object is encoded as follows: > > 1) Type. 1-2 byte code which identifies the type of the serialized object > and whether it is bytes or sub-objects. Some are X.690 standard, others > are "private specification" codes (as distinct from "private use" which we > don't define, leaving it to application programmers). There is only one > 2-byte code (ISO 8601 duration) and I wouldn't weep too hard if we left it > out. > > 2) Length. A length of 1-127 inclusive is encoded in one byte. A greater > length is encoded as one byte with value 128+k, where k is the number of > bytes (between 1 and 126 inclusive) that represent the actual length. The > next k bytes are the actual length as a big-endian base-256 value. > Practical values of k are probably 1, 2, 4, 8. > > 3) The content itself, either raw bytes or contained objects. Note that > all numeric content is big-endian. > > The Google Sheet at http://tinyurl.com/asn1-ler gives the proposed type > codes and what they mean. > > The only deviation from DER is that sets do not have to be sorted into > binary lexicographic order. This all sounds like a very good idea, as does ASN.1 in general. Unfortunately the devil is in the details: ASN.1 has a reputation as badly over-engineered (even when limited to its binary encodings, not the XML one), and there have been numerous bugs (many of them with security implications) in parsers in times past. I tried to understand the format once but it was so complex for what it did that I gave up. Basically, here's a Reddit thread full of people facepalming at the complexity of the format and the resulting pain: <https://www.reddit.com/r/programming/comments/1hf7ds/useful_old_technologies_asn1/>. Here's an equivalent Hacker News thread: <https://news.ycombinator.com/item?id=8871453>. I hate to be against yet another proposal, a binary format even, but hackers all seem to be against it, and all for the same reason. Sorry about that. So a Scheme implementation may be a very good idea if we need it for a paricular purpose (and someone volunteers to write all that code). But if we don't, then I'd recommend favoring simpler protocols. If ASN.1 is needed, it may be worth to considering wrapping one of the many C implementations. Your list of data types looks good if a comprehensive format is needed. But I would leave out all of the space-optimized ones unless someone has measured that some specific task is too slow. I'd go with varints for all numbers. For space savings without performance penalty, the recent crop of fast compression algorithms (LZ4, Zstandard, Snappy) is amazing. The f32vector etc. numeric arrays are a neat idea, but really large numeric arrays tend to come from some matrix package (R, Matlab, TensorFlow etc.) and it may make sense to use some format native to those packages. There are unlikely to be many people who juggle huge matrices in Scheme without interfacing to one of those de facto standard matrix environments. As for endianness, again I'd just use varints. I sound like a broken record, but few people get how amazing they are. They make short work of practically any data encoding problem that doesn't require extreme performance. For space savings, try wrapping it all in LZ4 or Zstandard first. A SRFI for one of those would be a fine idea IMHO :) I don't have a good opinion on the bag, mapping, range etc. types. My intuition favors simplicity, because the value of data is in exchange, and the more intricate a format is, the fewer environments it can be exchanged with. Porting S-expressions to a new environment, you have to implement lists, symbols, strings and integers. Every type you add to that means more porting work, which usually means people bother to port to fewer environments. It may make sense for some applications to add more data types if they are really useful. But generally I'd bet against it (no concrete arguments here, just intuition). For anything subject to network effects, a bigger network adds value much faster than technical sophistication, as Lispers are all too aware. This all brings to mind the old saying: "Recursive-descent" is computer-science jargon for "simple enough to write on a liter of Coke".