Email list hosting service & mailing list manager

Unpaired surrogate handling Shiro Kawai (26 Jan 2020 00:59 UTC)
Re: Unpaired surrogate handling John Cowan (26 Jan 2020 02:25 UTC)
Re: Unpaired surrogate handling Shiro Kawai (26 Jan 2020 03:16 UTC)
Re: Unpaired surrogate handling Amirouche Boubekki (03 Feb 2020 11:13 UTC)
Re: Unpaired surrogate handling John Cowan (05 Feb 2020 23:49 UTC)
Re: Unpaired surrogate handling Lassi Kortela (07 Feb 2020 15:42 UTC)
Re: Unpaired surrogate handling John Cowan (07 Feb 2020 15:46 UTC)
json-read vs json-fold Lassi Kortela (07 Feb 2020 15:52 UTC)
Re: json-read vs json-fold Amirouche Boubekki (07 Feb 2020 17:27 UTC)
Re: json-read vs json-fold John Cowan (11 Feb 2020 21:43 UTC)
Re: json-read vs json-fold Amirouche Boubekki (11 Feb 2020 21:49 UTC)
Re: json-read vs json-fold John Cowan (11 Feb 2020 22:26 UTC)

Re: Unpaired surrogate handling Amirouche Boubekki 03 Feb 2020 11:12 UTC

Le dim. 26 janv. 2020 à 04:16, Shiro Kawai <xxxxxx@gmail.com> a écrit :
>
> My choice is to raise an error, and I made Gauche's json library do so.  However,  I could think the a case of treating it as a non-unicode codepoint---when you have to deal with existing JSON files that contains such strings.   For example, if you read such JSON, add one field to it and write it out, using replacement character is out of option since it loses the information.

> I don't know how much such 'sloppy' JSON is in the wild.

> I once worked on emails and there were so many broken emails that libraries that rejected them weren't usable.  I guess it depends on how bad the actual situation is.

> We can say "it is an error to have unpaired surrogates", and leave the interpretation of "error" to each implementation.

What do you think about making the tokenizer, that is now called
json-tokens, an optional argument of json-fold?  Since it is easy to
implement json-read with json-fold, I would not propagate the optional
argument to json-read to keep the signature simpler.  Eventually, it
would be easier to somehow handle unpaired surrogates.

>
>
>
>
> On Sat, Jan 25, 2020 at 4:25 PM John Cowan <xxxxxx@ccil.org> wrote:
>>
>>
>>
>> On Sat, Jan 25, 2020 at 8:00 PM Shiro Kawai <xxxxxx@gmail.com> wrote:
>>
>>>  It may include such surrogate as a character with the codepoint of unpaired surrogate,
>>
>>
>> I think it's a bad idea to even suggest that.  Isolated surrogates have no defined meaning and could only be treated as non-Unicode characters.
>>
>>>
>>> or may raise a JSON error.
>>
>>
>> I believe it's right to raise an error just as if the JSON syntax was invalid, even though isolated surrogates are not actually invalid syntax.  In a forgiving mode they could be replaced by U+FFFD.
>>
>>
>>
>> John Cowan          http://vrici.lojban.org/~cowan        xxxxxx@ccil.org
>> One time I called in to the central system and started working on a big
>> thick 'sed' and 'awk' heavy duty data bashing script.  One of the geologists
>> came by, looked over my shoulder and said 'Oh, that happens to me too.
>> Try hanging up and phoning in again.'  --Beverly Erlebacher
>>