Re: words, punctuation, and whitespace

Show/hide message thread

words, punctuation, and whitespace Aubrey Jaffer (20 Jul 2005 02:50 UTC)

Re: words, punctuation, and whitespace Thomas Bushnell BSG (20 Jul 2005 03:09 UTC)

Re: words, punctuation, and whitespace Alex Shinn (20 Jul 2005 03:32 UTC)

Re: words, punctuation, and whitespace John.Cowan (20 Jul 2005 04:02 UTC)

Re: words, punctuation, and whitespace Alex Shinn 20 Jul 2005 03:31 UTC

On 7/20/05, Aubrey Jaffer <xxxxxx@alum.mit.edu> wrote:
>
> The first task in writing text-processing programs is to separate the
> input text into words, punctuation, and whitespace.  Could R6RS deal
> with Unicode text as words, punctuation, and whitespace?

Unfortunately, no.

>   Unicode-read port
>
> would return a word, punctuation, or whitespace object; or an
> eof-object.

This is an AI-complete problem.  Chinese, Japanese and Thai (at least)
don't use whitespace to separate words, and require dictionary lookups
and natural language processing.

Emacs' forward-word and related procedures use a simple hack to be
useful in Japanese (though not actually breaking at word boundaries),
but are useless in Chinese and Thai.

So yes, full multi-lingual processing is very difficult, but fortunately
you rarely need it.  Editors and translation software are about the only
examples I can think of where this is needed, and they will use
specialized libraries anyway.  We just need to specify in this SRFI
enough so that those libraries can be portable.

--
Alex