words, punctuation, and whitespace

Show/hide message thread

words, punctuation, and whitespace Aubrey Jaffer (20 Jul 2005 02:50 UTC)

Re: words, punctuation, and whitespace Thomas Bushnell BSG (20 Jul 2005 03:09 UTC)

Re: words, punctuation, and whitespace Alex Shinn (20 Jul 2005 03:32 UTC)

Re: words, punctuation, and whitespace John.Cowan (20 Jul 2005 04:02 UTC)

words, punctuation, and whitespace Aubrey Jaffer 20 Jul 2005 02:51 UTC

Having written many text processing applications in Scheme, I have
found plain R5RS poorly suited to "bespoke" parsers; so I use several
SLIB modules for string-level infrastructure:

  (require 'string-search)
  (require 'string-port)
  (require 'string-case)
  (require 'line-i/o)

These SRFI-75 discussions dealing with character attributes are
leading me to believe that, knowing only one language well, I will be
unable to write language-portable programs.  But why are we working at
the character or even the string level?

The first task in writing text-processing programs is to separate the
input text into words, punctuation, and whitespace.  Could R6RS deal
with Unicode text as words, punctuation, and whitespace?

  Unicode-read port

would return a word, punctuation, or whitespace object; or an
eof-object.

A procedure named `Unicode-write' or `Unicode-display' would write a
word, punctuation, or whitespace object to a port.  Perhaps `display'
can serve this purpose.

With case-sensitivity, symbols look like good candidates for word
objects.  Words as symbols would seem to make multilingual Scheme
programs possible.

Lists or vectors of these objects would represent multilingual text
compactly without character size or encoding issues.

As evidence that one can deal with multilanguage text at a high level,
consider http://swiss.csail.mit.edu/~jaffer/Scheme.html.jis.  Although
I know no Japanese, I cobbled together this Japanaese and English page
by cutting and pasting from Japanese web pages.