It is certainly more general, but it feels like putting multiple distinct responsibilities

(segmenting text and processing on each segmented item) and can be awkward

to use.

Especially, the cluster splitter as specified in UAX#29 requires lookahead

and such splitter could be implemented more efficiently if we allow keeping states,

but allowing only Text -> (a, Text) type procedure excludes such internal state.

Clustering algorithm can be provided as generators (e.g. Text -> (() -> Grapheme) or

Text -> (() -> Word), etc.). Such generators can be combined with generator

mapping operators freely, and covers use cases of the proposed generalized

text-map etc. (The only cases it can't cover is when you want to switch granularity

as you consume input. I assume it's rather rare case, though.)

Cf. Gauche employs generator-base approach for clustering:

http://practical-scheme.net/gauche/man/?p=Unicode+text+segmentation

On Thu, Jun 9, 2016 at 3:24 AM, John Cowan <xxxxxx@mercury.ccil.org> wrote:

I put this in briefer form into the bottom of another email, so I'm
repeating it here under its own subject line so it won't be lost sight of.

In a Unicode world, codepoint granularity for mapping functions will
often be too fine, but we cannot say for all processing tasks what the
correct granularity is. Sometimes it is code points, sometimes it is
grapheme clusters (legacy or extended), sometimes it is whole words or
larger textual units. See UAX #29 at <http://unicode.org/reports/tr29/>
for discussions of these terms, and note that this is an official part
of the Unicode Standard.

To generalize over all of these, I propose replacing the procedures passed
to textual-map, textual-fold, etc. to accept a text as their argument,
namely what has yet to be processed, and return two texts: the rest
of the text as yet unprocessed and the processed result of this call.
Thus a procedure that wants to process codepoint-by-codepoint uses
(text-ref t 0) to examine the first codepoint of its argument t, and
(subtext t 1) to get the first value to return.

In the case of textual-for-each, the second value could be required but
ignored, or could just not be returned; I'm uncertain which is better.
Because of the awkwardness of handling optional multiple values in Scheme,
one or the other should be chosen.

For the unfolds, the mapper argument should return a text rather than
a character.

--
John Cowan http://www.ccil.org/~cowan xxxxxx@ccil.org
May the hair on your toes never fall out! --Thorin Oakenshield (to Bilbo)
To unsubscribe from this list please goto http://www.simplelists.com/confirm.php?u=sEIcWVUfj68J0YoBZphG7DCkwoRXpWC0