TR29 word boundary use cases Alex Shinn (02 Dec 2013 00:53 UTC)
Re: TR29 word boundary use cases John Cowan (08 Dec 2013 17:18 UTC)
Re: TR29 word boundary use cases Alex Shinn (11 Dec 2013 14:03 UTC)
Re: TR29 word boundary use cases John Cowan (12 Dec 2013 05:18 UTC)
Re: TR29 word boundary use cases Alex Shinn (13 Dec 2013 02:14 UTC)
Re: TR29 word boundary use cases John Cowan (13 Dec 2013 02:51 UTC)
Re: TR29 word boundary use cases Alex Shinn (13 Dec 2013 02:56 UTC)

Re: TR29 word boundary use cases John Cowan 08 Dec 2013 17:18 UTC

Alex Shinn scripsit:

> I've been reviewing the TR29 word boundary algorithm for implementation,
> and it strikes me as a rather complicated way to do only part of
> the job.

Basically true.  As you know, the word "word" does not really have a
language-independent meaning.

> For example, it breaks sequences of hiragana on every codepoint, but
> chunks all consecutive Thai letters into a single word.

Japanese/Chinese and Thai/Lao are explicitly places where the algorithm
is not good enough, and needs to be supplemented by further information.
I think the fact that word breaks appear between every hiragana letter
is a reflection of the fact that each such place is a line break
opportunity; whereas in Thai, line break oppos come only between actual
words, which you can only find (absent ZWSP characters) with a Thai
morphology engine.

--
John Cowan  xxxxxx@ccil.org  http://ccil.org/~cowan
If he has seen farther than others,
        it is because he is standing on a stack of dwarves.
                --Mike Champion, describing Tim Berners-Lee (adapted)