Re: TR29 word boundary use cases
John Cowan 08 Dec 2013 17:18 UTC
Alex Shinn scripsit:
> I've been reviewing the TR29 word boundary algorithm for implementation,
> and it strikes me as a rather complicated way to do only part of
> the job.
Basically true. As you know, the word "word" does not really have a
language-independent meaning.
> For example, it breaks sequences of hiragana on every codepoint, but
> chunks all consecutive Thai letters into a single word.
Japanese/Chinese and Thai/Lao are explicitly places where the algorithm
is not good enough, and needs to be supplemented by further information.
I think the fact that word breaks appear between every hiragana letter
is a reflection of the fact that each such place is a line break
opportunity; whereas in Thai, line break oppos come only between actual
words, which you can only find (absent ZWSP characters) with a Thai
morphology engine.
--
John Cowan xxxxxx@ccil.org http://ccil.org/~cowan
If he has seen farther than others,
it is because he is standing on a stack of dwarves.
--Mike Champion, describing Tim Berners-Lee (adapted)