I've been reviewing the TR29 word boundary

algorithm for implementation, and it strikes me

as a rather complicated way to do only part of

the job. For example, it breaks sequences of

hiragana on every codepoint, but chunks all

consecutive Thai letters into a single word. It

seems more useful to consistently split

aggressively and then use a separate step to

recompose as needed, or to split conservatively

and then use a separate step to segment further.

But the TR29 algorithm does neither.

Indeed, in my company we do a lot of text

processing, and split words in many ways,

including at simplistic levels requiring post-

processing and with very sophisticated natural

language aware segmenters, but to my

knowledge we don't use the TR29 algorithm

anywhere. Does anyone have real-world uses

of the TR29 word boundary algorithm they

could share?

Thanks,

Alex