I've been reviewing the TR29 word boundary
algorithm for implementation, and it strikes me
as a rather complicated way to do only part of
the job. For example, it breaks sequences of
hiragana on every codepoint, but chunks all
consecutive Thai letters into a single word. It
seems more useful to consistently split
aggressively and then use a separate step to
recompose as needed, or to split conservatively
and then use a separate step to segment further.
But the TR29 algorithm does neither.
Indeed, in my company we do a lot of text
processing, and split words in many ways,
including at simplistic levels requiring post-
processing and with very sophisticated natural
language aware segmenters, but to my
knowledge we don't use the TR29 algorithm
anywhere. Does anyone have real-world uses
of the TR29 word boundary algorithm they
could share?
Thanks,
--
Alex