strings draft Tom Lord (22 Jan 2004 04:58 UTC)
Re: strings draft Shiro Kawai (22 Jan 2004 09:46 UTC)
Re: strings draft Tom Lord (22 Jan 2004 17:32 UTC)
Re: strings draft Shiro Kawai (23 Jan 2004 05:03 UTC)
Re: strings draft Tom Lord (24 Jan 2004 00:31 UTC)
Re: strings draft Matthew Dempsky (24 Jan 2004 03:00 UTC)
Re: strings draft Shiro Kawai (24 Jan 2004 03:27 UTC)
Re: strings draft Tom Lord (24 Jan 2004 04:18 UTC)
Re: strings draft Shiro Kawai (24 Jan 2004 04:49 UTC)
Re: strings draft Tom Lord (24 Jan 2004 18:47 UTC)
Re: strings draft Shiro Kawai (24 Jan 2004 22:16 UTC)
Octet vs Char (Re: strings draft) Shiro Kawai (26 Jan 2004 09:58 UTC)
Re: Octet vs Char (Re: strings draft) bear (26 Jan 2004 19:04 UTC)
Re: Octet vs Char (Re: strings draft) Matthew Dempsky (26 Jan 2004 20:12 UTC)
Re: Octet vs Char (Re: strings draft) Matthew Dempsky (26 Jan 2004 20:40 UTC)
Re: Octet vs Char (Re: strings draft) Ken Dickey (27 Jan 2004 04:33 UTC)
Re: Octet vs Char Shiro Kawai (27 Jan 2004 05:12 UTC)
Re: Octet vs Char Tom Lord (27 Jan 2004 05:23 UTC)
Re: Octet vs Char bear (27 Jan 2004 08:35 UTC)
Re: Octet vs Char (Re: strings draft) bear (27 Jan 2004 08:33 UTC)
Re: Octet vs Char (Re: strings draft) Ken Dickey (27 Jan 2004 15:43 UTC)
Re: Octet vs Char (Re: strings draft) bear (27 Jan 2004 19:06 UTC)
Re: Octet vs Char Shiro Kawai (26 Jan 2004 23:39 UTC)
Strings, one last detail. bear (30 Jan 2004 21:12 UTC)
Re: Strings, one last detail. Shiro Kawai (30 Jan 2004 21:43 UTC)
Re: Strings, one last detail. Tom Lord (31 Jan 2004 00:13 UTC)
Re: Strings, one last detail. bear (31 Jan 2004 20:26 UTC)
Re: Strings, one last detail. Tom Lord (31 Jan 2004 20:42 UTC)
Re: Strings, one last detail. bear (01 Feb 2004 02:29 UTC)
Re: Strings, one last detail. Tom Lord (01 Feb 2004 02:44 UTC)
Re: Strings, one last detail. bear (01 Feb 2004 07:53 UTC)
Re: strings draft bear (22 Jan 2004 19:05 UTC)
Re: strings draft Tom Lord (23 Jan 2004 01:53 UTC)
READ-OCTET (Re: strings draft) Shiro Kawai (23 Jan 2004 06:01 UTC)
Re: strings draft bear (23 Jan 2004 07:04 UTC)
Re: strings draft bear (23 Jan 2004 07:20 UTC)
Re: strings draft Tom Lord (24 Jan 2004 00:02 UTC)
Re: strings draft Alex Shinn (26 Jan 2004 01:59 UTC)
Re: strings draft Tom Lord (26 Jan 2004 02:22 UTC)
Re: strings draft bear (26 Jan 2004 02:35 UTC)
Re: strings draft Tom Lord (26 Jan 2004 02:48 UTC)
Re: strings draft Alex Shinn (26 Jan 2004 03:00 UTC)
Re: strings draft Tom Lord (26 Jan 2004 03:14 UTC)
Re: strings draft Shiro Kawai (26 Jan 2004 04:57 UTC)
Re: strings draft Alex Shinn (26 Jan 2004 04:58 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 18:48 UTC)
Re: strings draft bear (24 Jan 2004 02:21 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 02:10 UTC)
Re: strings draft Tom Lord (23 Jan 2004 02:29 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 02:44 UTC)
Re: strings draft Tom Lord (23 Jan 2004 02:53 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 03:04 UTC)
Re: strings draft Tom Lord (23 Jan 2004 03:16 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 03:42 UTC)
Re: strings draft Alex Shinn (23 Jan 2004 02:35 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 02:42 UTC)
Re: strings draft Tom Lord (23 Jan 2004 02:49 UTC)
Re: strings draft Alex Shinn (23 Jan 2004 02:58 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 03:13 UTC)
Re: strings draft Alex Shinn (23 Jan 2004 03:19 UTC)
Re: strings draft Bradd W. Szonye (23 Jan 2004 19:31 UTC)
Re: strings draft Alex Shinn (26 Jan 2004 02:22 UTC)
Re: strings draft Bradd W. Szonye (06 Feb 2004 23:30 UTC)
Re: strings draft Bradd W. Szonye (06 Feb 2004 23:33 UTC)
Re: strings draft Alex Shinn (09 Feb 2004 01:45 UTC)
specifying source encoding (Re: strings draft) Shiro Kawai (09 Feb 2004 02:51 UTC)
Re: strings draft Bradd W. Szonye (09 Feb 2004 03:39 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 03:12 UTC)
Re: strings draft Alex Shinn (23 Jan 2004 03:28 UTC)
Re: strings draft tb@xxxxxx (23 Jan 2004 03:44 UTC)
Parsing Scheme [was Re: strings draft] Ken Dickey (23 Jan 2004 17:02 UTC)
Re: Parsing Scheme [was Re: strings draft] bear (23 Jan 2004 17:56 UTC)
Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (23 Jan 2004 18:50 UTC)
Re: Parsing Scheme [was Re: strings draft] Per Bothner (23 Jan 2004 18:56 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 20:26 UTC)
Re: Parsing Scheme [was Re: strings draft] Per Bothner (23 Jan 2004 20:57 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 21:44 UTC)
Re: Parsing Scheme [was Re: strings draft] Ken Dickey (23 Jan 2004 21:47 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 23:22 UTC)
Re: Parsing Scheme [was Re: strings draft] Ken Dickey (25 Jan 2004 01:03 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (25 Jan 2004 03:01 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 20:07 UTC)
Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (23 Jan 2004 21:22 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (23 Jan 2004 22:38 UTC)
Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (24 Jan 2004 06:48 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (24 Jan 2004 18:41 UTC)
Re: Parsing Scheme [was Re: strings draft] tb@xxxxxx (24 Jan 2004 19:34 UTC)
Re: Parsing Scheme [was Re: strings draft] Tom Lord (24 Jan 2004 21:48 UTC)
Re: strings draft Matthew Dempsky (25 Jan 2004 06:59 UTC)
Re: strings draft Tom Lord (25 Jan 2004 07:16 UTC)
Re: strings draft Matthew Dempsky (26 Jan 2004 23:52 UTC)
Re: strings draft Tom Lord (27 Jan 2004 00:30 UTC)

Re: strings draft Shiro Kawai 26 Jan 2004 04:57 UTC

>From: Tom Lord <xxxxxx@emf.net>
Subject: Re: strings draft
Date: Sun, 25 Jan 2004 19:27:58 -0800 (PST)

> I'm not aware enough about the details of Shiro's.  All I'm (sort of)
> aware of is that he's dealing with a EUCJP -- which sounds very
> challenging if you want to wind up with an implementation suitable for
> intensive string processing.   (Unicode is similarly challenging.)

To be precise, what I'm dealing with is to use a CES-independent
multibyte string representation, currently including utf-8, EUC-JP,
and Shift_JIS.  EUC-CN, EUC-TW and EUC-KR should be supported
easily.   I exclude stateful encodings, like ISO2022
with stateful escape sequences.  (EUC is subset of ISO2022,
but it only uses single shift escape, thus effectively it's stateless).

Tom, if you have specific instance that discourages mb string,
I'm curious about hearing it, either off-list or on-list.

The following is a discussion that why I think mb string is
feasible.  Those who aren't interested can skip it.

 * * *

"Intensive string processing" would vary for application domains.
The domain I'm looking at has these properties:

  * very frequent use of regexp.
  * strings are hardly mutated.
  * lots of data passing between external programs/libraries.

Regexp engine can be implemented on multibyte strings almost as
efficient as "uniform character array" string, by compiling regexp
into octet-stream NFA/DFA.  Currently the only penalty of my
implementation is when you use a character range including large
character set.  It can be optimized, I think.

Regexp is heavily used to extract a part of string.  Returning
substring directly is very efficient if you share the string body.
Using string indices can be actually less efficient, even if you
use uniform character array strings.

Multibyte representation doesn't necessarily put a penalty to
use large corpora; e.g. suffix array can be constructed and used
efficiently using byte index (actually, any kind of string reference).

Most external libraries and programs nowadays require strings
to be passed in some sort of multibyte format.  If you can use
the same multibyte format internally, sending and receiving
data have little overhead.   It may not help when you're writing
a program that will be used on wide variety of environments,
but it is an advantage if you're writing in-house tools where
you have knowledge of which encoding is used in the environment.

Of course I don't insist multibyte strings is generally superior.
Actually I'm not 100% sure multibyte strings doesn't have serious
problems.  But I was curious, so I started implementing it,
and haven't seen a serious problem yet, though there are
some unresolved issues (like how to tread illegal byte sequences).

--shiro