Re: Surrogates and character representation
Alan Watson 25 Jul 2005 17:23 UTC
> By the same token, random-access disks are a useless feature, for they
> can be replaced by sequential-access DECtapes that can be rewound and
> selectively rewritten. But at a price.
Files actually provide a fairly close analogy to the commonest means of
representing Unicode strings.
Imagine a file system that implements files as streams of bytes. Now
imagine that you want to read the Nth *line*. The only way to do this is
to read through the file until you have encounted N-1 newlines. This is
like finding the Nth character when using UTF-8 for strings.
Now imagine a file system that implements files as enumerated
random-access records and uses exactly one record for each line. You can
directly read the Nth line. This is like finding the Nth character when
using UCS-32 for strings.
Now imagine a file system that implements files as enumerated
random-access records and uses one or more record for each line. This is
like using UTF-16 for strings.
Regards,
Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México