Re: What to use for language/locale identifiers?
Lassi Kortela 28 Jul 2020 20:57 UTC
Thank you for a comprehensive and lucid explanation.
> Let's talk about languages rather than locales, since messages are in a
> language or a variety of a language, and these names are very well
> standardized (Posix, Java, HTTP, even to some extent Windows).
I assume dialects are folded into language subtags. A regional dialect
would perhaps be the only part of a locale besides the main language
that would affect error messages, so if that info is part of the
language tag we're all set.
> For Lisp purposes a list of lower-cased symbols would be fine.
Sounds perfect.
> The first tag is always the primary language. It is a 2-letter ISO
> 639-1 code if there is one, or if there isn't, a 3-letter ISO 639-3
> code. That covers the 7000-odd languages of the world.
>
> The remaining tags have a fixed order (a) before (b) before (c), but any
> or all of them can be omitted:
We could feed 'message zero or more subtags as lowercase symbols:
(foreign-error-ref ferr 'message)
(foreign-error-ref ferr 'message 'nv)
(foreign-error-ref ferr 'message 'nv 'fi)
(foreign-error-ref ferr 'message 'sr 'cyrl)
(foreign-error-ref ferr 'message 'zh 'latn)
(foreign-error-ref ferr 'message 'en 'uk)
(foreign-error-ref ferr 'message 'es 419)
(foreign-error-ref ferr 'message 'es 'us)
(foreign-error-ref ferr 'message 'sr 'latn 'mn)
Judging by your examples 'message needs to be prepared accept
nonnegative exact integers in addition to symbols -- just as R7RS
library names do: (import (srfi 189)).
GNU software packages are translated to many languages.
translationproject.org is some kind of hub site for translators. Here is
the list of language tags they have:
<https://translationproject.org/PO-files/>.
Almost all of them only have a 2-letter main language.
The ones with a 3-letter main language are:
* ckb
* crh
* fur
The ones that disambiguate using an extra subtag are:
* bn_IN
* en_GB
* en_ZA
* pt_BR
* zh_CN
* zh_HK
* zh_TW
> The two algorithms given in RFC 4647 are called filtering and lookup.
> Both of them work with a set of language-tagged texts and a particular
> language tag. Filtering will return a set of texts compatible with the
> particular tag. For example, given the tag en, all texts tagged with
> en, en-us, en-uk, ... will be returned. This comes up when you want to
> get English but don't care which of various English alternatives might
> exist, as when specifying a desired language to read texts in. The
> algorithm is to truncate the tags of the texts until they are the same
> length as the particular tag.
>
> Lookup is used when exactly one answer is required, and the algorithm is
> the opposite: if you say you want en-US-newyork (not a real variant
> subtag), and no such text exists, it will look for en-US and failing
> that for en. Here we truncate the particular tag until it matches some
> existing textual tag.
For SRFI 198, we can probably coalesce both these algorithms into one.
- (foreign-error-ref ferr 'message) gives any language
- (foreign-error-ref ferr 'message 'en) gives any English
- (foreign-error-ref ferr 'message 'en 'us) gives (any) US English
In practice, it would be quite surprising if we came across a
three-level language tag for translated software, but if we allow "zero
or more" arguments for 'message then the message implementation can
match against three subtags just as well as two.
It's probably good enough to say that if the subtags given to
`foreign-error-ref` are a prefix of the language tag, then that language
tag matches. The fact that (foreign-error-ref ferr 'message) matches any
language falls out nicely by the general rule. Can you think of any
situations where this rule would cause problems?