ICU shell assumes "a" and "±" are the same

Question

ICU shell assumes "a" and "±" are the same

I use ICU with Lithuanian ( lt_LT) language. The alphabet for this language is as follows:a ą b c č d e ę ė <...> v z ž

However, when sorting, the ICU-collator assumes that, for example, aand ą( awith ogonek) are equivalent, therefore the list of Lithuanian words will be sorted as follows:

a, ą, ab, aba, abadas, <...>, b, ba, <...>`

When the expected result will be:

a, ab, aba, abadas, <...>, ą, <...>, b, ba, <...>

The same thing happens with other "accented" letters ( e- ę- ė, z- ž, etc.)

A more specific test case: running source/samples/coll/coll -locale lt_LT -source ą -target aadecides source is less than targetwhen it is not (see coll.cpp if you need to).

Is this behavior expected? Is this a bug or a function? If so, how can I prevent the ICU collaborator from matching similar letters?

+3

internationalization collation icu

Linas May 19, '12 at 20:21

source share

1 answer

Steven R. Loomis · Accepted Answer · 2012-05-19T20:55:42+0000

The letters are listed as a secondary difference in the portraits of CLDR, so they sort like this . If this is not the case, bring it to the CLDR , not the ICU problem. Mimer agrees.

ICU shell assumes "a" and "±" are the same

More articles: