I shell out sorting to the unix sort command in a Java program I've written. However I am having problems arising from Java's string comparison behaving differently than the comparisons done by sort.

From the [Java Doc][1]:

Compares two strings lexicographically. The comparison is based on the Unicode value of each character in the strings.

From the sort man page:

* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

So my guess is need to sort with LC_ALL=C. However I always thought this meant sort based on ASCII value, which means who knows what could happen with unicode.

link|improve this question
feedback

1 Answer

up vote 4 down vote accepted

The LC_COLLATE locale category controls the sorting order. LC_ALL sets all categories.

With LC_COLLATE=C, strings are sorted byte by byte. The bytes don't have to be ASCII characters (only byte values between 0 and 127 are ASCII). On a unix system, Unicode is almost always encoded as UTF-8. UTF-8 has the property that the encoding of characters as byte sequences preserves their ordering, and so sorting UTF-8 strings in byte lexicographic order is equivalent to sorting them in character lexicographic order. Therefore LC_COLLATE=C is suitable for sorting Unicode encoded in UTF-8 lexicographically according to the character values.

link|improve this answer
feedback

Your Answer

 
or
required, but never shown

Not the answer you're looking for? Browse other questions tagged or ask your own question.