How do get unix sort to sort in same order as Java (by unicode value)

Question

I shell out sorting to the unix sort command in a Java program I've written. However I am having problems arising from Java's string comparison behaving differently than the comparisons done by sort.

From the [Java Doc][1]:

Compares two strings lexicographically. The comparison is based on the Unicode value of each character in the strings.

From the sort man page:

* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

So my guess is need to sort with LC_ALL=C. However I always thought this meant sort based on ASCII value, which means who knows what could happen with unicode.

Gilles · Answer 1 · 2012-02-17 08:37:41Z

The LC_COLLATE locale category controls the sorting order. LC_ALL sets all categories.

With LC_COLLATE=C, strings are sorted byte by byte. The bytes don't have to be ASCII characters (only byte values between 0 and 127 are ASCII). On a unix system, Unicode is almost always encoded as UTF-8. UTF-8 has the property that the encoding of characters as byte sequences preserves their ordering, and so sorting UTF-8 strings in byte lexicographic order is equivalent to sorting them in character lexicographic order. Therefore LC_COLLATE=C is suitable for sorting Unicode encoded in UTF-8 lexicographically according to the character values.

asked	3 months ago
viewed	79 times
active	3 months ago

How do get unix sort to sort in same order as Java (by unicode value)

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged java sort unicode or ask your own question.

Welcome!

Community Bulletin

How do get unix sort to sort in same order as Java (by unicode value)

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged java sort unicode or ask your own question.

Welcome!

Community Bulletin

Related