Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems.. It's 100% free, no registration required.

In Unicode, some character combinations have more than one representation.

For example, the character ä can be represented as

  • "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
  • "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.

The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?

share|improve this question
 
Looks like there is a "Unicode::Normalization" module for perl which should do this kind of thing: search.cpan.org/~sadahiro/Unicode-Normalize-1.16/Normalize.pm –  goldilocks Sep 10 '13 at 19:36
add comment

2 Answers

Python has unicodedata module in its standard library, which allow to translate Unicode representations through unicodedata.normalize() function:

import unicodedata

s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2) 
print(ascii(t1)) 

t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))

Running with Python 3.x:

$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'

Python isn't well suited for shell one liners, but it can be done if you don't want to create external script:

$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää

For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-) and mark strings as Unicode with u character:

$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää
share|improve this answer
add comment

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

share|improve this answer
 
This works, thanks. You have to install a 30M dev library alongside it though. What's worse, I haven't been able to find proper documentation for uconv itself: where did you find any-nfd? It looks like development of this tool has been abandoned, last update was in 2005. –  glts Sep 14 '13 at 16:07
 
@glts I found any-nfd by browsing through the list displayed by uconv -L. –  Gilles Sep 14 '13 at 23:38
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.