Convert between Unicode Normalization Forms on the unix command-line

Question

In Unicode, some character combinations have more than one representation.

For example, the character ä can be represented as

"ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
"ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.

The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?

Looks like there is a "Unicode::Normalization" module for perl which should do this kind of thing: search.cpan.org/~sadahiro/Unicode-Normalize-1.16/Normalize.pm — goldilocks, Sep 10 '13 at 19:36

Nykakin · Answer 1 · 2013-09-11 00:46:22Z

Python has unicodedata module in its standard library, which allow to translate Unicode representations through unicodedata.normalize() function:

import unicodedata

s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2) 
print(ascii(t1)) 

t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))

Running with Python 3.x:

$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'

Python isn't well suited for shell one liners, but it can be done if you don't want to create external script:

$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää

For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-) and mark strings as Unicode with u character:

$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää

Gilles · Answer 2 · 2013-09-11 01:51:52Z

up vote 3 down vote

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

answered Sep 11 '13 at 1:51

Gilles
184k23234494

This works, thanks. You have to install a 30M dev library alongside it though. What's worse, I haven't been able to find proper documentation for uconv itself: where did you find any-nfd? It looks like development of this tool has been abandoned, last update was in 2005. – glts Sep 14 '13 at 16:07

@glts I found any-nfd by browsing through the list displayed by uconv -L. – Gilles Sep 14 '13 at 23:38

add comment

asked	6 months ago
viewed	199 times
active	6 months ago

current community

your communities

more stack exchange communities

Convert between Unicode Normalization Forms on the unix command-line

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged command-line text-processing conversion unicode or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Convert between Unicode Normalization Forms on the unix command-line

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged command-line text-processing conversion unicode or ask your own question.

Related

Hot Network Questions