For completeness, with zsh
, to split a string into:
its character constituents:
chars=( ${(s[])string} )
(if $string
contains bytes not forming parts of valid characters, each of those will still be stored as separate elements)
its byte constituents
you can do the same but after having unset the multibyte option, for instance locally in an anonymous function:
(){ set -o localoptions +o multibyte
bytes=( ${(s[])string} )
}
its grapheme cluster constituents.
You can use PCRE's ability to match them with \X
:
zmodload zsh/pcre
(){
graphemes=()
local rest=$string match
pcre_compile -s '(\X)\K.*'
while pcre_match -v rest -- "$rest"; do
graphemes+=($match[1])
done
}
(that one assumes the input contains text properly encoded in the locale's charmap).
With string=$'Ste\u0301phane'
, those give:
chars=( S t e ́ p h a n e )
bytes=( S t e $'\M-L' $'\M-\C-A' p h a n e )
graphemes=( S t é p h a n e )
As the e
+ U+0301 grapheme cluster (which display devices usually represent the same as the é
U+00E9 precomposed equivalent) is made up of 2 characters (U+0065 and U+0301) where in locales using UTF-8 as their charmap, the first one is encoded on one byte (0x65), and the second on two bytes (0xcc 0x81, also known as Meta-L and Meta-Ctrl-A).
For strings made up only of ASCII characters like your 11111001
, all three will be equivalent.
Note that in zsh
like in all other shells except ksh/bash, array indices start at 1, not 0.
arr[1]=0
with a string of11....
. – Jeff Schaller♦ 20 hours ago