Split array byte string in Python

Question

I'm trying to split a string of bytes like this:

'\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'

into something like this:

'\xf0\x9f\x98\x84', '\xf0\x9f\x98\x83', etc.

However, the split() method returns me something like this:

'xf0', 'x9f' 'x98' etc.

I tried split(" "), but it does not seem to work. How do I achieve the above mentioned?

What "split" method are you using? It looks like it doesn't understand the \x escape sequence and is thinking the backslash just escapes the next character. — Mike DeSimone, Aug 31 '14 at 18:03
What code did you use that got you individual characters? You cannot ever get 'xf0' from splitting that input; that's strings with 3 characters, an x followed by a 2-digit hexadecimal number. It sounds as if you treated the strings as sequences rather than splitting them, resulting in strings with just one character each (like '\xf0', note the backslash). — Martijn Pieters♦, Aug 31 '14 at 18:03
Use split(' ') instead of split(" ") to split by empty spaces. — Boop, Aug 31 '14 at 18:04
@Boop: there is no difference between those two method calls. Both split on spaces. — Martijn Pieters♦, Aug 31 '14 at 18:13

Martijn Pieters · Accepted Answer · 2014-08-31 18:11:15Z

str.split(' ') or even just str.split() (split on arbitrary-width whitespace) works just fine on your input:

sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
parts = sample.split()

Demo:

>>> sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
>>> sample.split()
['\xf0\x9f\x98\x84', '\xf0\x9f\x98\x83', '\xf0\x9f\x98\x80', '\xf0\x9f\x98\x8a', '\xe2\x98\xba', '\xf0\x9f\x98\x89', '\xf0\x9f\x98\x8d', '\xf0\x9f\x98\x98', '\xf0\x9f\x98\x9a', '\xf0\x9f\x98\x97', '\xf0\x9f\x98\x99', '\xf0\x9f\x98\x9c', '\xf0\x9f\x98\x9d', '\xf0\x9f\x98\x9b', '\xf0\x9f\x98\x81', '\xf0\x9f\x98\x82', '\xf0\x9f\x98\x85', '\xf0\x9f\x98\x86', '\xf0\x9f\x98\x8b', '\xf0\x9f\x98\x8e', '\xf0\x9f\x98\xac', '\xf0\x9f\x98\x87']

However, if this is binary data, you need to be careful that there are no \x20 bytes in those 4-byte values. It might be better to just produce chunks of 5 bytes from this, then remove the last byte:

for i in range(0, len(sample), 5):
    chunk = sample[i:i + 4]  # ignore the 5th byte, a space

Demo:

>>> for i in range(0, len(sample), 5):
...     chunk = sample[i:i + 4]  # ignore the 5th byte, a space
...     print chunk.decode('utf8')
...     if i == 20: break
... 
😄
😃
😀
😊
# On browsers that support it, those are various smiling emoji

asked	9 months ago
viewed	111 times
active	9 months ago

current community

your communities

more stack exchange communities

Split array byte string in Python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python regex python-2.7 or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Split array byte string in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python regex python-2.7 or ask your own question.

Related

Hot Network Questions