str.split(' ')
or even just str.split()
(split on arbitrary-width whitespace) works just fine on your input:
sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
parts = sample.split()
Demo:
>>> sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
>>> sample.split()
['\xf0\x9f\x98\x84', '\xf0\x9f\x98\x83', '\xf0\x9f\x98\x80', '\xf0\x9f\x98\x8a', '\xe2\x98\xba', '\xf0\x9f\x98\x89', '\xf0\x9f\x98\x8d', '\xf0\x9f\x98\x98', '\xf0\x9f\x98\x9a', '\xf0\x9f\x98\x97', '\xf0\x9f\x98\x99', '\xf0\x9f\x98\x9c', '\xf0\x9f\x98\x9d', '\xf0\x9f\x98\x9b', '\xf0\x9f\x98\x81', '\xf0\x9f\x98\x82', '\xf0\x9f\x98\x85', '\xf0\x9f\x98\x86', '\xf0\x9f\x98\x8b', '\xf0\x9f\x98\x8e', '\xf0\x9f\x98\xac', '\xf0\x9f\x98\x87']
However, if this is binary data, you need to be careful that there are no \x20
bytes in those 4-byte values. It might be better to just produce chunks of 5 bytes from this, then remove the last byte:
for i in range(0, len(sample), 5):
chunk = sample[i:i + 4] # ignore the 5th byte, a space
Demo:
>>> for i in range(0, len(sample), 5):
... chunk = sample[i:i + 4] # ignore the 5th byte, a space
... print chunk.decode('utf8')
... if i == 20: break
...
π
π
π
π
# On browsers that support it, those are various smiling emoji
\x
escape sequence and is thinking the backslash just escapes the next character. – Mike DeSimone Aug 31 '14 at 18:03'xf0'
from splitting that input; that's strings with 3 characters, anx
followed by a 2-digit hexadecimal number. It sounds as if you treated the strings as sequences rather than splitting them, resulting in strings with just one character each (like'\xf0'
, note the backslash). – Martijn Pieters♦ Aug 31 '14 at 18:03