Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I'm trying to split a string of bytes like this:

'\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'

into something like this:

'\xf0\x9f\x98\x84', '\xf0\x9f\x98\x83', etc.

However, the split() method returns me something like this:

'xf0', 'x9f' 'x98' etc.

I tried split(" "), but it does not seem to work. How do I achieve the above mentioned?

share|improve this question
    
What "split" method are you using? It looks like it doesn't understand the \x escape sequence and is thinking the backslash just escapes the next character. –  Mike DeSimone Aug 31 '14 at 18:03
    
What code did you use that got you individual characters? You cannot ever get 'xf0' from splitting that input; that's strings with 3 characters, an x followed by a 2-digit hexadecimal number. It sounds as if you treated the strings as sequences rather than splitting them, resulting in strings with just one character each (like '\xf0', note the backslash). –  Martijn Pieters Aug 31 '14 at 18:03
    
... Is it splitting on `\`? –  Mike DeSimone Aug 31 '14 at 18:03
    
Use split(' ') instead of split(" ") to split by empty spaces. –  Boop Aug 31 '14 at 18:04
1  
@Boop: there is no difference between those two method calls. Both split on spaces. –  Martijn Pieters Aug 31 '14 at 18:13

1 Answer 1

up vote 1 down vote accepted

str.split(' ') or even just str.split() (split on arbitrary-width whitespace) works just fine on your input:

sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
parts = sample.split()

Demo:

>>> sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
>>> sample.split()
['\xf0\x9f\x98\x84', '\xf0\x9f\x98\x83', '\xf0\x9f\x98\x80', '\xf0\x9f\x98\x8a', '\xe2\x98\xba', '\xf0\x9f\x98\x89', '\xf0\x9f\x98\x8d', '\xf0\x9f\x98\x98', '\xf0\x9f\x98\x9a', '\xf0\x9f\x98\x97', '\xf0\x9f\x98\x99', '\xf0\x9f\x98\x9c', '\xf0\x9f\x98\x9d', '\xf0\x9f\x98\x9b', '\xf0\x9f\x98\x81', '\xf0\x9f\x98\x82', '\xf0\x9f\x98\x85', '\xf0\x9f\x98\x86', '\xf0\x9f\x98\x8b', '\xf0\x9f\x98\x8e', '\xf0\x9f\x98\xac', '\xf0\x9f\x98\x87']

However, if this is binary data, you need to be careful that there are no \x20 bytes in those 4-byte values. It might be better to just produce chunks of 5 bytes from this, then remove the last byte:

for i in range(0, len(sample), 5):
    chunk = sample[i:i + 4]  # ignore the 5th byte, a space

Demo:

>>> for i in range(0, len(sample), 5):
...     chunk = sample[i:i + 4]  # ignore the 5th byte, a space
...     print chunk.decode('utf8')
...     if i == 20: break
... 
πŸ˜„
πŸ˜ƒ
πŸ˜€
😊
# On browsers that support it, those are various smiling emoji
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.