Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

NumPy's string dtype seems to correspond to Python's str and thus to change between Python 2.x and 3.x:

In Python 2.7:

In [1]: import numpy as np

In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 1

In [3]: np.dtype((np.unicode_, 1)).itemsize
Out[3]: 4

In Python 3.3:

In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 4

The version of NumPy is 1.7.0 in both cases.

I'm writing some code that I want to work on both Python versions, and I want an array of ASCII strings (4x memory overhead is not acceptable). So the questions are:

  • How do I define a dtype for an ASCII string of certain length (with 1 byte per char) in Python 3?
  • How do I do it in a way that also works in Python 2?
  • Bonus question: Can I limit the alphabet even further, e.g. to ascii_uppercase, and save a bit or two per char?

Something that I see as the potential answer are character arrays for the first question (i.e. have an array of character arrays instead of an array of strings). Seems like I can specify the item size when constructing one:

chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0,
          strides=None, order=None)

Update: nah, the itemsize is actually the number of characters. But there's still unicode=False.

Is that the way to go?

Will it answer the last question, too?

And how do I actually use it as dtype?

share|improve this question
 
I am pretty sure the answer to your very last question is a big NO. AFAIK, and I did look into it some time back, there is no way of packing data of less than 8 bits into less than 1 byte. Definitely not for 6 or 7 bit types, but unless you handle it yourself, you can't have 2 four bit values in a single 8 bit container either. Even bool arrays take a full 8 bits for every True/False value they store. –  Jaime Mar 4 at 8:30
 
@Jaime Wow, okay. I want 1 byte then :) –  Lev Levitsky Mar 4 at 8:53

1 Answer

up vote 1 down vote accepted

You can use the 'S' typestr:

>>> np.array(['Hello', 'World'], dtype='S')
array([b'Hello', b'World'], 
      dtype='|S5')

Also in 2.6/2.7 str is aliased to bytes (or np.bytes_):

>>> np.dtype((bytes, 1)) # 2.7
dtype('|S1')
>>> np.dtype((bytes, 1)) # 3.2
dtype('|S1')

And b'' literals are supported:

>>> np.array([b'Hello', b'World']) # 2.7
array(['Hello', 'World'], 
      dtype='|S5')
>>> np.array([b'Hello', b'World']) # 3.2
array([b'Hello', b'World'], 
      dtype='|S5')
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.