NumPy's string
dtype seems to correspond to Python's str
and thus to change between Python 2.x and 3.x:
In Python 2.7:
In [1]: import numpy as np
In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 1
In [3]: np.dtype((np.unicode_, 1)).itemsize
Out[3]: 4
In Python 3.3:
In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 4
The version of NumPy is 1.7.0 in both cases.
I'm writing some code that I want to work on both Python versions, and I want an array of ASCII strings (4x memory overhead is not acceptable). So the questions are:
- How do I define a dtype for an ASCII string of certain length (with 1 byte per char) in Python 3?
- How do I do it in a way that also works in Python 2?
- Bonus question: Can I limit the alphabet even further, e.g. to
ascii_uppercase
, and save a bit or two per char?
Something that I see as the potential answer are character arrays for the first question (i.e. have an array of character arrays instead of an array of strings). Seems like I can specify the item size when constructing one:
chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0,
strides=None, order=None)
Update: nah, the itemsize
is actually the number of characters. But there's still unicode=False
.
Is that the way to go?
Will it answer the last question, too?
And how do I actually use it as dtype
?
bool
arrays take a full 8 bits for everyTrue
/False
value they store. – Jaime Mar 4 at 8:30