ASCII string as dtype for numpy array of strings in Python 3

Question

NumPy's string dtype seems to correspond to Python's str and thus to change between Python 2.x and 3.x:

In Python 2.7:

In [1]: import numpy as np

In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 1

In [3]: np.dtype((np.unicode_, 1)).itemsize
Out[3]: 4

In Python 3.3:

In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 4

The version of NumPy is 1.7.0 in both cases.

I'm writing some code that I want to work on both Python versions, and I want an array of ASCII strings (4x memory overhead is not acceptable). So the questions are:

How do I define a dtype for an ASCII string of certain length (with 1 byte per char) in Python 3?
How do I do it in a way that also works in Python 2?
Bonus question: Can I limit the alphabet even further, e.g. to ascii_uppercase, and save a bit or two per char?

Something that I see as the potential answer are character arrays for the first question (i.e. have an array of character arrays instead of an array of strings). Seems like I can specify the item size when constructing one:

chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0,
          strides=None, order=None)

Update: nah, the itemsize is actually the number of characters. But there's still unicode=False.

Is that the way to go?

Will it answer the last question, too?

And how do I actually use it as dtype?

I am pretty sure the answer to your very last question is a big NO. AFAIK, and I did look into it some time back, there is no way of packing data of less than 8 bits into less than 1 byte. Definitely not for 6 or 7 bit types, but unless you handle it yourself, you can't have 2 four bit values in a single 8 bit container either. Even bool arrays take a full 8 bits for every True/False value they store. — Jaime, Mar 4 at 8:30

eryksun · Accepted Answer · 2013-03-05 08:29:15Z

You can use the 'S' typestr:

>>> np.array(['Hello', 'World'], dtype='S')
array([b'Hello', b'World'], 
      dtype='|S5')

Also in 2.6/2.7 str is aliased to bytes (or np.bytes_):

>>> np.dtype((bytes, 1)) # 2.7
dtype('|S1')
>>> np.dtype((bytes, 1)) # 3.2
dtype('|S1')

And b'' literals are supported:

>>> np.array([b'Hello', b'World']) # 2.7
array(['Hello', 'World'], 
      dtype='|S5')
>>> np.array([b'Hello', b'World']) # 3.2
array([b'Hello', b'World'], 
      dtype='|S5')

asked	8 months ago
viewed	440 times
active	8 months ago

ASCII string as dtype for numpy array of strings in Python 3

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays string numpy python-3.x or ask your own question.

Community Bulletin

ASCII string as dtype for numpy array of strings in Python 3

1 Answer

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays string numpy python-3.x or ask your own question.

Community Bulletin

Related