Size-Incremental Numpy Array in Python

Question

I just came across the need of an incremental Numpy array in Python, and since I haven't found anything I implemented it. I'm just wondering if my way is the best way or you can come up with other ideas.

So, the problem is that I have a 2D array (the program handles nD arrays) for which the size is not known in advance and variable amount of data need to be concatenated to the array in one direction (let's say that I've to call np.vstak a lot of times). Every time I concatenate data, I need to take the array, sort it along axis 0 and do other stuff, so I cannot construct a long list of arrays and then np.vstak the list at once. Since memory allocation is expensive, I turned to incremental arrays, where I increment the size of the array of a quantity bigger than the size I need (I use 50% increments), so that I minimize the number of allocations.

I coded this up and you can see it in the following code:

class ExpandingArray:

    __DEFAULT_ALLOC_INIT_DIM = 10   # default initial dimension for all the axis is nothing is given by the user
    __DEFAULT_MAX_INCREMENT = 10    # default value in order to limit the increment of memory allocation

    __MAX_INCREMENT = []    # Max increment
    __ALLOC_DIMS = []       # Dimensions of the allocated np.array
    __DIMS = []             # Dimensions of the view with data on the allocated np.array (__DIMS <= __ALLOC_DIMS)

    __ARRAY = []            # Allocated array

    def __init__(self,initData,allocInitDim=None,dtype=np.float64,maxIncrement=None):
        self.__DIMS = np.array(initData.shape)

        self.__MAX_INCREMENT = maxIncrement
        if self.__MAX_INCREMENT == None:
            self.__MAX_INCREMENT = self.__DEFAULT_MAX_INCREMENT

        # Compute the allocation dimensions based on user's input
        if allocInitDim == None:
            allocInitDim = self.__DIMS.copy()

        while np.any( allocInitDim < self.__DIMS  ) or np.any(allocInitDim == 0):
            for i in range(len(self.__DIMS)):
                if allocInitDim[i] == 0:
                    allocInitDim[i] = self.__DEFAULT_ALLOC_INIT_DIM
                if allocInitDim[i] < self.__DIMS[i]:
                    allocInitDim[i] += min(allocInitDim[i]/2, self.__MAX_INCREMENT)

        # Allocate memory 
        self.__ALLOC_DIMS = allocInitDim
        self.__ARRAY = np.zeros(self.__ALLOC_DIMS,dtype=dtype)

        # Set initData 
        sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
        self.__ARRAY[sliceIdxs] = initData

    def shape(self):
        return tuple(self.__DIMS)

    def getAllocArray(self):
        return self.__ARRAY

    def getDataArray(self):
        """
        Get the view of the array with data
        """
        sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
        return self.__ARRAY[sliceIdxs]

    def concatenate(self,X,axis=0):
        if axis > len(self.__DIMS):
            print "Error: axis number exceed the number of dimensions"
            return

        # Check dimensions for remaining axis 
        for i in range(len(self.__DIMS)):
            if i != axis:
                if X.shape[i] != self.shape()[i]:
                    print "Error: Dimensions of the input array are not consistent in the axis %d" % i
                    return

        # Check whether allocated memory is enough 
        needAlloc = False
        while self.__ALLOC_DIMS[axis] < self.__DIMS[axis] + X.shape[axis]:
            needAlloc = True
            # Increase the __ALLOC_DIMS 
            self.__ALLOC_DIMS[axis] += min(self.__ALLOC_DIMS[axis]/2,self.__MAX_INCREMENT)

        # Reallocate memory and copy old data 
        if needAlloc:
            # Allocate 
            newArray = np.zeros(self.__ALLOC_DIMS)
            # Copy 
            sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
            newArray[sliceIdxs] = self.__ARRAY[sliceIdxs]
            self.__ARRAY = newArray

        # Concatenate new data 
        sliceIdxs = []
        for i in range(len(self.__DIMS)):
            if i != axis:
                sliceIdxs.append(slice(self.__DIMS[i]))
            else:
                sliceIdxs.append(slice(self.__DIMS[i],self.__DIMS[i]+X.shape[i]))

        self.__ARRAY[sliceIdxs] = X
        self.__DIMS[axis] += X.shape[axis]

The code shows considerably better performances than vstack/hstack several random sized concatenations.

What I'm wondering about is: is it the best way? Is there anything that do this already in numpy?

Further it would be nice to be able to overload the slice assignment operator of np.array, so that as soon as the user assign anything outside the actual dimensions, an ExpandingArray.concatenate() is performed. How to do such overloading?

Testing code: I post here also some code I used to make comparison between vstack and my method. I add up random chunk of data of maximum length 100.

import time

N = 10000

def performEA(N):
    EA = ExpandingArray(np.zeros((0,2)),maxIncrement=1000)
    for i in range(N):
        nNew = np.random.random_integers(low=1,high=100,size=1)
        X = np.random.rand(nNew,2)
        EA.concatenate(X,axis=0)
        # Perform operations on EA.getDataArray()
    return EA

def performVStack(N):
    A = np.zeros((0,2))
    for i in range(N):
        nNew = np.random.random_integers(low=1,high=100,size=1)
        X = np.random.rand(nNew,2)
        A = np.vstack((A,X))
        # Perform operations on A
    return A

start_EA = time.clock()
EA = performEA(N)
stop_EA = time.clock()

start_VS = time.clock()
VS = performVStack(N)
stop_VS = time.clock()

print "Elapsed Time EA: %.2f" % (stop_EA-start_EA)
print "Elapsed Time VS: %.2f" % (stop_VS-start_VS)

Don't use triple quoted strings for comments ... That's not what they're for ... — mgilson, Feb 22 '13 at 13:51
@mgilson: hey, it's endorsed by Guido: link. And I do it myself, for the little that's worth. :^) — DSM, Feb 22 '13 at 15:04
@DSM -- that shocks me that Guido endorses it ... I still hold to my original statement. I wonder if they generate no code only for Cpython or for other versions as well. — mgilson, Feb 22 '13 at 15:13

seberg · Answer 1 · 2013-02-22 15:18:24Z

I think the most common design pattern for these things is to just use a list for the small arrays. Sure you could do things like dynamic resizing (if you want to do crazy things, you can try to use the resize array method too). I think a typical method is to always double the size, when you really don't know how large things will be. Of course if you know how large the array will grow to, just allocating the full thing up front is simplest.

def performVStack_fromlist(N):
    l = []
    for i in range(N):
        nNew = np.random.random_integers(low=1,high=100,size=1)
        X = np.random.rand(nNew,2)
        l.append(X)
    return np.vstack(l)

I am sure there are some use cases where an expanding array could be useful (for example when the appending arrays are all very small), but this loop seems better handled with the above pattern. The optimization is mostly about how often you need to copy everything around, and doing a list like this (other then the list itself) this is exactly once here. So it is much faster normally.

I'm actually avoiding doing this list approach because each time that I concatenate something I also need to perform other operations on the array (like sorting and many other things). I edited the example with comments where I need to perform additional operations. — Daniele Bigoni, Feb 22 '13 at 15:40

shx2 · Answer 2 · 2013-02-22 15:36:13Z

When I faced a similar problem, I used ndarray.resize() (http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.resize.html#numpy.ndarray.resize). Most of the time, it will avoid reallocation+copying altogether. I can't guarantee it would prove to be faster (it probably would), but it's so much simpler.

As for your second question, I think overriding slice assignment for extending purposes is not a good idea. That operator is meant for assigning to existing items/slices. If you want to change that, it's not immediately clear how you'd want it to behave in some cases, e.g.:

a = MyExtendableArray(np.arange(100))
a[200] = 6  # resize to 200? pad [100:200] with what?
a[90:110] = 7  # assign to existing items AND automagically-allocated items?
a[::-1][200] = 6 # ...

My suggestion is that slice-assignment and data appending should remain separate.

+1 for the overriding suggestion. About the resize I like the suggestion but "Referencing an array prevents resizing..." and I might need to reference that outside. — Daniele Bigoni, Feb 22 '13 at 15:50

asked	1 year ago
viewed	282 times
active	1 year ago

current community

your communities

more stack exchange communities

Size-Incremental Numpy Array in Python

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays memory-management numpy overloading or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Size-Incremental Numpy Array in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays memory-management numpy overloading or ask your own question.

Related

Hot Network Questions