Consecutive, Overlapping Subsets of Array (NumPy, Python)

Question

I have a NumPy array [1,2,3,4,5,6,7,8,9,10,11,12,13,14] and want to have an array structured like [[1,2,3,4], [2,3,4,5], [3,4,5,6], ..., [11,12,13,14]].

Sure this is possible by looping over the large array and adding arrays of length four to the new array, but I'm curious if there is some secret 'magic' Python method doing just this :)

user2357112 · Accepted Answer · 2017-11-25 07:35:20Z

35

You should use stride_tricks. When I first saw this, the word 'magic' did spring to mind. It's simple and is by far the fastest method.

>>> as_strided = numpy.lib.stride_tricks.as_strided
>>> a = numpy.arange(1,15)
>>> a
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> b = as_strided(a, (11,4), a.strides*2)
>>> b
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

Be aware that the values in array b are those in a, just viewed differently. Do a .copy() on b if you plan to modify it.

I saw this at a SciPy conference. Here are the slides for more explanation.

edited Nov 25, 2017 at 7:35

user2357112

285k32 gold badges487 silver badges568 bronze badges

answered Mar 21, 2010 at 15:19

Paul

43.8k17 gold badges111 silver badges126 bronze badges

3

First of all I would like to say that this is a fantastic solution, thanx for pointing it out! Please note that on my machine the strides are 8 bit length (can be checked using a.strides, as suggested in the link you provided) - probably a 64-bit vs 32-bit architecture default (though I did not verify). That is, the command on my machine is: >>> b = as_strided(a,(11,4),(8,8))

eldad-a
– eldad-a

06/12/2012 08:16:12
Commented Jun 12, 2012 at 8:16
2

@eldad-a for concreteness, use b = as_strided(a,(11,4),(a.strides[0], a.strides[0]))

Max
– Max

04/08/2017 16:41:01
Commented Apr 8, 2017 at 16:41

Add a comment |

Peter Mortensen · Accepted Answer · 2010-12-12 15:32:40Z

The fastest way seems to be to preallocate the array, given as option 7 right at the bottom of this answer.

>>> import numpy as np
>>> A=np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14])
>>> A
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> np.array(zip(A,A[1:],A[2:],A[3:]))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])
>>>

You can easily adapt this to do it for variable chunk size.

>>> n=5
>>> np.array(zip(*(A[i:] for i in range(n))))
array([[ 1,  2,  3,  4,  5],
       [ 2,  3,  4,  5,  6],
       [ 3,  4,  5,  6,  7],
       [ 4,  5,  6,  7,  8],
       [ 5,  6,  7,  8,  9],
       [ 6,  7,  8,  9, 10],
       [ 7,  8,  9, 10, 11],
       [ 8,  9, 10, 11, 12],
       [ 9, 10, 11, 12, 13],
       [10, 11, 12, 13, 14]])

You may wish to compare performance between this and using itertools.islice.

>>> from itertools import islice
>>> n=4
>>> np.array(zip(*[islice(A,i,None) for i in range(n)]))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

My timing results:

1. timeit np.array(zip(A,A[1:],A[2:],A[3:]))
10000 loops, best of 3: 92.9 us per loop

2. timeit np.array(zip(*(A[i:] for i in range(4))))
10000 loops, best of 3: 101 us per loop

3. timeit np.array(zip(*[islice(A,i,None) for i in range(4)]))
10000 loops, best of 3: 101 us per loop

4. timeit numpy.array([ A[i:i+4] for i in range(len(A)-3) ])
10000 loops, best of 3: 37.8 us per loop

5. timeit numpy.array(list(chunks(A, 4)))
10000 loops, best of 3: 43.2 us per loop

6. timeit numpy.array(byN(A, 4))
10000 loops, best of 3: 100 us per loop

# Does preallocation of the array help? (11 is from len(A)+1-4)
7. timeit B=np.zeros(shape=(11, 4),dtype=np.int32)
100000 loops, best of 3: 2.19 us per loop
   timeit for i in range(4):B[:,i]=A[i:11+i]
10000 loops, best of 3: 20.9 us per loop
total 23.1us per loop

As len(A) increases (20000) 4 and 5 converge to be equivalent speed (44 ms). 1,2,3 and 6 all remain about 3 times slower (135 ms). 7 is much faster (1.36 ms).

Note that the islice version is probably silly in this particular use case. It is slower (islicing is O(n); slicing is O(1)). It may be somewhat more memory efficient (maybe and maybe not—numpy slices don't copy), but if we wanted to optimize for memory we would never form any lists at all. Of course, tests talk and guesses walk. — Mike Graham
– Mike Graham, Commented Mar 21, 2010 at 3:54
If I was really into benchmarking this (a silly but fun enough task), I'd probably try to shave off a couple more nanoseconds by using numpy.empty instead of numpy.zeros and using a column-major array so that I deal with contiguous memory when setting columns. — Mike Graham
– Mike Graham, Commented Mar 21, 2010 at 6:24
Very nice, but have a look at Paul's answer below, I tested it and it's about 5 times faster than the 7th method. — Max
– Max, Commented Apr 8, 2017 at 16:38
I tried above and kept getting TypeError: iteration over a 0-d array. I had to use np.array(list(zip(A,A[1:],A[2:],...))) — Zak Keirn
– Zak Keirn, Commented Mar 24, 2018 at 19:11

Federico A. Ramponi · Accepted Answer · 2010-03-21 02:56:43Z

4

Quick&dirty solution:

>>> a = numpy.arange(1,15)
>>> numpy.array([ a[i:i+4] for i in range(len(a)-3) ])
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

answered Mar 21, 2010 at 2:56

Federico A. Ramponi

47.2k31 gold badges112 silver badges134 bronze badges

Nothing really dirty about that.

Mike Graham
– Mike Graham

03/21/2010 03:02:44
Commented Mar 21, 2010 at 3:02
Certainly quick - see the timings on my answer. Also easily to generalise to different size chunks.

John La Rooy
– John La Rooy

03/21/2010 04:28:28
Commented Mar 21, 2010 at 4:28

Add a comment |

meduz · Accepted Answer · 2010-03-27 10:49:07Z

Broadcast!

from numpy import ogrid
def stretch(N=5,M=15):
    x, y = ogrid[0:M,0:N]
    return x+y+1

Note that ogrid does give things like:

>> ogrid[0:5,0:5]
>> 
[array([[0],
       [1],
       [2],
       [3],
       [4]]),
 array([[0, 1, 2, 3, 4]])]

Let's compare with another solution given here:

def zipping(N=5,M=15):
    A = numpy.arange(1, M+1)
    return numpy.array(zip(*(A[i:] for i in range(N))))

comparison (python 2.6, 32 bit, 1Go RAM) gives

>>> %timeit stretch(5,15)
10000 loops, best of 3: 61.2 us per loop

>>> %timeit zipping(5,15)
10000 loops, best of 3: 72.5 us per loop

>>> %timeit stretch(5,1e3)
10000 loops, best of 3: 128 us per loop

>>> %timeit zipping(5,1e3)
100 loops, best of 3: 4.25 ms per loop

The 40-fold speed-up is kind of consistant to scaling.

Alex Martelli · Accepted Answer · 2010-03-21 03:27:57Z

1

Using itertools, and assuming Python 2.6:

import itertools

def byN(iterable, N):
    itrs = itertools.tee(iter(iterable), N)
    for n in range(N):
        for i in range(n):
            next(itrs[n], None)
    return zip(*itrs)

aby4 = numpy.array(byN(thearray, 4))

edited Mar 21, 2010 at 3:27

answered Mar 21, 2010 at 3:04

Alex Martelli

886k175 gold badges1.3k silver badges1.4k bronze badges

This code doesn't work. numpy would interpret the izip object as a scalar and make a rank-1, length-1 array of dtype object out of it.

Mike Graham
– Mike Graham

03/21/2010 03:06:56
Commented Mar 21, 2010 at 3:06
Also, you need to from itertools import tee, izip or use the fully-qualified names.

Mike Graham
– Mike Graham

03/21/2010 03:08:01
Commented Mar 21, 2010 at 3:08
I thought numpy.array accepted any iterable -- editing to fix.

Alex Martelli
– Alex Martelli

03/21/2010 03:09:50
Commented Mar 21, 2010 at 3:09
Also, you have backwards logic in traversing the iterators you teed. The columns in the resulting array would be backwards if the code did work.

Mike Graham
– Mike Graham

03/21/2010 03:10:00
Commented Mar 21, 2010 at 3:10
@Alex Martelli, numpy.array doesn't. It can be frustrating, but I think the reason is that it needs to preallocate the array so it wants to know the size of the input ahead of time.

Mike Graham
– Mike Graham

03/21/2010 03:10:57
Commented Mar 21, 2010 at 3:10

| Show 2 more comments

Mike Graham · Accepted Answer · 2010-03-21 02:50:41Z

I know of no Python stdlib function that quite does that. It's easy enough to do. Here is a generator that basically does it:

def chunks(sequence, length):
    for index in xrange(0, len(sequence) - length + 1):
        yield sequence[index:index + length]

You could use it like so

>>> import numpy
>>> a = numpy.arange(1, 15)
>>> a
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> numpy.array(list(chunks(a, 4)))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

The only weird thing about this code is that I called list on the result of chunks(a, 4). This is since numpy.array does not accept an arbitrary iterable, such as the generator chunks returns. If you just want to iterate over these chunks, you don't need to bother. If you really need to put the result into an array you can do it this way or some more efficient ways.

Peter Mortensen · Accepted Answer · 2010-12-12 15:29:44Z

The efficient NumPy way to do this is given here, which is somewhat too long to reproduce here. It boils down to using some stride tricks, and is much faster than itertools for largish window sizes. For example, using a method essentially the same as that of of Alex Martelli's:

In [16]: def windowed(sequence, length):
    seqs = tee(sequence, length)
    [ seq.next() for i, seq in enumerate(seqs) for j in xrange(i) ]
    return zip(*seqs)

We get:

In [19]: data = numpy.random.randint(0, 2, 1000000)

In [20]: %timeit windowed(data, 2)
100000 loops, best of 3: 6.62 us per loop
In [21]: %timeit windowed(data, 10)
10000 loops, best of 3: 29.3 us per loop
In [22]: %timeit windowed(data, 100)
1000 loops, best of 3: 1.41 ms per loop
In [23]: %timeit segment_axis(data, 2, 1)
10000 loops, best of 3: 30.1 us per loop
In [24]: %timeit segment_axis(data, 10, 9)
10000 loops, best of 3: 30.2 us per loop
In [25]: %timeit segment_axis(data, 100, 99)
10000 loops, best of 3: 30.5 us per loop

Collectives™ on Stack Overflow

Consecutive, Overlapping Subsets of Array (NumPy, Python)

7 Answers 7

My timing results:

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

My timing results:

Your Answer

Sign up or log in

Post as a guest

Linked

Related