Numpy array error setting an array element with a sequence

Question

i tryed to use multilist to hold scraped data from html

but after 50.000 list append i got memory error

So i decided to change lists to numpy array

SapList= []
ListAll  =  np.array([])

def eachshop(): #filling each list for each shop data
    global ListAll
    SapList.append(RowNum)
    SapList.extend([sap]) # here can be from one to 10 values in one list["sap1","sap2","sap3",...,"sap10"]
    SapList.extend([[strLink,ProdName],ProdCode,ProdH,NewPrice, OldPrice,[FileName+'#Komp!A1',KompPrice],[FileName+'#Sav!A1','Sav']])
    SapList.extend([ss]) # here can be from null to 80 sublist with 3 values [["id1", "link", "address"],["id80", "link", "address"]]


    ListAll = np.append(np.array(SapList))

So then i do print(ListAll) i got exception C:\Python36\scrap.py, LINE 307 "ListAll = np.append(np.array(SapList))"): setting an array element with a sequence

now for speed up i using pool.map

def makePool(cP, func, iters):
    try:

        pool = ThreadPool(cP)
        #perebiraem Url
        pool.map_async(func,enumerate(iters, start=2)).get(99999)
        pool.close()
        pool.join()
    except:
        print('Pool Error')
        raise
    finally:
        pool.terminate()

So how to use numpy array in my example and reduce memory usage and speedup I\O operation using Numpy?

If you want an array you can append rows to, you need a 2D array, not a 1D array. — abarnert, May 31, 2018 at 18:09
Also what is that ListAll = np.append(np.array(SapList)) supposed to be doing? It’s obviously not going to append anything to ListAll, it’s going to call append on nothing but the temp array created from SapList, then store the result in ListAll, replacing whatever used to be there. I’m pretty sure that’s not what you want, but I’m not sure what you do want, so I can’t tell you how to fix it. — abarnert, May 31, 2018 at 18:12
I thought ListAll = np.append(np.array(SapList)) is same as ListAll.append([SapList]) — Dmitrij Holkin, May 31, 2018 at 18:45
No, they're not even remotely the same. The latter calls an append method on ListAll. The former calls an append function on the np module, doesn't even pass ListAll to it, and then just assigns the result to ListAll. — abarnert, May 31, 2018 at 18:46
Consider dumping out the results every 10k instead of waiting for it to run out of memory at 50k entries. — Spinor8, Jun 3, 2018 at 18:18

hpaulj · Accepted Answer · 2018-05-31 18:46:02Z

It looks like you are trying to make an array from a list that contains a number and lists. Something like:

In [6]: np.array([1, [1,2],[3,4]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-812a9ccb6ca0> in <module>()
----> 1 np.array([1, [1,2],[3,4]])

ValueError: setting an array element with a sequence.

It does work if all elements of lists

In [7]: np.array([[1], [1,2],[3,4,5]])
Out[7]: array([list([1]), list([1, 2]), list([3, 4, 5])], dtype=object)

But if they vary in length the result is an object array, not a 2d numeric array. Such an object dtype array is very much like a list of lists, containing pointers to lists elsewhere in memory.

A multidimensional numeric array can use less memory than a list of lists, but it isn't going to help if you need to make the lists first. And it does not help at all if the sublists vary in size.

Oh, and stay away from np.append. It's evil. Plus you misused it!

Yep sublist always vary sized
– Dmitrij Holkin
May 31, 2018 at 18:49 — Dmitrij Holkin, May 31, 2018 at 18:49

Dux · Accepted Answer · 2018-06-04 14:53:06Z

As hpaulj pointed out already, numpy arrays will not help here, since you don't have consistent data sizes.

As Spinor8 suggested, dump out data in between instead:

AllList = []
limit = 10000
counter = 0
while not finished:
    if counter >= limit:
        print AllList
        AllList = []
    item = CreateYourList(...)
    AllList.append(item)
    counter += 1

Edit: Since your question is specifically asking about numpy and you even opened a bounty: numpy is not going to help you here, and here is why:

For using numpy efficiently, you have to know the array size at the time of array creation. numpy.array.append() doesn't actually append anything, but creates a new array, which is a huge overhead with large arrays.
Numpy arrays work best if all items have the same number of elements. Specifically, you can think of a numpy array like a matrix: all rows have the same number of columns.
You could create a numpy array based on the largest element in your data stream, but this would mean you allocate memory that you don't need (array elements that will never be filled). This will clearly not solve your memory problem.

So IMHO, your only way to solve this is to break your stream into chunks that your memory can handle, and stitch it together afterwards. Maybe write it to a (temporary) file and append to it?

columns number always the same. only differs info in this columns and rows count. It is products scraping from website — Dmitrij Holkin, Jun 4, 2018 at 18:31
Your data is way more complex than a simple matrix, its multidimensional, and it's not always the same (judged by your code and your comments) — Dux, Jun 4, 2018 at 19:40

Collectives™ on Stack Overflow

Numpy array error setting an array element with a sequence

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
arrays
numpy
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonarraysnumpy or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
arrays
numpy
or ask your own question.