Merging NumPy arrays and finding columns in Python

Question

I am new to Python. I have two data files in CSV format. I loaded the CSV files data into two NumPy arrays:

matrix1 = numpy.genfromtxt(fileName1)
matrix2 = numpy.genfromtxt(fileName2)

The rows and cols of both the matrices are unequal.

>>print(matrix1.shape)
(971, 4413)
>>print(matrix2.shape)
>>(5504, 4431)

I want to combine matrix1 and matrix2 in such a way:

mergedMatrix = [ matrix1, matrix2 ]

where I can access matrix1 from mergedMatrix using index 0 and matrix2 using index 1.

I tried to use numpy.concatenate but it does not works on these two matrices. So I tried using pandas merge function after converting matrix1 and matrix2 into pandas DataFrames. However, it took a lot of time to do so and all the matrices were merged into a single linear array like [1, 2, 3,4,5...] and I didn't had any way to distinguish between matrix1 and matrix2 in mergedMatrix.

So I am using:

#mergedMatrix as a list
mergedMatrix = [matrix1, matrix2]

My data contains values like Inf. If a column contains value Inf in matrix1 the I want to delete that column as well as the corresponding column i.e. the column with the same column number in matrix2.

Questions

Is there a better way than to use a list mergedMatrix?
How can find if a matrix1 column contains such values quickly without checking each element one by one and its column number?

Example:

matrix1 = [[1, 2, 3],
           [3, inf,0],
           [2 , inf, inf]]
matrix2 = [[0, 4, 2, 7],
           [0, 1, 0.5, 3],
           [1, 2, 3, 9]]

mergedMatrix = [[1, 2, 3],
           [3, inf,0],
           [2 , inf, inf],
           [0, 4, 2, 7],
           [0, 1, 0.5, 3],
           [1, 2, 3, 9]]

The result should be:

mergedMatrix = [[1],
                [3],
                [2],
                [0,7],
                [0,3],
                [1,9]]

removedMatrixCols = [[2, 3],
               [inf,0],
               [inf, inf],
               [4, 2],
               [1, 0.5],
               [2, 3]]

Then I want to split the matrices:

newMatrix1 = [[1],
              [3],
              [2]]
newMatrix2 = [[0,7],
              [0,3],
              [1,9]]

removedCols1 = [[2, 3],
                [inf,0],
                [inf, inf]]

removedCols2 = [[4, 2],
                [1, 0.5],
                [2, 3]]

so that I can store them into CSV files separately.

Add a minimal working example with some dummy data including the steps you tried (e.g. using np.random.rand() ). You could store your arrays in a list and access them by list[0] and list[0] — Moritz, Jul 5 '15 at 12:37
If you can make the two matrices equal size, you can use numpy.dstack([matrix1, matrix2]) and have a neat 3D matrix. — Evert, Jul 5 '15 at 12:39
With the way numpy stores its arrays, you'll have to make the dimensions of the two matrices equal. — Evert, Jul 5 '15 at 12:44
Are the second dimensions of your matrices indeed 4413 and 4431? — Evert, Jul 5 '15 at 12:48
@Moritz Added that. Yes I know that I can access the matrices using list[0] and list[1]. — Tehmas, Jul 5 '15 at 12:53

Matthew · Answer 1 · 2015-07-05 13:14:59Z

up vote 1 down vote

Answers in short: technically yes, but not really, no and yes.

1: You should use a list if you want a 3-D list, but I would also make it into an array (mergedMatrix = numpy.array([matrix1, matrix2])) such that you can still use the element-by-element logic in the new matrix

2: (Note: these are pretty different questions, so, strictly speaking, should be asked in 2 different questions than merged in to one, but I'll survive)

For this, you can remove a column using numpy.delete. To remove a column, use axis=1 arg, e.g:

new_mat = numpy.delete(mergedMatrix, cols_to_delete, axis=1)

where mergedMatrix and cols_to_delete are both arrays.

Instead of looping through the array with nested for loops to find columns containing an Inf number, you can use numpy.isinf, which will you can then substitute for cols_to_delete from above (*note: cols_to_delete = numpy.isinf(merged_Matrix)[:,1]

Anyhow, hope this helps out! Cheers

edited Jul 5 '15 at 13:14

answered Jul 5 '15 at 12:38

Matthew

562311

This line from the question, "where I can access matrix1 from mergedMatrix using index 0 and matrix2 using index 1", makes me think the OP wants a 3D matrix. – Evert Jul 5 '15 at 12:41

1

And you can't hstack/vstack 2 arrays (matrices) that have unequal 2D shapes; at least one shape will have to be equal. – Evert Jul 5 '15 at 12:43

Re-reading the question, I think you are correct..? As for unequal shapes, I mis-read "4413" and "4431" as the same.. oops. Corrected for updated part 1 – Matthew Jul 5 '15 at 12:44

The 4413 and 4431 could be a typo, since the OP asks about flagging columns in matrix 2 depending on values in matrix 1. In that case, it would be an example why it's always necessary to copy-paste things. – Evert Jul 5 '15 at 12:46

could well be. If not though, it is well possible that column 4425 of matrix2 will have an inf element which cannot be removed from matrix1, so @OP, watch for that if that wasn't a typo – Matthew Jul 5 '15 at 12:48

| show 3 more comments

Moritz · Answer 2 · 2015-07-05 12:52:52Z

I can think of four solutions:

Use a list as you already did in your question. There is nothing wrong with that. And you can index your array by list[0][xx:yy]
store your data in a dictionary like {1:matrix1,2:matrix2}
If you really want to use pandas you would have to add an identifier column to the data before merging it (data1, data2) later on you can either group your data using groupy or set an index df.set_index('id_column'). But in my opinion that is just too much.
If you use np.vstack or np.hstack (depending on the axis on which they are equal, you will loose the information which matrix was which. Unless you generate a mask with a boolean id e.g

mask = np.ones(len(merged_matrix)) mask[0:len(matrix1)] = 0

das-g · Answer 3 · 2015-07-05 23:20:03Z

Assuming you don't actually need mergedMatrix, here's how you can get to newMatrix1, newMatrix2, removedCols1 and removedCols2 without explicitly constructing mergedMatrix.

Locate interesting values

First, let's go find the inf entries:

import numpy as np
matrix1 = np.genfromtxt(fileName1)
matrix2 = np.genfromtxt(fileName2)

matrix1_infs = matrix1 == float('inf')

# or if you want to treat -inf the same as inf:
matrix1_infs = np.isinf(matrix1)

This gives you a boolean 2D NumPy array. For your small example arrays, it will be

array([[False, False, False],
       [False,  True, False],
       [False,  True,  True]], dtype=bool)

Boil it down to columns

You're not interested in individual elements, but which columns have any inf values. A straight forward way to find out is to use

matrix1_inf_columns = matrix1_infs.any(axis=0)

A bit more obscure would be using a combination of linear algebra and boolean algebra to come up with the following vector-matrix product:

matrix1_inf_columns = np.dot(np.repeat(True, matrix1.shape[1]), matrix1_infs)

The result is the same:

array([False,  True,  True], dtype=bool)

Use boolean index arrays for slicing

Something fun happens when you use boolean NumPy arrays as indices for other NumPy arrays:

>>> matrix1[:, matrix1_inf_columns] # First index is rows, second columns.
                                    # : means all. Thus here:
                                    # All rows, but only the selected columns.
array([[  2.,   3.],
       [ inf,   0.],
       [ inf,  inf]])

Nice. That's just what we wanted for removedCols1. But it gets crazier. What happens when you take the negative of a boolean array?

>>> -matrix1_inf_columns
array([ True, False, False], dtype=bool)

NumPy negates its elements! This means we can get newMatrix1 as

newMatrix1 = matrix1[:, -matrix1_inf_columns]
# array([[ 0.],
#        [ 0.],
#        [ 1.]])

Off course, the boolean index array doesn't know that it was originally constructed from matrix1, so we can just as easily use it to index matrix2:

removedCols2 = matrix2[:, matrix1_inf_columns]
# array([[ 4. ,  2. ],
#        [ 1. ,  0.5],
#        [ 2. ,  3. ]])

But where the boolean index array is shorter than the dimension of the indexed array, it will assume False for the missing boolean indices:

>>> matrix2[:, -matrix1_inf_columns]
array([[ 0.],
       [ 0.],
       [ 1.]])

That's not the complete newMatrix2 we want.

Size trouble

So we have to use a larger index array.

>>> matrix1_inf_columns.resize(matrix2.shape[1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot resize an array references or is referenced
by another array in this way.  Use the resize function

Ow. The resize function? The documentation says that when the requested size is larger than the array, it will (other than the resize NumPy array method I tried to use here) not fill in zeros (False in case of a boolean array) but instead repeat the array.

So let's see whether we can get a deep copy instead of a view on matrix1:

>>> tmp = matrix1_inf_columns.copy()
>>> tmp.resize(matrix2.shape[1])
>>> tmp
array([False,  True,  True, False], dtype=bool)
>>> -tmp
array([ True, False, False,  True], dtype=bool)

OK, that worked. Let's plug it in as the index for matrix2.

removedCols2 = matrix2[:, tmp]
# array([[ 4. ,  2. ],
#        [ 1. ,  0.5],
#        [ 2. ,  3. ]])

Great, so this still works.

newMatrix2 = matrix2[:, -tmp]
# array([[ 0.,  7.],
#        [ 0.,  3.],
#        [ 1.,  9.]])

Yay!

To infinity... and beyond

It will get a bit more complicated if you also want to take infinite values in matrix2 into account for the filtering, or if your actual condition is even more complex. But you've now seen most of the concepts you'd need for that.

asked	2 years ago
viewed	327 times
active	2 years ago

Merging NumPy arrays and finding columns in Python

Questions

Example:

3 Answers 3

Locate interesting values

Boil it down to columns

Use boolean index arrays for slicing

Size trouble

To infinity... and beyond

Your Answer

Not the answer you're looking for? Browse other questions tagged python csv numpy pandas data-analysis or ask your own question.

Hot Network Questions

Questions

Example:

3 Answers 3

Locate interesting values

Boil it down to columns

Use boolean index arrays for slicing

Size trouble

To infinity... and beyond

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python csv numpy pandas data-analysis or ask your own question.

Related