Background

I have tons of very large pandas DataFrames that need to be normalized with the following operation; log2(data) - mean(log2(data))

Example Data

The example DataFrame my_df looks like this;

     iovrrx    nfinsu    mvdfjc    idjges    fubmrg    lvuhfv
0  0.987654  0.206104  0.802920  0.011157  0.860618  0.575871
1  0.706397  0.860083  0.939230  0.436194  0.557081  0.706964
2  0.043139  0.729435  0.597488  0.700998  0.974193  0.917758
3  0.316080  0.461547  0.844540  0.510143  0.908475  0.877330
4  0.828839  0.177670  0.610833  0.328238  0.327697  0.689756

Question

I have tried to perform the normalization operation noted above many different ways however the following code snippet is the only one that I have gotten to work;

log_div_ave = my_df.apply(np.log2).values.T - my_df.apply(np.log2).mean(axis=1).values

log_div_ave = pd.DataFrame(log_div_ave.T,columns=my_df.columns)

print(log_div_ave)

   iovrrx    nfinsu    mvdfjc    idjges    fubmrg    lvuhfv
0  1.667378 -0.593258  1.368628 -4.800610  1.468744  0.889117
1  0.056992  0.340988  0.467991 -0.638518 -0.285601  0.058149
2 -3.467018  0.612699  0.324830  0.555330  1.030127  0.944032
3 -0.941776 -0.395590  0.476099 -0.251165  0.581380  0.531053
4  0.933714 -1.288174  0.493400 -0.402633 -0.405015  0.668708

As you can see I'm converting the DataFrame to a numpy array and transposing it just so I can subtract by the mean of the data. I then have to transpose the resulting array then reconstitute it as a DataFrame. Is there a simpler way to do all of this?

share|improve this question

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.