2
\$\begingroup\$

I want to be able to find a solution to run the following code in a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed, just behind iterating rows/cols- and there, there is already a 3x speed decrease). The problem is twofold: how to set this up AND save stuff in other places (an embedded function might do that). I know the pandas function for ROLLING window regression is already optimized to its limit but I was wondering how to get rid of the loop cycle and other \$O(N^k)\$ I might have missed.

Any help is greatly appreciated

import pandas as pd
import numpy as np
periods = 1000
alt_pan_fondi_prices = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu = pd.DataFrame(np.random.randn(periods ,4),index=pd.date_range('2011-1-1', periods=peridos), columns = list('ABCD'))
indu.columns = list('ABCD')


# some names to be used later
cols = ['fund'] + [("bench_" + str(i)) for i in list('ABCD')]
for item in alt_pan_fondi_prices.columns.values:

    to_infer = alt_pan_fondi_prices[item].dropna()
    indu = indu.loc[to_infer.index[0]:, :].dropna()
    dfBothPrices = pd.concat([to_infer, indu], axis=1)
    dfBothPrices = dfBothPrices.fillna(method='bfill')
    dfBothReturns = dfBothPrices.pct_change()
    dfBothReturns.columns = cols
    mask = cols[1:]

    # execute the OLS model
    model = pd.ols(y=dfBothReturns['fund'], x=dfBothReturns[mask], window=20)

    # I then need to store a whole bunch of stuff (alphas / betas / rsquared / etc) but I have this part safely taken care of
\$\endgroup\$
0

1 Answer 1

3
\$\begingroup\$

Archaeology

ols isn't in the current version of Pandas, but it is in (at least) 0.10.

In version v0.19.0-415-g542c9166a6, we see

        warnings.warn("The pandas.stats.ols module is deprecated and will be "
                      "removed in a future version. We refer to external packages "
                      "like statsmodels, see some examples here: "
                      "http://www.statsmodels.org/stable/regression.html",
                      FutureWarning, stacklevel=4)

It survived until Thu Feb 9 11:42:15 2017.

The signature of this function created a generic linear model object. The only curious parameter is window, which computes a "moving window regression". So far as I can tell, this is not covered in Pandas or scipy.stats, certainly not in Numpy; but it is in statsmodels. A contemporary implementation should probably use that.

Performance

a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed

No. apply is essentially a loop, with optional numba that is unlikely to help in your case. For the scale you demonstrated - 1000 rows with an outer loop across four columns - individual calls to a rolled regressor are fine.

If you wanted to get really tricky, you could use Numpy, sliding_window_view and one call to lstsq to regress across the outer column dimension, but I consider this premature optimisation and unlikely to be worth it until 1M+ rows.

\$\endgroup\$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.