I'm attempting to solve for $\hat{k}$ clusters, such that the rate distortion is minimized, as described here, however, the answers that I am getting from my algorithm are not following the "Jump" behavior (as is needed to appropriately choose $k$).
My current process:
Given a $N$ by $d$ matrix of asset returns:
- Calculate the correlation of the assets, resulting in a $d \times d$ correlation matrix.
- For $k \in [1, d]$ calculate the $1 \times d$ vectors of cluster labels, using the correlation matrix as the input
- For $k \in [1, d]$ calculate $k, 1 \times d$ vectors of cluster centroids, using the correlation matrix as the input
Based on the cluster label (see code below), construct $\mathbf{A}$, a $d \times d$ matrix of cluster centroids based on the cluster label of each asset. So for instance, say $k=3$ and there are 4 assets total, and the results of the k-means cluster is:
- asset 1 is in cluster 1
- asset 2 is in cluster 1
- asset 3 is in cluster 2
asset 4 is in cluster 3
I would create a $4 \times 4$ matrix where $\textrm{row }1 = \textrm{centroid } 1$ (a 1, 4 row vector), $\textrm{row }2 = \textrm{centroid } 1$ $\textrm{row }3 = \textrm{centroid } 2$ (a 1, 4 row vector), $\textrm{row }4 = \textrm{centroid} 3$
Then calculate the rate distortion function, $\epsilon$, for each $k$, where:
$\epsilon_{k} \triangleq \frac{1}{p}\mathbf{A^\intercal}\mathbf{\Sigma^{-1}}\mathbf{A}$ where, $\mathbf{\Sigma} \triangleq$ The Covariance Matrix of the $N \times d$ asset returns NOTE: I'm not certain whether I take the covariance of the asset returns or the covariance of the covariance matrix (which didn't seem right to me).
This leaves me with a $d \times d$ matrix that I then sum all the values, and follow the remainder of the procedure listed here. This procedure is derived from the paper, Finding the number of clusters in a data set: An information theoretic approach
My Python Code is as follows:
import Pycluster def rate_distortion(X, cov_matrix, num_clusters): clusterid, error, nfound = Pycluster.kcluster(X, nclusters = num_clusters) cdata, cmask = Pycluster.clustercentroids(X, clusterid = clusterid) c_mat = pandas.DataFrame(numpy.empty(X.shape), columns = X.columns) for i, cluster in enumerate(clusterid): c_mat.ix[i, :] = cdata[cluster, :] diff_mat = numpy.subtract(X, c_mat) p = X.shape[1] Y = p/2. distortion = 1./p * numpy.sum(numpy.dot(numpy.dot(diff_mat.transpose(), numpy.linalg.inv(cov_matrix)), diff_mat)) return distortion**(-Y)
To execute a script that exactly illustrates to what I'm referring to, use data located here and the following script:
import pandas, numpy, Pycluster prices = pandas.DataFrame.from_csv('asset_prices.csv') returns = prices.apply(numpy.log).diff() k = numpy.arange(1,12) d = [] for i in k: d.append(rate_distortion(returns.corr(), returns.cov(), i) d = pandas.Series(d, index = k) In [22]: d Out[22]: 1 8.845139e-33 2 3.969062e-29 3 9.387323e-28 4 9.200729e-28 5 4.675902e-18 6 2.412458e-21 7 3.582043e-18 8 8.094695e-17 9 1.424341e-16 10 4.320064e-14 11 inf dtype: float64