Rate Distortion Minimization in a Python Clustering Algorithm

Question

I'm attempting to solve for $\hat{k}$ clusters, such that the rate distortion is minimized, as described here, however, the answers that I am getting from my algorithm are not following the "Jump" behavior (as is needed to appropriately choose $k$).

My current process:

Given a $N$ by $d$ matrix of asset returns:

Calculate the correlation of the assets, resulting in a $d \times d$ correlation matrix.
For $k \in [1, d]$ calculate the $1 \times d$ vectors of cluster labels, using the correlation matrix as the input
For $k \in [1, d]$ calculate $k, 1 \times d$ vectors of cluster centroids, using the correlation matrix as the input
Based on the cluster label (see code below), construct $\mathbf{A}$, a $d \times d$ matrix of cluster centroids based on the cluster label of each asset. So for instance, say $k=3$ and there are 4 assets total, and the results of the k-means cluster is:
- asset 1 is in cluster 1
- asset 2 is in cluster 1
- asset 3 is in cluster 2
- asset 4 is in cluster 3
  
  I would create a $4 \times 4$ matrix where $\textrm{row }1 = \textrm{centroid } 1$ (a 1, 4 row vector), $\textrm{row }2 = \textrm{centroid } 1$ $\textrm{row }3 = \textrm{centroid } 2$ (a 1, 4 row vector), $\textrm{row }4 = \textrm{centroid} 3$
Then calculate the rate distortion function, $\epsilon$, for each $k$, where:

$\epsilon_{k} \triangleq \frac{1}{p}\mathbf{A^\intercal}\mathbf{\Sigma^{-1}}\mathbf{A}$ where, $\mathbf{\Sigma} \triangleq$ The Covariance Matrix of the $N \times d$ asset returns NOTE: I'm not certain whether I take the covariance of the asset returns or the covariance of the covariance matrix (which didn't seem right to me).
This leaves me with a $d \times d$ matrix that I then sum all the values, and follow the remainder of the procedure listed here. This procedure is derived from the paper, Finding the number of clusters in a data set: An information theoretic approach

My Python Code is as follows:

  import Pycluster

  def rate_distortion(X, cov_matrix, num_clusters):
      clusterid, error, nfound = Pycluster.kcluster(X, nclusters = num_clusters)
      cdata, cmask = Pycluster.clustercentroids(X, clusterid = clusterid)
      c_mat = pandas.DataFrame(numpy.empty(X.shape), columns = X.columns)
      for i, cluster in enumerate(clusterid):
        c_mat.ix[i, :] = cdata[cluster, :]

      diff_mat = numpy.subtract(X, c_mat)
      p = X.shape[1]
      Y = p/2.
      distortion = 1./p * numpy.sum(numpy.dot(numpy.dot(diff_mat.transpose(),  numpy.linalg.inv(cov_matrix)), diff_mat))
      return distortion**(-Y)

To execute a script that exactly illustrates to what I'm referring to, use data located here and the following script:

  import pandas, numpy, Pycluster

  prices = pandas.DataFrame.from_csv('asset_prices.csv')
  returns = prices.apply(numpy.log).diff()
  k = numpy.arange(1,12)
  d = []
  for i in k:
        d.append(rate_distortion(returns.corr(), returns.cov(), i)
  d = pandas.Series(d, index = k)

  In [22]: d
  Out[22]: 
  1     8.845139e-33
  2     3.969062e-29
  3     9.387323e-28
  4     9.200729e-28
  5     4.675902e-18
  6     2.412458e-21
  7     3.582043e-18
  8     8.094695e-17
  9     1.424341e-16
  10    4.320064e-14
  11             inf
  dtype: float64

Just posted the data, as well as a small script to show you exactly what I'm seeing (as well as the output). Thanks for the suggestion @quasi
So I'm a little confused. You're using $\Sigma$ as your underlying data, with each row a data point. So you have $d$ points in $\mathbb{R}^d$. But in your dispersion calculation, you're not using the covariance matrix of $\Sigma$, you're using $\Sigma$ itself. This seems to not be the algorithm.
The naive application of the paper you cited seems to be: calculate $\Sigma$ as you did, then use the distortion measure to cluster the $d$-dimensional returns as a function of time.
@quasi, thanks for the reply. A little clarification, $\mathbf{\Sigma}$ as I've defined it, is the covariance matrix as an input to the distortion calculation. The clustering algorithm is being run on the correlation matrix of asset returns. So you're saying, I should set $\mathbf{\Sigma}$ equal to the Covariance of the correlation matrix (which is the input to my clustering algorithm), instead of the covariance of asset returns (which I'm currently doing), correct?

asked	27 days ago
viewed	145 times

Rate Distortion Minimization in a Python Clustering Algorithm

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Browse other questions tagged python financial-engineering cluster rate-distortion or ask your own question.

Community Bulletin

Rate Distortion Minimization in a Python Clustering Algorithm

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python financial-engineering cluster rate-distortion or ask your own question.

Community Bulletin

Related