Simple k-means implemention using Python3 and Pandas

Question

Is there anything I can improve? The distance function is Pearson correlation.

import os
import pandas as pd
import numpy as np
from pandas import Series, DataFrame


def corrpairs(df1, df2):
    """
    Pairwise correlation for columns of two data frames
    :param df1:
    :type df1:
    :param df2:
    :type df2:
    :return:
    :rtype: pandas.core.frame.DataFrame
    """
    return df1.apply(lambda x: df2.corrwith(x))



import pdb
def kcluster(cols, k=4):
    """
    K Means clustering algorithm, applied to columns of a data frame.
    Using Pearson correlation as the distance function.
    :param rows:
    :type rows: pandas.core.frame.DataFrame
    :param k:
    :type k: int
    :return:
    :rtype: list[int]
    """
    cols = cols.astype(float)
    nrow, ncol = cols.shape
    nuclear0 = cols.iloc[:, :k]
    nuclear0.columns = range(k)
    nuclear0 += np.random.randn(np.prod(nuclear0.shape)).reshape(nuclear0.shape)

    correlations = corrpairs(cols, nuclear0)
    groups = correlations.idxmax(axis=0)
    nuclear1 = []
    for i in range(k):
        sub_cols = cols.loc[:, groups == i]
        sub_mean = sub_cols.mean(axis=1)
        nuclear1.append(sub_mean)
    nuclear1 = pd.concat(nuclear1, axis=1)

    while ((nuclear0 - nuclear1).abs() > 0.00001).any().any():
        print(nuclear0)
        print(nuclear1)
        print((nuclear0 - nuclear1).abs())
        nuclear0 = nuclear1
        correlations = corrpairs(cols, nuclear0)
        groups = correlations.idxmax(axis=0)
        nuclear1 = []
        for i in range(k):
            sub_cols = cols.loc[:, groups == i]
            sub_mean = sub_cols.mean(axis=1)
            nuclear1.append(sub_mean)
        nuclear1 = pd.concat(nuclear1, axis=1)

    return groups

Are you sure that using Pearson correlation with K-means is a good idea? See here — Janne Karila, Jan 8 '15 at 6:33

SirPython · Answer 1 · 2015-07-01 23:49:40Z

You have a few problems with your documentation.

First off, your documentation is incomplete in a few places.

For example, in your corrpairs function, you didn't fill in any of your documentation, except for the rtype part.

And, in your kcluster function, you only filled in type rows, type k, and rtype.

Finally, also in kcluster, you called the parameter "rows" in the documentation and called it "cols" in the function signature. Choose one and stick with it.

Documentation is a very important part of every function.

import pdb
def kcluster(cols, k=4):

You should not have an import in the middle of your code; all the importing should be done at the very top of your code like you were doing before.

This review was just mean to point out practices. I had trouble understanding the content of the code (a better documentation probably would've helped).

asked	1 year ago
viewed	329 times
active	10 months ago

current community

your communities

more stack exchange communities

Simple k-means implemention using Python3 and Pandas

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python python-3.x machine-learning pandas or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Simple k-means implemention using Python3 and Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python python-3.x machine-learning pandas or ask your own question.

Related

Hot Network Questions