Clustering data sets with multiple variables

Question

I am learning how to cluster large datasets and this maybe a newbie question, but I haven’t found a suitable answer in the Documentation or on this board.

My question is best illustrated by an example. Let’s import some data:

irisData = Import["http://aima.cs.berkeley.edu/data/iris.csv", "CSV”];
(*Column labels *)
pLabels = {"Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Type”};

How do I construct clusters based on only two columns say “Sepal.Length” and “Sepal.Width” and still retain the rest of the data… as part of the clusters

Ideally I would like a function:

  findClusters[data,n,columnlist]

that would cluster the data into n clusters using the data in the columnlist such that output data format of the clusters is the same as the input data.

I can find the clusters and plot them as:

clusters = {cluster1, cluster2, cluster3} = FindClusters[irisData[[All, 1 ;; 2]], 3];
ListPlot[{{#1, #2} & @@@ cluster1, {#1, #2} & @@@ cluster2, {#1, #2} & @@@ cluster3}, AspectRatio -> 1, ImageSize -> 200, FrameLabel -> pLabels[[1 ;; 2]]]

But I’d like to use the clusters that I found using FindClusters above to explore say “Sepal.Length” vs “Petal.Width” or some other data analysis.

mfvonh · Accepted Answer · 2014-07-15 17:36:24Z

You should use one of the syntax options for FindClusters involving rules. When clustering your dataset, transform it to {data to cluster} -> {data to return} format at the level of either individual elements or the whole list. The details are explained in the documentation.

For example, to cluster on columns 1 (sepal length) and 4 (petal width):

irisData = Import["http://aima.cs.berkeley.edu/data/iris.csv", "CSV"];
sample = irisData[[All, {1, 4}]] -> irisData;
sample[[All, 1]]

{5.1, 0.2} -> {5.1, 3.5, 1.4, 0.2, "setosa"}

Then of course

FindClusters[sample];
% // Length

3

%%[[1, 1]]

{5.1, 3.5, 1.4, 0.2, "setosa"}

This works well. Thanks for taking time out to post! – Pam 2 days ago — Pam, 2 days ago

Pam · Answer 2 · 2014-07-15 18:49:29Z

This is based on mfvonh’s answer. I thought I would flush out the full answer for posterity.

Let’s cluster the data based on Sepal Length and Sepal Width

 sample = irisData[[All, {1, 2}]] -> irisData;
 clusters = {cluster1, cluster2, cluster3} = FindClusters[sample, 3];

Let’s plot the various characteristics of the clusters:

{ListPlot[({#1, #2} &  @@@ clusters[[#]]) & /@ Range[3], 
FrameLabel -> pLabels[[1 ;; 2]] ]
, ListPlot[({#2, #3} &  @@@ clusters[[#]]) & /@ Range[3], 
FrameLabel -> pLabels[[2 ;; 3]] ]} // List // Grid

enter image description here

eldo · Answer 3 · 2014-07-15 17:04:17Z

up vote 1 down vote

Is this what you want?

pLabels = {"Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Type"};

IrisPlot[x_, y_] :=
 ListPlot[FindClusters[irisData[[All, {x, y}]], 3],
  AspectRatio -> 1, ImageSize -> 400, Frame -> True, 
  FrameLabel -> pLabels[[{x, y}]]]

IrisPlot[2, 3]

enter image description here

answered 2 days ago

eldo
3,723218

This is not what I want. I want to cluster using Petal Length and Petal Width fields, but for the clusters that I’ve computed plot the the Petal Length vs Sepal Width… – Pam 2 days ago

add comment

asked	2 days ago
viewed	105 times
active	2 days ago

current community

your communities

more stack exchange communities

Clustering data sets with multiple variables

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged cluster-analysis machine-learning or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Clustering data sets with multiple variables

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged cluster-analysis machine-learning or ask your own question.

Related

Hot Network Questions