First off, it's cleaner to minimize the variances (squares of the standard deviations). Take the square root at the very end if you must.
Second, be aware that Mathematica's Variance
(and StandardDeviation
) are the so-called "unbiased" calculations used for statistical estimation. They might not be wanted here, because they penalize small clusters (pretty heavily), driving the solution to contain one large cluster in many cases. This suggests implementing a solution in which the objective function can be easily modified.
A recursive solution naturally suggests itself for clustering a list of $n$ values: for each initial interval of length $j\ge 3$, find the optimal clustering for the remaining $n-j$ values (provided there are $3$ or more of them). But don't forget to consider a single cluster! Without any optimization that leads directly to this:
cluster::usage =
"cluster[x,k,t] returns an interval partition of x in which \
each interval has at least k >= 2 elements and \
('unbiased') variance not exceeding t (if possible).";
cluster[x_List, k_Integer, t_] := Module[{n = Length[x], options},
options = Prepend[cluster[Take[x, # - n], k, t], Take[x, #]] & /@ Range[k, n - k];
options = Append[options, {x}];
options[[First[Ordering[value[#, t] & /@ options, 1]]]]
];
cluster[x_List, k_Integer, t_] /; Length[x] < 2 k := {x};
This coding style values clarity because that assists debugging, modification, and integration with other code. It also values flexibility. E.g., hard-coding $k=3$ would speed it up about 25% but it is convenient to allow other values of $k$ (if only for testing).
Oh yes, concerning flexibility: the preceding could drive almost any interval-based clustering algorithm whose objective function is local (that is, a sum over the clusters). To complete the present solution, we need to be specific about value
, the objective function:
value::usage = "value[x,t] returns the total variance of a list of intervals \
(or Infinity if any interval has variance exceeding t).";
value[x_List, t_] := Sum[variance[y, t], {y, x}];
variance::usage = "variance[x,t] measures how 'clustered' a set of numbers x is.";
variance[x_List, t_] := With[{v = Variance[x]}, If[v > t, Infinity, v]];
It's fast and easy to experiment with different measures of "clusteredness"; for instance, try replacing variance
with a calculation of the sum of squares of residuals, Variance[x](Length[x]-1)
.
For an example of usage:
solution = cluster[x = RandomReal[{0, 1}, 30], 3, Variance[x]]
This generates 30 random values and requests clustering into a groups of 3 or more with the variance in each group not to exceed the variance of the entire dataset.
Several aspects of this solution seem worthy of comment:
If no solution is possible, it returns a single cluster. You need to check that (by applying value
to it).
The running time grows exponentially. It's going to get bad with lists of length 40 or more.
Running time could be improved with some pruning of the search tree (by tracking the value of the best solution so far).
RAM usage is optimized by this recursive approach (compared to generating all possible partitions of x
and then checking them all).