Mathematica Implementations of the Random Forest algorithm

Question

Is anyone aware of Mathematica use/implementation of Random Forest algorithm?

@AndyRoss may know of one. I heard him mention something along those lines, if I remember correctly.
@rcollyer thanks for the info, fingers crossed Andy responds...
I've certainly played with it but never took the time to make more than a toy. For the time being you might consider trying RLink and the package randomForest. cran.r-project.org/web/packages/randomForest/index.html

Andy Ross · Accepted Answer · 2013-02-10 03:26:43Z

Here I will attempt to provide a basic implementation of the random forest algorithm for classification. This is by no means fast and doesn't scale very well but otherwise is a nice classifier. I recommend reading Breiman and Cutler's page for information about random forests.

The following are some helper functions that allow us to compute entropy and ultimately information gain easily.

condEnt = Statistics`Library`NConditionalEntropy;
ent = Statistics`Library`NEntropy;
maxI[v_] := Ordering[v, -1][[1]]

Now for the decision trees that will be used in the forest. This is the biggest bottleneck in the whole process and I would love to see a much faster implementation in Mathematica! The trees I'm working with are called CART trees.

The basic idea is to select at random m of the possible k variables to separate the response y into classes. Once the best variable is chosen the value of that variable which bests separates the classes is chosen to create a split in the tree. This process continues until all responses have been classified or we aren't able to split things based on the remaining data.

cart[x_, y_ /; ent[y] == 0, m_] := First[y]

cart[x_, y_, m_] :=
 Block[{k = Length[x], h, drop, bestVar, ub, best, mask},
  h = ent[y];
  drop = RandomSample[Range[k], k - m];
  bestVar = 
   maxI[Table[If[FreeQ[drop, i], h - condEnt[x[[i]], y], 0], {i, k}]];
  ub = Union[x[[bestVar]]];
  mask = UnitStep[x[[bestVar]] - #] & /@ ub;
  best = maxI[(h - condEnt[#, y] & /@ mask)];
  If[Min[#] == Max[#],
     RandomChoice[y],
     {bestVar, ub[[best]],
      cart[Pick[x\[Transpose], #, 1]\[Transpose], Pick[y, #, 1], m],
      cart[Pick[x\[Transpose], #, 0]\[Transpose], Pick[y, #, 0], m]
      }
     ] &[mask[[best]]]
  ]

To demo these things as I go lets use the famous iris data built in to Mathematica selecting about 80% for training and 20% for testing.

data = ExampleData[{"Statistics", "FisherIris"}];

rs = RandomSample[Range[Length[data]]];  
train = rs[[1 ;; 120]];
test = rs[[121 ;;]];

Now lets create a CART tree from this data letting m be 3. We can read the result easily. The first element is the variable that bests splits the data, in this case variable 3. The next value 3.3 is the critical value of variable 3 to use. If a value is below that it branches to the right and if it is above it branches to the left. The leaves are the next two elements.

tree = cart[Transpose@data[[train, 1 ;; -2]], data[[train, -1]], 3]

(* {3, 3.3, {3, 4.9, {4, 1.8, "virginica", {3, 5.1, {4, 1.6, "versicolor","virginica"}, 
 "versicolor"}}, {4, 1.7, {4, 1.8, "versicolor", "virginica"},"versicolor"}}, "setosa"} *)

So far so good. Now we need a classifier that can take a new input and a tree to make a classification.

classify[x_, Except[_List, d_]] := d
classify[x_, {best_, value_, d1_, d2_}] := classify[x, If[x[[best]] < value, d2, d1]]

Lets try it out with the first element from our training data. It correctly classifies the iris as species 2 (virginica).

classify[{6.4, 2.8, 5.6, 2.1}, tree]

(* "virginica *)

A random forest is nothing but an ensemble of such trees, created from a bootstrap sample of the data, that allows each one to vote. The function rf takes some input data x, a response vector y, the number of variables to select for splitting m and the number of trees to grow ntree. The function rfClassify takes a new input and a trained forest and makes a classification.

rf[x_, y_, m_, ntree_] := 
  Table[boot = RandomChoice[Range[Length[y]], Length[y]]; 
        cart[x[[All, boot]], y[[boot]], m], {ntree}]

rfClassify[input_, forest_] := 
 First[#][[maxI[Last[#]]]] &[
  Transpose[Tally[classify[input, #] & /@ forest]]]

Lets try it on the iris data. First we fit a forest with our training set. And test it to make sure it works well on its own training data. In this case we get perfect classification of our training data so it looks good.

f = rf[data[[train, 1 ;; -2]]\[Transpose], data[[train, -1]], 3, 500];

N@
 Mean[Boole[
   First[#] === Last[#] & /@ 
    Transpose[{rfClassify[#, f] & /@ data[[train, 1 ;; -2]], 
      data[[train, -1]]}]]]

(* 1. *)

Now lets try it out on the test data it has never seen before.

 N@
 Mean[Boole[
   First[#] === Last[#] & /@ 
    Transpose[{rfClassify[#, f] & /@ data[[test, 1 ;; -2]], 
      data[[test, -1]]}]]]

 (* 0.966667 *)

I'd say 97% correct classification isn't bad for a relatively small data set and forest.

EDIT:

It is worth showing how one might use RLink and the randomForest package in Mathematica with the data I have here. The following code will fit a random forest to the iris data and return the prediction for a particular input newX.

(*Enable RLink*)
<< RLink`
InstallR[]

(*Set the training data and response*)
RSet["x", data[[train, 1 ;; -2]]];
RSet["y", data[[train, -1]]];
RSet["newX", data[[test, 1 ;; -2]][[1]]];

(*install the package. Note: you only need to do this once*)
REvaluate["install.packages(\"randomForest\")"];

(*fit a random forest and make one classification*)
REvaluate["{
 library(randomForest)
 f = randomForest(x,as.factor(y))
 predict(f,newX)
 }"
 ]

(* RFactor[{1}, RFactorLevels["setosa", "versicolor", "virginica"], 
      RAttributes["names" :> {"1"}]] *)

The Mathematica command NConditionalEntropy in the Statistics`Library` package does not appear to be documented. Could you provide a few sentences on its argument structure and maybe an example or two. It would at least me understand your code better.
It computes the entropy given x[[i]] is known. Wikipedia has a good definition. en.wikipedia.org/wiki/Conditional_entropy
I also recommend looking at en.wikipedia.org/wiki/Information_gain_in_decision_trees
Would anything be lost by simplifying rfClassify as rfClassify[input_,forest_]:=Commonest[classify[input,#]&/@forest]
@SethChandler no more than is lost with the code I have already. In practice we generally want to keep a record of all the "votes" since the number of votes for a particular class divided by the total number of trees can be considered a measure of confidence in the classification.

Seth Chandler · Answer 2 · 2013-03-03 17:58:34Z

I'm going to be bold and attempt to edit the Ross code so that it is (a) a little easier to understand and (b) takes the same form of argument as LinearModelFit and other Mathematica prediction creators. I've also added some annotations to the critical code. My variable names are now far longer than the Ross names but perhaps for informative. So far in my testing this code works the same as Ross, but it is possible I have messed something up in my rewrite.

Part 1. Same as Ross but I add a function informationGain that determines the information gain one obtains about y from knowing the ith feature of x.

condEnt = Statistics`Library`NConditionalEntropy;
ent = Statistics`Library`NEntropy;
maxI[v_] := Ordering[v, -1][[1]];
informationGain[x_, y_, i_] := 
  Statistics`Library`NEntropy[x] - 
  Statistics`Library`NConditionalEntropy[x[[i]], y]

Part 2. Same idea as Ross but x now contains not just the features (independent variables in statistics lingo) but the entire data set. It thus has the same structure as the first argument to LinearModelFit and NonlinearModelFit.

  cart[x_?MatrixQ /; ent[Last /@ x] == 0, m_] := First[Last /@ x]

Part 3. The idea here is again to let the first argument just be the entire data set. What I have then done is to use a lot of local variables with hopefully evocative names and some annotations to better explain what is going on in the elegant but terse Ross code. The text below explains the annotations.

cart[x_?MatrixQ, m_Integer] := 
 Block[{dims = Dimensions[x], numberOfInstances, numberOfAttributes, 
allAttributes, y, byFeature,
h, keep, bestVar, ub, allValuesOfBestVar, mask, best, 
xBiggerThanBestSplitQ, bestSplittingValue, 
iXBestGTESplit,
iXBestLTSplit},
numberOfInstances = First[dims];
numberOfAttributes = Last[dims] - 1;(*1*)
allAttributes = Range[numberOfAttributes];(*2*)
y = Part[x, All, -1];(*3*)
byFeature = Transpose[(Most /@ x)];(*4*)
h = ent[y];(*5*)
keep = RandomSample[allAttributes, m];(*6*)
bestVar = 
maxI[Table[
 If[MemberQ[keep, i], informationGain[byFeature, y, i], 0], {i, 
  allAttributes}]];(*7*)
allValuesOfBestVar = byFeature[[bestVar]];(*8*)
ub = Union[allValuesOfBestVar];(*9*)
mask = UnitStep[allValuesOfBestVar - #] & /@ 
ub;(*10*)
best = maxI[(h - condEnt[#, y] & /@ mask)];(*11*)
xBiggerThanBestSplitQ = mask[[best]];(*12*)
bestSplittingValue = ub[[best]];(*13*)
iXBestGTESplit= 
  Pick[x, xBiggerThanBestSplitQ, 1];(*14*)
iXBestLTSplit = 
  Pick[x, xBiggerThanBestSplitQ, 0];(*15*)
If[Min[xBiggerThanBestSplitQ] === Max[xBiggerThanBestSplitQ], 
   RandomChoice[y] , 
{bestVar, bestSplittingValue, 
cart[iXBestGTESplit 
 m], 
cart[
iXBestLTSplit, m]}](*16*)
]

Part 3 Annotations.

The number of attributes is just the number of columns minus one, because the last column is the class value.
I like the idea of iterating over a set of values, so I create it here.
The class values y are the last column of the x matrix.
To get the features in the same form as Ross, Transpose x and then take Most of each row.
Calculating the entropy of the y values in this level of the decision tree.
Use an m-length subset of the attributes. Note that in my code, m is the number to keep, not the number to drop.
Find the information gain for each selected attribute. Find the selected attribute that produces the greatest information gain. This produces an index into the attributes. We call it bestVar.
Get the features in the bestVar th column of x. Call it allValuesOfBestVar.
Find the different values of the features in allValuesOfBestVar. Call it ub. (Union of Best values). These now represent possible splitting points in our best feature.
This is the clever part. For each possible splitting point, determine whether each value is above or equal to that splitting point (1) or below it (9). This produces a matrix that is Length[ub] x Length[y]. Call this matrix mask.
Rather than compute the information gain from knowing the feature values, compute the information gain from knowing whether the feature for a particular instance is above a splitting point for all the splitting points. Pick the splitting point that produces the most information gain. The index of the splitting point into the list of splitting points is named best. Note that this is the second use of the idea of conditional entropy.
Now select the best part of the mask. This gives us the information on >= or < using the best splitting point for the data for the best feature of each instance.
We'll need that best splitting value again, so we capture it here as bestSplittingValue.
Pick the instances for which the best feature has a value above or equal to the best splitting value. Call those instances iXBestGTESplit
Pick the instances for which the best feature has a value less than the best splitting value. Call those instances iXBestLTSplit
The recursion step. Run cart again first on the iXBestGTESplit data and then on the iXBestLTSplit data. But before doing so, create a list that shows what feature you split on and what the best splitting value was. Stop when your data has no more information to yield.

Part 4. This code to classify an instance works exactly the same as in Ross. Notice that here, and now consistent with the way it is done above, x is a vector of attributes. It will also work as a vector that contains a list of which most of the elements are attributes and the last element is the class value. Thus, one can now just map over the instances in some testing set to get the predictions.

classify[x_?MatrixQ, Except[_List, d_]] := d
classify[x_?MatrixQ, {best_, value_, d1_, d2_}] := 
  classify[x, If[x[[best]] < value, d2, d1]]

Part 5. The random forest creation function is similar to Ross but, again, x is now a unified matrix containing the instances. I have added an optional argument subsetFactor that can speed up your code (at the risk of losing accuracy) by using just a subset of the instances to create the trees each time.

rf[x_?MatrixQ, m_Integer, ntree_Integer, subsetFactor_: 1] := 
Table[With[{boot = 
 RandomChoice[Range[Length[x]], Round[subsetFactor*Length[x]]]},
 cart[x[[boot]], m]], {ntree}]

Seth Chandler · Answer 3 · 2013-03-02 15:20:06Z

I very much enjoy Dan's approach in part because it is so simple both in concept and implementation. I'm taking the liberty here of suggesting a few arguable improvements to his terrific code. For makeForest (a) the data is in the same format as is used in functions such as LinearModelFit (a simple array instead of a list of rules of features onto class); (b) standard machine learning vocabulary in naming variables; (c) direct use of RandomSample on the data.

makeForest[data_, ntrees_, subdim_, sizes_] := 
  Module[{dims = Dimensions[data], numberOfInstances, 
  numberOfAttributes, leaves, slots, subleaves, nf},
  numberOfInstances = First[dims];
  numberOfAttributes = Last[dims] - 1;
  Table[(leaves = RandomSample[data, sizes];
         slots = Sort[RandomSample[Range[dim], subdim]];
         subleaves = Map[#[[slots]] -> #[[-1]] &, leaves];
         nf = Nearest[subleaves];
         {slots, nf}), {ntrees}]
   ]

For classify,(a) clarify with named variables how one extracts the slots and the NearestFunction from the forest; (b) use the built in Commonest instead of the Reverse SortBy Tally composition. This method does lose the number of votes each prediction received; (c) create an optional argument k that effectively implements kNN on the data instead of Dan's required 1NN.

classify[instance_, forest_, n_, k_: 1] := Module[{predictions},
  predictions = Flatten[Table[Module[{tree, nf, slots},
  tree = forest[[j]];
  slots = tree[[1]];
  nf = tree[[2]];
  nf[instance[[slots]], k]], {j, Length[forest]}]];
  Commonest[predictions, n]]

Also, a wacky idea that follows up on Andy Ross's comment: could one not pretty easily create an ensemble of forests and then let the forests that perform well on the training data have breeding rights? Breeding might consist of some sort of reshuffling of the slots. To do this, we might take more liberties with Dan's code by creating a polymorphic makeForest that permits the slots to be input directly.

makeForest[data_, slotList_, sizes_] := 
  Module[{dims = Dimensions[data], numberOfInstances, 
  numberOfAttributes, leaves, slots, subleaves, nf},
  numberOfInstances = First[dims];
  numberOfAttributes = Last[dims] - 1;
 Table[(leaves = RandomSample[data, sizes];
        slots = Sort[slotList[[tree]]];
        subleaves = 
          Map[leaf \[Function] leaf[[slots]] -> leaf[[-1]], leaves];
        nf = Nearest[subleaves];
        {slots, nf}), {tree, Length[slotList]}]
  ]

 makeForest[data_, ntrees_, subdim_, sizes_] := 
   Module[{dims = Dimensions[data], numberOfAttributes},
     numberOfAttributes = Last[dims] - 1;
     makeForest[
        data, 
        Table[Sort[
          RandomSample[Range[numberOfAttributes], subdim]], {ntrees}
         ], 
        sizes
      ]
  ]

Seth, I'm glad you were able to recast this in a more general (and perhaps more natural) setting. As for the evolutionary optimizing aspect, you might be able to do this within the setting of NMinimize (via the "DifferentialEvolution" method). I realize you have also written code to do such things though, and maybe some of yours is better suited for such an optimization.

Daniel Lichtblau · Answer 4 · 2013-02-10 21:34:22Z

Disclaimer: This is not an implementation of the Random Forest Algorithm. Also, while I have on occasion used random florists, until today I had not heard of the Random Forest Algorithm.

I poked around a bit on the Net and learned that these take subsamples of data, subsampling the variables as well, and form decision trees for the subsetted subsamples. These then can be used to classify data points. One tests against each subsamples (tree) and returns the modal value.

Okay there is more that can be done with these, and I probably don't have even the above part correct. Whatever.

For some types of problem one can regard the "space" as having a vector of either continuous, discrete, or perhaps mixed values, with each datum then evaluating to some new value. That is to say, we pretend we have an unknown function f with f(x_1,...,x_n) = y The classifier goal is to find ` given the vector(x_1,...,x_n). The idea I have is to useNearestFunctionas a surrogate for a tree. One creates each tree using a random subset of data points, and a random subset of dimensions. Classification of data can then be done by a voting system:. For a particular input, see what values the forest (set ofNearestFunction`s) gives, and use the most common as the end result. or can use the several most common if a viable set of candidate values is desired.

I will assume the input data is of the form {vec1->val1,...,vecn->valn}. For purposes of terminology to describe the functions below:

rules is the master set ntrees is how many trees to create (size of forest) subdim is the dimension used for each tree sizes is the number of rules to use in each subtree

makeForest[rules_, ntrees_, subdim_, sizes_] := Module[
  {len = Length[rules], dim = Length[rules[[1, 1]]],
   slots, leaves, subleaves, nf},
  Table[
   leaves = rules[[RandomSample[Range[len], sizes]]];
   slots = Sort[RandomSample[Range[dim], subdim]];
   subleaves = Map[#[[1, slots]] -> #[[2]] &, leaves];
   nf = Nearest[subleaves];
   {slots, nf}
   , {ntrees}]
  ]

classify[vector_, forest_, n_] := Module[
  {vals, tree, subvec},
  vals = Flatten[Table[
     tree = forest[[j]];
     tree[[2]][vector[[tree[[1]]]]]
     , {j, Length[forest]}]];
  Reverse[SortBy[Tally[vals], Last][[-n ;; -1]]]
  ]

That's it.

Here is an example. I'll take a million points in the 8-cube {-1,1}^8. We give them ordinal values based on orthant (use base 2...) but before evaluating we "pollute" them first with Gaussian noise.

dim = 8;
size = 6;
pts = RandomReal[{-1, 1}, {10^size, dim}];

ruls = Map[# -> 
     FromDigits[(Sign[# + 
           RandomReal[NormalDistribution[0, .4], Length[#]]] + 1)/2, 
      2] &, pts];

Now we'll create a "forest" (I really should use a different term since I doubt this is what an actual random forest is, but...). I use 300 trees, each taking 4 (of the 8) vector positions, and each using 1000 elements from the full set, chosen at random.

Timing[rForest = makeForest[ruls, 300, 4, 1000];]

(* Out[80]= {4.740000, Null} *)

Now i create an random 8-vector with coordinates in {-1,1}, and also evaluate it according to our actual function.

SeedRandom[11111];
randvec = RandomReal[{-1, 1}, dim]
FromDigits[(Sign[randvec] + 1)/2, 2]

(* Out[82]= {-0.472529240338, 0.534376840246, 0.286266627232, \
0.690979660248, -0.663034535046, 0.715836737332, -0.389931777484, \
0.66884997164} *)

(* Out[83]= 117 *)

We run it through classify[].

Timing[candidates = classify[randvec, rForest, 5]]

(* Out[84]= {1.370000, {{117, 13}, {245, 10}, {116, 10}, {101, 10}, {53, 
   10}}} *)

While I doubt it will always give the "expected" result at the top of the list, it is encouraging to see that it might do so. Also the next ones do tend to share bits with 117.

Of course it might be unrealistic to use half of the full set of dimensions in the trees, so I really do now know how well this might scale to "real world" problems of high dimension. But I thought it might be worth showing this approach since it is simple in terms of code and others might have ideas on tweaking it for better accuracy/performance.

Thank you @daniel-lichtblau I think this is a great start and likewise I am interested in other peoples comments/suggestions about your solution and other possible approaches...
This is certainly fast and seems to do a good job when we can expect similar amounts of information to be contained in each variable. It will likely suffer with real world data because it doesn't have a concept of variable importance. The trees used in a random forest are generally created using information gain to decide which variable and value will best separate the categories.
@Andy Ross I wonder though if one could use the training data/verification data approach (that might not be quite the right terminology). That is, for each tree also take some number of other random data points and use them to estimate the "value" of that tree. This might then provide a means to weight variables, as those trees that tend to fare better will likely be ones constructed from more important variables. (And don't ask me for details, you know I don't know stats from sprats).
So, this method seems to be a variant of K Nearest Neighbors in which one trains on a small subset of points and a small subset of dimensions. In the Lichtblau variation k=1, but this would not necessarily need to be the case. Then the NearestFunctions vote. Neat!

asked	2 months ago
viewed	591 times
active	2 months ago

Mathematica Implementations of the Random Forest algorithm

4 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged fitting algorithm modeling or ask your own question.

Community Bulletin

Linked

Mathematica Implementations of the Random Forest algorithm

4 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged fitting algorithm modeling or ask your own question.

Community Bulletin

Linked

Related