Is anyone aware of Mathematica use/implementation of Random Forest algorithm?
|
Here I will attempt to provide a basic implementation of the random forest algorithm for classification. This is by no means fast and doesn't scale very well but otherwise is a nice classifier. I recommend reading Breiman and Cutler's page for information about random forests. The following are some helper functions that allow us to compute entropy and ultimately information gain easily.
Now for the decision trees that will be used in the forest. This is the biggest bottleneck in the whole process and I would love to see a much faster implementation in Mathematica! The trees I'm working with are called CART trees. The basic idea is to select at random
To demo these things as I go lets use the famous iris data built in to Mathematica selecting about 80% for training and 20% for testing.
Now lets create a CART tree from this data letting
So far so good. Now we need a classifier that can take a new input and a tree to make a classification.
Lets try it out with the first element from our training data. It correctly classifies the iris as species 2 (virginica).
A random forest is nothing but an ensemble of such trees, created from a bootstrap sample of the data, that allows each one to vote. The function
Lets try it on the iris data. First we fit a forest with our training set. And test it to make sure it works well on its own training data. In this case we get perfect classification of our training data so it looks good.
Now lets try it out on the test data it has never seen before.
I'd say 97% correct classification isn't bad for a relatively small data set and forest. EDIT: It is worth showing how one might use
|
|||||||||||
|
I'm going to be bold and attempt to edit the Ross code so that it is (a) a little easier to understand and (b) takes the same form of argument as LinearModelFit and other Mathematica prediction creators. I've also added some annotations to the critical code. My variable names are now far longer than the Ross names but perhaps for informative. So far in my testing this code works the same as Ross, but it is possible I have messed something up in my rewrite. Part 1. Same as Ross but I add a function informationGain that determines the information gain one obtains about y from knowing the ith feature of x.
Part 2. Same idea as Ross but x now contains not just the features (independent variables in statistics lingo) but the entire data set. It thus has the same structure as the first argument to LinearModelFit and NonlinearModelFit.
Part 3. The idea here is again to let the first argument just be the entire data set. What I have then done is to use a lot of local variables with hopefully evocative names and some annotations to better explain what is going on in the elegant but terse Ross code. The text below explains the annotations.
Part 3 Annotations.
Part 4. This code to classify an instance works exactly the same as in Ross. Notice that here, and now consistent with the way it is done above, x is a vector of attributes. It will also work as a vector that contains a list of which most of the elements are attributes and the last element is the class value. Thus, one can now just map over the instances in some testing set to get the predictions.
Part 5. The random forest creation function is similar to Ross but, again, x is now a unified matrix containing the instances. I have added an optional argument subsetFactor that can speed up your code (at the risk of losing accuracy) by using just a subset of the instances to create the trees each time.
|
|||
|
I very much enjoy Dan's approach in part because it is so simple both in concept and implementation. I'm taking the liberty here of suggesting a few arguable improvements to his terrific code. For makeForest (a) the data is in the same format as is used in functions such as LinearModelFit (a simple array instead of a list of rules of features onto class); (b) standard machine learning vocabulary in naming variables; (c) direct use of RandomSample on the data.
For classify,(a) clarify with named variables how one extracts the slots and the NearestFunction from the forest; (b) use the built in Commonest instead of the Reverse SortBy Tally composition. This method does lose the number of votes each prediction received; (c) create an optional argument k that effectively implements kNN on the data instead of Dan's required 1NN.
Also, a wacky idea that follows up on Andy Ross's comment: could one not pretty easily create an ensemble of forests and then let the forests that perform well on the training data have breeding rights? Breeding might consist of some sort of reshuffling of the slots. To do this, we might take more liberties with Dan's code by creating a polymorphic makeForest that permits the slots to be input directly.
|
|||
|
Disclaimer: This is not an implementation of the Random Forest Algorithm. Also, while I have on occasion used random florists, until today I had not heard of the Random Forest Algorithm. I poked around a bit on the Net and learned that these take subsamples of data, subsampling the variables as well, and form decision trees for the subsetted subsamples. These then can be used to classify data points. One tests against each subsamples (tree) and returns the modal value. Okay there is more that can be done with these, and I probably don't have even the above part correct. Whatever. For some types of problem one can regard the "space" as having a vector of either continuous, discrete, or perhaps mixed values, with each datum then evaluating to some new value. That is to say, we pretend we have an unknown function I will assume the input data is of the form
That's it. Here is an example. I'll take a million points in the 8-cube {-1,1}^8. We give them ordinal values based on orthant (use base 2...) but before evaluating we "pollute" them first with Gaussian noise.
Now we'll create a "forest" (I really should use a different term since I doubt this is what an actual random forest is, but...). I use 300 trees, each taking 4 (of the 8) vector positions, and each using 1000 elements from the full set, chosen at random.
Now i create an random 8-vector with coordinates in
We run it through
While I doubt it will always give the "expected" result at the top of the list, it is encouraging to see that it might do so. Also the next ones do tend to share bits with 117. Of course it might be unrealistic to use half of the full set of dimensions in the trees, so I really do now know how well this might scale to "real world" problems of high dimension. But I thought it might be worth showing this approach since it is simple in terms of code and others might have ideas on tweaking it for better accuracy/performance. |
|||||||||||
|