Nested loops - Random Forest, multiple parameters

Question

I'm writing a code which task is to grow Random Forest trees based on multiple parameters. In short:

Firstly, I declare a data frame in which model parameters and some stats will be saved.
Secondly, I declare model parameters and the loop iterator (it will be showed after every loop iteration).
Next, I have a nested loops with the model and prediction function.
Furthermore, parameters and some stats from the confusion matrix are saved to the dataframe.
Additionally, the number of iterations is printed and counted.
Last but not least, garbage collector is called.

The code looks like this:

## data frame in which model parameters and some stats will be saved

model_eff <- data.frame("ntrees" = numeric(0),
                        "zeros" = numeric(0), 
                        "mvars"= numeric(0),
                        "eff" = numeric(0),
                        "0_0" = numeric(0),
                        "0_1" = numeric(0),
                        "1_0" = numeric(0),
                        "1_1" = numeric(0),
                        "predict_sum" = numeric(0),
                        "triangle" = numeric(0))


## parameteres

ntrees <- c(300, 500)
zeros <- sum(train.target) * c(1, 2, 3, 4, 5)
mvars <- c(30, 50, 70, 90, 110, 130)

## loop counter

i = 1

## loop with model, prediction etc.

for (j in 1:length(ntrees)){
  for (k in 1:length(zeros)){
    for (l in 1:length(mvars)){

      ## i-th model

      model <- randomForest(train,
                            y = as.factor(train.target),
                            ntree = ntrees[j],
                            do.trace = T,
                            sampsize = c('0' = zeros[k], '1' = sum(train.target)),
                            mtry = mvars[l])

      ## prediction - my function, apart from a regular prediction
      ##              outputs additional info

      predict.model(model, val, val.target)

      ## inserting model parameters and stats to a data frame for further comparisons

      model_eff <- rbind(model_eff,
                         c("ntrees" = ntrees[j],
                           "zeros" = zeros[k],
                           "mvars"= mvars[l],
                           "eff" = eff_measures$eff,
                           "0_0" = eff_measures$c.m[1, 1],
                           "0_1" = eff_measures$c.m[1, 2],
                           "1_0" = eff_measures$c.m[2, 1],
                           "1_1" = eff_measures$c.m[2, 2],
                           "predict_sum" = sum(TARGET3),
                           "triangle" = eff_measures$triangle))

      ## printing the number of iteration

      cat("iteration =", i)
      i <- i+1

      ## calling garbage collector to assure free space in RAM

      gc()
    }
  }
}

I have already split the train/validation data sets and their target variables, knowing that Random Forest deals with such data mor efficiently. I also tried to use the "foreach" package for parallelizing computations, however, the growing time for only one tree was 10-15% longer than without using all the cores.

I would like to know if I can shorten the time of execution of this code, especially if there is a way to avoid multiple loops since I heard that they are not the best way of programming in R.

Coatless · Answer 1 · 2016-03-19 17:59:35Z

Reproducible Example

Unfortunately, the code snippet that you gave does not lend itself to being reproducible. So, the advice being given is constrained.

Caches are nice

There are certain times where you should be caching a summation if the value is known to be constant through different iterations. In this particular case, we have: sum(train.target) and sum(TARGET3) that should be cached. Say:

stt = sum(train.target)
st3 = sum(TARGET3)

Knowledge (of size) is Power!

Immediately, one of the key issue you will face is the fact that you are rbind 60 items since you avoid giving stable numerical entries in the data.frame

## parameteres

ntrees <- c(300, 500)
zeros <- sum(train.target) * c(1, 2, 3, 4, 5)
mvars <- c(30, 50, 70, 90, 110, 130)

nitr = length(ntrees)*length(zeros)*length(mvars)

model_eff <- data.frame("ntrees" = numeric(nitr),
                        "zeros"  = numeric(nitr), 
                        "mvars"  = numeric(nitr),
                        "eff"    = numeric(nitr),
                        "0_0"    = numeric(nitr),
                        "0_1"    = numeric(nitr),
                        "1_0"    = numeric(nitr),
                        "1_1"    = numeric(nitr),
                        "predict_sum" = numeric(nitr),
                        "triangle"    = numeric(nitr),
                        stringsAsFactors = F)

Declare count = 1 before the 3x for loops. Then save results using:

model_eff[count,] = c("ntrees" = ntrees[j],
                      "zeros" = zeros[k],
                      "mvars"= mvars[l],
                      "eff" = eff_measures$eff,
                      "0_0" = eff_measures$c.m[1, 1],
                      "0_1" = eff_measures$c.m[1, 2],
                      "1_0" = eff_measures$c.m[2, 1],
                      "1_1" = eff_measures$c.m[2, 2],
                      "predict_sum" = st3 ,
                      "triangle" = eff_measures$triangle))
count = count + 1

Parallel RandomForest via `caret`

The only other suggestion I have it to parallelize the build of the random forest via:

# caret modeling framework
library(caret)

# Parallel backend
library(doParallel)

# Register a cluster
registerDoParallel(cores = 5)

rf_model = train(train.target~.,data=train,method="rf",
                 prox=TRUE,allowParallel=TRUE)

asked	7 months ago
viewed	62 times
active	7 months ago

current community

your communities

more stack exchange communities

Nested loops - Random Forest, multiple parameters

1 Answer 1

Reproducible Example

Caches are nice

Knowledge (of size) is Power!

Parallel RandomForest via `caret`

Your Answer

Not the answer you're looking for? Browse other questions tagged performance r machine-learning data-mining or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Nested loops - Random Forest, multiple parameters

1 Answer 1

Reproducible Example

Caches are nice

Knowledge (of size) is Power!

Parallel RandomForest via caret

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged performance r machine-learning data-mining or ask your own question.

Related

Hot Network Questions

Parallel RandomForest via `caret`