Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

5 days and still no answer

  • As can be seen by Simon's comment, this is a reproducible and very strange issue. It seems that the issue only arises when a stepwise regression with very high predictive power is wrapped in a function.

I have been struggling with this for a while and any help would be much appreciated. I am trying to write a function that runs several stepwise regressions and outputs all of them to a list. However, R is having trouble reading the dataset that I specify in my function arguments. I found several similar errors on various boards (here, here, and here), however none of them seemed to ever get resolved. It all comes down to some weird issues with calling step() in a user-defined function. I am using the following script to test my code. Run the whole thing several times until an error arises (trust me, it will):

test.df <- data.frame(a = sample(0:1, 100, rep = T),
                      b = as.factor(sample(0:5, 100, rep = T)),
                      c = runif(100, 0, 100),
                      d = rnorm(100, 50, 50))
test.df$b[10:100] <- test.df$a[10:100] #making sure that at least one of the variables has some predictive power

stepModel <- function(modeling.formula, dataset, outfile = NULL) {
  if (is.null(outfile) == FALSE){
    sink(file = outfile,
         append = TRUE, type = "output")
    print("")
    print("Models run at:")
    print(Sys.time())
  }
  model.initial <- glm(modeling.formula,
                       family = binomial,
                       data = dataset)
  model.stepwise1 <- step(model.initial, direction = "backward")
  model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
  output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
  sink()
  return(output)
}

blah <- stepModel(a~., dataset = test.df)

This returns the following error message (if the error does not show up right away, keep re-running the test.df script as well as the call for stepModel(), it will show up eventually):

Error in is.data.frame(data) : object 'dataset' not found

I have determined that everything runs fine up until model.stepwise2 starts to get built. Somehow, the temporary object 'dataset' works just fine for the first stepwise regression, but fails to be recognized by the second. I found this by commenting out part of the function as can be seen below. This code will run fine, proving that the object 'dataset' was originally being recognized:

stepModel1 <- function(modeling.formula, dataset, outfile = NULL) {
  if (is.null(outfile) == FALSE){
    sink(file = outfile,
         append = TRUE, type = "output")
    print("")
    print("Models run at:")
    print(Sys.time())
  }
  model.initial <- glm(modeling.formula,
                       family = binomial,
                       data = dataset)
  model.stepwise1 <- step(model.initial, direction = "backward")
#   model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
#   sink()
#   output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
  return(model.stepwise1)
}

blah1 <- stepModel1(a~., dataset = test.df) 

EDIT - before anyone asks, all the summary() functions were there because the full function (i edited it so that you could focus in on the error) has another piece that defines a file to which you can output stepwise trace. I just got rid of them

EDIT 2 - session info

sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sqldf_0.4-6.4         RSQLite.extfuns_0.0.1 RSQLite_0.11.3        chron_2.3-43         
 [5] gsubfn_0.6-5          proto_0.3-10          DBI_0.2-6             ggplot2_0.9.3.1      
 [9] caret_5.15-61         reshape2_1.2.2        lattice_0.20-6        foreach_1.4.0        
[13] cluster_1.14.2        plyr_1.8             

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.2-1   dichromat_2.0-0    digest_0.6.2       grid_2.15.1       
 [6] gtable_0.1.2       iterators_1.0.6    labeling_0.1       MASS_7.3-18        munsell_0.4       
[11] RColorBrewer_1.0-5 scales_0.2.3       stringr_0.6.2      tools_2.15

EDIT 3 - this performs all the same operations as the function, just without using a function. This will run fine every time, even when the algorithm doesn't converge:

modeling.formula <- a~.
dataset <- test.df
outfile <- NULL
if (is.null(outfile) == FALSE){
  sink(file = outfile,
       append = TRUE, type = "output")
  print("")
  print("Models run at:")
  print(Sys.time())
}
  model.initial <- glm(modeling.formula,
                       family = binomial,
                       data = dataset)
  model.stepwise1 <- step(model.initial, direction = "backward")
  model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
  output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
share|improve this question
I tried to reproduce this error and couldn't do so. I'm running R 2.15.3 64-bit on Windows 7 and the function ran satisfactorily until it reached the call to sink(), at which point it reported In sink() : no sink to remove. Could you say what version of R you are using? – Simon May 17 at 4:19
@Simon - i just put the outfile argument back into my function. this made the error reappear for me. I am adding system and session info to the question now. – zap2008 May 17 at 4:32
Now this is extremely strange... The error is popping up seemingly at random. If I run my code 10 times it pops up maybe twice... – zap2008 May 17 at 4:37
When I tried blah <- stepModel(a~., data = test.df,"test.txt"), I got no error at all and blah contained what I believe to be the expected output. – Simon May 17 at 4:39
1  
Fascinating. When I tried re-running the whole thing several times, including creating the random dataset test.df, I got the Error in is.data.frame(data) : object 'dataset' not found message after the third, sixth, seventh and tenth time that I ran it. Also, most of those times, I also got warning messages to say that the glm.fit algorithm did not converge. My suspicion is that the content of the frame seems to be part of the problem. – Simon May 17 at 5:04
show 5 more comments

1 Answer

up vote 3 down vote accepted

Using do.call to refer to the data set in the calling environment works for me. See http://stackoverflow.com/a/7668846/210673 for the original suggestion. Here's a version that works (with sink code removed).

stepModel2 <- function(modeling.formula, dataset) {
  model.initial <- do.call("glm", list(modeling.formula,
                       family = "binomial",
                       data = as.name(dataset)))
  model.stepwise1 <- step(model.initial, direction = "backward")
  model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
  list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
}

blah <- stepModel2(a~., dataset = "test.df")

It fails for me consistently with set.seed(6) with the original code. The reason it fails is that the dataset variable is not present within the step function, and although it's not needed in making model.stepwise1, it is needed for model.stepwise2 when model.stepwise1 keeps a linear term. So that's the case when your version fails. Calling the dataset from the global environment as I do here fixes this issue.

share|improve this answer
so are you suggesting that I define an object called dataset in the global environment? That was my workaround originally, but it makes the function less usable. – zap2008 May 23 at 1:20
No, this function refers to "test.df" in the original environment rather than "dataset" within the function. My text about it is perhaps unclear; I'll edit. You should be able to just load this function and run the code example at the bottom. – Aaron May 23 at 1:31
This is great, thank you so much! I tried using do.call before I even posted this, as was suggested in the linked question, but I tried using get() instead of as.name(). How do the two differ? – zap2008 May 23 at 14:05
get("x") retrieves the object with name x, as.name("x") merely tells R that x is not a string but the name of an object. – Aaron May 23 at 19:08
Hmmm... I understand the difference, but I'm not sure I completely understand why those two would behave all that differently in a function. I'll look into it a little more on my own. Thanks! – zap2008 May 24 at 4:07

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.