Tell me more ×
Cross Validated is a question and answer site for statisticians, data analysts, data miners and data visualization experts. It's 100% free, no registration required.

I have a dataset that includes variables about customer income levels. The income was collected in binned fashion (Which range describes your income? 0-25k, 25k-50k,...). My question is how best use this for modeling using glmnet and gbm packages in R.

I have looked at the grouped packaged in R but it seems to do everything (coursening and regression) for you. Is there a package that converts binned data back to continuous data for use in with other algos?

EDIT: The current method I'm using is to convert them to the mid-point of the range (0-25k -> 12500), then using an ifelse() stmt to code a few variables to convey the fact that there is a relationship between the levels.

incOver25k <- ifelse(df1$income >= 25000,1,0)
    incOver50k <- ifelse(df1$income >= 50000,1,0)

Then use these flags instead of using model.matrix().

Was curious if there were any better ideas.

share|improve this question
2  
Midpoints are just one of several possible treatments. Depending on the shape of the distribution, midpoints would be too low in some cases and too high in others. So, using midpoints implies a need for some sensitivity analysis. You haven't told us the number of bins (or indeed the currency units!), but if your questionnaire is typical of many, there won't be more than about 10 bins. There's quantitative juice in your lemon, but it is hard to squeeze out. – Nick Cox May 1 at 17:13

2 Answers

I would keep treating those as categorical variables. Since glmnet requires for its input a matrix for the independent variables rather than the usual 'formula', you would need to use the model.matrix function to set this up.

One more gotcha here - glmnet also standardizes the dependent variables, which doesn't make sense for categorical variables. So you should pass in standardize = FALSE as an argument. If you have continuous data as predictors as well, you should be standardize those manually before model.matrix

share|improve this answer

You can't restore detail that has been thrown away -- or in this case never supplied in the first place -- so the presumption that coarsening can be inverted or reversed is surprising.

You could see how far the cumulative probabilities for 25k, 50k, etc. are consistent with various distributions, e.g. lognormal. Odds are that (1) you have rather few bins or classes (2) the important uppermost class is open-ended.

As Affine says, you could fall back on treating income as categorical; at least it is ordinal or graded.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.