Dealing with grouped / rounded data

Question

I have a dataset that includes variables about customer income levels. The income was collected in binned fashion (Which range describes your income? 0-25k, 25k-50k,...). My question is how best use this for modeling using glmnet and gbm packages in R.

I have looked at the grouped packaged in R but it seems to do everything (coursening and regression) for you. Is there a package that converts binned data back to continuous data for use in with other algos?

EDIT: The current method I'm using is to convert them to the mid-point of the range (0-25k -> 12500), then using an ifelse() stmt to code a few variables to convey the fact that there is a relationship between the levels.

incOver25k <- ifelse(df1$income >= 25000,1,0)
    incOver50k <- ifelse(df1$income >= 50000,1,0)

Then use these flags instead of using model.matrix().

Was curious if there were any better ideas.

Midpoints are just one of several possible treatments. Depending on the shape of the distribution, midpoints would be too low in some cases and too high in others. So, using midpoints implies a need for some sensitivity analysis. You haven't told us the number of bins (or indeed the currency units!), but if your questionnaire is typical of many, there won't be more than about 10 bins. There's quantitative juice in your lemon, but it is hard to squeeze out.

Affine · Answer 1 · 2013-05-01 16:35:08Z

I would keep treating those as categorical variables. Since glmnet requires for its input a matrix for the independent variables rather than the usual 'formula', you would need to use the model.matrix function to set this up.

One more gotcha here - glmnet also standardizes the dependent variables, which doesn't make sense for categorical variables. So you should pass in standardize = FALSE as an argument. If you have continuous data as predictors as well, you should be standardize those manually before model.matrix

Nick Cox · Answer 2 · 2013-05-01 16:45:12Z

You can't restore detail that has been thrown away -- or in this case never supplied in the first place -- so the presumption that coarsening can be inverted or reversed is surprising.

You could see how far the cumulative probabilities for 25k, 50k, etc. are consistent with various distributions, e.g. lognormal. Odds are that (1) you have rather few bins or classes (2) the important uppermost class is open-ended.

As Affine says, you could fall back on treating income as categorical; at least it is ordinal or graded.

asked	16 days ago
viewed	43 times
active	16 days ago

Dealing with grouped / rounded data

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged r regression data-transformation or ask your own question.

Related Jobs

Dealing with grouped / rounded data

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged r regression data-transformation or ask your own question.

Related Jobs

Related