Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I basically want to convert all categorical column in an R dataframe to several binary columns:

For example, a categorical column like

Company
-------
IBM
Microsoft
Google

will be converted to 3 columns:

Company_is_IBM   Company_is_Microsoft    Company_is_Google

 1                 0                       0
 0                 1                       0
 0                 0                       1

However, my code runs very slowly on my testing data, which should generate a 900 column*367 row data frame.

I'm attaching the code in to test with my dataset, simply by extracting the source code package, set working directory (setwd) to the source code files directory like shown in the picture (in archive), and run:

gpudataf = read.table("gpudataf.txt")
gpudataf_bin = ConvertCategoricalDataFrameToBinaryKeepingOriginals(gpudataf)

and you will see how slow it is.

Source code archive

source("IsCategorical.R")

# Function CategoricalToBinary: Take a data.frame, determine which columns are categorical,
# if categorical, convert the categorical column to several binary columns with values 0 and 1

#input: a Categorical Column, name of that column. Output: a data frame of multiple     binary columns. 
ConvertCategoricalColumnToBinaryColumns <- function(catecol,prefix = "")
{
  set_values = unique(catecol)
  returnvalue = data.frame("ru9uGEQu" = catecol) #temporary frame no one will hit
  returnvalue[["ru9uGEQu"]] <- NULL #I just did it for the number_rows compatible!

  for(val in set_values){
    newcolName = paste(prefix, val, sep = "_is_")
    returnvalue[[newcolName]] <- Map(function(x){return (if(x==val) 1 else 0)}, catecol) 
  }

  return(returnvalue)
}

#input: a dataframe_with_categorical columns, and extra categorical column names, 
# if that column is categorical 
# but all of numerical denotations 
#output: the categorical columns are replaces by binary columns.

ConvertCategoricalDataFrameToBinary <- function(dataframe_with_cat_column, categoricalColumns=list())
{
  colNames= colnames(dataframe_with_cat_column)
  returnvalue = data.frame("ru9uGEQu" = dataframe_with_cat_column[[colNames[1]]])
  returnvalue[["ru9uGEQu"]] <- NULL

  for(columnName in colnames(dataframe_with_cat_column)){
    # print(columnName)
    if(IsCategorical(dataframe_with_cat_column[[columnName]]) || (columnName %in% categoricalColumns)){
      binary_dataframe = ConvertCategoricalColumnToBinaryColumns(dataframe_with_cat_column[[columnName]],prefix = columnName)
      # attach/insert it to the returnvalue
      for(binColName in colnames(binary_dataframe)){
        returnvalue[[binColName]] = binary_dataframe[[binColName]]
      }
    }
    else{
      # that column is numerical, don't change it.
      returnvalue[[columnName]] = dataframe_with_cat_column[[columnName]]
    }
  }
  return(returnvalue)
}

# if you want to keep the original values in the returned dataframe.

ConvertCategoricalDataFrameToBinaryKeepingOriginals <- function(dataframe_with_cat_column, categoricalColumns=list())
{
  without = ConvertCategoricalDataFrameToBinary(dataframe_with_cat_column, categoricalColumns)
  returnvalue = dataframe_with_cat_column

  existing_colNames = colnames(returnvalue)

  for(coln in colnames(without)){
    if(!(coln %in% existing_colNames)){
      returnvalue[[coln]] = without[[coln]]
    }
  }
  return(returnvalue)
}
share|improve this question

migrated from stackoverflow.com Feb 20 '12 at 6:05

This question came from our site for professional and enthusiast programmers.

5  
I suspect you under-estimate peoples' willingness to download anonymous executable code from person's they are unfamiliar with. –  DWin Feb 20 '12 at 1:05
1  
General strategies for speeding up R performance: stackoverflow.com/a/8474941/636656 –  gsk3 Feb 20 '12 at 1:06
    
Try to replace Map(function(x){return (if(x==val) 1 else 0)}, catecol) with (catecol==val)*1 –  Gregory Demin Feb 20 '12 at 5:10

2 Answers 2

This is a general do-it-yourself answer to "my code is slow, what can I do?": Use the profiler.

Rprof(tmp <- tempfile())

#############################
#### YOUR CODE GOES HERE ####
#############################

Rprof()
summaryRprof(tmp)
unlink(tmp)

Yes, there is a bit of a learning curve about interpreting the output, but you could try to find the bottleneck in your code and test alternatives of your own. If those don't work, ask again with a more specific "My code is slow doing this, what would be a better way?".

share|improve this answer

model.matrix already does most of what you want:

company <- c("IBM", "Microsoft", "Google");
model.matrix( ~ company - 1)
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.