Re-write custom feature encoding function python

Asked today

Viewed 1 times

I have a question regarding code quality and best practices. My task - to write a feature encoding function that will encode categorical labels, boolean labels as one-hot encoding, timestamps for further usage for ml training.

Input - dictionary of several dataframes, each dataframe consist of several columns of different types.

The function should return a dictionary of correctly encoded dataframes and a dictionary of label encoders for categorical columns.

Here is what I did:

# Encode bool values to one hot encoding, change NaN numerical values to single const value, make timestamp be time columns, add categorical encoding
def df_encoder(data_dict) :
    #encode all NA values of continuos data as a constant
    NA_values = 0.001
    # dictionary to save dictionary of label encodings from LabelEncoder
    labels_codes = dict()

    for names_df in data_dict:
        # list, where to save label encodings from LabelEncoder from one dataframe
        labels_codes[names_df] = list()

        #take iteratively dataframe from the dictionary of dataframes
        df_additional = data_dict[names_df]

        for col in df_additional:

            if is_bool_dtype(df_additional[col]):
                loc_col = df_additional.columns.get_loc(col)
                df_additional_one_hot = pd.get_dummies(df_additional[col], prefix=col, dummy_na=True)
                df_additional = pd.concat([df_additional.iloc[:, :loc_col], df_additional_one_hot, df_additional.iloc[:, loc_col:]], axis=1).drop(col, axis=1)

            elif is_numeric_dtype(df_additional[col]):
                df_additional[col].fillna(NA_values)

            elif is_datetime64_any_dtype(df_additional[col]):
                loc_col = df_additional.columns.get_loc(col)
                date_df = pd.DataFrame()
                date_df[col+'_year'] = df_additional[col].dt.year.fillna(0)
                date_df[col+'_month'] = df_additional[col].dt.month.fillna(0)
                date_df[col+'_day'] = df_additional[col].dt.day.fillna(0)
                date_df[col+'_hour'] = df_additional[col].dt.hour.fillna(25)
                date_df[col+'_minute'] = df_additional[col].dt.minute.fillna(60)
                date_df[col+'_seconds'] = df_additional[col].dt.second.fillna(60)
                df_additional = pd.concat([df_additional.iloc[:, :loc_col], date_df, df_additional.iloc[:, loc_col:]], axis=1).drop(col, axis=1)

            elif is_categorical_dtype(df_additional[col]) and df_additional[col].nunique()== 2:
                loc_col = df_additional.columns.get_loc(col)
                df_additional_two_val_categ = pd.get_dummies(df_additional[col], prefix=col, dummy_na=True)
                df_additional = pd.concat([df_additional.iloc[:, :loc_col], df_additional_two_val_categ, df_additional.iloc[:, loc_col:]], axis=1).drop(col, axis=1)

            elif is_categorical_dtype(df_additional[col]) and df_additional[col].nunique()>2:
                #keep only alphanumeric and space, and ignore non-ASCII
                df_additional[col].replace(regex=True,inplace=True,to_replace=r'[^A-Za-z0-9 ]+',value=r'')
                label_enc = LabelEncoder()
                df_additional[col] = label_enc.fit_transform(df_additional[col].astype(str))
                labels_codes[names_df].append({col: label_enc})

            data_dict[names_df] = df_additional
    return data_dict, labels_codes

The functions work well, but I'm not happy with its quality. I need some useful advice or examples of how to make this function more efficient, and more "best-coding practises" alike. Will appreciate any insights and critique.

asked just now

PasDeSence

New contributor

add a comment |

0 Your Answer

PasDeSence is a new contributor. Be nice, and check out our Code of Conduct.

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Stack Exchange Network

current community

your communities

more stack exchange communities

Re-write custom feature encoding function python

0

Your Answer

Browse other questions tagged python python-3.x functional-programming pandas machine-learning or ask your own question.

Hot Network Questions

Re-write custom feature encoding function python

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python python-3.x functional-programming pandas machine-learning or ask your own question.

Hot Network Questions