0
\$\begingroup\$

I have a question regarding code quality and best practices. My task - to write a feature encoding function that will encode categorical labels, boolean labels as one-hot encoding, timestamps for further usage for ml training.

Input - dictionary of several dataframes, each dataframe consist of several columns of different types.

The function should return a dictionary of correctly encoded dataframes and a dictionary of label encoders for categorical columns.

Here is what I did:

# Encode bool values to one hot encoding, change NaN numerical values to single const value, make timestamp be time columns, add categorical encoding
def df_encoder(data_dict) :
    #encode all NA values of continuos data as a constant
    NA_values = 0.001
    # dictionary to save dictionary of label encodings from LabelEncoder
    labels_codes = dict()

    for names_df in data_dict:
        # list, where to save label encodings from LabelEncoder from one dataframe
        labels_codes[names_df] = list()

        #take iteratively dataframe from the dictionary of dataframes
        df_additional = data_dict[names_df]

        for col in df_additional:

            if is_bool_dtype(df_additional[col]):
                loc_col = df_additional.columns.get_loc(col)
                df_additional_one_hot = pd.get_dummies(df_additional[col], prefix=col, dummy_na=True)
                df_additional = pd.concat([df_additional.iloc[:, :loc_col], df_additional_one_hot, df_additional.iloc[:, loc_col:]], axis=1).drop(col, axis=1)

            elif is_numeric_dtype(df_additional[col]):
                df_additional[col].fillna(NA_values)

            elif is_datetime64_any_dtype(df_additional[col]):
                loc_col = df_additional.columns.get_loc(col)
                date_df = pd.DataFrame()
                date_df[col+'_year'] = df_additional[col].dt.year.fillna(0)
                date_df[col+'_month'] = df_additional[col].dt.month.fillna(0)
                date_df[col+'_day'] = df_additional[col].dt.day.fillna(0)
                date_df[col+'_hour'] = df_additional[col].dt.hour.fillna(25)
                date_df[col+'_minute'] = df_additional[col].dt.minute.fillna(60)
                date_df[col+'_seconds'] = df_additional[col].dt.second.fillna(60)
                df_additional = pd.concat([df_additional.iloc[:, :loc_col], date_df, df_additional.iloc[:, loc_col:]], axis=1).drop(col, axis=1)

            elif is_categorical_dtype(df_additional[col]) and df_additional[col].nunique()== 2:
                loc_col = df_additional.columns.get_loc(col)
                df_additional_two_val_categ = pd.get_dummies(df_additional[col], prefix=col, dummy_na=True)
                df_additional = pd.concat([df_additional.iloc[:, :loc_col], df_additional_two_val_categ, df_additional.iloc[:, loc_col:]], axis=1).drop(col, axis=1)

            elif is_categorical_dtype(df_additional[col]) and df_additional[col].nunique()>2:
                #keep only alphanumeric and space, and ignore non-ASCII
                df_additional[col].replace(regex=True,inplace=True,to_replace=r'[^A-Za-z0-9 ]+',value=r'')
                label_enc = LabelEncoder()
                df_additional[col] = label_enc.fit_transform(df_additional[col].astype(str))
                labels_codes[names_df].append({col: label_enc})

            data_dict[names_df] = df_additional
    return data_dict, labels_codes

The functions work well, but I'm not happy with its quality. I need some useful advice or examples of how to make this function more efficient, and more "best-coding practises" alike. Will appreciate any insights and critique.

New contributor
PasDeSence is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
\$\endgroup\$

Your Answer

PasDeSence is a new contributor. Be nice, and check out our Code of Conduct.

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Browse other questions tagged or ask your own question.