How to handle categorical variables in machine learning

rishabhdwivedi062
Sep 12, 2022
3 min read

Updated: Oct 6, 2022

Categorical variables are the non-numeric data present in the dataset. We can use it to well to enhance the accuracy of our machine learning model.

In this post, we will go through various approaches for handling this type of data.

Introduction

A categorical variable takes only a limited number of values.

Consider a survey that asks people their height and divided it into three categories ie, "Small", "Medium", and "Large". In this case, the data is categorical, because responses fall into a fixed set of categories.

You will get an error if you use those data directly into your machine learning models without preprocessing them first. Let us see three different approaches to preprocessing categorical data.

Approaches

1) Drop categorical variables

One of the easiest approaches to deal with Categorical Variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information. Hence it is not a recommended way.

2) Ordinal Encoding

This method assigns different integer values to each unique category.

For example, in the above image, we can see each unique category has a unique integer value assigned. The assigning of an integer is random or based on some rules also. For tree-based models, you can use ordinal encoding, but it also depends upon the problem statement.

3) One-Hot Encoding

This method creates new columns indicating the presence and absence of each possible value in the original data.

Unlike ordinal encoding, this method does not maintain a proper ordering of words, hence you must use it when there is no proper ordering kind of limitations.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

Example

We will not focus on data loading and preprocessing in this part, rather we will see the direct approach for doing all the above steps.

# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
# we are doing this just for our reference

Categorical variables:
['Type', 'Method', 'Regionname']

1) Drop categorical variables example

# We are creating another table drop_X_table and excluding object types 
# from X_train
drop_X_train = X_train.select_dtypes(exclude=['object'])

2) Ordinal Encoding example

Sklearn has an ordinal encoding class that can be directly used to do the respective task. We need to loop over the variables and apply the ordinal encoding individually to each column having categorical variables.

from sklearn.preprocessing import OrdinalEncoder
# Make copy to avoid changing original data 
label_X_train = X_train.copy()
# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])

3

) One-Hot Encoding example

Sklearn has an OneHotEncoder class that we can directly use in our program.

(You can visit the sklearn docs to learn more about it)

from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)

We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a NumPy array (instead of a sparse matrix).

Which approach is best?

Approach 1 perform worst in most situation, generally, Approach 3 is better than Approach 1 but most of the time it depends upon the use cases.

Hope you have learned something new :-)