Handling Rare Labels in Categorical Variables in Machine Learning

Categorical variables in machine learning are those whose values are selected from a set of predefined categories or labels. It’s common for some labels to appear frequently while others are rare. These rare labels—categories that appear infrequently—can affect the performance of machine learning models, particularly when building models using tree-based methods.

This article explores the implications of rare labels in categorical variables, discusses why they can be problematic, and provides practical examples on how to address rare labels during preprocessing.

Rare Labels in Categorical Variables: Why Are They a Problem?

In business datasets, it’s common to encounter categorical variables with a few dominant labels and several rare ones. For example, consider a dataset containing information on loan applicants, where one variable represents the “city where the applicant lives.” Larger cities like ‘New York’ will likely appear frequently, whereas small towns with fewer residents may only show up in a handful of cases.

Overfitting and Noise from Rare Labels

Rare labels in categorical variables can cause several issues:

  1. Overfitting: Tree-based models like decision trees or random forests tend to overfit to rare labels. Since rare categories appear infrequently, the model may split on them to perfectly predict the few observations in the training set, but this introduces noise and reduces generalization to unseen data.

  2. Noisy Information: A large number of infrequent labels can add noise to the model without providing useful information. Instead of contributing meaningful signals, these labels make the model more complex and prone to overfitting.

  3. Label Missing in Test Set: Rare labels present in the training set may not appear in the test set, leading to overfitting during training.

  4. New Rare Labels in Test Set: Conversely, new rare labels might appear in the test set that were not present during training. The model won’t know how to handle these unseen labels, which can hurt its predictive accuracy.

Rare labels can sometimes be informative. For instance, when predicting fraudulent applications, rare labels might be highly predictive of fraudulent behavior. In such cases, handling these rare values appropriately is essential for building robust models.

Real-Life Example: The Mercedes-Benz Greener Manufacturing Challenge

Consider the Mercedes-Benz Greener Manufacturing Challenge, a regression problem where we need to predict the time it takes for a car with certain features to pass a testing system. The dataset contains a mix of car features, some of which may have rare labels that represent certain configurations seen in only a few cars. Ignoring or mishandling these rare labels could result in overfitting and poor performance on unseen data.

To follow along, you can download the dataset from Kaggle. Once downloaded, you can unzip the train.csv.zip file and save it as mercedesbenz.csv in your working directory. We’ll use this dataset to demonstrate how rare labels affect machine learning models.

Loading and Exploring the Dataset

import pandas as pd

# Load the dataset
data = pd.read_csv('mercedesbenz.csv')

# Display the shape and first few rows of the dataset
print(data.shape)
data.head()

Identifying Rare Labels

We can now check for rare labels in categorical variables by calculating the frequency of each category.

# Frequency of each category in a sample categorical column
categorical_var = 'X1'  # Assuming 'X1' is a categorical variable
data[categorical_var].value_counts().plot(kind='bar')

This will give you an idea of which categories are dominant and which ones are rare.

Handling Rare Labels in Categorical Variables

There are several techniques to handle rare labels effectively:

  1. Group Rare Labels into a New Category

One common approach is to group all rare labels into a single category, such as “Other.” This reduces the noise introduced by rare labels and helps prevent overfitting.

# Define a threshold for rare categories
threshold = 0.05 * len(data)

# Group rare labels into 'Other'
data[categorical_var] = np.where(data[categorical_var].value_counts()[data[categorical_var]].values < threshold, 
                                 'Other', 
                                 data[categorical_var])
  1. RFrequency Encoding

Another method is to replace each category with its frequency (the number of times it appears in the dataset). This way, rare labels are represented by their occurrence rate rather than as distinct categories.

# Frequency encoding
freq_map = data[categorical_var].value_counts(normalize=True).to_dict()
data[categorical_var] = data[categorical_var].map(freq_map)
  1. Target Encoding

For regression problems, target encoding replaces each category with the mean of the target variable (e.g., y) for that category. This provides a smoother representation of rare categories.

# Target encoding
mean_target = data.groupby(categorical_var)['y'].mean()
data[categorical_var] = data[categorical_var].map(mean_target)
  1. One-Hot Encoding with Threshold

You can also apply one-hot encoding while keeping only the most frequent categories. This reduces the dimensionality of the encoded dataset and prevents rare labels from cluttering the model.

from sklearn.preprocessing import OneHotEncoder

# One-hot encoding for the top n categories, merging others into 'Other'
top_n = 10
top_categories = [x for x in data[categorical_var].value_counts().index[:top_n]]

data[categorical_var] = np.where(data[categorical_var].isin(top_categories), 
                                 data[categorical_var], 
                                 'Other')

# Apply one-hot encoding
encoder = OneHotEncoder(sparse=False)
encoded_vars = encoder.fit_transform(data[[categorical_var]])

Conclusion

Rare labels in categorical variables can have a significant impact on machine learning model performance. If not handled properly, they can introduce noise, lead to overfitting, and decrease generalization. Depending on the context and the model, rare labels can be grouped, encoded, or transformed in ways that mitigate their impact. While they may sometimes provide valuable information—especially in cases like fraud detection—it’s essential to apply appropriate preprocessing techniques to ensure robust model performance.

In the Mercedes-Benz Greener Manufacturing Challenge, for example, properly handling rare labels could improve the accuracy of predicting testing times for various car configurations.