Label Encoding in Python

What is Label Encoding?

Label encoding is a preprocessing technique that converts categorical data, such as colors or animal types, into numerical representations. Each unique category is assigned an integer value.

Why is Label Encoding Necessary?

Many machine learning algorithms can only be applied to numerical data. For instance:

  • Algorithms such as Logistic Regression, Linear Regression, or Support Vector Machines cannot handle string labels directly.
  • Converting these strings into numbers allows the algorithms to process and use the data.

Without encoding, the model fails to understand categorical data well.

Detailed Example of Label Encoding

Example Data

Let’s start with a simple dataset:

data = ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat']

Step 1: Import LabelEncoder from sklearn

from sklearn.preprocessing import LabelEncoder

The LabelEncoder class is designed to handle this transformation.

Step 2: Initialize the Label Encoder

encoder = LabelEncoder()

Here, the LabelEncoder object (encoder) is created, which will map unique categories to integers.

Step 3: Fit and Transform the Data

encoded_data = encoder.fit_transform(data)

This method .fit_transform() does two things.

  1. Fit: Analyzes the input data and determines the unique categories.
  2. Transform: Maps each distinct category to a distinct integer and places the original values with the mapped integers.

For the provided data ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat'], transformation goes like this:

  • Unique categories identified: ['Cat', 'Dog', 'Mouse']
  • Assign integers:
    • ‘Cat’ → 0
    • ‘Dog’ → 1
    • ‘Mouse’ → 2
  • Transformed data: [1, 0, 2, 1, 2, 0]

Step 4: Display the Encoded Data

print("Original Data:", data)
print("Encoded Data:", encoded_data)

Output:

Original Data: ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat']
Encoded Data: [1, 0, 2, 1, 2, 0]

Step 5: Access the Mapping

You can view how the categories were mapped using:

print("Mapping:", dict(zip(encoder.classes_, range(len(encoder.classes_)))))

Explanation:

  • encoder.classes_ contains the unique categories in sorted order: ['Cat', 'Dog', 'Mouse'].
  • range(len(encoder.classes_)) generates integers: [0, 1, 2].
  • zip pairs categories with their respective integers.

Output:

Mapping: {'Cat': 0, 'Dog': 1, 'Mouse': 2}

Step 6: Inverse Transformation

If you want to convert the numbers back to their original categories:

original_data = encoder.inverse_transform(encoded_data)
print("Decoded Data:", original_data)

Explanation:

  • .inverse_transform() reverses the transformation, mapping integers back to their original categories.

Output:

Decoded Data: ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat']

Handling Larger Datasets

Consider a dataset with multiple columns, some of which are categorical. Let’s encode these categorical columns.

Sample Dataset

import pandas as pd

# Sample dataset
data = {
    'Animal': ['Dog', 'Cat', 'Mouse', 'Dog', 'Cat', 'Mouse'],
    'Size': ['Large', 'Small', 'Small', 'Medium', 'Large', 'Medium']
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

Output:

Original DataFrame:
   Animal    Size
0    Dog   Large
1    Cat   Small
2  Mouse   Small
3    Dog  Medium
4    Cat   Large
5  Mouse  Medium

Encoding the Columns

# Initialize LabelEncoder
encoder = LabelEncoder()

# Encode the 'Animal' column
df['Animal_Encoded'] = encoder.fit_transform(df['Animal'])

# Encode the 'Size' column
df['Size_Encoded'] = encoder.fit_transform(df['Size'])

This step:

  1. Fits the encoder to each column separately.
  2. Transforms the column values into integers.

Encoded DataFrame:

   Animal    Size  Animal_Encoded  Size_Encoded
0    Dog   Large               1             1
1    Cat   Small               0             2
2  Mouse   Small               2             2
3    Dog  Medium               1             0
4    Cat   Large               0             1
5  Mouse  Medium               2             0

Limitations of Label Encoding

Although Label Encoding is simple and efficient, it has some limitations:

1. Ordinal Relationships:

  • Encoded integers can suggest an order or rank where there is none.
  • For example, the encoded values [0, 1, 2] could suggest that Cat < Dog < Mouse, which might be misleading for some models like Linear Regression.

2. Effect on Distance-Based Models:

  • Models such as k-Nearest Neighbors (k-NN) or Support Vector Machines (SVM) are based on distances. The numerical encoding may distort the relationships between categories.

When to Use Label Encoding

  • Use Label Encoding for ordinal data, i.e., the data with meaningful order. This includes:
    • ['Low', 'Medium', 'High'] → [0, 1, 2]
  • It is safe to use with tree-based models (e.g., Decision Trees, Random Forests, XGBoost) because they split data based on thresholds and are not affected by numerical magnitude.