Label Encoding in Python
What is Label Encoding?
Label encoding is a preprocessing technique that converts categorical data, such as colors or animal types, into numerical representations. Each unique category is assigned an integer value.
Why is Label Encoding Necessary?
Many machine learning algorithms can only be applied to numerical data. For instance:
- Algorithms such as Logistic Regression, Linear Regression, or Support Vector Machines cannot handle string labels directly.
- Converting these strings into numbers allows the algorithms to process and use the data.
Without encoding, the model fails to understand categorical data well.
Detailed Example of Label Encoding
Example Data
Let’s start with a simple dataset:
data = ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat']
Step 1: Import LabelEncoder from sklearn
from sklearn.preprocessing import LabelEncoder
The LabelEncoder class is designed to handle this transformation.
Step 2: Initialize the Label Encoder
encoder = LabelEncoder()
Here, the LabelEncoder object (encoder) is created, which will map unique categories to integers.
Step 3: Fit and Transform the Data
encoded_data = encoder.fit_transform(data)
This method .fit_transform() does two things.
- Fit: Analyzes the input data and determines the unique categories.
- Transform: Maps each distinct category to a distinct integer and places the original values with the mapped integers.
For the provided data ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat'], transformation goes like this:
- Unique categories identified:
['Cat', 'Dog', 'Mouse'] - Assign integers:
- ‘Cat’ → 0
- ‘Dog’ → 1
- ‘Mouse’ → 2
- Transformed data:
[1, 0, 2, 1, 2, 0]
Step 4: Display the Encoded Data
print("Original Data:", data)
print("Encoded Data:", encoded_data)
Output:
Original Data: ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat']
Encoded Data: [1, 0, 2, 1, 2, 0]
Step 5: Access the Mapping
You can view how the categories were mapped using:
print("Mapping:", dict(zip(encoder.classes_, range(len(encoder.classes_)))))
Explanation:
encoder.classes_contains the unique categories in sorted order:['Cat', 'Dog', 'Mouse'].range(len(encoder.classes_))generates integers:[0, 1, 2].zippairs categories with their respective integers.
Output:
Mapping: {'Cat': 0, 'Dog': 1, 'Mouse': 2}
Step 6: Inverse Transformation
If you want to convert the numbers back to their original categories:
original_data = encoder.inverse_transform(encoded_data)
print("Decoded Data:", original_data)
Explanation:
.inverse_transform()reverses the transformation, mapping integers back to their original categories.
Output:
Decoded Data: ['Dog', 'Cat', 'Mouse', 'Dog', 'Mouse', 'Cat']
Handling Larger Datasets
Consider a dataset with multiple columns, some of which are categorical. Let’s encode these categorical columns.
Sample Dataset
import pandas as pd
# Sample dataset
data = {
'Animal': ['Dog', 'Cat', 'Mouse', 'Dog', 'Cat', 'Mouse'],
'Size': ['Large', 'Small', 'Small', 'Medium', 'Large', 'Medium']
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
Output:
Original DataFrame:
Animal Size
0 Dog Large
1 Cat Small
2 Mouse Small
3 Dog Medium
4 Cat Large
5 Mouse Medium
Encoding the Columns
# Initialize LabelEncoder
encoder = LabelEncoder()
# Encode the 'Animal' column
df['Animal_Encoded'] = encoder.fit_transform(df['Animal'])
# Encode the 'Size' column
df['Size_Encoded'] = encoder.fit_transform(df['Size'])
This step:
- Fits the encoder to each column separately.
- Transforms the column values into integers.
Encoded DataFrame:
Animal Size Animal_Encoded Size_Encoded
0 Dog Large 1 1
1 Cat Small 0 2
2 Mouse Small 2 2
3 Dog Medium 1 0
4 Cat Large 0 1
5 Mouse Medium 2 0
Limitations of Label Encoding
Although Label Encoding is simple and efficient, it has some limitations:
1. Ordinal Relationships:
- Encoded integers can suggest an order or rank where there is none.
- For example, the encoded values
[0, 1, 2]could suggest thatCat < Dog < Mouse, which might be misleading for some models like Linear Regression.
2. Effect on Distance-Based Models:
- Models such as k-Nearest Neighbors (k-NN) or Support Vector Machines (SVM) are based on distances. The numerical encoding may distort the relationships between categories.
When to Use Label Encoding
- Use Label Encoding for ordinal data, i.e., the data with meaningful order. This includes:
['Low', 'Medium', 'High'] → [0, 1, 2]
- It is safe to use with tree-based models (e.g., Decision Trees, Random Forests, XGBoost) because they split data based on thresholds and are not affected by numerical magnitude.