Python SimpleImputer module

The SimpleImputer class in Python is part of the sklearn.impute module from the Scikit-learn library. It is used for handling missing data by imputing, which means replacing missing values with a specified strategy. Here’s a detailed explanation:

What is SimpleImputer?

SimpleImputer is a preprocessing tool that assists you in filling missing values in your dataset. You can represent missing values by NaN, None, and other placeholders. The imputer makes your data clean for machine learning models, as generally the models cannot tolerate missing values.

Key Features

  • Replaces missing values with a constant, mean, median, most frequent value, or a custom strategy.
  • Supports both numerical and categorical data.
  • Easy to integrate into a machine learning pipeline.

Basic Syntax

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Parameters:

  1. missing_values:
    • Placeholder for missing values in the dataset. Default is np.nan.
    • Also can handle other markers, such as 0 or None.
  2. strategy: Specifies how missing values are imputed:
    • 'mean': Missing values are replaced with the mean of the column (only for numerical data).
    • 'median': Missing values are replaced with the median of the column (only for numerical data).
    • 'most_frequent': Replaces with the most frequent value in the column (for both numerical and categorical data).
    • 'constant': Replace with a specified constant value. Must set the fill_value parameter.
  3. fill_value:
    • Used when strategy='constant'.
    • Default is 0 for numerical data and "missing_value" for strings or categorical data.
  4. add_indicator:
    • If True, adds a binary indicator column for missing values, marking where values were missing.

Example Usage

1. Imputing Missing Values with the Mean

import numpy as np
from sklearn.impute import SimpleImputer

# Sample dataset with missing values
data = [[1, 2, np.nan],
        [3, np.nan, 6],
        [7, 8, 9]]

# Create SimpleImputer instance
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
imputed_data = imputer.fit_transform(data)

print("Imputed Data:\n", imputed_data)

Output:

Imputed Data:
 [[1. 2. 7.5]
  [3. 5. 6. ]
  [7. 8. 9. ]]

2. Using the Most Frequent Strategy

data = [[1, 2, np.nan],
        [1, np.nan, 6],
        [7, 2, 9]]

imputer = SimpleImputer(strategy='most_frequent')
imputed_data = imputer.fit_transform(data)

print("Imputed Data:\n", imputed_data)

Output:

Imputed Data:
 [[1. 2. 6.]
  [1. 2. 6.]
  [7. 2. 9.]]

3. Using a Constant Value

data = [[1, 2, np.nan],
        [3, np.nan, 6],
        [7, 8, 9]]

imputer = SimpleImputer(strategy='constant', fill_value=-1)
imputed_data = imputer.fit_transform(data)

print("Imputed Data:\n", imputed_data)

Output:

Imputed Data:
 [[ 1.  2. -1.]
  [ 3. -1.  6.]
  [ 7.  8.  9.]]

Using SimpleImputer with Pandas

SimpleImputer directly works on numpy arrays, but it is also capable of handling the Pandas DataFrames. It does return a numpy array after transformation. It is possible to convert it to DataFrame when needed.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [4, np.nan, 6],
    'C': [np.nan, 8, 9]
})

# Initialize imputer
imputer = SimpleImputer(strategy='mean')

# Fit and transform
imputed_df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("Imputed DataFrame:\n", imputed_df)

Output:

Imputed DataFrame:
      A    B    C
0  1.0  4.0  8.5
1  2.0  5.0  8.0
2  1.5  6.0  9.0

Advanced Features

Add Binary Indicator for Missing Values

If you want to track which values were imputed, use add_indicator=True.

imputer = SimpleImputer(strategy='mean', add_indicator=True)
imputed_data = imputer.fit_transform(data)

This will add extra columns indicating where missing values were present.

Integration in Pipelines

SimpleImputer is often used as part of a machine learning pipeline:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('model', RandomForestClassifier())
])

This ensures imputation is automatically applied before fitting the model.

Key Points to Remember

  1. SimpleImputer works both on numerical and categorical.
  2. The choice of the strategy depends on the type of data and its distribution.
  3. Use add_indicator=True to track imputed values.
  4. Always fit the imputer on the training set and apply it to both training and testing datasets.