Your Page Title
🔍

    Python SimpleImputer module

    The SimpleImputer class in Python is part of the sklearn.impute module from the Scikit-learn library. It is used for handling missing data by imputing, which means replacing missing values with a specified strategy. Here’s a detailed explanation:

    What is SimpleImputer?

    SimpleImputer is a preprocessing tool that assists you in filling missing values in your dataset. You can represent missing values by NaN, None, and other placeholders. The imputer makes your data clean for machine learning models, as generally the models cannot tolerate missing values.

    Key Features

    • Replaces missing values with a constant, mean, median, most frequent value, or a custom strategy.
    • Supports both numerical and categorical data.
    • Easy to integrate into a machine learning pipeline.

    Basic Syntax

    from sklearn.impute import SimpleImputer
    
    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

    Parameters:

    1. missing_values:
      • Placeholder for missing values in the dataset. Default is np.nan.
      • Also can handle other markers, such as 0 or None.
    2. strategy: Specifies how missing values are imputed:
      • 'mean': Missing values are replaced with the mean of the column (only for numerical data).
      • 'median': Missing values are replaced with the median of the column (only for numerical data).
      • 'most_frequent': Replaces with the most frequent value in the column (for both numerical and categorical data).
      • 'constant': Replace with a specified constant value. Must set the fill_value parameter.
    3. fill_value:
      • Used when strategy='constant'.
      • Default is 0 for numerical data and "missing_value" for strings or categorical data.
    4. add_indicator:
      • If True, adds a binary indicator column for missing values, marking where values were missing.

    Example Usage

    1. Imputing Missing Values with the Mean

    import numpy as np
    from sklearn.impute import SimpleImputer
    
    # Sample dataset with missing values
    data = [[1, 2, np.nan],
            [3, np.nan, 6],
            [7, 8, 9]]
    
    # Create SimpleImputer instance
    imputer = SimpleImputer(strategy='mean')
    
    # Fit and transform the data
    imputed_data = imputer.fit_transform(data)
    
    print("Imputed Data:\n", imputed_data)

    Output:

    Imputed Data:
     [[1. 2. 7.5]
      [3. 5. 6. ]
      [7. 8. 9. ]]

    2. Using the Most Frequent Strategy

    data = [[1, 2, np.nan],
            [1, np.nan, 6],
            [7, 2, 9]]
    
    imputer = SimpleImputer(strategy='most_frequent')
    imputed_data = imputer.fit_transform(data)
    
    print("Imputed Data:\n", imputed_data)

    Output:

    Imputed Data:
     [[1. 2. 6.]
      [1. 2. 6.]
      [7. 2. 9.]]

    3. Using a Constant Value

    data = [[1, 2, np.nan],
            [3, np.nan, 6],
            [7, 8, 9]]
    
    imputer = SimpleImputer(strategy='constant', fill_value=-1)
    imputed_data = imputer.fit_transform(data)
    
    print("Imputed Data:\n", imputed_data)

    Output:

    Imputed Data:
     [[ 1.  2. -1.]
      [ 3. -1.  6.]
      [ 7.  8.  9.]]

    Using SimpleImputer with Pandas

    SimpleImputer directly works on numpy arrays, but it is also capable of handling the Pandas DataFrames. It does return a numpy array after transformation. It is possible to convert it to DataFrame when needed.

    import pandas as pd
    
    # Create a DataFrame
    df = pd.DataFrame({
        'A': [1, 2, np.nan],
        'B': [4, np.nan, 6],
        'C': [np.nan, 8, 9]
    })
    
    # Initialize imputer
    imputer = SimpleImputer(strategy='mean')
    
    # Fit and transform
    imputed_df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
    
    print("Imputed DataFrame:\n", imputed_df)

    Output:

    Imputed DataFrame:
          A    B    C
    0  1.0  4.0  8.5
    1  2.0  5.0  8.0
    2  1.5  6.0  9.0

    Advanced Features

    Add Binary Indicator for Missing Values

    If you want to track which values were imputed, use add_indicator=True.

    imputer = SimpleImputer(strategy='mean', add_indicator=True)
    imputed_data = imputer.fit_transform(data)

    This will add extra columns indicating where missing values were present.

    Integration in Pipelines

    SimpleImputer is often used as part of a machine learning pipeline:

    from sklearn.pipeline import Pipeline
    from sklearn.ensemble import RandomForestClassifier
    
    pipeline = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('model', RandomForestClassifier())
    ])

    This ensures imputation is automatically applied before fitting the model.

    Key Points to Remember

    1. SimpleImputer works both on numerical and categorical.
    2. The choice of the strategy depends on the type of data and its distribution.
    3. Use add_indicator=True to track imputed values.
    4. Always fit the imputer on the training set and apply it to both training and testing datasets.