Pair Plot in Python

A Pair Plot, also known as a scatterplot matrix, is a technique for visualizing data in order to better understand the relationship between numerical variables in a dataset. It is very useful for Exploratory Data Analysis, and it allows us to look for patterns, correlations, and outliers.

Python provides a simple way to generate Pair Plots using the Seaborn library, which is built on top of Matplotlib. This guide will explain everything about Pair Plots, including when to use them, how to customize them, and how to interpret their results.

1. What is a Pair Plot?

A Pair Plot is a matrix of scatter plots that represents pairwise relationships between numerical variables. Every cell in the matrix is a scatter plot that describes the relationship between two different variables. The diagonal cells of the matrix include histograms or KDE (Kernel Density Estimation) plots, showing the distribution of each variable individually.

If there are n numerical columns in a dataset, then the pair plot will create an n × n grid where:

  • The diagonal elements contain histograms or density plots for each variable.
  • The off-diagonal elements hold scatter plots between two different variables.

Example

Let’s take a three-dimensional dataset, A, B, and C, all of which are numerical. A Pair Plot for that dataset will then consist of a 3×3 grid such that:

  • The A vs. A, B vs. B, and C vs. C plots will be histograms, or KDE plots, of A, B, and C, respectively.
  • The A vs. B, A vs. C, and B vs. C plots will plot scatter plots of those variable pairs.

2. Why Use a Pair Plot?

Pair Plots are useful in various data analysis tasks, including:

2.1. Exploratory Data Analysis (EDA)

Pair Plots can help us to have a fast intuition about the relations between several numerical features of a dataset. It is very helpful for checking:

  • Whether the variables are positively or negatively correlated.
  • Patterns, clusters, or trends in the data.

2.2. Checking Multicollinearity

If two variables are very well correlated (they are nearly linearly related), one of them might be redundant. Pair Plots let us detect such a relationship in order to decide whether we should remove or transform some variables in the machine learning model.

2.3 Detecting Outliers

Pair Plots help in detecting unusual data points which do not align with the overall trend of the data. The outliers may reflect errors in the data collection or interesting anomalies.

3. How to Create a Pair Plot in Python?

Python provides an easy way to create Pair Plots using the seaborn.pairplot() function.

3.1. Installing Seaborn

If you haven’t installed Seaborn yet, you can do so using:

pip install seaborn

3.2. Importing Required Libraries

To generate a Pair Plot, we need seaborn and matplotlib:

import seaborn as sns
import matplotlib.pyplot as plt

3.3. Loading a Sample Dataset

Seaborn provides built-in datasets for practice. One of the most commonly used datasets is the Iris dataset, which contains measurements of flower species.

# Load the Iris dataset
iris = sns.load_dataset("iris")
print(iris.head())  # View first 5 rows

3.4. Creating a Basic Pair Plot

Now, let’s generate a Pair Plot using the pairplot() function.

# Create a Pair Plot
sns.pairplot(iris)
plt.show()

Explanation

  • sns.pairplot(iris): Generates a scatter plot matrix for all numerical columns in the dataset.
  • plt.show(): Displays the plot.

4. Customizing a Pair Plot

The pairplot() function offers a number of customization options to beautify the visualization.

4.1. Coloring Data Points by Category (hue parameter).

If the dataset contains a categorical column, we can color the scatter plots based on categories. In the Iris dataset, the "species" column categorizes the data into three flower species.

sns.pairplot(iris, hue="species")
plt.show()

Explanation

  • The hue="species" argument colors the points based on different species in the dataset.
  • This helps in understanding how different categories relate to each other.

4.2. Changing the Diagonal Plot Type (diag_kind Parameter)

By default, the diagonal plots contain histograms. We can change them to Kernel Density Estimation (KDE) plots, which provide a smoother view of distributions.

sns.pairplot(iris, hue="species", diag_kind="kde")
plt.show()

Explanation

  • diag_kind="kde" replaces histograms with density plots.
  • This provides a smoother distribution of each variable.

4.3. Changing Marker Styles (markers Parameter)

If the scatter points overlap too much, different markers can be assigned to each category:

sns.pairplot(iris, hue="species", markers=["o", "s", "D"])
plt.show()

Explanation

  • "o", "s", and "D" represent different marker shapes.
  • This improves clarity when categories overlap.

4.4. Adjusting Plot Size (height Parameter)

We can control the size of the plots using the height parameter:

sns.pairplot(iris, hue="species", height=3)
plt.show()

Explanation

  • height=3 sets each subplot’s height to 3 inches.

4.5. Selecting Specific Variables (vars Parameter)

If the dataset has many numerical variables, we can restrict the Pair Plot to specific columns.

sns.pairplot(iris, hue="species", vars=["sepal_length", "sepal_width"])
plt.show()

Explanation

  • Only sepal_length and sepal_width are included in the plot.
  • This is useful when we want to focus on specific relationships.

5. How to Read a Pair Plot?

5.1. Reading the Plot

  • Strong Positive Correlation: If the points lie along a steep diagonal, one variable rises as the other rises.
  • Strong Negative Correlation: If the points lie along a steep inverse diagonal, one variable falls as the other rises.
  • No Correlation: If the points fall in no obvious pattern, then the variables don’t correlate.

5.2. Reading for Clusters

  • Points that bunch together can signify that there exist distinct groups.
  • This can be applied to classification tasks.

5.3. Identifying Outliers

  • Data points that are more distant from the others may be outliers.
  • Outliers may mean there is a mistake or useful information.

6. Conclusion

Pair Plots are one of the powerful tools in data visualization and exploratory data analysis. We can quickly identify the associations and patterns with Seaborn’s pairplot() for any kind of numerical data.

Important Takeaways

  • Easy to create using sns.pairplot()
  • This is customizable with parameters hue, diag_kind, markers, height, and vars.
  • Useful for EDA, correlation analysis and outlier detection.