Python Data Analytics

Python is widely used for data analytics due to its simplicity, efficiency, and powerful libraries. In this guide, we will explore data analytics in Python with step-by-step explanations, code examples, and expected outputs.

1. What is Data Analytics?

Data analytics is the process of examining, transforming, and interpreting data to extract meaningful insights. It helps businesses and researchers make informed decisions.

Types of Data Analytics

1. Descriptive Analytics – Summarizes past data to understand what happened.

  • Example: Sales reports, web traffic analysis.

2. Diagnostic Analytics – Investigates why something happened.

  • Example: Analyzing the drop in sales for a particular product.

3. Predictive Analytics – Uses statistical models to predict future trends.

  • Example: Forecasting stock prices.

4. Prescriptive Analytics – Provides recommendations based on data insights.

  • Example: Suggesting personalized discounts to customers.

2. Setting Up Python for Data Analytics

Installing Required Libraries

Before starting, install the necessary libraries:

pip install numpy pandas matplotlib seaborn scikit-learn

Key Libraries:

  • NumPy – Numerical computations and efficient arrays.
  • Pandas – Data manipulation and analysis.
  • Matplotlib – Data visualization.
  • Seaborn – Advanced statistical plots.
  • Scikit-learn – Machine learning for predictive analytics.

3. Working with NumPy for Numerical Computation

NumPy is used for handling numerical data efficiently.

Creating NumPy Arrays

import numpy as np

# Creating a 1D array
arr = np.array([1, 2, 3, 4, 5])
print(arr)

# Creating a 2D array (Matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix)

Output:

[1 2 3 4 5]
[[1 2 3]
 [4 5 6]]

Basic NumPy Operations

print(arr.mean())  # Mean
print(arr.sum())   # Sum
print(arr.std())   # Standard deviation
print(np.sqrt(arr))  # Square root

Output:

3.0
15
1.4142135623730951
[1. 1.41421356 1.73205081 2. 2.23606798]

4. Pandas for Data Manipulation

Pandas provides flexible data structures for working with datasets.

Creating a Pandas DataFrame

import pandas as pd

# Creating a simple DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
print(df)

Output:

     Name  Age  Salary
0   Alice   25  50000
1     Bob   30  60000
2  Charlie   35  70000

Reading Data from CSV

df = pd.read_csv("data.csv")
print(df.head())  # Display first 5 rows

Output (Example):

     Name  Age  Salary
0   Alice   25  50000
1     Bob   30  60000
2  Charlie   35  70000
3    David   40  80000
4     Eve   45  90000

Data Cleaning and Preprocessing

df.dropna(inplace=True)  # Remove missing values
df.fillna(0, inplace=True)  # Replace missing values with 0
df["Salary"] = df["Salary"].astype(float)  # Convert column type

5. Data Visualization with Matplotlib and Seaborn

Data visualization helps in understanding trends, patterns, and relationships in the data.

Matplotlib: Basic Plotting

import matplotlib.pyplot as plt

# Line Chart
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]

plt.plot(x, y, marker="o", linestyle="--", color="r")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Basic Line Chart")
plt.show()

Output:

A line graph with red markers showing the data points.

Seaborn: Advanced Visualization

import seaborn as sns

# Histogram
sns.histplot(df["Salary"], bins=10, kde=True)
plt.show()

# Scatter Plot
sns.scatterplot(x="Age", y="Salary", data=df)
plt.show()

Output:

  1. A histogram showing the distribution of salaries.
  2. A scatter plot showing the relationship between age and salary.

6. Data Analysis with Pandas

Statistical Summary

print(df.describe())  # Summary statistics
print(df.corr())  # Correlation between columns

Output:

              Age       Salary
count   5.000000      5.000000
mean   35.000000  70000.000000
std    7.905694  15811.388301
min    25.000000  50000.000000
max    45.000000  90000.000000

GroupBy and Aggregation

grouped = df.groupby("Age")["Salary"].mean()
print(grouped)

Output:

Age
25    50000.0
30    60000.0
35    70000.0
40    80000.0
45    90000.0
Name: Salary, dtype: float64

7. Machine Learning for Predictive Analytics

Scikit-learn is used for building machine learning models.

Linear Regression Example

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Sample dataset
X = df[["Age"]]
y = df["Salary"]

# Splitting data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predicting salaries
predictions = model.predict(X_test)
print(predictions)

Output (Example):

[75000.  85000.]

8. Advanced Topics

  • Time Series Analysis – Analyzing trends over time (e.g., stock prices).
  • Natural Language Processing (NLP) – Analyzing text data.
  • Big Data Processing – Using Spark and Dask for large datasets.
  • Deep Learning – Using TensorFlow or PyTorch for AI applications.

Conclusion

Python provides an excellent ecosystem for data analytics. By mastering NumPy, Pandas, visualization tools, and machine learning techniques, you can extract valuable insights and make data-driven decisions.