Python Data Analytics
Python is widely used for data analytics due to its simplicity, efficiency, and powerful libraries. In this guide, we will explore data analytics in Python with step-by-step explanations, code examples, and expected outputs.
1. What is Data Analytics?
Data analytics is the process of examining, transforming, and interpreting data to extract meaningful insights. It helps businesses and researchers make informed decisions.
Types of Data Analytics
1. Descriptive Analytics – Summarizes past data to understand what happened.
- Example: Sales reports, web traffic analysis.
2. Diagnostic Analytics – Investigates why something happened.
- Example: Analyzing the drop in sales for a particular product.
3. Predictive Analytics – Uses statistical models to predict future trends.
- Example: Forecasting stock prices.
4. Prescriptive Analytics – Provides recommendations based on data insights.
- Example: Suggesting personalized discounts to customers.
2. Setting Up Python for Data Analytics
Installing Required Libraries
Before starting, install the necessary libraries:
pip install numpy pandas matplotlib seaborn scikit-learn
Key Libraries:
- NumPy – Numerical computations and efficient arrays.
- Pandas – Data manipulation and analysis.
- Matplotlib – Data visualization.
- Seaborn – Advanced statistical plots.
- Scikit-learn – Machine learning for predictive analytics.
3. Working with NumPy for Numerical Computation
NumPy is used for handling numerical data efficiently.
Creating NumPy Arrays
import numpy as np
# Creating a 1D array
arr = np.array([1, 2, 3, 4, 5])
print(arr)
# Creating a 2D array (Matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix)
Output:
[1 2 3 4 5]
[[1 2 3]
[4 5 6]]
Basic NumPy Operations
print(arr.mean()) # Mean
print(arr.sum()) # Sum
print(arr.std()) # Standard deviation
print(np.sqrt(arr)) # Square root
Output:
3.0
15
1.4142135623730951
[1. 1.41421356 1.73205081 2. 2.23606798]
4. Pandas for Data Manipulation
Pandas provides flexible data structures for working with datasets.
Creating a Pandas DataFrame
import pandas as pd
# Creating a simple DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
Reading Data from CSV
df = pd.read_csv("data.csv")
print(df.head()) # Display first 5 rows
Output (Example):
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000
4 Eve 45 90000
Data Cleaning and Preprocessing
df.dropna(inplace=True) # Remove missing values
df.fillna(0, inplace=True) # Replace missing values with 0
df["Salary"] = df["Salary"].astype(float) # Convert column type
5. Data Visualization with Matplotlib and Seaborn
Data visualization helps in understanding trends, patterns, and relationships in the data.
Matplotlib: Basic Plotting
import matplotlib.pyplot as plt
# Line Chart
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]
plt.plot(x, y, marker="o", linestyle="--", color="r")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Basic Line Chart")
plt.show()
Output:
A line graph with red markers showing the data points.
Seaborn: Advanced Visualization
import seaborn as sns
# Histogram
sns.histplot(df["Salary"], bins=10, kde=True)
plt.show()
# Scatter Plot
sns.scatterplot(x="Age", y="Salary", data=df)
plt.show()
Output:
- A histogram showing the distribution of salaries.
- A scatter plot showing the relationship between age and salary.
6. Data Analysis with Pandas
Statistical Summary
print(df.describe()) # Summary statistics
print(df.corr()) # Correlation between columns
Output:
Age Salary
count 5.000000 5.000000
mean 35.000000 70000.000000
std 7.905694 15811.388301
min 25.000000 50000.000000
max 45.000000 90000.000000
GroupBy and Aggregation
grouped = df.groupby("Age")["Salary"].mean()
print(grouped)
Output:
Age
25 50000.0
30 60000.0
35 70000.0
40 80000.0
45 90000.0
Name: Salary, dtype: float64
7. Machine Learning for Predictive Analytics
Scikit-learn is used for building machine learning models.
Linear Regression Example
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Sample dataset
X = df[["Age"]]
y = df["Salary"]
# Splitting data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training the model
model = LinearRegression()
model.fit(X_train, y_train)
# Predicting salaries
predictions = model.predict(X_test)
print(predictions)
Output (Example):
[75000. 85000.]
8. Advanced Topics
- Time Series Analysis – Analyzing trends over time (e.g., stock prices).
- Natural Language Processing (NLP) – Analyzing text data.
- Big Data Processing – Using Spark and Dask for large datasets.
- Deep Learning – Using TensorFlow or PyTorch for AI applications.
Conclusion
Python provides an excellent ecosystem for data analytics. By mastering NumPy, Pandas, visualization tools, and machine learning techniques, you can extract valuable insights and make data-driven decisions.