Your Page Title
🔍

    Python Data Analytics

    Python is widely used for data analytics due to its simplicity, efficiency, and powerful libraries. In this guide, we will explore data analytics in Python with step-by-step explanations, code examples, and expected outputs.

    1. What is Data Analytics?

    Data analytics is the process of examining, transforming, and interpreting data to extract meaningful insights. It helps businesses and researchers make informed decisions.

    Types of Data Analytics

    1. Descriptive Analytics – Summarizes past data to understand what happened.

    • Example: Sales reports, web traffic analysis.

    2. Diagnostic Analytics – Investigates why something happened.

    • Example: Analyzing the drop in sales for a particular product.

    3. Predictive Analytics – Uses statistical models to predict future trends.

    • Example: Forecasting stock prices.

    4. Prescriptive Analytics – Provides recommendations based on data insights.

    • Example: Suggesting personalized discounts to customers.

    2. Setting Up Python for Data Analytics

    Installing Required Libraries

    Before starting, install the necessary libraries:

    pip install numpy pandas matplotlib seaborn scikit-learn

    Key Libraries:

    • NumPy – Numerical computations and efficient arrays.
    • Pandas – Data manipulation and analysis.
    • Matplotlib – Data visualization.
    • Seaborn – Advanced statistical plots.
    • Scikit-learn – Machine learning for predictive analytics.

    3. Working with NumPy for Numerical Computation

    NumPy is used for handling numerical data efficiently.

    Creating NumPy Arrays

    import numpy as np
    
    # Creating a 1D array
    arr = np.array([1, 2, 3, 4, 5])
    print(arr)
    
    # Creating a 2D array (Matrix)
    matrix = np.array([[1, 2, 3], [4, 5, 6]])
    print(matrix)

    Output:

    [1 2 3 4 5]
    [[1 2 3]
     [4 5 6]]

    Basic NumPy Operations

    print(arr.mean())  # Mean
    print(arr.sum())   # Sum
    print(arr.std())   # Standard deviation
    print(np.sqrt(arr))  # Square root

    Output:

    3.0
    15
    1.4142135623730951
    [1. 1.41421356 1.73205081 2. 2.23606798]

    4. Pandas for Data Manipulation

    Pandas provides flexible data structures for working with datasets.

    Creating a Pandas DataFrame

    import pandas as pd
    
    # Creating a simple DataFrame
    data = {
        "Name": ["Alice", "Bob", "Charlie"],
        "Age": [25, 30, 35],
        "Salary": [50000, 60000, 70000]
    }
    
    df = pd.DataFrame(data)
    print(df)

    Output:

         Name  Age  Salary
    0   Alice   25  50000
    1     Bob   30  60000
    2  Charlie   35  70000

    Reading Data from CSV

    df = pd.read_csv("data.csv")
    print(df.head())  # Display first 5 rows

    Output (Example):

         Name  Age  Salary
    0   Alice   25  50000
    1     Bob   30  60000
    2  Charlie   35  70000
    3    David   40  80000
    4     Eve   45  90000

    Data Cleaning and Preprocessing

    df.dropna(inplace=True)  # Remove missing values
    df.fillna(0, inplace=True)  # Replace missing values with 0
    df["Salary"] = df["Salary"].astype(float)  # Convert column type

    5. Data Visualization with Matplotlib and Seaborn

    Data visualization helps in understanding trends, patterns, and relationships in the data.

    Matplotlib: Basic Plotting

    import matplotlib.pyplot as plt
    
    # Line Chart
    x = [1, 2, 3, 4, 5]
    y = [10, 20, 25, 30, 40]
    
    plt.plot(x, y, marker="o", linestyle="--", color="r")
    plt.xlabel("X-axis")
    plt.ylabel("Y-axis")
    plt.title("Basic Line Chart")
    plt.show()

    Output:

    A line graph with red markers showing the data points.

    Seaborn: Advanced Visualization

    import seaborn as sns
    
    # Histogram
    sns.histplot(df["Salary"], bins=10, kde=True)
    plt.show()
    
    # Scatter Plot
    sns.scatterplot(x="Age", y="Salary", data=df)
    plt.show()

    Output:

    1. A histogram showing the distribution of salaries.
    2. A scatter plot showing the relationship between age and salary.

    6. Data Analysis with Pandas

    Statistical Summary

    print(df.describe())  # Summary statistics
    print(df.corr())  # Correlation between columns

    Output:

                  Age       Salary
    count   5.000000      5.000000
    mean   35.000000  70000.000000
    std    7.905694  15811.388301
    min    25.000000  50000.000000
    max    45.000000  90000.000000

    GroupBy and Aggregation

    grouped = df.groupby("Age")["Salary"].mean()
    print(grouped)

    Output:

    Age
    25    50000.0
    30    60000.0
    35    70000.0
    40    80000.0
    45    90000.0
    Name: Salary, dtype: float64

    7. Machine Learning for Predictive Analytics

    Scikit-learn is used for building machine learning models.

    Linear Regression Example

    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    
    # Sample dataset
    X = df[["Age"]]
    y = df["Salary"]
    
    # Splitting data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Training the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predicting salaries
    predictions = model.predict(X_test)
    print(predictions)

    Output (Example):

    [75000.  85000.]

    8. Advanced Topics

    • Time Series Analysis – Analyzing trends over time (e.g., stock prices).
    • Natural Language Processing (NLP) – Analyzing text data.
    • Big Data Processing – Using Spark and Dask for large datasets.
    • Deep Learning – Using TensorFlow or PyTorch for AI applications.

    Conclusion

    Python provides an excellent ecosystem for data analytics. By mastering NumPy, Pandas, visualization tools, and machine learning techniques, you can extract valuable insights and make data-driven decisions.