Python Statistics Module

The statistics module in Python allows for the performance of mathematical statistics on numeric data. It is part of the standard library in Python 3.4 and is very useful when it comes to basic operations in statistics. Here’s a detailed explanation of key functions and features:

Basic Overview

  • The statistics module is primarily concerned with numeric data, integers, or floats.
  • It contains functions for measures of central tendency (mean, median, mode) and measures of variability (variance, standard deviation, etc.).

Key Functions in the statistics Module

1. Measures of Central Tendency

These functions help identify the central value in a dataset.

  • mean(data)
    • Computes the arithmetic mean (average) of the data.
import statistics

data = [10, 20, 30, 40, 50]
print(statistics.mean(data))  # Output: 30
  • median(data)
    • Returns the middle value of the sorted dataset. If the dataset has an even number of elements, it returns the average of the two middle values.
data = [1, 3, 3, 6, 7, 8, 9]
print(statistics.median(data)) # Output: 6
  • median_low(data)
    • Returns the lower middle value of the dataset.
data = [1, 3, 3, 6, 7, 8, 9]
print(statistics.median_low(data)) # Output: 6
  • median_high(data)
    • Returns the upper middle value of the dataset.
data = [1, 3, 3, 6, 7, 8, 9]
print(statistics.median_high(data)) # Output: 6
  • mode(data)
    • Returns the most common data point. If there are multiple modes, it raises a StatisticsError.
data = [1, 1, 2, 3, 3, 3, 4]
print(statistics.mode(data)) # Output: 3
  • multimode(data)
    • Returns a list of the most common values (modes).
data = [1, 1, 2, 3, 3, 4, 4]
print(statistics.multimode(data)) # Output: [1, 3, 4]

2. Measures of Variability

These functions measure the spread of data.

  • variance(data, xbar=None)
    • Returns the variance, the average of the squared deviations from the mean.
data = [10, 20, 30, 40, 50]
print(statistics.variance(data)) # Output: 250
  • stdev(data, xbar=None)
    • Computes the standard deviation, the square root of the variance.
data = [10, 20, 30, 40, 50]
print(statistics.stdev(data)) # Output: 15.81 (approx)
  • pvariance(data, mu=None)
    • Computes the population variance. The difference between variance and pvariance is that the latter considers all data points, not a sample.
data = [10, 20, 30, 40, 50]
print(statistics.pvariance(data)) # Output: 200
  • pstdev(data, mu=None)
    • Computes the population standard deviation.
data = [10, 20, 30, 40, 50]
print(statistics.pstdev(data)) # Output: 14.14 (approx)

3. Measures of Quantiles

These functions help divide the dataset into intervals.

  • quantiles(data, n=4, method='exclusive')
    • Divides the dataset into n equal intervals and returns the cut points.
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(statistics.quantiles(data, n=4))  # Output: [2.5, 5.0, 7.5]

Additional Features

  • Flexibility in Input: Functions accepts any iterable (e.g., lists, tuples etc.).
  • Error Handling: It raises the StatisticsError for invalid operations, such as calculating the mean of an empty dataset.

Common Applications

  1. Basic Data Analysis: Very fast calculation of statistics of small samples.
  2. Data Validation: Verifying assumptions on the distribution of the data.
  3. Education: Illustrate the application of statistical concepts.

Example: Using Multiple Functions Together

import statistics

data = [12, 15, 12, 15, 18, 20, 25]

print("Mean:", statistics.mean(data))
print("Median:", statistics.median(data))
print("Mode:", statistics.mode(data))
print("Variance:", statistics.variance(data))
print("Standard Deviation:", statistics.stdev(data))

Output:

Mean: 16.714285714285715
Median: 15
Mode: 12
Variance: 18.80952380952381
Standard Deviation: 4.337174028485286

Limitations

  • Small Dataset Handling: Variance and standard deviation do not represent well variability in small datasets.
  • Advanced Analysis: For highly complex statistical analysis, NumPy, SciPy, or pandas can be utilized.