Python Statistics Module
The statistics
module in Python allows for the performance of mathematical statistics on numeric data. It is part of the standard library in Python 3.4 and is very useful when it comes to basic operations in statistics. Here’s a detailed explanation of key functions and features:
Basic Overview
- The
statistics
module is primarily concerned with numeric data, integers, or floats. - It contains functions for measures of central tendency (mean, median, mode) and measures of variability (variance, standard deviation, etc.).
Key Functions in the statistics
Module
1. Measures of Central Tendency
These functions help identify the central value in a dataset.
mean(data)
- Computes the arithmetic mean (average) of the data.
import statistics
data = [10, 20, 30, 40, 50]
print(statistics.mean(data)) # Output: 30
median(data)
- Returns the middle value of the sorted dataset. If the dataset has an even number of elements, it returns the average of the two middle values.
data = [1, 3, 3, 6, 7, 8, 9]
print(statistics.median(data)) # Output: 6
median_low(data)
- Returns the lower middle value of the dataset.
data = [1, 3, 3, 6, 7, 8, 9]
print(statistics.median_low(data)) # Output: 6
median_high(data)
- Returns the upper middle value of the dataset.
data = [1, 3, 3, 6, 7, 8, 9]
print(statistics.median_high(data)) # Output: 6
mode(data)
- Returns the most common data point. If there are multiple modes, it raises a
StatisticsError
.
- Returns the most common data point. If there are multiple modes, it raises a
data = [1, 1, 2, 3, 3, 3, 4]
print(statistics.mode(data)) # Output: 3
multimode(data)
- Returns a list of the most common values (modes).
data = [1, 1, 2, 3, 3, 4, 4]
print(statistics.multimode(data)) # Output: [1, 3, 4]
2. Measures of Variability
These functions measure the spread of data.
variance(data, xbar=None)
- Returns the variance, the average of the squared deviations from the mean.
data = [10, 20, 30, 40, 50]
print(statistics.variance(data)) # Output: 250
stdev(data, xbar=None)
- Computes the standard deviation, the square root of the variance.
data = [10, 20, 30, 40, 50]
print(statistics.stdev(data)) # Output: 15.81 (approx)
pvariance(data, mu=None)
- Computes the population variance. The difference between
variance
andpvariance
is that the latter considers all data points, not a sample.
- Computes the population variance. The difference between
data = [10, 20, 30, 40, 50]
print(statistics.pvariance(data)) # Output: 200
pstdev(data, mu=None)
- Computes the population standard deviation.
data = [10, 20, 30, 40, 50]
print(statistics.pstdev(data)) # Output: 14.14 (approx)
3. Measures of Quantiles
These functions help divide the dataset into intervals.
quantiles(data, n=4, method='exclusive')
- Divides the dataset into
n
equal intervals and returns the cut points.
- Divides the dataset into
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(statistics.quantiles(data, n=4)) # Output: [2.5, 5.0, 7.5]
Additional Features
- Flexibility in Input: Functions accepts any iterable (e.g., lists, tuples etc.).
- Error Handling: It raises the
StatisticsError
for invalid operations, such as calculating the mean of an empty dataset.
Common Applications
- Basic Data Analysis: Very fast calculation of statistics of small samples.
- Data Validation: Verifying assumptions on the distribution of the data.
- Education: Illustrate the application of statistical concepts.
Example: Using Multiple Functions Together
import statistics
data = [12, 15, 12, 15, 18, 20, 25]
print("Mean:", statistics.mean(data))
print("Median:", statistics.median(data))
print("Mode:", statistics.mode(data))
print("Variance:", statistics.variance(data))
print("Standard Deviation:", statistics.stdev(data))
Output:
Mean: 16.714285714285715
Median: 15
Mode: 12
Variance: 18.80952380952381
Standard Deviation: 4.337174028485286
Limitations
- Small Dataset Handling: Variance and standard deviation do not represent well variability in small datasets.
- Advanced Analysis: For highly complex statistical analysis, NumPy, SciPy, or pandas can be utilized.