Understanding Skewness and Kurtosis in Data Science
In data science, understanding the shape of a data distribution is crucial for choosing appropriate statistical models and interpreting results. Skewness and kurtosis are two key measures that describe the asymmetry and the 'tailedness' or 'peakedness' of a probability distribution, respectively. They provide insights beyond simple measures like mean and standard deviation.
What is Skewness?
Skewness measures the asymmetry of a probability distribution of a real-valued random variable about its mean. In simpler terms, it tells us whether the data is more concentrated on one side of the mean than the other. A distribution can be positively skewed, negatively skewed, or have zero skewness (symmetric).
Skewness quantifies the asymmetry of a distribution.
A symmetric distribution has zero skewness. Positive skew means the tail on the right side is longer or fatter, with the bulk of the data on the left. Negative skew means the tail on the left side is longer or fatter, with the bulk of the data on the right.
Mathematically, skewness is often calculated as the third standardized moment. A skewness value of 0 indicates perfect symmetry. A positive skewness value (typically > 0.5) suggests a tail extending towards higher values, meaning the mean is usually greater than the median. A negative skewness value (typically < -0.5) suggests a tail extending towards lower values, meaning the mean is usually less than the median. Values between -0.5 and 0.5 are generally considered to indicate approximate symmetry.
Positive skewness. The mean is typically greater than the median.
What is Kurtosis?
Kurtosis measures the 'tailedness' or 'peakedness' of a probability distribution relative to a normal distribution. It indicates the likelihood of extreme values (outliers) in a dataset. High kurtosis means more data in the tails and a sharper peak, while low kurtosis means less data in the tails and a flatter peak.
Kurtosis describes the shape of a distribution's tails and peak.
Kurtosis is often compared to the normal distribution, which has a kurtosis of 3 (or an 'excess kurtosis' of 0). Distributions with kurtosis greater than 3 are leptokurtic (heavy tails, sharp peak), and those with kurtosis less than 3 are platykurtic (light tails, flat peak).
Kurtosis is calculated as the fourth standardized moment. The term 'excess kurtosis' is often used, which is kurtosis minus 3. A leptokurtic distribution (excess kurtosis > 0) has fatter tails and a sharper peak than a normal distribution, indicating a higher probability of extreme values. A platykurtic distribution (excess kurtosis < 0) has thinner tails and a flatter peak than a normal distribution, indicating a lower probability of extreme values. A mesokurtic distribution (excess kurtosis = 0) has tails and peak similar to a normal distribution.
Visualizing the difference between leptokurtic, mesokurtic, and platykurtic distributions. A mesokurtic distribution (like the normal distribution) serves as the baseline. Leptokurtic distributions have higher peaks and fatter tails, suggesting more outliers. Platykurtic distributions have lower peaks and thinner tails, suggesting fewer outliers.
Text-based content
Library pages focus on text content
Leptokurtic distribution. Its excess kurtosis is positive (greater than 0).
Why are Skewness and Kurtosis Important in Data Science?
Understanding skewness and kurtosis helps data scientists in several ways:
- Model Selection: Many statistical models assume normality. Deviations from normality (indicated by skewness and kurtosis) might suggest that these models are not appropriate or require data transformation.
- Outlier Detection: High kurtosis can signal the presence of outliers, which may need special handling.
- Data Interpretation: Skewness provides insight into the distribution of data, which can be critical for understanding phenomena like income distribution or test scores.
- Feature Engineering: Transforming skewed data can sometimes improve the performance of machine learning algorithms.
Skewness tells you about the asymmetry, while kurtosis tells you about the 'extremeness' of the tails.
Calculating Skewness and Kurtosis in Python
Python's
scipy.stats
pandas
scipy.stats.skew()
scipy.stats.kurtosis()
.skew()
.kurt()
Measure | Description | Interpretation (vs. Normal) |
---|---|---|
Skewness | Asymmetry of the distribution | 0: Symmetric
0: Positively skewed (right tail) <0: Negatively skewed (left tail) |
Kurtosis | Peakedness/tailedness of the distribution | 3 (Excess Kurtosis 0): Mesokurtic (normal)
3 (Excess Kurtosis >0): Leptokurtic (heavy tails, sharp peak) <3 (Excess Kurtosis <0): Platykurtic (light tails, flat peak) |
Learning Resources
Official documentation for SciPy's statistical functions, including detailed explanations of skewness and kurtosis tests.
Learn how to calculate skewness for pandas Series, a fundamental data structure in data science.
Explore the pandas method for calculating kurtosis on Series, essential for understanding data distribution shape.
A clear and concise explanation of skewness and kurtosis with visual examples and interpretations.
An accessible overview of kurtosis, its types, and its significance in financial analysis and general statistics.
A foundational article explaining skewness, its different types, and how it impacts data interpretation.
A video tutorial explaining skewness and kurtosis with visual aids and practical examples.
A comprehensive video covering the concepts of skewness and kurtosis, their formulas, and their importance in statistics.
Khan Academy's explanation of skewness and kurtosis, focusing on understanding the shape of data distributions.
Wikipedia's detailed article on skewness, covering its mathematical definition, properties, and applications.