Understanding Skewness and Kurtosis in Data Science

In data science, understanding the shape of a data distribution is crucial for choosing appropriate statistical models and interpreting results. Skewness and kurtosis are two key measures that describe the asymmetry and the 'tailedness' or 'peakedness' of a probability distribution, respectively. They provide insights beyond simple measures like mean and standard deviation.

What is Skewness?

Skewness measures the asymmetry of a probability distribution of a real-valued random variable about its mean. In simpler terms, it tells us whether the data is more concentrated on one side of the mean than the other. A distribution can be positively skewed, negatively skewed, or have zero skewness (symmetric).

Skewness quantifies the asymmetry of a distribution.

A symmetric distribution has zero skewness. Positive skew means the tail on the right side is longer or fatter, with the bulk of the data on the left. Negative skew means the tail on the left side is longer or fatter, with the bulk of the data on the right.

Mathematically, skewness is often calculated as the third standardized moment. A skewness value of 0 indicates perfect symmetry. A positive skewness value (typically > 0.5) suggests a tail extending towards higher values, meaning the mean is usually greater than the median. A negative skewness value (typically < -0.5) suggests a tail extending towards lower values, meaning the mean is usually less than the median. Values between -0.5 and 0.5 are generally considered to indicate approximate symmetry.

If a dataset's histogram shows a long tail extending to the right, what type of skewness is present, and how does it typically relate the mean and median?

Positive skewness. The mean is typically greater than the median.

What is Kurtosis?

Kurtosis measures the 'tailedness' or 'peakedness' of a probability distribution relative to a normal distribution. It indicates the likelihood of extreme values (outliers) in a dataset. High kurtosis means more data in the tails and a sharper peak, while low kurtosis means less data in the tails and a flatter peak.

Kurtosis describes the shape of a distribution's tails and peak.

Kurtosis is often compared to the normal distribution, which has a kurtosis of 3 (or an 'excess kurtosis' of 0). Distributions with kurtosis greater than 3 are leptokurtic (heavy tails, sharp peak), and those with kurtosis less than 3 are platykurtic (light tails, flat peak).

Kurtosis is calculated as the fourth standardized moment. The term 'excess kurtosis' is often used, which is kurtosis minus 3. A leptokurtic distribution (excess kurtosis > 0) has fatter tails and a sharper peak than a normal distribution, indicating a higher probability of extreme values. A platykurtic distribution (excess kurtosis < 0) has thinner tails and a flatter peak than a normal distribution, indicating a lower probability of extreme values. A mesokurtic distribution (excess kurtosis = 0) has tails and peak similar to a normal distribution.

Visualizing the difference between leptokurtic, mesokurtic, and platykurtic distributions. A mesokurtic distribution (like the normal distribution) serves as the baseline. Leptokurtic distributions have higher peaks and fatter tails, suggesting more outliers. Platykurtic distributions have lower peaks and thinner tails, suggesting fewer outliers.

📚

Text-based content

Library pages focus on text content

What type of distribution has a higher peak and fatter tails than a normal distribution, and what is its excess kurtosis value?

Leptokurtic distribution. Its excess kurtosis is positive (greater than 0).

Why are Skewness and Kurtosis Important in Data Science?

Understanding skewness and kurtosis helps data scientists in several ways:

Model Selection: Many statistical models assume normality. Deviations from normality (indicated by skewness and kurtosis) might suggest that these models are not appropriate or require data transformation.
Outlier Detection: High kurtosis can signal the presence of outliers, which may need special handling.
Data Interpretation: Skewness provides insight into the distribution of data, which can be critical for understanding phenomena like income distribution or test scores.
Feature Engineering: Transforming skewed data can sometimes improve the performance of machine learning algorithms.

Skewness tells you about the asymmetry, while kurtosis tells you about the 'extremeness' of the tails.

Calculating Skewness and Kurtosis in Python

Python's

code

scipy.stats

module and the

code

pandas

library provide convenient functions to calculate skewness and kurtosis. For instance,

code

scipy.stats.skew()

and

code

scipy.stats.kurtosis()

can be used, and pandas Series objects have

code

.skew()

and

code

.kurt()

methods.

Measure	Description	Interpretation (vs. Normal)
Skewness	Asymmetry of the distribution	0: Symmetric 0: Positively skewed (right tail) <0: Negatively skewed (left tail)
Kurtosis	Peakedness/tailedness of the distribution	3 (Excess Kurtosis 0): Mesokurtic (normal) 3 (Excess Kurtosis >0): Leptokurtic (heavy tails, sharp peak) <3 (Excess Kurtosis <0): Platykurtic (light tails, flat peak)

Learning Resources

SciPy.Stats: Skewtest and Kurtosistest(documentation)

Official documentation for SciPy's statistical functions, including detailed explanations of skewness and kurtosis tests.

Pandas Documentation: Series.skew()(documentation)

Learn how to calculate skewness for pandas Series, a fundamental data structure in data science.

Pandas Documentation: Series.kurt()(documentation)

Explore the pandas method for calculating kurtosis on Series, essential for understanding data distribution shape.

Understanding Skewness and Kurtosis(blog)

A clear and concise explanation of skewness and kurtosis with visual examples and interpretations.

Kurtosis: Definition, Types, and Examples(blog)

An accessible overview of kurtosis, its types, and its significance in financial analysis and general statistics.

Skewness: Definition, Types, and Examples(blog)

A foundational article explaining skewness, its different types, and how it impacts data interpretation.

Data Distribution: Skewness and Kurtosis(video)

A video tutorial explaining skewness and kurtosis with visual aids and practical examples.

Introduction to Skewness and Kurtosis(video)

A comprehensive video covering the concepts of skewness and kurtosis, their formulas, and their importance in statistics.

Skewness and Kurtosis - Statistics(video)

Khan Academy's explanation of skewness and kurtosis, focusing on understanding the shape of data distributions.

Skewness(wikipedia)

Wikipedia's detailed article on skewness, covering its mathematical definition, properties, and applications.