How to detect outliers in pandas
Quick answer
Use
pandas with statistical methods like the interquartile range (IQR) or Z-score to detect outliers in your data. Calculate these metrics on your DataFrame columns and filter rows where values fall outside typical bounds.PREREQUISITES
Python 3.8+pip install pandas>=1.5
Setup
Install pandas if not already installed and import it for data manipulation.
pip install pandas Step by step
This example shows how to detect outliers in a pandas DataFrame using the IQR method and Z-score method.
import pandas as pd
import numpy as np
# Sample data
np.random.seed(0)
data = pd.DataFrame({
'value': np.append(np.random.normal(50, 5, 100), [100, 105, 110])
})
# IQR method
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = data[(data['value'] < (Q1 - 1.5 * IQR)) | (data['value'] > (Q3 + 1.5 * IQR))]
print("Outliers detected by IQR method:")
print(outliers_iqr)
# Z-score method
mean_val = data['value'].mean()
std_val = data['value'].std()
z_scores = (data['value'] - mean_val) / std_val
outliers_z = data[(z_scores < -3) | (z_scores > 3)]
print("\nOutliers detected by Z-score method:")
print(outliers_z) output
Outliers detected by IQR method:
value
100 100.0
101 105.0
102 110.0
Outliers detected by Z-score method:
value
100 100.0
101 105.0
102 110.0 Common variations
You can customize the threshold multiplier in the IQR method (default 1.5) or the Z-score cutoff (default 3) to be more or less sensitive. For multivariate data, consider using Mahalanobis distance or clustering methods.
def detect_outliers_iqr(df, column, multiplier=1.5):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
return df[(df[column] < (Q1 - multiplier * IQR)) | (df[column] > (Q3 + multiplier * IQR))]
outliers_custom = detect_outliers_iqr(data, 'value', multiplier=2)
print("Outliers with multiplier=2:")
print(outliers_custom) output
Outliers with multiplier=2:
value
102 110.0 Troubleshooting
If no outliers are detected, verify your data distribution and thresholds. Extremely skewed data may require transformations like log or Box-Cox before outlier detection.
Key Takeaways
- Use the IQR method to detect outliers by filtering values outside 1.5 times the interquartile range.
- Z-score method flags points beyond 3 standard deviations from the mean as outliers.
- Adjust thresholds to tune sensitivity based on your dataset's characteristics.