统计描述与分位数,基于IQR或3σ原则识别异常
使用 retail_orders 数据集,包含价格、数量等数值列。
使用 describe() 查看数据分布。
IQR = Q3 - Q1。
异常值范围:< Q1-1.5*IQR 或 > Q3+1.5*IQR。
异常值范围:< mean-3*std 或 > mean+3*std。
删除或用中位数替换异常值。
import pandas as pd df = pd.read_csv('retail_orders.csv')
print(df['price'].describe())
Q1 = df['price'].quantile(0.25) print(Q1)
Q1 = df['price'].quantile(0.25) Q3 = df['price'].quantile(0.75) IQR = Q3 - Q1 print(IQR)
Q1 = df['price'].quantile(0.25) Q3 = df['price'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR outliers = df[(df['price'] < lower) | (df['price'] > upper)] print(outliers)
mean = df['price'].mean() std = df['price'].std() print(f"均值: {mean}, 标准差: {std}")
mean = df['price'].mean() std = df['price'].std() lower = mean - 3 * std upper = mean + 3 * std outliers = df[(df['price'] < lower) | (df['price'] > upper)] print(outliers)
median = df['price'].median() mean = df['price'].mean() std = df['price'].std() lower = mean - 3 * std upper = mean + 3 * std df.loc[(df['price'] < lower) | (df['price'] > upper), 'price'] = median print(df['price'])