Pandas Bootcamp
首页特训项目关于开始学习
01
电商销售数据质检审计
入门
02
基于购物车的复购行为与商品关联预测
入门
03
RFM 模型与用户价值聚类分层
进阶
04
AIGC 训练语料库去偏与去噪
进阶
05
游戏玩家行为序列与留存漏斗分析
进阶
06
金融信贷反欺诈特征工程
高级
07
IoT 传感器时序数据异常检测
高级
08
多源数据融合与主数据治理
进阶
09
AI 数据质量监控与血缘追踪
高级
10
AI 辅助数据自动化分析与故事化报告
高级
返回首页
PROJECT 06高级

金融信贷反欺诈特征工程

Credit Anti-fraud Feature Engineering

构建信贷反欺诈特征工程管道,包含 Target Encoding、特征交叉与方差过滤。

Target Encoding特征交叉方差过滤WOE 编码信息值 IV

项目背景

某消费金融公司每月收到 5 万笔贷款申请,欺诈率约 3%。传统规则引擎漏报率高,需要通过特征工程提升机器学习模型的欺诈识别能力。

模拟数据集

app_id,age,income,employment_type,city_tier,loan_amount,dti_ratio,has_collateral,is_fraud
A001,32,85000,salaried,1,50000,0.35,True,False
A002,45,120000,salaried,1,200000,0.42,True,False
A003,22,0,unemployed,3,30000,0.00,False,True
A004,38,95000,self_employed,2,80000,0.28,False,False
A005,28,60000,salaried,2,150000,0.65,False,True
A006,55,200000,salaried,1,100000,0.20,True,False
A007,19,15000,student,3,20000,0.00,False,True
A008,41,110000,salaried,1,90000,0.33,True,False

代码练习区

在下方编辑器中编写你的 Pandas 代码。可记录笔记、编写伪代码,参考答案在下方。

pandas_exercise.py
Loading...

参考答案

reference_solution.py
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

df = pd.read_csv('credit_data.csv')

# 1. 特征交叉
df['income_to_loan'] = df['income'] / df['loan_amount']
df['age_income_interaction'] = df['age'] * df['income']
df['dti_category'] = pd.cut(df['dti_ratio'], bins=[0, 0.2, 0.4, 0.6, 1.0],
                            labels=['low', 'medium', 'high', 'extreme'])

# 2. Target Encoding (带平滑防过拟合)
global_mean = df['is_fraud'].mean()
smooth = 10
for col in ['employment_type', 'city_tier']:
    agg = df.groupby(col)['is_fraud'].agg(['mean', 'count'])
    df[f'{col}_target_enc'] = (agg['mean'] * agg['count'] + global_mean * smooth) / (agg['count'] + smooth)

# 3. WOE 编码
for col in ['dti_category']:
    total_good = (df['is_fraud'] == False).sum()
    total_bad = (df['is_fraud'] == True).sum()
    woe_dict = {}
    for cat in df[col].dropna().unique():
        mask = df[col] == cat
        good = ((df[mask]['is_fraud'] == False).sum() + 0.5)
        bad = ((df[mask]['is_fraud'] == True).sum() + 0.5)
        woe_dict[cat] = np.log(good / total_good) - np.log(bad / total_bad)
    df[f'{col}_woe'] = df[col].map(woe_dict)

# 4. 方差过滤
selector = VarianceThreshold(threshold=0.01)
df_filtered = selector.fit_transform(df.select_dtypes(include=[np.number]))
low_var_features = df.select_dtypes(include=[np.number]).columns[~selector.get_support()]

业务解读

反欺诈特征工程的核心是 '让异常变得可见'。收入/贷款比过低、无业但申请大额贷款等特征交叉能暴露欺诈模式。Target Encoding 将分类变量转化为目标相关性数值,但需要平滑处理防止过拟合。WOE 编码在金融风控中被广泛使用,其线性特性与逻辑回归天然匹配。