PROJECT 06高级
金融信贷反欺诈特征工程
Credit Anti-fraud Feature Engineering
构建信贷反欺诈特征工程管道,包含 Target Encoding、特征交叉与方差过滤。
Target Encoding特征交叉方差过滤WOE 编码信息值 IV
项目背景
某消费金融公司每月收到 5 万笔贷款申请,欺诈率约 3%。传统规则引擎漏报率高,需要通过特征工程提升机器学习模型的欺诈识别能力。
模拟数据集
app_id,age,income,employment_type,city_tier,loan_amount,dti_ratio,has_collateral,is_fraud
A001,32,85000,salaried,1,50000,0.35,True,False
A002,45,120000,salaried,1,200000,0.42,True,False
A003,22,0,unemployed,3,30000,0.00,False,True
A004,38,95000,self_employed,2,80000,0.28,False,False
A005,28,60000,salaried,2,150000,0.65,False,True
A006,55,200000,salaried,1,100000,0.20,True,False
A007,19,15000,student,3,20000,0.00,False,True
A008,41,110000,salaried,1,90000,0.33,True,False代码练习区
在下方编辑器中编写你的 Pandas 代码。可记录笔记、编写伪代码,参考答案在下方。
pandas_exercise.py
Loading...
参考答案
reference_solution.py
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
df = pd.read_csv('credit_data.csv')
# 1. 特征交叉
df['income_to_loan'] = df['income'] / df['loan_amount']
df['age_income_interaction'] = df['age'] * df['income']
df['dti_category'] = pd.cut(df['dti_ratio'], bins=[0, 0.2, 0.4, 0.6, 1.0],
labels=['low', 'medium', 'high', 'extreme'])
# 2. Target Encoding (带平滑防过拟合)
global_mean = df['is_fraud'].mean()
smooth = 10
for col in ['employment_type', 'city_tier']:
agg = df.groupby(col)['is_fraud'].agg(['mean', 'count'])
df[f'{col}_target_enc'] = (agg['mean'] * agg['count'] + global_mean * smooth) / (agg['count'] + smooth)
# 3. WOE 编码
for col in ['dti_category']:
total_good = (df['is_fraud'] == False).sum()
total_bad = (df['is_fraud'] == True).sum()
woe_dict = {}
for cat in df[col].dropna().unique():
mask = df[col] == cat
good = ((df[mask]['is_fraud'] == False).sum() + 0.5)
bad = ((df[mask]['is_fraud'] == True).sum() + 0.5)
woe_dict[cat] = np.log(good / total_good) - np.log(bad / total_bad)
df[f'{col}_woe'] = df[col].map(woe_dict)
# 4. 方差过滤
selector = VarianceThreshold(threshold=0.01)
df_filtered = selector.fit_transform(df.select_dtypes(include=[np.number]))
low_var_features = df.select_dtypes(include=[np.number]).columns[~selector.get_support()]业务解读
反欺诈特征工程的核心是 '让异常变得可见'。收入/贷款比过低、无业但申请大额贷款等特征交叉能暴露欺诈模式。Target Encoding 将分类变量转化为目标相关性数值,但需要平滑处理防止过拟合。WOE 编码在金融风控中被广泛使用,其线性特性与逻辑回归天然匹配。