集成学习 (Ensemble)

📌 核心定义 (What)

一句话定义：集成学习是通过组合多个弱学习器来构建一个强学习器的方法。“三个臭皮匠，顶个诸葛亮”。

方法	核心思想	代表算法
Bagging	并行训练，投票/平均	Random Forest
Boosting	串行训练，纠正前一个的错误	AdaBoost, XGBoost, LightGBM
Stacking	用模型组合模型	比赛常用

🏠 生活类比 (Analogy)

👥 “集体决策”

方法	类比
单个决策树	一个人做判断，容易犯错
Random Forest	100 人投票，少数服从多数
Boosting	专家组：第一个人判断后，第二个人专门纠正第一个人的错误，以此类推

🎯 “打靶”

高方差 (Variance)：多次打靶，分布很散 → Bagging 解决
高偏差 (Bias)：多次打靶，都偏左 → Boosting 解决

🎬 视频详解 (Video)

🎨 交互演示 (Interactive)

观察 Bagging 和 Boosting 的工作原理，理解集成学习如何提升性能。

🌲 集成学习可视化

弱学习器数量:3

📦 样本数据

🍎+

🍊+

🍋+

🥦-

🥕-

🍅-

🔀 并行训练的弱学习器

树 1

+++--+

准确率: 83%

树 2

++----

准确率: 83%

树 3

+-+-+-

准确率: 50%

💡 Bagging 减少方差 (Random Forest)，Boosting 减少偏差 (XGBoost)

📊 数学原理 (Math)

Bagging (Bootstrap Aggregating)

Bagging 预测

\hat{f}(x) = \frac{1}{B}\sum_{b=1}^B f_b(x)

$B$ : 基学习器数量（如 100 棵树）
$f_b$ : 第 $b$ 个基学习器，在 Bootstrap 样本上训练
回归任务取平均，分类任务投票

Bootstrap 采样：从 N 个样本中有放回抽取 N 个样本，约 63.2% 的原始样本会被抽到。

Random Forest

在 Bagging 基础上，每次分裂时只考虑随机子集的特征：

特征子采样

m = \sqrt{p} \text{ (分类)} \quad \text{或} \quad m = \frac{p}{3} \text{ (回归)}

$p$ : 总特征数
$m$ : 每次分裂考虑的特征数
降低树之间的相关性，减少方差

Boosting (AdaBoost)

Boosting 迭代

F_m(x) = F_{m-1}(x) + \alpha_m h_m(x)

$F_m$ : 第 $m$ 轮后的组合模型
$h_m$ : 第 $m$ 个弱学习器
$\alpha_m$ : 第 $m$ 个学习器的权重
每轮增加对”难样本”的关注

Gradient Boosting (XGBoost/LightGBM)

梯度提升

F_m(x) = F_{m-1}(x) + \gamma h_m(x), \quad h_m = \arg\min_h \sum_i L(y_i, F_{m-1}(x_i) + h(x_i))

每轮拟合残差（负梯度方向）
$L$ : 损失函数（MSE, LogLoss 等）
XGBoost 加入正则化项防止过拟合

💻 代码示例

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Random Forest
rf = RandomForestClassifier(
    n_estimators=100,      # 树的数量
    max_depth=10,          # 最大深度
    min_samples_split=5,   # 分裂最少样本数
    random_state=42
)
rf.fit(X_train, y_train)
print(f"Random Forest 准确率: {rf.score(X_test, y_test):.3f}")

# 特征重要性
import pandas as pd
importance = pd.Series(rf.feature_importances_).sort_values(ascending=False)
print("Top 5 特征重要性:\n", importance.head())

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# XGBoost
model = xgb.XGBClassifier(
    n_estimators=100,      # 树的数量
    max_depth=6,           # 最大深度
    learning_rate=0.1,     # 学习率
    subsample=0.8,         # 样本采样比例
    colsample_bytree=0.8,  # 特征采样比例
    use_label_encoder=False,
    eval_metric='logloss'
)
model.fit(X_train, y_train)
print(f"XGBoost 准确率: {model.score(X_test, y_test):.3f}")

import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# LightGBM (更快，适合大数据)
model = lgb.LGBMClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    num_leaves=31,         # 叶子节点数（LightGBM 特有）
    verbose=-1
)
model.fit(X_train, y_train)
print(f"LightGBM 准确率: {model.score(X_test, y_test):.3f}")

📈 算法对比

特性	Random Forest	XGBoost	LightGBM	CatBoost
训练速度	中	慢	✅ 快	中
内存占用	高	高	✅ 低	中
类别特征	需编码	需编码	✅ 原生支持	✅ 原生支持
过拟合风险	低	中	中	低
可解释性	高	高	高	高
Kaggle 常用	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐

🔑 关键参数调优

参数	作用	过拟合时
`n_estimators`	树的数量	影响不大
`max_depth`	树的深度	⬇️ 减小
`learning_rate`	学习率 (Boosting)	⬇️ 减小
`min_samples_split`	分裂最少样本	⬆️ 增大
`subsample`	样本采样比例	⬇️ 减小
`colsample_bytree`	特征采样比例	⬇️ 减小

🚀 下一步

决策树 - 理解基学习器
SVM - 另一个经典算法
神经网络 - 深度学习方法