【耿直哥机器学习】项目二：房价预测

Charlie482

373人浏览 · 2026-02-10 09:23:04

Charlie482 · 2026-02-10 09:23:04 发布

Kaggle页面：https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/

一、Jupyter Notebook版本（带详细注释）

保留原标题结构，代码添加逐行详细注释，确保机器学习小白可理解。

项目三：信用卡交易欺诈预测（注：原标题笔误，应为“房价预测”）

House Prices - Advanced Regression Techniques

1. 项目概述

Kaggle页面：https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/
影响房价的因素有很多，在本节的数据集中，79个特征几乎描述了爱荷华州艾姆斯 (Ames, Iowa) 住宅的方方面面，要求预测每套房屋最终的销售价格。当然这里面既有离散型也有连续型特征，而且存在大量的缺失值。且与人们固有印象不同的是，部分因素对房屋价格的影响远大于卧室等人们印象中很重要的因素。

比赛目标非常重要，如果一开始就搞错了可能会南辕北辙。最终要求的是对测试集中的每条数据预测出一个SalePrice，然后与Id合并提交。特别需要注意的是，这里的评价指标是均方根误差 (RMSE) ，这个在第五章模型评价中讲过。
$RMSE=1n∑i=1n(yreal−ypredict)2RMSE=\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_{real}-y_{predict})^2 }$

2. 数据

train.csv - 训练集
test.csv - 测试集
data_description.txt - 每列的完整描述
sample_submission.csv - 根据销售年份和月份、地块面积和卧室数量进行线性回归的基准提交

其中data_description的描述虽然很全，但并不太直观，下面整理了数据集中各列的含义用作对照：
（原特征含义表格保留，此处省略，与用户提供内容一致）

3. 问题解决流程

(1) 数据分析；
(2) 数据预处理；
(3) 特征工程；
(4) 建模、预测和解决问题；
(5) 模型评估；
(6) 提交结果；

4. 代码示例

(1) 加载数据集

# 导入pandas库：用于数据读取、处理和分析（数据分析核心库）
import pandas as pd

# 读取训练集：路径为当前目录下的house文件夹，文件名为train.csv
# 注意：需将路径改为你本地数据集的实际路径
train_df = pd.read_csv('./house/train.csv')

# 读取测试集：路径为当前目录下的house文件夹，文件名为test.csv
test_df = pd.read_csv('./house/test.csv')

(2) 数据预览

# 设置pandas显示参数：
# display.max_columns=500：显示所有列（默认只显示部分）
# display.max_rows=1000：显示所有行（默认只显示部分）
pd.set_option('display.max_columns',500)
pd.set_option('display.max_rows',1000)

# 查看训练集前5行数据：快速预览数据格式、特征名称和数值范围
train_df.head()

# 查看测试集最后5行数据：验证测试集的特征结构是否与训练集一致（除SalePrice外）
test_df.tail()

# 查看训练集和测试集的形状（行数, 列数）：
# train_df.shape：(1460, 81) 表示1460条样本，81列（含Id、79个特征、SalePrice）
# test_df.shape：(1459, 80) 表示1459条样本，80列（含Id、79个特征，无SalePrice）
train_df.shape, test_df.shape

(3) 统计分析

数据基本信息

# 查看训练集的基本信息：包括每列的非空值数量、数据类型
# 核心作用：快速识别缺失值（Non-Null Count < 总行数）、区分数值型/类别型特征
train_df.info()

注释：输出结果显示训练集中38个特征是数值型（int64/float64），43个特征是类别型（object），多个特征存在缺失值（如LotFrontage缺失259个、Alley缺失1369个）。

# 查看测试集的基本信息：对比训练集的缺失值分布和数据类型
test_df.info()

注释：输出结果显示测试集中37个特征是数值型，43个特征是类别型，缺失值分布与训练集类似但略有差异。

特征名

# 提取训练集的所有列名：返回数组形式的列名列表，便于后续特征筛选
train_df.columns.values

特征验证

# 验证训练集和测试集的特征差异：
# 第一行：[x for x in train_df.columns if x not in test_df.columns] → ['SalePrice']
# 说明训练集比测试集多一列SalePrice（预测目标）
# 第二行：[x for x in test_df.columns if x not in train_df.columns] → []
# 说明测试集的所有列都在训练集中，无额外特征
print([x for x in train_df.columns if x not in test_df.columns])
print([x for x in test_df.columns if x not in train_df.columns])

哪些特征包含缺失值

# 统计训练集每列的缺失值数量：isnull()标记缺失值（True），sum()统计每列True的数量
# 核心作用：识别缺失值较多的特征（如Alley、PoolQC、MiscFeature等）
train_df.isnull().sum()

# 统计测试集每列的缺失值数量：对比训练集的缺失值分布
test_df.isnull().sum()

样本中数值特征的分布

# 计算训练集数值型特征的统计指标：均值、标准差、最小值、25%/50%/75%分位数、最大值
# 核心作用：识别数值特征的分布范围、异常值（如LotArea最大值远大于75%分位数）
train_df.describe()

样本中分类特征的分布

# 计算训练集类别型特征的统计指标：计数、唯一值数量、最频繁值、最频繁值的计数
# 核心作用：识别类别特征的取值分布（如MSZoning最频繁值是RL，占比最高）
train_df.describe(include=['O'])

(4) 可视化数据分析

# 导入可视化相关库：
# numpy：数值计算，用于处理绘图的数值数据
# matplotlib.pyplot：基础绘图库
# seaborn：基于matplotlib的高级可视化库，更美观、更易用
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

缺失值分布

# 创建画布：设置画布大小为16x4
plt.figure(figsize=(16, 4))

# 第一个子图：训练集缺失值分布
plt.subplot(1, 2, 1)  # 1行2列，第1个子图
# 横轴：特征索引（0到80），纵轴：每列缺失值数量
plt.bar(np.arange(train_df.shape[1]), train_df.isnull().sum().values)
plt.title('Train Set Missing Values')  # 子图标题
plt.xlabel('Feature Index')            # 横轴标签
plt.ylabel('Missing Count')           # 纵轴标签

# 第二个子图：测试集缺失值分布
plt.subplot(1, 2, 2)  # 1行2列，第2个子图
plt.bar(np.arange(test_df.shape[1]), test_df.isnull().sum().values)
plt.title('Test Set Missing Values')
plt.xlabel('Feature Index')
plt.ylabel('Missing Count')

# 显示图表：核心作用是直观对比训练集/测试集的缺失值分布，识别缺失严重的特征
plt.show()

房价分布

# 绘制房价（SalePrice）的分布直方图：
# sns.displot：绘制分布直方图，默认包含核密度估计（KDE）曲线
sns.displot(train_df.SalePrice)
plt.title('SalePrice Distribution')  # 图表标题
plt.xlabel('SalePrice')              # 横轴标签（房价）
plt.ylabel('Count')                 # 纵轴标签（样本数）
plt.show()
# 解读：房价分布呈右偏态（长尾），大部分房价集中在10-20万，少数高价房拉高均值

房价与建造年份相关性

# 创建画布：设置大小为16x6
plt.figure(figsize=(16,6))
# 绘制箱线图：横轴=建造年份（YearBuilt），纵轴=房价（SalePrice）
# 箱线图作用：展示不同建造年份房屋的房价分布，识别年份对房价的影响
sns.boxplot(x = train_df.YearBuilt, y = train_df.SalePrice)
plt.title('SalePrice vs YearBuilt')
plt.xlabel('Year Built')
plt.ylabel('SalePrice')
# 旋转横轴标签：避免年份重叠
plt.xticks(rotation=90)
plt.show()
# 解读：越新建造的房屋，房价中位数越高；老旧房屋房价波动更大

房价与整体评价相关性

plt.figure(figsize=(16,6))
# 绘制箱线图：横轴=整体质量评分（OverallQual），纵轴=房价
sns.boxplot(x = train_df.OverallQual, y = train_df.SalePrice)
plt.title('SalePrice vs OverallQual')
plt.xlabel('Overall Quality (1-10)')
plt.ylabel('SalePrice')
plt.show()
# 解读：整体质量评分越高，房价中位数越高，且分布越集中（评分10的房屋房价几乎无低估值）

房价与居住面积相关性

# 绘制散点图：横轴=地面以上生活面积（GrLivArea），纵轴=房价
# 散点图作用：展示两个连续变量的相关性，识别线性/非线性关系
sns.displot(x = train_df.GrLivArea, y = train_df.SalePrice)
plt.title('SalePrice vs GrLivArea')
plt.xlabel('Above Ground Living Area (sq ft)')
plt.ylabel('SalePrice')
plt.show()
# 解读：居住面积越大，房价越高，呈明显的正相关关系

数值特征间的相关关系

# 计算数值型特征的相关系数矩阵：numeric_only=True 仅计算数值型特征
corr = train_df.corr(numeric_only = True)

# 创建画布：设置大小为12x9
plt.figure(figsize=(12, 9))
# 绘制热力图：展示相关系数矩阵
# vmax=0.9：颜色最大值对应相关系数0.9
# square=True：每个单元格为正方形
sns.heatmap(corr, vmax = 0.9, square = True)
plt.title('Correlation Matrix of Numeric Features')
plt.show()
# 解读：颜色越深（越红），相关系数越高；可快速识别与房价高度相关的特征（如OverallQual、GrLivArea）

相关关系Top-10

# 筛选与房价（SalePrice）相关性最高的10个特征：
# nlargest(k, 'SalePrice')：按SalePrice列降序取前k行
k = 10
cols = corr.nlargest(k, 'SalePrice')['SalePrice'].index

# 计算这10个特征的相关系数矩阵：
cm = np.corrcoef(train_df[cols].values.T)

# 绘制热力图：
# annot=True：显示相关系数数值
# fmt='.2f'：数值保留2位小数
# annot_kws={'size': 10}：注释字体大小
# yticklabels/xticklabels：坐标轴标签为特征名
sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', 
            annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.title('Top 10 Features Correlation with SalePrice')
plt.show()
# 解读：OverallQual（整体质量）、GrLivArea（居住面积）与房价相关性最高（>0.7）

(5) 处理数据

# 拼接训练集和测试集的特征：
# train_df.iloc[:, 1:-1]：训练集去掉第0列（Id）和最后1列（SalePrice），保留79个特征
# test_df.iloc[:, 1:]：测试集去掉第0列（Id），保留79个特征
# pd.concat：纵向拼接，便于统一做数据预处理（避免训练/测试集处理不一致）
combine_features = pd.concat((train_df.iloc[:, 1:-1], test_df.iloc[:, 1:]))

特征类型

# 查看拼接后特征集的数据类型：区分数值型/类别型特征，为后续预处理做准备
combine_features.dtypes

处理数值型特征

# 筛选数值型特征的列名：dtypes != 'object' 表示非类别型（即数值型）
numeric_feature_index = combine_features.dtypes[combine_features.dtypes != 'object'].index

# 计算训练集数值型特征的均值和标准差：
# 注意：必须用训练集的统计量来标准化测试集，避免数据泄露
numeric_feature_mean = train_df[numeric_feature_index].mean()
numeric_feature_std = train_df[numeric_feature_index].std()

# 标准化数值型特征：(特征值 - 均值) / 标准差 → 转换为均值0、方差1的标准分布
# 目的：消除量纲影响（如面积单位是平方英尺，年份是整数），提升模型收敛速度
combine_features[numeric_feature_index] = (combine_features[numeric_feature_index] - numeric_feature_mean) / numeric_feature_std

# 填充数值型特征的缺失值：用0填充（标准化后0对应均值，是合理的缺失值填充策略）
combine_features[numeric_feature_index] = combine_features[numeric_feature_index].fillna(0)

# 查看处理后的数值型特征：验证标准化和填充结果
combine_features[numeric_feature_index]

处理字符串特征

# 示例：演示独热编码的效果（小白理解用）
# 创建测试数据框：包含缺失值的类别特征
samples = pd.DataFrame({'name': ['a', 'b', 'c', None]})
samples  # 显示测试数据

# 独热编码：pd.get_dummies
# dummy_na=True：将缺失值作为独立的类别（生成name_nan列）
# 目的：将类别型特征转换为数值型特征（模型只能处理数值）
pd.get_dummies(samples, dummy_na = True)

# 对拼接后的特征集做独热编码：
# 所有类别型特征转换为独热编码，缺失值作为独立类别
combine_features = pd.get_dummies(combine_features, dummy_na = True)

# 查看编码后的特征集：类别型特征被拆分为多个0/1列（如MSZoning拆分为MSZoning_C (all)、MSZoning_FV等）
combine_features

缺失值填充验证

# 验证所有特征的缺失值数量：max()取最大值
# 结果为0表示所有缺失值已填充，预处理完成
combine_features.isnull().sum().max()

划分训练集和测试集

# 获取训练集的样本数量：n_train = 1460
n_train = train_df.shape[0]

# 拆分特征集：
# train_features：前1460行（训练集特征），转换为数组
# test_features：后1459行（测试集特征），转换为数组
train_features = combine_features[:n_train].values
test_features = combine_features[n_train:].values

# 提取训练集标签（房价）：
train_labels = train_df.SalePrice.values

# 对标签做对数变换：np.log10(train_labels + 1)
# 目的：将右偏的房价分布转换为近似正态分布，提升模型预测效果
# +1：避免房价为0时对数无意义
train_labels = np.log10(train_labels + 1)

(6) PCA降维

# 从sklearn导入PCA：主成分分析，用于特征降维
from sklearn.decomposition import PCA

# 初始化PCA模型：n_components=0.96 表示保留96%的方差（信息）
# 目的：减少特征维度，降低计算量，避免过拟合
pca = PCA(0.96)

# 训练PCA模型：仅用训练集特征拟合（避免数据泄露）
pca.fit(train_features)

# 查看PCA降维后的特征数量：
# 输出结果为88，表示保留96%方差需要88个主成分（原特征数约300+）
pca.n_components_

# 对训练集和测试集特征做PCA降维：
train_features_pca = pca.transform(train_features)
test_features_pca = pca.transform(test_features)

# 查看降维后的特征形状：
# train_features_pca.shape：(1460, 88) → 1460个样本，88个特征
# test_features_pca.shape：(1459, 88) → 1459个样本，88个特征
train_features_pca.shape, test_features_pca.shape

(7) 建模评估

交叉验证RMSE评估

# 从sklearn导入交叉验证工具
from sklearn.model_selection import cross_val_score

# 定义RMSE交叉验证函数：
# model：待评估的模型
# x：特征集
# y：标签集
# cv：交叉验证折数（默认5）
def rmse_cv(model, x, y, cv = 5):
    # cross_val_score：交叉验证评分
    # scoring="neg_mean_squared_error"：负均方误差（sklearn默认返回负值，便于最大化）
    # -cross_val_score(...)：转换为正均方误差
    # np.sqrt(...)：开平方得到RMSE（均方根误差）
    return np.sqrt(-cross_val_score(model, x, y, scoring="neg_mean_squared_error", cv = cv))

KNN

# 导入K近邻回归器
from sklearn.neighbors import KNeighborsRegressor

# 5折交叉验证评估KNN模型：
# n_neighbors=10：近邻数为10
# weights='distance'：加权距离（近的样本权重高）
# p=2：欧氏距离
cv_knn = rmse_cv(KNeighborsRegressor(n_neighbors=10, weights='distance', p=2), train_features_pca, train_labels)

# 输出KNN的RMSE均值和标准差：
# 均值：模型的平均预测误差
# 标准差：模型在不同折上的稳定性（越小越稳定）
cv_knn.mean(), cv_knn.std()

线性回归

# 导入线性回归模型
from sklearn.linear_model import LinearRegression

# 5折交叉验证评估线性回归
cv_linear = rmse_cv(LinearRegression(), train_features_pca, train_labels)
cv_linear.mean(), cv_linear.std()

弹性网络

# 导入弹性网络回归器（L1+L2正则化的线性回归）
from sklearn.linear_model import ElasticNet

# 5折交叉验证评估弹性网络：
# alpha=0.001：正则化强度（越小正则化越弱）
# l1_ratio=0.4：L1正则化占比（0=纯L2，1=纯L1）
cv_ela = rmse_cv(ElasticNet(alpha=0.001, l1_ratio=0.4), train_features_pca, train_labels)
cv_ela.mean(), cv_ela.std()

SVM

# 导入支持向量机回归器
from sklearn.svm import SVR

# 5折交叉验证评估SVM：
# C=0.3：正则化强度
# epsilon=0.01：epsilon不敏感损失（误差小于0.01时不计算损失）
cv_svm = rmse_cv(SVR(C=0.3, epsilon=0.01), train_features_pca, train_labels)
cv_svm.mean(), cv_svm.std()

随机森林

# 导入随机森林回归器
from sklearn.ensemble import RandomForestRegressor

# 5折交叉验证评估随机森林：
# n_estimators=100：决策树数量
# max_samples=0.75：每个决策树使用75%的样本（减少过拟合）
cv_rf = rmse_cv(RandomForestRegressor(n_estimators=100, max_samples=0.75), train_features_pca, train_labels)
cv_rf.mean(), cv_rf.std()

集成学习

# 导入投票回归器（集成学习）
from sklearn.ensemble import VotingRegressor

# 初始化投票回归器：整合多个模型的预测结果（默认按等权重平均）
vr = VotingRegressor([
    ('knn', KNeighborsRegressor(n_neighbors=10, weights='distance', p=2)),
    ('linear', LinearRegression()),
    ('ela', ElasticNet(alpha=0.001, l1_ratio=0.4)),
    ('svm', SVR(C=0.3, epsilon=0.01)),
    ('rf', RandomForestRegressor(n_estimators=100, max_samples=0.75))
])

# 5折交叉验证评估投票回归器：集成模型通常比单一模型效果更好
cv_vr = rmse_cv(vr, train_features_pca, train_labels)
cv_vr.mean(), cv_vr.std()

(8) 模型评估

# 将各模型的RMSE均值整理为DataFrame：
models = pd.DataFrame({
    'Model': [ 'KNN', 'Linear Regression', 'ElasticNet',
              'SVM', 'Random Forest', 'Voting Regressor'],
    'Score': [cv_knn.mean(), cv_linear.mean(), cv_ela.mean(), 
              cv_svm.mean(), cv_rf.mean(), cv_vr.mean()]})

# 按RMSE升序排序：RMSE越小，模型预测效果越好
models.sort_values(by='Score', ascending=True)

注释：输出结果显示Voting Regressor（投票回归器）的RMSE最小，是最优模型。

(9) 保存预测结果提交

预测值

# 用最优模型（投票回归器）拟合训练集数据
vr.fit(train_features_pca, train_labels)

# 预测测试集标签：
y_pred = vr.predict(test_features_pca)

# 还原对数变换：np.power(10, y_pred) - 1
# 因为之前对标签做了np.log10(train_labels + 1)，此处逆向操作得到真实房价
y_pred = np.power(10, y_pred) - 1

# 查看预测结果：
y_pred

保存结果

# 创建提交文件：包含Id和预测的SalePrice
submission = pd.DataFrame({
    'Id': test_df.Id,  # 测试集Id
    'SalePrice': y_pred  # 预测的房价
})

# 保存为CSV文件：index = False 不保存行索引（符合Kaggle提交格式）
submission.to_csv("./house/submission.csv", index = False)

(10) 进一步优化

异常值处理，例如离群值、异常值等
剔除无效特征，例如Utilities特征这种几乎全部相同的项
缺失值处理可以更加细致，比如使用众数、中位数等进行填充
某些字符串特征可以转换为数值特征，比如地下室高度评估等特征，其Excellent、Good等描述本身是有高低好坏之分的
还可以组合特征从而增加新特征，例如地下室总面积+一楼面积+二楼面积=房屋总面积等
还可以使用其他集成学习方法，例如Stacking等

二、核心知识点详解

2.1 关键概念梳理

概念	定义	应用场景
均方根误差（RMSE）	$RMSE=1n∑i=1n(yreal−ypredict)2RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_{real}-y_{predict})^2}$ ，衡量回归模型的预测误差	回归任务评估（如房价预测）
数据标准化	将数值特征转换为均值0、方差1的分布（ $x′=(x−μ)/σx'=(x-\mu)/\sigma$ ），消除量纲影响	基于距离的模型（KNN、SVM）、线性模型
独热编码（One-Hot）	将类别型特征转换为0/1矩阵，缺失值作为独立类别	类别特征数值化（模型仅处理数值）
PCA降维	主成分分析，保留大部分方差的前提下减少特征维度	高维特征降维、减少计算量、避免过拟合
交叉验证	将数据集划分为k份，轮流用k-1份训练、1份测试，评估模型泛化能力	模型评估、避免单次划分的偶然性
集成学习（Voting）	整合多个模型的预测结果（平均/投票），提升预测稳定性和准确性	单一模型效果不佳时，融合多个模型优势
对数变换	对右偏分布的标签（如房价）做对数变换，使其接近正态分布	回归任务中优化标签分布，提升模型效果

2.2 核心库/函数说明

库/函数	作用
`pd.read_csv()`	读取CSV格式数据集
`pd.set_option()`	设置pandas显示参数（如显示所有列/行）
`df.info()`	查看数据集基本信息（非空值、数据类型）
`df.isnull().sum()`	统计每列缺失值数量
`df.corr(numeric_only=True)`	计算数值特征的相关系数矩阵
`sns.boxplot()`	绘制箱线图，展示类别特征与连续标签的关系
`sns.heatmap()`	绘制热力图，展示相关系数矩阵
`pd.get_dummies()`	对类别特征做独热编码
`PCA(n_components=0.96)`	初始化PCA模型，保留96%的方差
`cross_val_score()`	交叉验证评分，支持指定评估指标（如neg_mean_squared_error）
`VotingRegressor()`	投票回归器，整合多个回归模型的预测结果
`df.to_csv()`	将DataFrame保存为CSV文件（符合Kaggle提交格式）

2.3 房价预测核心流程

三、PyCharm版本代码（无if name，可直接运行）

注：需提前安装依赖pip install pandas numpy matplotlib seaborn scikit-learn，并修改数据集路径。

# 导入核心库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, VotingRegressor

# -------------------------- (1) 加载数据集 --------------------------
# 注意：修改为你本地的数据集路径（建议使用绝对路径）
train_df = pd.read_csv("D:/house/train.csv")
test_df = pd.read_csv("D:/house/test.csv")

# -------------------------- (2) 数据预览 --------------------------
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 1000)
print("训练集前5行：")
print(train_df.head())
print("\n测试集最后5行：")
print(test_df.tail())
print("\n数据集形状：", train_df.shape, test_df.shape)

# -------------------------- (3) 统计分析 --------------------------
print("\n训练集基本信息：")
train_df.info()
print("\n测试集基本信息：")
test_df.info()

print("\n训练集缺失值统计：")
print(train_df.isnull().sum())
print("\n测试集缺失值统计：")
print(test_df.isnull().sum())

print("\n数值特征统计：")
print(train_df.describe())
print("\n类别特征统计：")
print(train_df.describe(include=['O']))

# -------------------------- (4) 可视化数据分析 --------------------------
# 缺失值分布
plt.figure(figsize=(16, 4))
plt.subplot(1, 2, 1)
plt.bar(np.arange(train_df.shape[1]), train_df.isnull().sum().values)
plt.title('Train Set Missing Values')
plt.xlabel('Feature Index')
plt.ylabel('Missing Count')

plt.subplot(1, 2, 2)
plt.bar(np.arange(test_df.shape[1]), test_df.isnull().sum().values)
plt.title('Test Set Missing Values')
plt.xlabel('Feature Index')
plt.ylabel('Missing Count')
plt.show()

# 房价分布
sns.displot(train_df.SalePrice)
plt.title('SalePrice Distribution')
plt.xlabel('SalePrice')
plt.ylabel('Count')
plt.show()

# 房价与建造年份相关性
plt.figure(figsize=(16, 6))
sns.boxplot(x=train_df.YearBuilt, y=train_df.SalePrice)
plt.title('SalePrice vs YearBuilt')
plt.xlabel('Year Built')
plt.ylabel('SalePrice')
plt.xticks(rotation=90)
plt.show()

# 房价与整体评价相关性
plt.figure(figsize=(16, 6))
sns.boxplot(x=train_df.OverallQual, y=train_df.SalePrice)
plt.title('SalePrice vs OverallQual')
plt.xlabel('Overall Quality (1-10)')
plt.ylabel('SalePrice')
plt.show()

# 房价与居住面积相关性
sns.displot(x=train_df.GrLivArea, y=train_df.SalePrice)
plt.title('SalePrice vs GrLivArea')
plt.xlabel('Above Ground Living Area (sq ft)')
plt.ylabel('SalePrice')
plt.show()

# 数值特征相关矩阵
corr = train_df.corr(numeric_only=True)
plt.figure(figsize=(12, 9))
sns.heatmap(corr, vmax=0.9, square=True)
plt.title('Correlation Matrix of Numeric Features')
plt.show()

# 相关关系Top-10
k = 10
cols = corr.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train_df[cols].values.T)
sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f',
            annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.title('Top 10 Features Correlation with SalePrice')
plt.show()

# -------------------------- (5) 处理数据 --------------------------
# 拼接特征集
combine_features = pd.concat((train_df.iloc[:, 1:-1], test_df.iloc[:, 1:]))

# 处理数值型特征
numeric_feature_index = combine_features.dtypes[combine_features.dtypes != 'object'].index
numeric_feature_mean = train_df[numeric_feature_index].mean()
numeric_feature_std = train_df[numeric_feature_index].std()
combine_features[numeric_feature_index] = (combine_features[numeric_feature_index] - numeric_feature_mean) / numeric_feature_std
combine_features[numeric_feature_index] = combine_features[numeric_feature_index].fillna(0)

# 处理类别型特征
combine_features = pd.get_dummies(combine_features, dummy_na=True)

# 验证缺失值
print("\n缺失值填充后最大值：", combine_features.isnull().sum().max())

# 划分训练集和测试集
n_train = train_df.shape[0]
train_features = combine_features[:n_train].values
test_features = combine_features[n_train:].values
train_labels = train_df.SalePrice.values
train_labels = np.log10(train_labels + 1)

# -------------------------- (6) PCA降维 --------------------------
pca = PCA(0.96)
pca.fit(train_features)
print("\nPCA降维后特征数：", pca.n_components_)
train_features_pca = pca.transform(train_features)
test_features_pca = pca.transform(test_features)
print("降维后特征形状：", train_features_pca.shape, test_features_pca.shape)

# -------------------------- (7) 建模评估 --------------------------
# 定义RMSE交叉验证函数
def rmse_cv(model, x, y, cv=5):
    return np.sqrt(-cross_val_score(model, x, y, scoring="neg_mean_squared_error", cv=cv))

# KNN
cv_knn = rmse_cv(KNeighborsRegressor(n_neighbors=10, weights='distance', p=2), train_features_pca, train_labels)
print("\nKNN RMSE：", cv_knn.mean(), cv_knn.std())

# 线性回归
cv_linear = rmse_cv(LinearRegression(), train_features_pca, train_labels)
print("线性回归 RMSE：", cv_linear.mean(), cv_linear.std())

# 弹性网络
cv_ela = rmse_cv(ElasticNet(alpha=0.001, l1_ratio=0.4), train_features_pca, train_labels)
print("弹性网络 RMSE：", cv_ela.mean(), cv_ela.std())

# SVM
cv_svm = rmse_cv(SVR(C=0.3, epsilon=0.01), train_features_pca, train_labels)
print("SVM RMSE：", cv_svm.mean(), cv_svm.std())

# 随机森林
cv_rf = rmse_cv(RandomForestRegressor(n_estimators=100, max_samples=0.75), train_features_pca, train_labels)
print("随机森林 RMSE：", cv_rf.mean(), cv_rf.std())

# 集成学习
vr = VotingRegressor([
    ('knn', KNeighborsRegressor(n_neighbors=10, weights='distance', p=2)),
    ('linear', LinearRegression()),
    ('ela', ElasticNet(alpha=0.001, l1_ratio=0.4)),
    ('svm', SVR(C=0.3, epsilon=0.01)),
    ('rf', RandomForestRegressor(n_estimators=100, max_samples=0.75))
])
cv_vr = rmse_cv(vr, train_features_pca, train_labels)
print("投票回归器 RMSE：", cv_vr.mean(), cv_vr.std())

# -------------------------- (8) 模型评估 --------------------------
models = pd.DataFrame({
    'Model': ['KNN', 'Linear Regression', 'ElasticNet', 'SVM', 'Random Forest', 'Voting Regressor'],
    'Score': [cv_knn.mean(), cv_linear.mean(), cv_ela.mean(), cv_svm.mean(), cv_rf.mean(), cv_vr.mean()]
})
print("\n模型评分（升序）：")
print(models.sort_values(by='Score', ascending=True))

# -------------------------- (9) 保存预测结果 --------------------------
# 训练最优模型
vr.fit(train_features_pca, train_labels)

# 预测
y_pred = vr.predict(test_features_pca)
y_pred = np.power(10, y_pred) - 1
print("\n测试集预测结果前5个：", y_pred[:5])

# 保存提交文件
submission = pd.DataFrame({
    'Id': test_df.Id,
    'SalePrice': y_pred
})
submission.to_csv("D:/house/submission.csv", index=False)
print("\n提交文件已保存至：D:/house/submission.csv")

总结

核心流程：房价预测的核心是「数据预处理（标准化+独热编码）→ 特征降维（PCA）→ 模型集成（VotingRegressor）」，集成模型的预测效果优于单一模型；
关键预处理：数值特征标准化消除量纲影响，类别特征独热编码实现数值化，对数变换优化房价分布；
评估指标：回归任务使用RMSE评估模型效果，交叉验证确保评估结果的可靠性，集成学习可有效降低RMSE提升预测精度。

DAMO开发者矩阵

DAMO开发者矩阵，由阿里巴巴达摩院和中国互联网协会联合发起，致力于探讨最前沿的技术趋势与应用成果，搭建高质量的交流与分享平台，推动技术创新与产业应用链接，围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐

冻库低温环境下的机器人搬运技术测评

DAMO开发者矩阵

console.log不可用解决

总的来说，这两个脚本共同构建了一个双层检测体系：第一个脚本用于识别"谁在访问"（即生成设备指纹），第二个脚本用于判断"是不是真人"（即检测机器人）。它们是Cloudflare机器人管理系统中不可或缺的一环。

DAMO开发者矩阵

ROS2 从零到一完整学习

ROS2 = Robot Operating System 2，机器人开发中间件，不是真正操作系统，运行在 Linux（主力 Ubuntu）上。作用：统一机器人硬件驱动、传感器通信、算法调度、仿真、上位机交互，支持机械臂、移动小车、人形机器人、自动驾驶。对比 ROS1：抛弃 Master 单点故障，采用 DDS 分布式通信，支持多机、实时性、嵌入式（Jetson / 单片机）。