【Kaggle】【机器学习项目】【决策树与随机森林】机器学习入门之机器学习模型训练验证基本步骤

机器学习小白总结笔记Kaggle入门机器学习题目链接：https://www.kaggle.com/learn/intro-to-machine-learning最基本的应用流程：1.提取X，y数据y其实就是想预测的特征（比如房价），X就是其他的你认为对预测y最有用的特征（比如面积，窗户数，厕所数等）2.train_test_split将数据集划分成训练集train_X,train_y和测试(验证

唐唐无糖

609人浏览 · 2021-01-08 21:23:30

唐唐无糖 · 2021-01-08 21:23:30 发布

机器学习小白总结笔记

Kaggle入门机器学习题目链接：
https://www.kaggle.com/learn/intro-to-machine-learning
最基本的应用流程：
1.提取X，y数据

	y其实就是想预测的特征（比如房价），X就是其他的你认为对预测y最有用的特征（比如面积，窗户数，厕所数等）

2.train_test_split将数据集划分成训练集train_X,train_y和测试(验证)集val_X,val_y
3.指定某种机器学习模型（决策树和随机森林），调参防止欠拟合&过拟合。

	3.1 随机森林就是多颗决策树分别决策，最后采用投票机制，选择多数决策
	的结果/平均结果
	3.2 调参包括调节决策树的叶子节点数量等

4.模型.fit(train_X,train_y)
5.pre_y = 模型.predict(val_X) #得到一系列模型预测的y
6.通过模型预测的y和val_y(类比ground_truth)对比差距，差距越小说明模型预测越接近真实情况

	这里的差距，其实就是定义loss,一般有MSE，MAE等

使用决策树和随机森林：

# Code you have previously used to load data
# 1.导入机器学习模型包
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

# Path of the file to read. We changed the directory structure to simplify submitting to a competition
# 2.导入数据集，提取X，y
iowa_file_path = '../input/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
# 3.构造交叉验证集
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
# 4.指定/构建机器学习模型，训练
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
# 5.模型预测与评估
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
# 4.1. 在构建模型的时候可以调节参数，找到使得loss值最小的参数
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))

# Define the model. Set random_state to 1
# 4.2. 可以指定/构建不同的机器学习模型，这里又选用了随机森林
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))