机器学习小白总结笔记

Kaggle入门机器学习题目链接:
https://www.kaggle.com/learn/intro-to-machine-learning
最基本的应用流程:
1.提取X,y数据

	y其实就是想预测的特征(比如房价),X就是其他的你认为对预测y最有用的特征(比如面积,窗户数,厕所数等)

2.train_test_split将数据集划分成训练集train_X,train_y和测试(验证)集val_X,val_y
3.指定某种机器学习模型(决策树和随机森林),调参防止欠拟合&过拟合。

	3.1 随机森林就是多颗决策树分别决策,最后采用投票机制,选择多数决策
	的结果/平均结果
	3.2 调参包括调节决策树的叶子节点数量等

4.模型.fit(train_X,train_y)
5.pre_y = 模型.predict(val_X) #得到一系列模型预测的y
6.通过模型预测的y和val_y(类比ground_truth)对比差距,差距越小说明模型预测越接近真实情况

	这里的差距,其实就是定义loss,一般有MSE,MAE等
  • 使用决策树和随机森林:
# Code you have previously used to load data
# 1.导入机器学习模型包
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

# Path of the file to read. We changed the directory structure to simplify submitting to a competition
# 2.导入数据集,提取X,y
iowa_file_path = '../input/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
# 3.构造交叉验证集
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
# 4.指定/构建机器学习模型,训练
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
# 5.模型预测与评估
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
# 4.1. 在构建模型的时候可以调节参数,找到使得loss值最小的参数
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))

# Define the model. Set random_state to 1
# 4.2. 可以指定/构建不同的机器学习模型,这里又选用了随机森林
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

相关推荐

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐