1. 混淆矩阵 (Confusion Matrix)

概念:评估分类模型性能的表格,展示预测结果与真实标签的对应关系
作用

  • 量化模型的分类准确性

  • 识别错误类型(假阳性/假阴性)

  • 计算多种评估指标的基础

关键参数

  • TP (True Positive):正确预测的正例

  • FP (False Positive):错误预测的正例

  • FN (False Negative):错误预测的负例

  • TN (True Negative):正确预测的负例

  • 公式:

    confusionMatrix(predicted, actual, positive = "阳性类别")

衍生指标

  • 准确率 (Accuracy) = (TP+TN)/(TP+FP+FN+TN)

  • 精确率 (Precision) = TP/(TP+FP)

  • 召回率 (Recall/Sensitivity) = TP/(TP+FN)

  • F1值 = 2×(Precision×Recall)/(Precision+Recall)


2. ROC曲线与AUC值

概念

  • ROC曲线:展示不同阈值下真阳性率(TPR)与假阳性率(FPR)的关系

  • AUC值:曲线下面积,衡量模型区分能力

作用

  • 评估模型整体区分能力

  • 不受分类阈值影响

  • 比较不同模型性能

关键函数

roc(response, predictor)  # 创建ROC对象
auc(roc_object)          # 计算AUC值
ci.auc(roc_object)       # 计算AUC置信区间

解读

  • AUC=0.5:无区分能力

  • 0.7≤AUC<0.8:一般区分能力

  • 0.8≤AUC<0.9:良好区分能力

  • AUC≥0.9:优秀区分能力


3. 校准曲线 (Calibration Curve)

概念:评估预测概率与实际概率一致性的图表

作用

  • 检验概率预测的可靠性

  • 识别过度自信/保守的预测

  • 诊断模型校准质量

实现方法

ggplot(data, aes(x=预测概率, y=实际概率)) +
  geom_smooth() +          # 拟合曲线
  geom_abline()            # 理想参考线

解读

  • 曲线接近对角线:校准良好

  • 曲线上方:预测概率低估实际风险

  • 曲线下方:预测概率高估实际风险


4. 决策曲线分析 (Decision Curve Analysis, DCA)

概念:评估临床决策价值的分析方法

作用

  • 量化模型临床实用性

  • 比较不同干预策略的净获益

  • 确定最佳决策阈值范围

关键函数

decision_curve(formula, data)  # 计算决策曲线
plot_decision_curve()          # 可视化结果

核心指标

  • 阈值概率 (Threshold Probability):采取干预措施的临界概率

  • 净获益 (Net Benefit) = (TP - FP×权重)/N

  • 权重 = 阈值概率/(1-阈值概率)

解读

  • 曲线越高表示临床价值越大

  • "All"线:干预所有患者的策略

  • "None"线:不干预任何患者的策略


综合评价框架

  1. 区分能力:ROC/AUC评估模型区分不同类别能力

  2. 校准度:校准曲线评估概率预测准确性

  3. 临床效用:DCA评估实际临床应用价值

  4. 分类精度:混淆矩阵提供详细分类性能

5.R语言实践

####################1.加载包####################################
# 加载所有必需的R包
library(caret)         # 机器学习工具包
library(rpart)         # 决策树
library(partykit)      # 决策树可视化
library(randomForest)  # 随机森林
library(xgboost)       # XGBoost
library(lightgbm)      # LightGBM
library(e1071)         # SVM
library(nnet)          # 神经网络
library(neuralnet)     # 神经网络(另一种实现)
library(kknn)          # KNN
library(pROC)          # ROC曲线分析
library(ggplot2)       # 数据可视化
library(ggthemes)      # 额外的ggplot2主题
library(rms)           # 列线图
library(rmda)          # 决策曲线分析
library(dplyr)         # 数据处理
library(tidyr)         # 数据整理
library(Matrix)        # 稀疏矩阵处理
library(regplot)
####################2.加载数据####################################
# 读取数据
tlog <- read.csv("tlog.csv", row.names = 1)

# 定义筛选特征集合
selected_vars <- c("exercise", "hyperlip", "pregnant", 
                   "age", "glucose", "bmi", "pedigree")
selected_vars_scaled <- c("exercise", "hyperlip", "pregnant", "age_scaled", 
                          "glucose_scaled", "bmi_scaled", "pedigree_scaled")

# 将结局变量因子化
tlog$diabetes <- factor(tlog$diabetes, levels = c(0, 1), labels = c("No", "Yes"))

valdata <- read.csv("valdata.csv",row.names = 1)
valdata$diabetes <- factor(valdata$diabetes, levels = c(0, 1), labels = c("No", "Yes"))

####################3.二分类机器学习模型建模####################################

# 基于训练集构建模型

###### 3.1 Logistic模型 #########

# 拟合模型
lr_model<- glm(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree,
               data = tlog,
               family ="binomial"  #使用二项分布,适用于二分类因变量
) 

# 显示模型信息 
print(lr_model)    

## 绘制列线图
regplot(lr_model, 
        title = "Nomogram", 
        points = TRUE,                    # 显示每个变量的点数贡献
        axis.text.size = 12,              # 调整刻度字体大小
        title.text.size = 14)             # 调整标题字体大小

################3.2 决策树:分类回归树########################################

# 构建基础CART模型:利用默认参数建模
tree_model1 <- rpart(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree, 
                     data = tlog, 
                     method = "class")   # 分类问题,使用分类树算法来构建决策树
tree_model1$cptable   # 返回模型剪枝的复杂度表

# 设置控制参数  
control <- trainControl(method="cv", number=10)  
# 定义参数网格
param_grid <- expand.grid(cp = seq(0.001, 0.3, by = 0.002))  #cp:CART模型的复杂度参数,控制模型的剪枝过程

# 使用train函数进行交叉验证和模型调优
set.seed(111)
fit_cv_rpart <- train(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree, 
                      data = tlog, 
                      method = "rpart", # 使用 rpart算法来训练模型
                      trControl = control, 
                      tuneGrid = param_grid)
fit_cv_rpart $bestTune  # 查看最优参数

# 使用最佳参数构建决策树模型  
tree_model <- rpart(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree, 
                    data = tlog, 
                    method = "class", 
                    cp=fit_cv_rpart$bestTune) 

# 查看控制参数
print(tree_model$control)

# 画决策树图
plot(as.party(tree_model))

# 显示模型信息 
print(tree_model)


###################3.3 随机森林(RF)模型##########################

# 构建默认参数,构建基础随机森林模型
rf_model0 <- randomForest(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree, 
                          data = tlog,  
                          importance=TRUE) # 利用默认参数建模
print(rf_model0)

## 最佳模型参数:超参数调节、网格搜索、交叉验证
# 定义训练控制参数
set.seed(123)  
ctrl <- trainControl(method = "cv", 
                     number = 10,    # 10折交叉验证
                     search = "grid")  # 网格搜索

# 定义超参数mtry搜索范围(mtry表示每棵树随机选择的特征数)
tuneGrid <- expand.grid(mtry = c(1:sqrt(7)))   # 从1到数据集中特征数的平方根的整数值        

# 超参数调优
rf_model1 <- train(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree, 
                   data = tlog, 
                   method = "rf",      # 指定使用 rf
                   trControl = ctrl,   # 指定训练控制参数 ctrl
                   tuneGrid = tuneGrid # 指定要调优的超参数网格
)
# 输出最佳模型参数
print(rf_model1)
rf_model1 $ bestTune

# 设置树的数量ntree范围
ntree_values <- seq(50, 1000, by = 50)  
# 创建一个向量,存储每个 ntree 值对应的 OOB 错误率
oob_error_rates <- numeric(length(ntree_values))

# 训练多个模型并记录OOB误差率
for (i in 1:length(ntree_values)) {   # for循环:使用不同的ntree训练多个模型
  rf_model2 <- randomForest(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree,
                            data = tlog, 
                            mtry = rf_model1$bestTune$mtry, 
                            ntree = ntree_values[i], 
                            importance = TRUE, 
                            oob.prox = TRUE)
  # 找出OOB误差率最低的树的数量
  oob_error_rates[i] <- rf_model2$err.rate[ntree_values[i]]
}

# 找出OOB误差率最低的树的数量
best_ntree <- ntree_values[which.min(oob_error_rates)]
print(paste("最佳树的数量:", best_ntree))

# 使用最佳参数重构建模型
rf_model <- randomForest(diabetes ~ exercise+hyperlip+pregnant+age+glucose+bmi+pedigree,
                         data = tlog, 
                         ntree = best_ntree, 
                         mtry = rf_model1$bestTune$mtry,
                         importance = TRUE)

# 显示模型信息 
print(rf_model)


########3.4 Xgboost模型################################

## 因变量需为数值型变量
tlog$diabetes <- as.numeric(tlog$diabetes) - 1
valdata$diabetes <- as.numeric(valdata$diabetes) - 1

# 设置XGBoost的训练和验证数据集 
train_matrix <- xgb.DMatrix(data = as.matrix(tlog[, selected_vars]), 
                            label = tlog$diabetes)  # label指定模型结局变量

val_matrix <- xgb.DMatrix(data = as.matrix(valdata[, selected_vars]), 
                          label = valdata$diabetes)  

# 基于默认参数,构建基础 Xgboost模型
xgb_model0 <- xgb.train(data = train_matrix, nrounds=100) # 100次迭代


# 超参数调优  
param_grid <- expand.grid(
  objective = "binary:logistic",  # 二分类任务,预测类别概率
  max_depth = c(2, 3, 4, 5),      # 树的最大深度,控制模型的复杂度
  eta = c(0.01, 0.1, 0.2),        # 学习率,决定每一轮迭代中模型更新的步伐大小
  nrounds = c(50, 100, 150)       # 训练轮数,每轮会调整树的参数
)  

# 初始化最佳 AUC 和参数
best_auc <- 0  
best_params <- list()     # 存储最优超参数组合

# 超参数调优的循环
for (i in 1:nrow(param_grid)) {  
  param <- list(  
    objective = "binary:logistic",  
    eval_metric = "auc",  
    max_depth = param_grid$max_depth[i],  
    eta = param_grid$eta[i]  
  )  
  
  xgb_model_0 <- xgb.train(params = param, data = train_matrix, 
                           nrounds = param_grid $ nrounds[i])  
  
  # 评估每个模型的 AUC  
  pred_probs <- predict(xgb_model_0, train_matrix)  
  roc_curve <- roc(tlog$diabetes, pred_probs)  
  auc_value <- roc_curve$auc  
  
  #  选择最优模型
  if (auc_value > best_auc) {  
    best_auc <- auc_value  
    best_params <-  c(param, nrounds = param_grid$nrounds[i])  
  }  
} 

# 输出最佳参数和AUC  
print(best_params)  # 输出最优的超参数组合 
cat("最佳AUC: ", best_auc, "\n")  

# 使用最佳超参数训练模型  
xgb_model <- xgb.train(params = best_params, data = train_matrix, 
                       nrounds = best_params$nrounds)  

#显示模型信息 
print(xgb_model)


#######################3.5 LightGBM模型#############################

# 设置LightGBM 的训练和验证数据集 
lgbtlog <- lgb.Dataset(as.matrix(tlog[,selected_vars]),
                       label = tlog$diabetes) # 创建LightGBM所需的训练数据格式
lgbvaldata <- lgb.Dataset.create.valid(lgbtlog, 
                                       as.matrix(valdata[,selected_vars]), 
                                       label = tlog$diabetes) # 创建验证集,用于模型的验证过程

# 基于默认参数,构建基础 LightGBM 模型
lightgbm_model0 <- lgb.train(data = lgbtlog)  


# 设置超参数搜索网格  
param_grid <- expand.grid(  
  num_leaves = c(15, 31),    # 树的叶子数
  max_depth = c(-1, 1, 3),   # 树的最大深度
  learning_rate = c( 0.1, 0.2),  # 学习率,控制每一轮迭代时模型更新的步伐
  n_estimators = c(50),          # 训练轮数(树的数量)
  min_data_in_leaf = c(30),      # 每棵树叶子节点最少样本数
  lambda_l1 = c(0, 1),           # L1 正则化参数
  lambda_l2 = c(0, 1)            # L2 正则化参数
)  

# 准备一个空的数据框来保存结果  
results <- data.frame()  

# 超参数调优与交叉验证 
for (i in 1:nrow(param_grid)) {  
  
  # 获取当前的参数组合  
  params <- list(  
    objective = "binary",  
    metric = "auc",  
    learning_rate = param_grid$learning_rate[i],  
    num_leaves = param_grid$num_leaves[i],  
    max_depth = param_grid$max_depth[i],  
    n_estimators = param_grid$n_estimators[i],  
    min_data_in_leaf = param_grid$min_data_in_leaf[i]  
  )  
  
  # 进行交叉验证  
  cv_results <- lgb.cv(  
    params = params,  
    data = lgbtlog,  
    nrounds = 10,  
    nfold = 5,  
    early_stopping_rounds = 10,  
    verbose = -1    # -1 表示不输出训练过程的详细信息
  )  
  
  # 保存当前的参数和其对应的auc  
  results <- rbind(results,data.frame(param_grid[i, ], 
                                      auc=max(cv_results$record_evals$valid[['auc']]$data)))  
}  

# 找到最优参数  
best_params <- results[which.max(results$auc), ]  
print(best_params)  

# 用最佳参数构建模型  
best_params_list <- list(  #超参数配置列表
  objective = "binary",  
  metric = "auc",  
  learning_rate = best_params$learning_rate,  
  num_leaves = best_params$num_leaves,  
  max_depth = best_params$max_depth,  
  n_estimators = best_params$n_estimators,  
  min_data_in_leaf = best_params$min_data_in_leaf  
)  

lightgbm_model <- lgb.train(  
  params = best_params_list,  
  data = lgbtlog,  
  nrounds = best_params$n_estimators  
)  

#显示模型信息 
print(lightgbm_model)


########################3.6 knn 模型###########################

# 将结局变量因子化
tlog$diabetes <- factor(tlog$diabetes,levels = c(0,1),labels = c('No','Yes'))

# 基于默认参数,构建基础 knn 模型
knn_model0 <- train(diabetes ~ exercise + hyperlip + pregnant + age_scaled + 
                      glucose_scaled + bmi_scaled + pedigree_scaled,  
                    data = tlog, 
                    method = "kknn"    # 指定使用加权K近邻算法
)


# 设置交叉验证控制
train_control <- trainControl(method = "cv", number = 5)

# 设置超参数网格,核函数和 k 值
tune_grid<-expand.grid(kmax = seq(1, 20, by = 2), # 调整 k 值
                       distance = 2,             # Minkowski距离,2表示欧几里得距离
                       kernel=c("rectangular","triangular","gaussian")) # 核函数,计算邻居的权重

# 训练 KNN 模型并调优
set.seed(123)
kknn_model <- train(diabetes ~ exercise + hyperlip + pregnant + age_scaled + 
                      glucose_scaled + bmi_scaled + pedigree_scaled,  
                    data = tlog, 
                    method = "kknn", 
                    trControl = train_control, 
                    tuneGrid = tune_grid)

# 查看调参结果
print(kknn_model)

# 提取最佳参数组合
best_params <- kknn_model$bestTune
print(best_params)

# 绘制可视化调参结果
ggplot(kknn_model) +
  theme_minimal() +
  ggtitle("KNN 超参数调整结果")

# 使用最佳参数构建最终模型
knn_model <- train(diabetes ~ exercise + hyperlip + pregnant + age_scaled + 
                     glucose_scaled + bmi_scaled + pedigree_scaled, 
                   data = tlog,  
                   method = "kknn", 
                   trControl = train_control, 
                   tuneGrid = expand.grid(kmax = best_params$kmax, 
                                          distance = best_params$distance, 
                                          kernel = best_params$kernel))  

# 查看最终模型
print(knn_model)


#########################3.7 支持向量机(SVM)#######################################################
##基于标准化后的数据建模

# 参数调整
set.seed(11)
tune_result <- tune.svm(diabetes ~ exercise + hyperlip + pregnant + age_scaled + 
                          glucose_scaled + bmi_scaled + pedigree_scaled, 
                        data = tlog,   
                        kernel = "radial",   # 径向核函数RBF
                        cost = 10^(-1:3),  # cost:惩罚参数,用于控制分类错误的惩罚程度
                        gamma = 10^(-3:1), # gamma:核函数的参数,定义单个训练样本影响的范围
                        tunecontrol=tune.control(sampling = "cross",cross = 5), #交叉验证
                        probability = TRUE) 

# 查看最佳参数  
best_model <- tune_result$best.model  
print(tune_result)  

# 使用最佳参数拟合SVM模型  
svm_model <- svm(diabetes ~ exercise + hyperlip + pregnant + age_scaled + 
                   glucose_scaled + bmi_scaled + pedigree_scaled, 
                 data = tlog,   
                 kernel = "radial",   
                 cost = best_model$cost,   
                 gamma = best_model$gamma,   
                 probability = TRUE)  # 启用概率 
print(svm_model)


#########################3.8 神经网络(nnet)#######################################################

# 构建神经网络模型的函数  
build_nn_model <- function(hidden_layers) {  
  formula <- as.formula("diabetes ~ exercise + hyperlip + pregnant + age_scaled + 
                          glucose_scaled + bmi_scaled + pedigree_scaled")  
  model <- neuralnet(formula, data = tlog, hidden = hidden_layers, 
                     linear.output = FALSE)  # 模型输出为分类概率(非线性激活)
  return(model)  
}  

# 初始化变量  
best_model_nnet <- NULL  
best_auc <- 0  
best_hidden_layers <- NULL  # 保存最佳隐藏层组合

# 设置隐藏层组合  
hidden_layer_combinations <- list(c(2),c(3),c(4), c(2, 1))  
# 网格搜索
for (hidden in hidden_layer_combinations) {  # 遍历每种隐藏层结构
  set.seed(123)  # 设置随机种子
  nn_model <- build_nn_model(hidden)  
  
  # 进行预测
  predictions_prob <- predict(nn_model, tlog)[,2]   # 获取概率  
  predictions <- ifelse(predictions_prob > 0.5, "Yes", "No")  # 将概率转为分类  
  
  # 计算AUC  
  roc_obj <- roc(tlog$diabetes, predictions_prob)  
  auc_value <- roc_obj$auc  
  
  # 更新最佳模型  
  if (auc_value > best_auc) {  
    best_auc <- auc_value  
    best_model_nnet <- nn_model
    best_hidden_layers <- hidden  # 保存最佳隐藏层组合
  }  
}  

# 输出最佳模型和AUC值  
cat("Best AUC:", best_auc, "\n")
cat("Best Hidden Layer Configuration:", paste(unlist(best_hidden_layers), collapse = ", "), "\n")  

nnet_model <- best_model_nnet

#显示模型信息 
summary(nnet_model )


#############4.训练集模型效果评价##############################################

######4.1 模型预测结果####

# Logistic模型
train_prob_lr <- predict(lr_model, newdata = tlog, 
                         type = 'response')     # 指定预测输出为概率,预测Yes的概率
train_prob_lr
train_pred_lr <- factor(ifelse(train_prob_lr > 0.5,'Yes','No'))  # 预测分类
train_pred_lr 

# 决策树模型
train_pred_tree <- predict(tree_model, 
                           newdata = tlog, 
                           type = "class") # 预测分类 
train_pred_tree
train_prob_tree <- predict(tree_model, newdata = tlog, 
                           type = "prob")[, 2]  # 预测Yes的概率
train_prob_tree

# 随机森林
train_pred_rf <- predict(rf_model, newdata = tlog)   # 预测分类
train_pred_rf
train_prob_rf <- predict(rf_model, newdata = tlog, 
                         type = "prob")[, 2]  # 预测Yes的概率  
train_prob_rf

# Xgboost模型
train_prob_xgb <- predict(xgb_model, train_matrix)   # 预测Yes的概率
train_prob_xgb
train_pred_xgb <- factor(ifelse(train_prob_xgb > 0.5,'Yes','No')) # 预测分类
train_pred_xgb

# LightGBM模型
train_prob_lightgbm <- predict(lightgbm_model,
                               newdata = as.matrix(tlog[, selected_vars]),
                               type = 'prob')   # 预测Yes的概率           
train_prob_lightgbm
train_pred_lightgbm <- predict(lightgbm_model,
                               newdata = as.matrix(tlog[, selected_vars]),
                               type = 'class')  # 预测分类
train_pred_lightgbm <- factor(train_pred_lightgbm,levels = c(0,1),labels = c('No','Yes'))
train_pred_lightgbm

# knn 模型
train_pred_knn <- predict(knn_model, newdata = tlog)    # 预测分类
train_pred_knn
train_prob_knn <- predict(knn_model, newdata = tlog, type = "prob")[,"Yes"] # 预测Yes的概率    
train_prob_knn

# 支持向量机
train_pred_svm  <- predict(svm_model, newdata = tlog)   # 预测分类
train_pred_svm
train_prob_svm <- attr(predict(svm_model, newdata = tlog, probability = TRUE), 
                       "probabilities")[, "Yes"]    # 预测Yes的概率
train_prob_svm

# 神经网络 
train_prob_nnet <- predict(nnet_model, tlog)[,2]  # 预测Yes的概率 
train_prob_nnet
train_pred_nnet <- factor(ifelse(train_prob_nnet > 0.5,'Yes','No'))  # 预测分类
train_pred_nnet          


#########4.2 混淆矩阵####

# Logistic模型
confusion_matrix_lr <- caret::confusionMatrix(train_pred_lr, 
                                              tlog$diabetes, 
                                              positive = "Yes")   
print(confusion_matrix_lr) 

# 决策树模型
confusion_matrix_tree <- caret::confusionMatrix(train_pred_tree, 
                                                tlog$diabetes, 
                                                positive = "Yes") # 训练集
print(confusion_matrix_tree) 

# 随机森林
confusion_matrix_rf <- caret::confusionMatrix(train_pred_rf, 
                                              tlog$diabetes,
                                              positive = "Yes")  
print(confusion_matrix_rf) 

# Xgboost模型
confusion_matrix_xgb <- caret::confusionMatrix(train_pred_xgb, 
                                               tlog$diabetes, 
                                               positive = "Yes")  
print(confusion_matrix_xgb) 


# LightGBM模型
confusion_matrix_lightgbm <- caret::confusionMatrix(train_pred_lightgbm, 
                                                    tlog$diabetes, 
                                                    positive = "Yes") 
print(confusion_matrix_lightgbm) 

# knn 模型
confusion_matrix_knn <- caret::confusionMatrix(train_pred_knn, 
                                               tlog$diabetes, 
                                               positive = "Yes") 
print(confusion_matrix_knn) 

# 支持向量机
confusion_matrix_svm <- caret::confusionMatrix(train_pred_svm, 
                                               tlog$diabetes, 
                                               positive = "Yes")
print(confusion_matrix_svm) 

# 神经网络 
confusion_matrix_nnet <- caret::confusionMatrix(train_pred_nnet, 
                                                tlog$diabetes, 
                                                positive = "Yes")  
print(confusion_matrix_nnet)           

#########4.3 ROC曲线####

## 计算ROC的auc值及95%CI
# (1) lr
roc_lr <- roc(tlog$diabetes,    # 目标变量的真实标签
              as.numeric(train_prob_lr)  # 模型预测的概率值
)
auc_lr <- roc_lr$auc  # AUC值
auc_lr    
ci.auc(roc_lr)        # AUC值的95%CI

# (2) tree
roc_tree <- roc(tlog$diabetes, as.numeric(train_prob_tree))
auc_tree <- roc_tree$auc
auc_tree
ci.auc(auc_tree)

# (3) rf
roc_rf <- roc(tlog$diabetes, as.numeric(train_prob_rf))
auc_rf <- roc_rf$auc
auc_rf
ci.auc(auc_rf)

# (4) xgboost
roc_xgb <- roc(tlog$diabetes, as.numeric(train_prob_xgb))
auc_xgb <- roc_xgb$auc
auc_xgb
ci.auc(auc_xgb)

# (5) lightgbm
roc_lightgbm <- roc(tlog$diabetes, as.numeric(train_prob_lightgbm))
auc_lightgbm <- roc_lightgbm$auc
auc_lightgbm
ci.auc(auc_lightgbm)

# (6) knn
roc_knn <- roc(tlog$diabetes, as.numeric(train_prob_knn))
auc_knn <- roc_knn$auc
auc_knn
ci.auc(auc_knn)

# (7) svm
roc_svm <- roc(tlog$diabetes, as.numeric(train_prob_svm))
auc_svm <- roc_svm$auc
auc_svm
ci.auc(auc_svm)

# (8) nnet
roc_nnet <- roc(tlog$diabetes, as.numeric(train_prob_nnet))
auc_nnet <- roc_nnet$auc
auc_nnet
ci.auc(auc_nnet)

# 绘制ROC曲线
plot(roc_lr, 
     col = "red", # 曲线颜色为红色
     lwd = 2,     # 曲线的线宽为 2
     main = "ROC Curves for Training dataset",  # 设置图的标题
     xlab = "1 - Specificity", ylab = "Sensitivity", # 设置X轴和Y轴标签
     legacy.axes = TRUE,  # 使 X 轴范围从 0 到 1
     cex.main = 1.5,      # 设置标题字体大小
     cex.lab = 1.2, cex.axis = 1.2  # 设置坐标轴标签和刻度字体大小
)
lines(roc_tree, col = "blue", lwd = 2)
lines(roc_rf, col = "green", lwd = 2)
lines(roc_knn, col = "purple", lwd = 2)
lines(roc_svm, col = "orange", lwd = 2)
lines(roc_nnet, col = "brown", lwd = 2)
lines(roc_xgb, col = "pink", lwd = 2)
lines(roc_lightgbm, col = "cyan", lwd = 2)

# 添加图例
legend("bottomright", 
       legend = c(    # 创建每个模型的AUC文本
         paste("Logistic Regression (AUC = ", round(auc_lr, 3), ")", sep = ""),
         paste("Decision Tree (AUC = ", round(auc_tree, 3), ")", sep = ""),
         paste("Random Forest (AUC = ", round(auc_rf, 3), ")", sep = ""),
         paste("KNN (AUC = ", round(auc_knn, 3), ")", sep = ""),
         paste("SVM (AUC = ", round(auc_svm, 3), ")", sep = ""),
         paste("Neural Network (AUC = ", round(auc_nnet, 3), ")", sep = ""),
         paste("XGBoost (AUC = ", round(auc_xgb, 3), ")", sep = ""),
         paste("LightGBM (AUC = ", round(auc_lightgbm, 3), ")", sep = "") ), 
       col = c("red", "blue", "green", "purple", "orange", 
               "brown", "pink", "cyan"), 
       lty = 1,   # 设置图例中线的样式为实线
       lwd = 2,   # 设置图例中线的宽度为 2
       cex = 0.6) # 设置图例字体的缩放比例


#############4.4 校准曲线####

# 预测结果及真实标签汇总为一个数据框
calibration_data <- data.frame(
  Model = c(rep("Logistic Regression", length(train_prob_lr)),# rep():重复指定字符串多次
            rep("Decision Tree", length(train_prob_tree)),
            rep("Random Forest", length(train_prob_rf)),
            rep("KNN", length(train_prob_knn)),
            rep("SVM", length(train_prob_svm)),
            rep("Neural Network", length(train_prob_nnet)),
            rep("XGBoost", length(train_prob_xgb)),
            rep("LightGBM", length(train_prob_lightgbm))),
  Probability = c(train_prob_lr, 
                  train_prob_tree,
                  train_prob_rf,
                  train_prob_knn,
                  train_prob_svm,
                  train_prob_nnet,
                  train_prob_xgb,
                  train_prob_lightgbm),#将所有模型的预测概率按顺序拼接成一个向量
  Actual = as.numeric(c(tlog$diabetes)) - 1  # 将因子转为数值
)

# 绘制光滑校准曲线
ggplot(calibration_data, aes(x = Probability, y = Actual, color = Model)) +
  geom_smooth(method = "loess", se = FALSE, size = 1.5) +  # 使用 LOESS 光滑曲线
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black", size = 1) +  # 理想参考线
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(
    title = "Calibration Curves",
    x = "Actual Probability",
    y = "Observed Proportion" ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.line = element_line(colour = "black")) +
  scale_color_brewer(palette = "Set1")  # 使用醒目的配色方案

############4.5 决策曲线分析(DCA)曲线######

# 将模型的预测结果和真实标签组合成一个数据框
dca_tlog <- data.frame(diabetes = as.numeric(tlog$diabetes)-1, 
                       train_prob_lr,
                       train_prob_tree, 
                       train_prob_rf,
                       train_prob_knn,
                       train_prob_svm,
                       train_prob_nnet,
                       train_prob_xgb,
                       train_prob_lightgbm)

# Logistic 模型
dca.result_lr <- decision_curve(diabetes ~ train_prob_lr, 
                                data = dca_tlog, 
                                bootstraps = 10)
# 决策树
dca.result_tree <- decision_curve(diabetes ~ train_prob_tree, 
                                  data = dca_tlog,
                                  bootstraps = 10)
# 随机森林
dca.result_rf <- decision_curve(diabetes ~ train_prob_rf, 
                                data = dca_tlog, 
                                bootstraps = 10)
# XGBoost 
dca.result_xgb <- decision_curve(diabetes ~ train_prob_xgb, 
                                 data = dca_tlog, 
                                 bootstraps = 10)
# LightGBM
dca.result_lightgbm <- decision_curve(diabetes ~ train_prob_lightgbm, 
                                      data = dca_tlog, 
                                      bootstraps = 10)
# knn
dca.result_knn <- decision_curve(diabetes ~ train_prob_knn, 
                                 data = dca_tlog,
                                 bootstraps = 10)
# 支持向量机(SVM)
dca.result_svm <- decision_curve(diabetes ~ train_prob_svm, 
                                 data = dca_tlog)
# 神经网络
dca.result_nnet <- decision_curve(diabetes ~ train_prob_nnet, 
                                  data = dca_tlog, 
                                  bootstraps = 10)

plot_decision_curve(
  list(dca.result_lr, dca.result_tree, dca.result_rf, 
       dca.result_knn, dca.result_svm, dca.result_nnet,
       dca.result_xgb, dca.result_lightgbm),  # 传入包含决策曲线分析结果的列表
  curve.names = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
  col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
  lwd = 2,  # 设置线宽
  confidence.intervals = FALSE, # 禁用置信区间
  legend.position = ("none")  
)

# 手动添加图例
legend("bottomleft", 
       legend = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
       col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
       lwd = 2,
       cex = 0.6,  # 调整cex改变字体大小
       bty = "y",   # 无边框
       y.intersp = 0.8,    # 调整条目之间的间距
       x.intersp = 0.5,    # 调整颜色线条与标签之间的水平间距
       text.width = 0.1)  # 自动调整宽度


##############################5.验证集模型效果评价##########################

######5.1 模型预测结果####

# 结局变量因子化
valdata$diabetes <- factor(valdata$diabetes,levels = c(0,1),labels = c('No','Yes'))

# Logistic模型
val_prob_lr <- predict(lr_model, newdata = valdata, 
                       type = 'response')    # 预测Yes的概率
val_prob_lr
val_pred_lr <-  factor(ifelse(val_prob_lr > 0.5,'Yes','No'))  # 预测分类
val_pred_lr 

# 决策树模型
val_pred_tree <- predict(tree_model, newdata = valdata, type = "class") # 预测分类 
val_pred_tree
val_prob_tree <- predict(tree_model, newdata = valdata, 
                         type = "prob")[, 2]  # 预测Yes的概率
val_prob_tree


# 随机森林
val_pred_rf <- predict(rf_model, newdata = valdata)   # 预测分类
val_pred_rf
val_prob_rf <- predict(rf_model, newdata = valdata, 
                       type = "prob")[, 2]  # 预测Yes的概率  
val_prob_rf

# Xgboost模型
val_prob_xgb <- predict(xgb_model, val_matrix)   # 预测Yes的概率
val_prob_xgb
val_pred_xgb <- factor(ifelse(val_prob_xgb > 0.5,'Yes','No')) # 预测分类
val_pred_xgb

# LightGBM模型
val_prob_lightgbm <- predict(lightgbm_model,
                             newdata = as.matrix(valdata[,selected_vars]),
                             type = 'prob')   # 预测Yes的概率           
val_prob_lightgbm
val_pred_lightgbm <- predict(lightgbm_model,
                             newdata = as.matrix(valdata[,selected_vars]),
                             type = 'class')  # 预测分类
val_pred_lightgbm <- factor(val_pred_lightgbm,
                            levels = c(0,1),labels = c('No','Yes'))
val_pred_lightgbm

# knn 模型
val_pred_knn <- predict(knn_model, newdata = valdata) # 预测分类
val_pred_knn
val_prob_knn <- predict(knn_model, newdata = valdata, 
                        type = "prob")[,"Yes"]  # 预测Yes的概率    
val_prob_knn

# 支持向量机
val_pred_svm  <- predict(svm_model, newdata = valdata)   # 预测分类
val_pred_svm
val_prob_svm <- attr(predict(svm_model, newdata = valdata, probability = TRUE), 
                     "probabilities")[, "Yes"]    # 预测Yes的概率
val_prob_svm

# 神经网络 
val_prob_nnet <- predict(nnet_model, valdata)[,2]  # 预测Yes的概率 
val_prob_nnet
val_pred_nnet <- factor(ifelse(val_prob_nnet > 0.5,'Yes','No'))  # 预测分类
val_pred_nnet          

#########5.2 混淆矩阵####

# Logistic模型
confusion_matrix_lr1 <- caret::confusionMatrix(
  val_pred_lr, valdata$diabetes, positive = "Yes")   
print(confusion_matrix_lr1) 

# 决策树模型
confusion_matrix_tree1 <- caret::confusionMatrix(val_pred_tree, 
                                                 valdata$diabetes, 
                                                 positive = "Yes") # 训练集
print(confusion_matrix_tree1) 

# 随机森林
confusion_matrix_rf1 <- caret::confusionMatrix(val_pred_rf, 
                                               valdata$diabetes,
                                               positive = "Yes")  
print(confusion_matrix_rf1) 

# Xgboost模型
confusion_matrix_xgb1 <- caret::confusionMatrix(val_pred_xgb, 
                                                valdata$diabetes, 
                                                positive = "Yes")  
print(confusion_matrix_xgb1) 


# LightGBM模型
confusion_matrix_lightgbm1 <- caret::confusionMatrix(val_pred_lightgbm, 
                                                     valdata$diabetes, 
                                                     positive = "Yes") 
print(confusion_matrix_lightgbm1) 

# knn 模型
confusion_matrix_knn1 <- caret::confusionMatrix(val_pred_knn, 
                                                valdata$diabetes, 
                                                positive = "Yes") 
print(confusion_matrix_knn1) 

# 支持向量机
confusion_matrix_svm1 <- caret::confusionMatrix(val_pred_svm, 
                                                valdata$diabetes, 
                                                positive = "Yes")
print(confusion_matrix_svm1) 

# 神经网络 
confusion_matrix_nnet1 <- caret::confusionMatrix(val_pred_nnet, 
                                                 valdata$diabetes, 
                                                 positive = "Yes")  
print(confusion_matrix_nnet1)

#########5.3 ROC曲线####

# 计算ROC的auc值及95%CI

roc_lr_val <- roc(valdata$diabetes, as.numeric(val_prob_lr))
auc_lr_val <- roc_lr_val$auc  # AUC值
auc_lr_val    
ci.auc(roc_lr_val)        # AUC值的95%CI

roc_tree_val <- roc(valdata$diabetes, as.numeric(val_prob_tree))
auc_tree_val <- roc_tree_val $auc
auc_tree_val
ci.auc(auc_tree_val)

roc_rf_val <- roc(valdata$diabetes, as.numeric(val_prob_rf))
auc_rf_val <- roc_rf_val $auc
auc_rf_val
ci.auc(auc_rf_val)

roc_xgb_val <- roc(valdata$diabetes, as.numeric(val_prob_xgb))
auc_xgb_val <- roc_xgb_val $auc
auc_xgb_val
ci.auc(auc_xgb_val)

roc_lightgbm_val <- roc(valdata$diabetes, as.numeric(val_prob_lightgbm))
auc_lightgbm_val <- roc_lightgbm_val $auc
auc_lightgbm_val
ci.auc(auc_lightgbm_val)

roc_knn_val <- roc(valdata$diabetes, as.numeric(val_prob_knn))
auc_knn_val <- roc_knn_val $auc
auc_knn_val
ci.auc(auc_knn_val)

roc_svm_val <- roc(valdata$diabetes, as.numeric(val_prob_svm))
auc_svm_val <- roc_svm_val$auc
auc_svm_val
ci.auc(auc_svm_val)

roc_nnet_val <- roc(valdata$diabetes, as.numeric(val_prob_nnet))
auc_nnet_val <- roc_nnet_val $auc
auc_nnet_val
ci.auc(auc_nnet_val)

# 绘制ROC曲线
plot(roc_lr_val, col = "red", lwd = 2, main = "ROC Curves for val dataset", 
     xlab = "1 - Specificity", ylab = "Sensitivity", legacy.axes = TRUE, 
     cex.main = 1.6, cex.lab = 1.3, cex.axis = 1.2)
lines(roc_tree_val, col = "blue", lwd = 2)
lines(roc_rf_val, col = "green", lwd = 2)
lines(roc_knn_val, col = "purple", lwd = 2)
lines(roc_svm_val, col = "orange", lwd = 2)
lines(roc_nnet_val, col = "brown", lwd = 2)
lines(roc_xgb_val, col = "pink", lwd = 2)
lines(roc_lightgbm_val, col = "cyan", lwd = 2)

# 添加图例
legend("bottomright", 
       legend = c(
         paste("Logistic Regression (AUC = ", round(auc_lr_val, 3), ")", sep = ""),
         paste("Decision Tree (AUC = ", round(auc_tree_val, 3), ")", sep = ""),
         paste("Random Forest (AUC = ", round(auc_rf_val, 3), ")", sep = ""),
         paste("KNN (AUC = ", round(auc_knn_val, 3), ")", sep = ""),
         paste("SVM (AUC = ", round(auc_svm_val, 3), ")", sep = ""),
         paste("Neural Network (AUC = ", round(auc_nnet_val, 3), ")", sep = ""),
         paste("XGBoost (AUC = ", round(auc_xgb_val, 3), ")", sep = ""),
         paste("LightGBM (AUC = ", round(auc_lightgbm_val, 3), ")", sep = "") ), 
       col = c("red", "blue", "green", "purple", "orange", 
               "brown", "pink", "cyan"), 
       lty = 1, lwd = 2, cex = 0.6)


#############5.4 校准曲线####

# 预测结果及真实标签汇总为一个数据框
calibration_data1 <- data.frame(
  Model = c(rep("Logistic Regression", length(val_prob_lr)),
            rep("Decision Tree", length(val_prob_tree)),
            rep("Random Forest", length(val_prob_rf)),
            rep("KNN", length(val_prob_knn)),
            rep("SVM", length(val_prob_svm)),
            rep("Neural Network", length(val_prob_nnet)),
            rep("XGBoost", length(val_prob_xgb)),
            rep("LightGBM", length(val_prob_lightgbm))),
  Probability = c(val_prob_lr, 
                  val_prob_tree,
                  val_prob_rf,
                  val_prob_knn,
                  val_prob_svm,
                  val_prob_nnet,
                  val_prob_xgb,
                  val_prob_lightgbm),
  Actual = as.numeric(c(valdata$diabetes)) - 1 )  # 将因子转为数值



# 绘制光滑校准曲线
ggplot(calibration_data1, aes(x = Probability, y = Actual, color = Model)) +
  geom_smooth(method = "loess", se = FALSE, size = 1.5) +  # 使用 LOESS 光滑曲线
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black", size = 1) +  # 理想参考线
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(
    title = "Calibration Curves",
    x = "Actual Probability",
    y = "Observed Proportion" ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.line = element_line(colour = "black")) +
  scale_color_brewer(palette = "Set1")  # 使用醒目的配色方案

############5.5 DCA曲线######

dca_valdata <- data.frame(diabetes = as.numeric(valdata$diabetes)-1, 
                          val_prob_lr,
                          val_prob_tree, 
                          val_prob_rf,
                          val_prob_knn,
                          val_prob_svm,
                          val_prob_nnet,
                          val_prob_xgb,
                          val_prob_lightgbm)
# Logistic Regression 
dca.result_lr1 <- decision_curve(diabetes ~ val_prob_lr, 
                                 data = dca_valdata, 
                                 bootstraps = 10)
# Decision Tree
dca.result_tree1 <- decision_curve(diabetes ~ val_prob_tree, 
                                   data = dca_valdata, 
                                   bootstraps = 10)
# Random Forest
dca.result_rf1 <- decision_curve(diabetes ~ val_prob_rf, 
                                 data = dca_valdata, 
                                 bootstraps = 10)
# XGBoost
dca.result_xgb1 <- decision_curve(diabetes ~ val_prob_xgb, 
                                  data = dca_valdata, 
                                  bootstraps = 10)
# LightGBM
dca.result_lightgbm1 <- decision_curve(diabetes ~ val_prob_lightgbm, 
                                       data = dca_valdata, 
                                       bootstraps = 10)
# KNN
dca.result_knn1 <- decision_curve(diabetes ~ val_prob_knn, 
                                  data = dca_valdata, 
                                  bootstraps = 10)
# SVM
dca.result_svm1 <- decision_curve(diabetes ~ val_prob_svm, 
                                  data = dca_valdata, 
                                  bootstraps = 10)
# Neural Network
dca.result_nnet1 <- decision_curve(diabetes ~ val_prob_nnet, 
                                   data = dca_valdata, 
                                   bootstraps = 10)

plot_decision_curve(
  list(dca.result_lr1, dca.result_tree1, dca.result_rf1, 
       dca.result_knn1, dca.result_svm1, dca.result_nnet1,
       dca.result_xgb1, dca.result_lightgbm1),  # 传入包含决策曲线分析结果的列表
  curve.names = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
  col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
  lwd = 2,  # 设置线宽
  confidence.intervals = FALSE, # 禁用置信区间
  legend.position = ("none")  
)

# 手动添加图例
legend("bottomright", 
       legend = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
       col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
       lwd = 2,
       cex = 0.6,  # 通过调整cex改变字体大小
       bty = "y",   # 无边框
       y.intersp = 0.8,    # 调整条目之间的间距
       x.intersp = 0.5,    # 调整颜色线条与标签之间的水平间距
       text.width = 0.15)  # 自动调整宽度

#################### 保存所有绘制的图片为PNG格式 ####################

###### 1. 列线图 (Nomogram) ######
png("nomogram.png", width=1000, height=800, res=150)
regplot(lr_model, 
        title = "Nomogram", 
        points = TRUE,                    
        axis.text.size = 12,              
        title.text.size = 14)             
dev.off()

###### 2. 决策树图 ######
png("decision_tree.png", width=1200, height=800, res=150)
plot(as.party(tree_model))
dev.off()

###### 3. 训练集ROC曲线 ######
png("train_roc_curves.png", width=1000, height=800, res=150)
plot(roc_lr, 
     col = "red", lwd = 2, 
     main = "ROC Curves for Training dataset", 
     xlab = "1 - Specificity", ylab = "Sensitivity",
     legacy.axes = TRUE, cex.main = 1.5, cex.lab = 1.2, cex.axis = 1.2)
lines(roc_tree, col = "blue", lwd = 2)
lines(roc_rf, col = "green", lwd = 2)
lines(roc_knn, col = "purple", lwd = 2)
lines(roc_svm, col = "orange", lwd = 2)
lines(roc_nnet, col = "brown", lwd = 2)
lines(roc_xgb, col = "pink", lwd = 2)
lines(roc_lightgbm, col = "cyan", lwd = 2)
legend("bottomright", 
       legend = c(
         paste("Logistic Regression (AUC = ", round(auc_lr, 3), ")"),
         paste("Decision Tree (AUC = ", round(auc_tree, 3), ")"),
         paste("Random Forest (AUC = ", round(auc_rf, 3), ")"),
         paste("KNN (AUC = ", round(auc_knn, 3), ")"),
         paste("SVM (AUC = ", round(auc_svm, 3), ")"),
         paste("Neural Network (AUC = ", round(auc_nnet, 3), ")"),
         paste("XGBoost (AUC = ", round(auc_xgb, 3), ")"),
         paste("LightGBM (AUC = ", round(auc_lightgbm, 3), ")") ), 
       col = c("red", "blue", "green", "purple", "orange", 
               "brown", "pink", "cyan"), 
       lty = 1, lwd = 2, cex = 0.6)
dev.off()

###### 4. 训练集校准曲线 ######
png("train_calibration_curve.png", width=1000, height=800, res=150)
ggplot(calibration_data, aes(x = Probability, y = Actual, color = Model)) +
  geom_smooth(method = "loess", se = FALSE, size = 1.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black", size = 1) +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(title = "Calibration Curves (Training Set)",
       x = "Predicted Probability",
       y = "Observed Proportion") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        legend.position = "bottom",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        axis.line = element_line(colour = "black")) +
  scale_color_brewer(palette = "Set1")
dev.off()

###### 5. 训练集DCA曲线 ######
png("train_dca_curve.png", width=1000, height=800, res=150)
plot_decision_curve(
  list(dca.result_lr, dca.result_tree, dca.result_rf, 
       dca.result_knn, dca.result_svm, dca.result_nnet,
       dca.result_xgb, dca.result_lightgbm),
  curve.names = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
  col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
  lwd = 2,
  confidence.intervals = FALSE,
  legend.position = "none"
)
legend("bottomleft", 
       legend = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
       col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
       lwd = 2, cex = 0.6, bty = "y", y.intersp = 0.8, x.intersp = 0.5, text.width = 0.1)
dev.off()

###### 6. 验证集ROC曲线 ######
png("validation_roc_curves.png", width=1000, height=800, res=150)
plot(roc_lr_val, col = "red", lwd = 2, 
     main = "ROC Curves for Validation Set", 
     xlab = "1 - Specificity", ylab = "Sensitivity", 
     legacy.axes = TRUE, cex.main = 1.6, cex.lab = 1.3, cex.axis = 1.2)
lines(roc_tree_val, col = "blue", lwd = 2)
lines(roc_rf_val, col = "green", lwd = 2)
lines(roc_knn_val, col = "purple", lwd = 2)
lines(roc_svm_val, col = "orange", lwd = 2)
lines(roc_nnet_val, col = "brown", lwd = 2)
lines(roc_xgb_val, col = "pink", lwd = 2)
lines(roc_lightgbm_val, col = "cyan", lwd = 2)
legend("bottomright", 
       legend = c(
         paste("Logistic Regression (AUC = ", round(auc_lr_val, 3), ")"),
         paste("Decision Tree (AUC = ", round(auc_tree_val, 3), ")"),
         paste("Random Forest (AUC = ", round(auc_rf_val, 3), ")"),
         paste("KNN (AUC = ", round(auc_knn_val, 3), ")"),
         paste("SVM (AUC = ", round(auc_svm_val, 3), ")"),
         paste("Neural Network (AUC = ", round(auc_nnet_val, 3), ")"),
         paste("XGBoost (AUC = ", round(auc_xgb_val, 3), ")"),
         paste("LightGBM (AUC = ", round(auc_lightgbm_val, 3), ")") ), 
       col = c("red", "blue", "green", "purple", "orange", 
               "brown", "pink", "cyan"), 
       lty = 1, lwd = 2, cex = 0.6)
dev.off()

###### 7. 验证集校准曲线 ######
png("validation_calibration_curve.png", width=1000, height=800, res=150)
ggplot(calibration_data1, aes(x = Probability, y = Actual, color = Model)) +
  geom_smooth(method = "loess", se = FALSE, size = 1.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black", size = 1) +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(title = "Calibration Curves (Validation Set)",
       x = "Predicted Probability",
       y = "Observed Proportion") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        legend.position = "bottom",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        axis.line = element_line(colour = "black")) +
  scale_color_brewer(palette = "Set1")
dev.off()

###### 8. 验证集DCA曲线 ######
png("validation_dca_curve.png", width=1000, height=800, res=150)
plot_decision_curve(
  list(dca.result_lr1, dca.result_tree1, dca.result_rf1, 
       dca.result_knn1, dca.result_svm1, dca.result_nnet1,
       dca.result_xgb1, dca.result_lightgbm1),
  curve.names = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
  col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
  lwd = 2,
  confidence.intervals = FALSE,
  legend.position = "none"
)
legend("bottomright", 
       legend = c("Logistic Regression", "Decision Tree", "Random Forest", 
                  "KNN", "SVM", "Neural Network", "XGBoost", "LightGBM"),
       col = c("red", "green", "blue", "purple", "orange", "brown", "cyan", "magenta"),
       lwd = 2, cex = 0.6, bty = "y", y.intersp = 0.8, x.intersp = 0.5, text.width = 0.15)
dev.off()

###### 9. 保存KNN调参结果图 ######
png("knn_tuning_results.png", width=1000, height=800, res=150)
ggplot(kknn_model) +
  theme_minimal() +
  ggtitle("KNN Hyperparameter Tuning Results")
dev.off()

6.图片展示

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐