2022年 11月 16日

Python机器学习11——支持向量机

本系列所有的代码和数据都可以从陈强老师的个人主页上下载:Python数据程序

参考书目:陈强.机器学习及Python应用. 北京:高等教育出版社, 2021.

本系列基本不讲数学原理,只从代码角度去让读者们利用最简洁的Python代码实现机器学习方法。


前面的决策树,随机森林,梯度提升都是属于树模型,而支持向量机被称为核方法。其主要是依赖核函数将数据映射到高维空间进行分离。支持向量机适合用于变量越多越好的问题,因此在神经网络之前,它对于文本和图片领域都算效果还不错的方法。学术界偏爱支持向量机是因为它具有非常严格和漂亮的数学证明过程。支持向量机可以分类也可以回归,但一般用于分类问题更好。


支持向量机二分类Python案例

采用垃圾邮件的数据集,导入包读取数据:

  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. import seaborn as sns
  5. from sklearn.preprocessing import StandardScaler
  6. from sklearn.model_selection import train_test_split
  7. from sklearn.model_selection import KFold, StratifiedKFold
  8. from sklearn.model_selection import GridSearchCV
  9. from sklearn.metrics import plot_confusion_matrix
  10. from sklearn.svm import SVC
  11. from sklearn.svm import SVR
  12. #from sklearn.svm import LinearSVC
  13. from sklearn.datasets import load_boston
  14. from sklearn.datasets import load_digits
  15. from sklearn.datasets import make_blobs
  16. from mlxtend.plotting import plot_decision_regions
  17. spam = pd.read_csv('spam.csv')
  18. spam.shape
  19. spam.head(3)

数据长这样 

 取出X和y,划分训练测试集,将数据标准化

  1. X = spam.iloc[:, :-1]
  2. y = spam.iloc[:, -1]
  3. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, stratify=y, random_state=0)
  4. scaler = StandardScaler()
  5. scaler.fit(X_train)
  6. X_train_s = scaler.transform(X_train)
  7. X_test_s = scaler.transform(X_test)

分别采用不同的核函数进行支持向量机的估计

  1. #线性核函数
  2. model = SVC(kernel="linear", random_state=123)
  3. model.fit(X_train_s, y_train)
  4. model.score(X_test_s, y_test)
  5. #二次多项式核
  6. model = SVC(kernel="poly", degree=2, random_state=123)
  7. model.fit(X_train_s, y_train)
  8. model.score(X_test_s, y_test)
  9. #三次多项式
  10. model = SVC(kernel="poly", degree=3, random_state=123)
  11. model.fit(X_train_s, y_train)
  12. model.score(X_test_s, y_test)
  13. #径向核
  14. model = SVC(kernel="rbf", random_state=123)
  15. model.fit(X_train_s, y_train)
  16. model.score(X_test_s, y_test)
  17. #S核
  18. model = SVC(kernel="sigmoid",random_state=123)
  19. model.fit(X_train_s, y_train)
  20. model.score(X_test_s, y_test)

一般来说,径向核效果比较好,网格化搜索最优超参数

  1. param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
  2. kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
  3. model = GridSearchCV(SVC(kernel="rbf", random_state=123), param_grid, cv=kfold)
  4. model.fit(X_train_s, y_train)
  5. model.best_params_
  6. model.score(X_test_s, y_test)
  7. pred = model.predict(X_test)
  8. pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])

 预测得到混淆矩阵

 支持向量机多分类Python案例

使用sklearn库自带的手写数字的案例,采用支持向量机分类

  1. digits = load_digits()
  2. dir(digits)
  3. #图片数据(三维)
  4. digits.images.shape
  5. #拉成二维
  6. digits.data.shape
  7. #y的形状
  8. digits.target.shape
  9. #查看第十五个图片
  10. plt.imshow(digits.images[15], cmap=plt.cm.gray_r)

是个手写的5

 打印数字为8 的图片展示

  1. images_8 = digits.images[digits.target==8]
  2. for i in range(1, 10):
  3. plt.subplot(3, 3, i)
  4. plt.imshow(images_8[i-1], cmap=plt.cm.gray_r)
  5. plt.tight_layout()

取出X和y,进行支持向量机分类

  1. X = digits.data
  2. y = digits.target
  3. X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=0)
  4. model = SVC(kernel="linear", random_state=123)
  5. model.fit(X_train, y_train)
  6. model.score(X_test, y_test)
  7. model = SVC(kernel="poly", degree=2, random_state=123)
  8. model.fit(X_train, y_train)
  9. model.score(X_test, y_test)
  10. model = SVC(kernel="poly", degree=3, random_state=123)
  11. model.fit(X_train, y_train)
  12. model.score(X_test, y_test)
  13. model = SVC(kernel='rbf', random_state=123)
  14. model.fit(X_train, y_train)
  15. model.score(X_test, y_test)
  16. model = SVC(kernel="sigmoid",random_state=123)
  17. model.fit(X_train, y_train)
  18. model.score(X_test, y_test)

 网格化搜索最优超参数,预测得到混淆矩阵

  1. param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1, 10]}
  2. kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
  3. model = GridSearchCV(SVC(kernel='rbf',random_state=123), param_grid, cv=kfold)
  4. model.fit(X_train, y_train)
  5. model.best_params_
  6. model.score(X_test, y_test)
  7. pred = model.predict(X_test)
  8. pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])

画热力图

  1. plot_confusion_matrix(model, X_test, y_test,cmap='Blues')
  2. plt.tight_layout()

 

 


支持向量机回归Python案例

依旧采用波士顿房价数据集进行回归

  1. # Support Vector Regression with Boston Housing Data
  2. X, y = load_boston(return_X_y=True)
  3. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
  4. scaler = StandardScaler()
  5. scaler.fit(X_train)
  6. X_train_s = scaler.transform(X_train)
  7. X_test_s = scaler.transform(X_test)
  8. # Radial Kernel
  9. model = SVR(kernel='rbf')
  10. model.fit(X_train_s, y_train)
  11. model.score(X_test_s, y_test)
  12. param_grid = {'C': [0.01, 0.1, 1, 10, 50, 100, 150], 'epsilon': [0.01, 0.1, 1, 10], 'gamma': [0.01, 0.1, 1, 10]}
  13. kfold = KFold(n_splits=10, shuffle=True, random_state=1)
  14. model = GridSearchCV(SVR(), param_grid, cv=kfold)
  15. model.fit(X_train_s, y_train)
  16. model.best_params_
  17. model = model.best_estimator_
  18. len(model.support_)
  19. model.support_vectors_
  20. model.score(X_test_s, y_test)
  21. # Comparison with Linear Regression
  22. from sklearn.linear_model import LinearRegression
  23. model = LinearRegression()
  24. model.fit(X_train, y_train)
  25. model.score(X_test, y_test)