本文共 12189 字,大约阅读时间需要 40 分钟。
``如果贝叶斯的模型效果不如其他模型,而我们又不想更换模型,那怎么办呢?如果以精确度
为指标来调整参数,贝叶斯估计是无法拯救了——不同于SVC和逻辑回归,贝叶斯的原理简单,根本没有什么可用的 参数。但是产出概率的算法有自己的调节方式,就是调节概率的校准程度。校准程度越高,模型对概率的预测越准 确,算法在做判断时就越有自信,模型就会更稳定。如果我们追求模型在概率预测上必须尽量贴近真实概率,那我们 就可以使用可靠性曲线来调节概率的校准程度`#可靠性曲线import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.datasets import make_classification as mc#制作自己数据集的类from sklearn.naive_bayes import GaussianNBfrom sklearn.svm import SVCfrom sklearn.linear_model import LogisticRegression as LRfrom sklearn.metrics import brier_score_lossfrom sklearn.model_selection import train_test_split
#创建数据集X,y=mc(n_samples=100000,n_features=20#总共20个特征 ,n_classes=2#标签是2分类 ,n_informative=2#其中两个代表较多信息 ,n_redundant=10#10个都是冗余特征 ,random_state=42)
#样本量足够大,因此使用1%的样本作为训练集,因为之前我们都知道#朴素贝叶斯,即使是使用非常少的训练集,效果也会不错Xtrain,Xtest,ytrain,ytest=train_test_split(X,y,test_size=0.99,random_state=42)
gnb=GaussianNB()gnb.fit(Xtrain,ytrain)y_pred=gnb.predict(Xtest)
prob_pos=gnb.predict_proba(Xtest)[:,1]#利用字典来创建DATAFRAMEdf=pd.DataFrame({"ytrue":ytest[:500],"probability":prob_pos[:500]})df=df.sort_values(by="probability")#索引变乱了df.index=range(df.shape[0])```
fig=plt.figure()#画布ax1=plt.subplot()#建立一个子图ax1.plot([0,1],[0,1],"k:",label="perfectly calibrated")##画出对角线,X轴取值0-1,y轴取值0-1#因为如果x=y,预测值和真实值越接近,那么曲线就越靠近对角线#ax1.plot(df["probability"],df["ytrue"],"s-",label="%s(%1.3f)"%("bayes",clf_score))ax1.plot(df["probability"],df["ytrue"],"s-")ax1.set_ylabel("true label")ax1.set_xlabel("predicted probability")ax1.set_ylim([-0.05,1.05])ax1.legend()plt.show()
fig=plt.figure()#画布ax1=plt.subplot()#建立一个子图ax1.plot([0,1],[0,1],"k:",label="perfectly calibrated")##画出对角线,X轴取值0-1,y轴取值0-1#因为如果x=y,预测值和真实值越接近,那么曲线就越靠近对角线#ax1.plot(df["probability"],df["ytrue"],"s-",label="%s(%1.3f)"%("bayes",clf_score))ax1.scatter(df["probability"],df["ytrue"],s=10)ax1.set_ylabel("true label")ax1.set_xlabel("predicted probability")ax1.set_ylim([-0.05,1.05])ax1.legend()plt.show()
#因为真实概率不可能找到,不过我们可以找到类概率from sklearn.calibration import calibration_curvetrueprob,preproba=calibration_curve(ytest,prob_pos,n_bins=10)fig=plt.figure()ax1=plt.subplot()ax1.plot([0,1],[0,1],"k:")ax1.plot(preproba,trueprob,"s-")ax1.set_ylabel("true probability for class 1")ax1.set_ylabel("mean predcited pronaility")ax1.set_ylim([-0.05,1.05])ax1.legend()plt.show()
举一个例子说明enumerate函数的用法。enumerate([‘a’, ‘b’, ‘c’])的值是:[(0, ‘a’), '(1, ‘b’), (2, ‘c’)]。(0, ‘a’)有两项,第一项是序号,第二项是原列表的元素。
再举一个例子说明enumerate函数的用法。enumerate([1.5, 2.8, 3.14, 5.6])的值是:[(0, 1.5), (1, 2.8), (3, 3.14), (4, 5.6)]。(0, 1.5)有两项,第一项是序号,第二项是原列表的元素。(4, 5.6)也一样,第一项是序号,第二项是原列表的元素。
ig,ax = plt.subplots()使用该函数确定图的位置,掉用时要XXX=ax.(ax是位置)
等价于:
fig = plt.figure()
ax = fig.add_subplot(1,1,1) fig 是图像对象,ax 是坐标轴对象 fig, ax = plt.subplots(1,3),其中参数1和3分别代表子图的行数和列数,一共有 1x3 个子图像。函数返回一个figure图像和子图ax的array列表。fig, ax = plt.subplots(1,3,1),最后一个参数1代表第一个子图。
如果想要设置子图的宽度和高度可以在函数内加入figsize值#不同得分箱下曲线是如何变化得fig=plt.figure()axes=plt.subplots(1,3,figsize=(18,4))#fig,axes=plt.subplots(1,3,figsize=(18,4))
fig,axes=plt.subplots(1,3,figsize=(18,4))for ind ,i in enumerate([3,10,100]): ax=axes[ind] trueprob,preproba=calibration_curve(ytest,prob_pos,n_bins=i) ax.plot([0,1],[0,1],"k:") ax.plot(preproba,trueprob,"s-") ax.set_ylabel("true probability for class 1") ax.set_xlabel("mean predcited pronaility") ax.set_ylim([-0.05,1.05]) ax.legend()plt.show()
#建立循环,绘制多个模型得概率密度曲线name=["GaussianBayes","Logistic","SVC"]gnb=GaussianNB()logi=LR(C=1.,solver='lbfgs',max_iter=3000,multi_class="auto")#优化算法选择参数:solver,lbfgs:拟牛顿法的一种,利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。#分类方式选择参数:multi_class#C是用来控制正则化程度的超参数,C正则化强度的倒数,必须是一个大于0的浮点数,不填写默认1.0,#即默认正则项与损失函数的比值是1:1。#C越小,损失函数会越小,模型对损失函数的惩罚越重,正则化的效力越强,参数会逐渐被压缩得越来越小svc=SVC(kernel="linear",gamma=1.)fig,ax1=plt.subplots(figsize=(8,6))ax1.plot([0,1],[0,1],"k:")#画出对角线for clf,name_ in zip([gnb,logi,svc],name): clf.fit(Xtrain,ytrain) y_pred=clf.predict(Xtest) if hasattr(clf,"predict_proba"): prob_pos=clf.predict_proba(Xtest)[:,1] else : prob_pos=clf.decision_function(Xtest) prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min()) clf_score=brier_score_loss(ytest,prob_pos,pos_label=y.max()) trueprob,preproba=calibration_curve(ytest,prob_pos,n_bins=10) ax1.plot(preproba,trueprob,"s-",label="%s(%1.3f)"%(name_,clf_score)) ax1.set_ylabel("true probability for class 1") ax1.set_xlabel("mean predcited pronaility") ax1.set_ylim([-0.05,1.05]) ax1.legend()plt.show()
我们可以通过绘制直方图来查看模型的预测概率的分布。直方图是以样本的预测概率分箱后的结果为横坐标,每个箱
中的样本数量为纵坐标的一个图像。注意,这里的分箱和我们在可靠性曲线中的分箱不同,这里的分箱是将预测概率 均匀分为一个个的区间,与之前可靠性曲线中为了平滑的分箱完全是两码事。我们来绘制一下我们的直方图name=["GaussianBayes","Logistic","SVC"]gnb=GaussianNB()logi=LR(C=1.,solver='lbfgs',max_iter=3000,multi_class="auto")#优化算法选择参数:solver,lbfgs:拟牛顿法的一种,利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。#分类方式选择参数:multi_class#C是用来控制正则化程度的超参数,C正则化强度的倒数,必须是一个大于0的浮点数,不填写默认1.0,#即默认正则项与损失函数的比值是1:1。#C越小,损失函数会越小,模型对损失函数的惩罚越重,正则化的效力越强,参数会逐渐被压缩得越来越小svc=SVC(kernel="linear",gamma=1.)fig,ax2=plt.subplots(figsize=(8,6))for clf,name_ in zip([gnb,logi,svc],name): clf.fit(Xtrain,ytrain) y_pred=clf.predict(Xtest) if hasattr(clf,"predict_proba"):#hasattr(obj,name):查看一个类obj中是否存在名字为name的接口,存在则返回True prob_pos=clf.predict_proba(Xtest)[:,1] else : prob_pos=clf.decision_function(Xtest) prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min()) ax2.hist(prob_pos#只设置横坐标即可,纵坐标是该概率下样本的数量 ,bins=10 ,label=name_ ,histtype="step" #设置直方图为透明 ,lw=2 #设置直方图每个柱子描边的粗细 )ax2.set_ylabel("Distribution of probability")ax2.set_xlabel("Mean predicted probability")ax2.set_xlim([-0.05, 1.05])ax2.set_xticks([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])ax2.legend(loc=9)plt.show()
#包装函数def plot_calib(models,name,Xtrain,Xtest,ytrain,ytest,n_bins=10): import matplotlib.pyplot as plt from sklearn.metrics import brier_score_loss from sklearn.calibration import calibration_curve fig,(ax1,ax2)=plt.subplots(1,2,figsize=(30,10)) ax1.plot([0,1],[0,1],"k:",label="Perfectly calibrated") for clf,name_ in zip(models,name): clf.fit(Xtrain,ytrain) y_pred=clf.predict(Xtest) if hasattr(clf,"predict_proba"):#hasattr(obj,name):查看一个类obj中是否存在名字为name的接口,存在则返回True prob_pos=clf.predict_proba(Xtest)[:,1] else : prob_pos=clf.decision_function(Xtest) prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min()) clf_score=brier_score_loss(ytest,prob_pos,pos_label=y.max()) trueprob,preproba=calibration_curve(ytest,prob_pos,n_bins=n_bins) ax1.plot(preproba,trueprob,"s-",label="%s(%1.3f)"%(name_,clf_score)) ax2.hist(prob_pos#只设置横坐标即可,纵坐标是该概率下样本的数量 ,range=(0,1) ,bins=n_bins ,label=name_ ,histtype="step" #设置直方图为透明 ,lw=2 #设置直方图每个柱子描边的粗细 )ax2.set_ylabel("Distribution of probability")ax2.set_xlabel("Mean predicted probability")ax2.set_xlim([-0.05, 1.05])ax2.set_xticks([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])ax2.legend(loc=9)ax2.set_title("Distribution of probablity")ax1.set_ylabel("true probability for class 1")ax1.set_xlabel("mean predcited pronaility")ax1.set_ylim([-0.05,1.05])ax1.legend()ax1.set_title("Calibration plots(reliabilty curve)")plt.show()
from sklearn.calibration import CalibratedClassifierCVname=["GaussianBayes","Logistic","Bayes+isotonic","Bayes+sigmoid"]gnb=GaussianNB()logi=LR(C=1.,solver='lbfgs',max_iter=3000,multi_class="auto")models=[gnb,logi,CalibratedClassifierCV(gnb,method="isotonic"),CalibratedClassifierCV(gnb,method="sigmoid")]
plot_calib(models,name,Xtrain,Xtest,ytrain,ytest)
#对SVC进行校准from sklearn.calibration import CalibratedClassifierCVsvcname=["SVC","Logistic","SVC+isotonic","SVC+sigmoid"]svc=SVC(kernel="linear",gamma=1)logi=LR(C=1.,solver='lbfgs',max_iter=3000,multi_class="auto")models=[svc,logi,CalibratedClassifierCV(svc,method="isotonic"),CalibratedClassifierCV(svc,method="sigmoid")]
name_svc=["SVC","SVC+isotonic","SVC+sigmoid"]svc=SVC(kernel="linear",gamma=1)models_svc=[svc,CalibratedClassifierCV(svc,cv=2,method="isotonic"), CalibratedClassifierCV(svc,cv=2,method="sigmoid")]for clf,name in zip(models_svc,name_svc): clf.fit(Xtrain,ytrain) y_pred=clf.predict(Xtest) if hasattr(clf,"predict_proba"): prob_pos=clf.predict_proba(Xtest)[:,1] else: prob_pos=clf.decision_function(Xtest) prob_pos=(prob_pos-prob_pos.min())/(prob_pos.max()-prob_pos.min()) clf_score=brier_score_loss(ytest,prob_pos,pos_label=prob_pos.max()) score=clf.score(Xtest,ytest) print("{}:".format(name)) print("\tBrier:{:.4f}".format(clf_score)) print("\tAccuracy:{:.4f}".format(score))
转载地址:http://ntrmz.baihongyu.com/