博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
R-Logistic回归
阅读量:5319 次
发布时间:2019-06-14

本文共 9126 字,大约阅读时间需要 30 分钟。

广义线性模型已经不再局限于各个变量之间得服从正态分布和连续性变量要求;

其主要采用最大似然估计法进行系数计算,可应用于非常多的特定场景。

常用的有两个:

  • logistics回归----解决通过一系列连续性或类别型变量来预测二值结果变量;
  • Posion回归-------解决通过一系列连续性或类别型变量来预测计数型结果变量;

我们将通过AER包的Affair数据集(国外婚外情调查数据)来探究是哪些具体的、重要的因素会让人产生婚外情,

以及婚姻评分对婚外情的影响。

 

1、数据探索

#affair=婚外情事件次数#gender=性别;female:男性;male:女性#yearmarried=结婚年份#religiousness=宗教信仰#occuption=职业#rating=婚姻评价 data(Affairs,package = "AER")summary(Affairs)par(mfrow=c(3,3))name<-names(Affairs)for (i in 1:9) {           plot(table(Affairs[,i]),type = "b",main=name[i])}  > summary(Affairs)    affairs          gender         age         yearsmarried    children  religiousness  Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171   Min.   :1.000  1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430   1st Qu.:2.000  Median : 0.000                Median :32.00   Median : 7.000             Median :3.000  Mean   : 1.456                Mean   :32.49   Mean   : 8.178             Mean   :3.116  3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000             3rd Qu.:4.000  Max.   :12.000                Max.   :57.00   Max.   :15.000             Max.   :5.000     education       occupation        rating     Min.   : 9.00   Min.   :1.000   Min.   :1.000  1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  Median :16.00   Median :5.000   Median :4.000  Mean   :16.17   Mean   :4.195   Mean   :3.932  3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  Max.   :20.00   Max.   :7.000   Max.   :5.000

  

通过数据探索,得到每个数的5个基本统计指标和频数可视化我们得到如下简要结论:
  1. affair的婚外偷腥次数最多达到了12次,最小值为0次,并且0次居多,也就是说大部分人(72%)是不进行婚外情的。
  2. gender中男女比例从数值和图示来看基本一致男(47%);
  3. age代表被调查者的年龄分布来看在区间22-37岁年龄段调查者居多,样本年龄中位数为32岁;
  4. yearsmarried表示结婚年限,15年的老夫妻竟然很多!其他年龄段相对较均匀;
  5. children结合具体数值来看有72%的人有孩子;
  6. ...
 

2、数据处理

Affairs$ynaffair[Affairs$affairs>0]<-1Affairs$ynaffair[Affairs$affairs==0]<-0Affairs$ynaffair<-factor(Affairs$ynaffair,                         levels = c(0,1),                         labels = c("NO","YES"))table(Affairs$ynaffair)head(Affairs)  > table(Affairs$ynaffair) NO YES451 150> head(Affairs)   affairs gender age yearsmarried children religiousness education occupation rating ynaffair4        0   male  37        10.00       no             3        18          7      4       NO5        0 female  27         4.00       no             4        14          6      4       NO11       0 female  32        15.00      yes             1        12          1      4       NO16       0   male  57        15.00      yes             5        18          6      5       NO23       0   male  22         0.75       no             2        17          6      3       NO29       0 female  32         1.50       no             2        17          5      5       NO

3、logisic回归

#logisic回归fit.full<-glm(ynaffair~gender+age+yearsmarried+children+                religiousness+education+occupation+rating,              data = Affairs,              family = binomial(link = "logit"))summary(fit.full)#逐步回归glm.full<-step(fit.full) summary(glm.full)#两者回归比较(变量选择评价)anova(fit.full,glm.full,test = "Chisq")#对逐步回归结果系数解释coef(glm.full)#对数优势比exp(coef(glm.full))#优势比  logisitc回归结果> summary(fit.full) Call:glm(formula = ynaffair ~ gender + age + yearsmarried + children +    religiousness + education + occupation + rating, family = binomial(link = "logit"),    data = Affairs) Deviance Residuals:    Min       1Q   Median       3Q      Max  -1.5713  -0.7499  -0.5690  -0.2539   2.5191   Coefficients:              Estimate Std. Error z value Pr(>|z|)    (Intercept)    1.37726    0.88776   1.551 0.120807    gendermale     0.28029    0.23909   1.172 0.241083    age           -0.04426    0.01825  -2.425 0.015301 *  yearsmarried   0.09477    0.03221   2.942 0.003262 **childrenyes    0.39767    0.29151   1.364 0.172508    religiousness -0.32472    0.08975  -3.618 0.000297 ***education      0.02105    0.05051   0.417 0.676851    occupation     0.03092    0.07178   0.431 0.666630    rating        -0.46845    0.09091  -5.153 2.56e-07 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)     Null deviance: 675.38  on 600  degrees of freedomResidual deviance: 609.51  on 592  degrees of freedomAIC: 627.51 Number of Fisher Scoring iterations: 4 #逐步回归过程> glm.full<-step(fit.full)Start:  AIC=627.51ynaffair ~ gender + age + yearsmarried + children + religiousness +    education + occupation + rating                 Df Deviance    AIC- education      1   609.68 625.68- occupation     1   609.70 625.70- gender         1   610.89 626.89..............Step:  AIC=623.86ynaffair ~ gender + age + yearsmarried + religiousness + rating                 Df Deviance    AIC
611.86 623.86- gender 1 615.36 625.36- age 1 618.05 628.05- religiousness 1 625.57 635.57- yearsmarried 1 626.23 636.23- rating 1 639.93 649.93 #逐步回归结果> summary(glm.full)#逐步回归 Call:glm(formula = ynaffair ~ gender + age + yearsmarried + religiousness + rating, family = binomial(link = "logit"), data = Affairs) Deviance Residuals: Min 1Q Median 3Q Max -1.5623 -0.7495 -0.5664 -0.2671 2.3975 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.94760 0.61234 3.181 0.001470 **gendermale 0.38612 0.20703 1.865 0.062171 . age -0.04393 0.01806 -2.432 0.015011 * yearsmarried 0.11133 0.02983 3.732 0.000190 ***religiousness -0.32714 0.08947 -3.656 0.000256 ***rating -0.46721 0.08928 -5.233 1.67e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 675.38 on 600 degrees of freedomResidual deviance: 611.86 on 595 degrees of freedomAIC: 623.86 Number of Fisher Scoring iterations: 4 #两者回归方式进行检验> anova(fit.full,glm.full,test = "Chisq")Analysis of Deviance Table Model 1: ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + ratingModel 2: ynaffair ~ gender + age + yearsmarried + religiousness + rating Resid. Df Resid. Dev Df Deviance Pr(>Chi)1 592 609.51 2 595 611.86 -3 -2.3486 0.5033

4、增强系数解释(概率)与预测(新样本)

#由于采用书本知识原因,不采用逐步回归方法进行预测;(判断方法和之前基本一致)#采用第一次回归的系数最显著地4个指标进行logistics回归fit.reduce<-glm(ynaffair~rating+age+yearsmarried+religiousness,                data = Affairs,family = binomial())summary(fit.reduce)anova(fit.reduce,fit.full,test = "Chisq")  #新数据集(探究婚姻满意度对婚外情的影响)testdata<-data.frame(rating=c(1,2,3,4,5),                     age=mean(Affairs$age),                     yearsmarried=mean(Affairs$yearsmarried),                     religiousness=mean(Affairs$religiousness))testdatatestdata$prob<-predict(fit.reduce,newdata = testdata,type = "response")testdata #新数据集2(探究不同年龄对婚外情的影响)testdata2<-data.frame(rating=mean(Affairs$rating),                     age=seq(17,57,10),                     yearsmarried=mean(Affairs$yearsmarried),                     religiousness=mean(Affairs$religiousness))testdata2testdata2$prob2<-predict(fit.reduce,newdata = testdata,type = "response")testdata2  > testdata  rating      age yearsmarried religiousness      prob1      1 32.48752     8.177696      3.116473 0.53022962      2 32.48752     8.177696      3.116473 0.41573773      3 32.48752     8.177696      3.116473 0.30967124      4 32.48752     8.177696      3.116473 0.22045475      5 32.48752     8.177696      3.116473 0.1513079 > testdata2   rating age yearsmarried religiousness     prob21 3.93178  17     8.177696      3.116473 0.53022962 3.93178  27     8.177696      3.116473 0.41573773 3.93178  37     8.177696      3.116473 0.30967124 3.93178  47     8.177696      3.116473 0.22045475 3.93178  57     8.177696      3.116473 0.1513079

5、结果可视化

par(mfrow=c(1,2))plot(testdata2$prob,type = "b",     xlab = "rating",ylab = "prob",main = "婚姻满意度对婚外情影响",     cex=3,lty=13,lwd=3,col=rainbow(5),pch=19,     cex.lab=1.3,cex.axis=1.3,cex.main=1.5,)plot(testdata2$prob2,type = "b",     xlab = "age",ylab = "prob",main = "不同年龄段婚外情概率",     cex=3,lty=18,lwd=3,col=rainbow(5),pch=17,     cex.lab=1.3,cex.axis=1.3,cex.main=1.5,)

  

结论:控制其他因素影响,在相同条件下,rating(婚姻满意度)越高,婚外情的概率越低;
age(年龄)越高,婚外情的概率越低;
 

6、回归检验

过度离势会造成奇异的标准误检验和不精确的显著性检验,使得p值等失效,失去回归的可信度,因此要对回归进行过度离势检验。
> deviance(fit.reduce)/df.residual(fit.reduce)
[1] 1.03248
过度离势接近1,表明没有过度离势,P值和标准误检验有效。
 
 

 

附录

1、系数的优势比方面解释较为困难,可参考网址
(3)对一般Logistic模型系数的解释 - 简书
#由于p值没有通过,因此采用逐步回归结果进行系数解释
> coef(glm.full)#对数优势比  (Intercept)    gendermale           age  yearsmarried   1.94760307    0.38612217   -0.04392545    0.11132715religiousness        rating  -0.32714238   -0.46721157> exp(coef(glm.full))#优势比  (Intercept)    gendermale           age  yearsmarried    7.0118605     1.4712644     0.9570253     1.1177605religiousness        rating    0.7209811     0.6267475
2、chi-square test)或称卡方检验,是一种用途较广的假设检验方法。可以分为成组比较(不配对资料)和个别比较(配对,或同一对象两种处理的比较)两类。
此处用来进行同一结果的两种回归进行比较,如果卡方检验通过了,表明两者回归本质是一样的,如果没有通过,那么两个回归就是不一样的,也就是说,通过逐步回归得到的(较少变量拟合)结果方式完全不一样,取得的效果确实更好的,因此可以采用后者方法进行解释。
 
借用一下王老板的Q版表情包~--------------蔡磊---原创笔记  2018-11-25
 

转载于:https://www.cnblogs.com/Clactionxy/p/10016241.html

你可能感兴趣的文章
【UVA 1380】 A Scheduling Problem (树形DP)
查看>>
数学符号?
查看>>
山东省第四届蓝桥杯 ///题目标题:前缀判断//c/c++组
查看>>
sql server实现主从复制
查看>>
Aras 引入外部的dll
查看>>
走楼梯
查看>>
C# JSON字符串序列化与反序列化
查看>>
HTTPS、SPDY和HTTP/2的性能比较
查看>>
GRPC单向/双向流
查看>>
Ubuntu18.10下配置PHP7.2 + Apache2 + Mysql5.7
查看>>
Java变量类型,实例变量 与局部变量 静态变量
查看>>
Angular实践----理解数据绑定过程
查看>>
sublime快捷键
查看>>
mysql操作命令梳理(4)-中文乱码问题
查看>>
Hyper-V Centos7 网络设置 虚拟机固定IP
查看>>
Python环境搭建(安装、验证与卸载)
查看>>
android菜鸟学习笔记5----第一个android程序
查看>>
2.5 非透明PCI桥 分类: 浅谈PCI 2...
查看>>
POJ 2263 Heavy Cargo(Floyd + map)
查看>>
Macbook pro 下修改MySQL数据库密码
查看>>