[R] Logistic regression (로지스틱 회귀분석)

2020. 1. 21. 14:42

[R] Logistic regression (로지스틱 회귀분석) Start

BioinformaticsAndMe

Logistic regression (로지스틱 회귀분석)

: 로지스턱 회귀분석은 종속변수(반응변수)가 범주형 데이터인 경우에 사용되는 회귀 분석법

: 종속변수 y는 '성공(1) 및 실패(0)'의 두 가지 값(이항변수)을 갖음

*환자사망여부/전염병발병여부/교통사고발생여부 등

: 로지스티 회귀분석은 지도 학습으로 분류되며, 특정 결과의 분류 및 예측을 위해 사용됨

일반화선형모형 (Generalized linear model)

: 일반화선형모형은 정규분포를 따르지 않는 종속변수의 선형 모형 확장으로, 로지스틱회귀 또는 포아송회귀에 사용됨

: 일반화선형모형은 R의 내장함수인 glm()함수를 사용

: 로지스틱 회귀분석에서는 glm()함수에 'family=binomial' 인수를 지정해야함

1. 실습 대장암 데이터 로딩

# survival 패키지의 1858명 colon 데이터

install.packages(“survival”) library(survival) str(colon)

'data.frame':   1858 obs. of  16 variables:
 $ id      : num  1 1 2 2 3 3 4 4 5 5 ...
 $ study   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ rx      : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
 $ sex     : num  1 1 1 1 0 0 0 0 1 1 ...
 $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
 $ obstruct: num  0 0 0 0 0 0 1 1 0 0 ...
 $ perfor  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ adhere  : num  0 0 0 0 1 1 0 0 0 0 ...
 $ nodes   : num  5 5 1 1 7 7 6 6 22 22 ...
 $ status  : num  1 1 0 0 1 1 1 1 1 1 ...
 $ differ  : num  2 2 2 2 2 2 2 2 2 2 ...
 $ extent  : num  3 3 3 3 2 2 3 3 3 3 ...
 $ surg    : num  0 0 0 0 0 0 1 1 1 1 ...
 $ node4   : num  1 1 0 0 1 1 1 1 1 1 ...
 $ time    : num  1521 968 3087 3087 963 ...
 $ etype   : num  2 1 2 1 2 1 2 1 2 1 ...

2. 로지스틱 회귀분석 수행

■반응변수 - status(대장암 재발 또는 사망인 경우 1)

■예측변수

- obstruct : 종양에 의한 장의 폐쇄 (obstruction)

- perfor : 장의 천공 (perforation)

- adhere : 인접장기와의 유착 (adherence)

- nodes : 암세포가 확인된 림프절 수

- differ : 암세포의 조직학적 분화 정도 (1=well, 2=moderate, 3=poor)

- extent : 암세포가 침습한 깊이 (1=submucosa, 2=muscle, 3=serosa, 4=인접장기)

- surg : 수술 후 등록까지의 시간 (0=short, 1=long)

# 로지스틱 회귀분석에서 'family=binomial'로 지정

colon1<-na.omit(colon) result<-glm(status ~ sex+age+obstruct+perfor+adhere+nodes+differ+extent+surg, family=binomial, data=colon1) summary(result)

Call:
glm(formula = status ~ rx + sex + age + obstruct + perfor + adhere + 
    nodes + differ + extent + surg, family = binomial, data = colon1)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.575  -1.046  -0.584   1.119   2.070  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.430926   0.478301  -5.082 3.73e-07 ***
rxLev       -0.069553   0.122490  -0.568 0.570156    
rxLev+5FU   -0.585606   0.124579  -4.701 2.59e-06 ***
sex         -0.086161   0.101614  -0.848 0.396481    
age          0.001896   0.004322   0.439 0.660933    
obstruct     0.219995   0.128234   1.716 0.086240 .  
perfor       0.085831   0.298339   0.288 0.773578    
adhere       0.373527   0.147164   2.538 0.011144 *  
nodes        0.185245   0.018873   9.815  < 2e-16 ***
differ       0.031839   0.100757   0.316 0.752003    
extent       0.563617   0.116837   4.824 1.41e-06 ***
surg         0.388068   0.113840   3.409 0.000652 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2461.7  on 1775  degrees of freedom
Residual deviance: 2240.4  on 1764  degrees of freedom
AIC: 2264.4

Number of Fisher Scoring iterations: 4

3. 유의한 변수 선택

: backward elimination방법으로 stepwise logistic regression 수행

*backward elimination 참고 - https://bioinformaticsandme.tistory.com/290

# 유의하지 않은 변수를 누락하고 로지스틱 회귀모형을 새롭게 정의

reduced.model=step(result, direction = "backward") summary(reduced.model)

Call:
glm(formula = status ~ rx + obstruct + adhere + nodes + extent + 
    surg, family = binomial, data = colon1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5583  -1.0490  -0.5884   1.1213   2.0393  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.30406    0.35138  -6.557 5.49e-11 ***
rxLev       -0.07214    0.12221  -0.590 0.554978    
rxLev+5FU   -0.57807    0.12428  -4.651 3.30e-06 ***
obstruct     0.22148    0.12700   1.744 0.081179 .  
adhere       0.38929    0.14498   2.685 0.007251 ** 
nodes        0.18556    0.01850  10.030  < 2e-16 ***
extent       0.56510    0.11643   4.854 1.21e-06 ***
surg         0.38989    0.11371   3.429 0.000606 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2461.7  on 1775  degrees of freedom
Residual deviance: 2241.5  on 1768  degrees of freedom
AIC: 2257.5

Number of Fisher Scoring iterations: 4

4. 예측 인자들의 Odds ratio 구하기

: 예측 변수들의 오즈비 계산

# 오즈비 출력 함수 정의
ORtable=function(x,digits=2){
    suppressMessages(a<-confint(x))
    result=data.frame(exp(coef(x)),exp(a))
    result=round(result,digits)
    result=cbind(result,round(summary(x)$coefficient[,4],3))
    colnames(result)=c("OR","2.5%","97.5%","p")
    result
}

ORtable(reduced.model)
              OR 2.5% 97.5%     p
(Intercept) 0.10 0.05  0.20 0.000
rxLev       0.93 0.73  1.18 0.555
rxLev+5FU   0.56 0.44  0.72 0.000
obstruct    1.25 0.97  1.60 0.081
adhere      1.48 1.11  1.96 0.007
nodes       1.20 1.16  1.25 0.000
extent      1.76 1.41  2.22 0.000
surg        1.48 1.18  1.85 0.001

# Odds ratio 시각화
install.packages(“moonBook”)
library(moonBook)
odds_ratio = ORtable(reduced.model)
odds_ratio = odds_ratio[2:nrow(odds_ratio),]
HRplot(odds_ratio, type=2, show.CI=TRUE, cex=2)

#Reference

1) https://www.tech-quantum.com/classification-logistic-regression/

2) https://rstudio-pubs-static.s3.amazonaws.com/41074_62aa52bdc9ff48a2ba3fb0f468e19118.html

3) http://www.dodomira.com/2016/02/12/logistic-regression-in-r/

4) https://link.springer.com/chapter/10.1007/978-1-4842-4470-8_20

[R] Logistic regression (로지스틱 회귀분석) End

BioinformaticsAndMe

저작자표시 (새창열림)

'R' 카테고리의 다른 글

[R] 국민건강영양조사 분석 (1)	2021.02.01
[R] Function (사용자 정의 함수) (0)	2020.02.03
[R] Multiple linear regression (다중회귀분석) (1)	2020.01.14
[R] ChIP-seq 분석 (1)	2020.01.05
[R] Circos plot (0)	2019.12.30

BioinformaticsAndMe

[R] Logistic regression (로지스틱 회귀분석)

'R' 카테고리의 다른 글

+ Recent posts

티스토리툴바