Data analysis

다중회귀 Multiple Regression (SAS)

keepgroovin' 2016. 4. 9. 22:25

​​1. 다중공선성을 피하기 위한 평균 중심화 방법

- 설명변수가 0을 중심으로 분포 되게끔, proc means 문으로 구한 평균을 각 값에서 빼준다

Urban.rate_c = urban.rate - 평균값 of urban.rate

- 예시
PROC GLM;
Model NDSymptoms=MajorDepLife urban.rate_c / solution ;
Run;

(해석) estimate가 MajorDepLife=1.342 urban.rate_c=0.036 --> MajorDepLife가 설명변수이고 urban.rate_c는 potential confounder

- 평균 중심화 관련 보다 자세한 설명은 하단 참조
http://www.statedu.com/term/105360

- ​Confidence Intervals ~ clmparm 옵션
어떤 변수가 95% 신뢰구간이 -0.2~0.7 나왔다
==> ​0이 포함 = no association 인 변수



2. Polynominal Regression



(해석) dgree=1 : 1차식, dgree=2 : 2차식



- Polynominal하려면 꼭 변수를 centering 해줘야

프로그램문 예
PROC GLM;
model female.employ.rate = urban.rate_c / solution clparm ;
Run;

PROC GLM;
model female.employ.rate = urban.rate_c*urban.rate_c / solution clparm ;
Run;

- centering significantly reduces thr correlation between the linear and quadratic variables in a polynominal model

- 유의할 사항 : the overfitted model may be biased toward the sample it was tested on and as a result may not fit as well in another sample. ~ bias-variance Tradeoff


3. Evaluating Model Fit

- evaluating the fit of model : ​significance of the regression coefficients and their confidence intervals.

- specification : the processing of developing regression model , 좋은 모델은 error terms are not correlated with 설명변수, residuals이 0 위주로 랜덤하게 분포되어 있어야

<-> ​Misspecification : 중요한 설명변수를 누락했거나 혹은 Linear Regression Assumptions 위배했거나



- 알아내기 ​violation of Assumptions by examining model residuals
-> 여기에서는 잔차들을 시각화 하는 방법을 알려줌

[방법1] PROC GLM 문

PROC GLM PLOTS (unpack) = all ;
Model female.employ.rate = urban.rate_c*urban.rate_c internet.use.rate. / solution clparm ;
Output ​residual=res student=stdres ​out=results ;
Run;

(해석)
Unpack : 각 변수별 산점도 별도로 그리기
residual=res : 잔차가 unstandardized
student=stdres : ask for a column with standardized residuals

Q-Q plot of residual for female.employ.rate
: 45도 직선에 가까울수록 잔차들이 normally distribited


[방법2] PROC GPLOT 문

standardized residual로 산점도 그리기

PROC GPLOT ;
label stdres ="standardized residual" country ="country" ;
Plot stdres * country / ​vref=0 ;

(해석) 평균값(잔차들의 평균은 0)에다가 수평선을 그려라




standardized residual의 standard deviation이 -2 와 +2 사이에 분포, 95% 범위 내 분포하나 2를 벗어나는 값이 조금 존재
-> fit 높이려면 설명변수 더 집어넣어야

​​ [방법3] PROC REG 문

PROC REG PLOTS = ​partial ;
Model female.employ.rate = urban.rate. urban.rate_c*urban.rate_c internet.use.rate. / partial ;
Run;




****************************************************************************************************

MULTIPLE REGRESSION

****************************************************************************************************;


* centering quantitative number of cigarettes smoked (also age for later regression);

* print mean;

PROC MEANS;

var numbercigsmoked age;

run;


* centering (subtract mean);

data new2;

set new;

numbercigsmoked_c = numbercigsmoked - 13.3642586;

age_c = age - 21.6053030;

run;


* check coding;

PROC MEANS;

var numbercigsmoked_c age_c;

run;



* multiple regression model with centered number cigarettes smoked;

PROC GLM; model NDSymptoms=MAJORDEPLIFE numbercigsmoked_c /solution;

run;



* regression model with dysthymia;

PROC GLM; model NDSymptoms=DYSLIFE /solution;

run;


* adding depression;

PROC GLM; model NDSymptoms=DYSLIFE MAJORDEPLIFE/solution;

run;


* adding more explanatory variables;

PROC  GLM; model NDSymptoms=dyslife majordeplife numbercigsmoked_c age_c SEX/solution;

run;


* confidence intervals for parameter estimates;

PROC  GLM; model NDSymptoms=dyslife majordeplife numbercigsmoked_c age_c SEX/solution clparm;

run;



=============================예제=============================

/*

MAJORDEPLIFE = MAJOR DEPRESSION - LIFETIME

bHV_PRBL = 부모형제 중 BEHAVIOR PROBLEM

DRINKNUMPERYEAR = 1년동안 마시는 주량

CIGARNUMPERYEAR= 1년동안 피는 담배량

*/

DATA NEW2 (keep=idnum S2AQ4B S2AQ21C S3AQ3B1 period FREQ_DRINK_YEAR CNT_DRINK DAY_CIGAR FREQ_CIGAR_YEAR bHV_PRBL MAJORDEPLIFE) 

; SET nesarc_pds;

if S2AQ4B='1' then period=1;

else if S2AQ4B='2' then period=1.5;

else if S2AQ4B='3' then period=2;

else if S2AQ4B='4' then period=3.5;

else if S2AQ4B='5' then period=7;

else if S2AQ4B='6' then period=15;

else if S2AQ4B='7' then period=30;

else if S2AQ4B='8' then period=40;

else if S2AQ4B='9' then period=120;

else if S2AQ4B='10' then period=300;

else if S2AQ4B NOT IN ('1','2','3','4','5','6','7','8','9','10') THEN period=0;

IF PERIOD^=0 THEN FREQ_DRINK_YEAR=360/PERIOD;  ELSE FREQ_DRINK_YEAR=0;

IF 1<=S2AQ21C <=98 THEN S2AQ21C =CNT_DRINK; ELSE CNT_DRINK=0;

/*S2AQ21C = LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS */

IF S3AQ3B1='1' THEN DAY_CIGAR=30;

ELSE IF S3AQ3B1='2' THEN DAY_CIGAR=22;

ELSE IF S3AQ3B1='3' THEN DAY_CIGAR=17;

ELSE IF S3AQ3B1='4' THEN DAY_CIGAR=6;

ELSE IF S3AQ3B1='5' THEN DAY_CIGAR=3;

ELSE IF S3AQ3B1='6' THEN DAY_CIGAR=1;

ELSE IF S3AQ3B1 NOT IN ('1','2','3','4','5','6')THEN DAY_CIGAR=0;

FREQ_CIGAR_YEAR =DAY_CIGAR*12;


IF S11BQ1='1' OR S11BQ2='1' OR S11BQ3C='1' OR S11BQ4C='1' THEN bHV_PRBL=1 ; 

ELSE bHV_PRBL=0;

RUN;


proc means data=new2; var FREQ_DRINK_YEAR ; run; 


PROC GLM data=new2 PLOT(UNPACK)=ALL ;

MODEL FREQ_CIGAR_YEAR = FREQ_DRINK_YEAR MAJORDEPLIFE bHV_PRBL/SOLUTION CLPARM;

OUTPUT RESIDUAL=RES STUDENT=STDRES OUT=NEW3;

RUN;