1. 다중공선성을 피하기 위한 평균 중심화 방법
- 설명변수가 0을 중심으로 분포 되게끔, proc means 문으로 구한 평균을 각 값에서 빼준다
Urban.rate_c = urban.rate - 평균값 of urban.rate
- 예시
PROC GLM;
Model NDSymptoms=MajorDepLife urban.rate_c / solution ;
Run;
(해석) estimate가 MajorDepLife=1.342 urban.rate_c=0.036 --> MajorDepLife가 설명변수이고 urban.rate_c는 potential confounder
- 평균 중심화 관련 보다 자세한 설명은 하단 참조
http://www.statedu.com/term/105360
- Confidence Intervals ~ clmparm 옵션
어떤 변수가 95% 신뢰구간이 -0.2~0.7 나왔다
==> 0이 포함 = no association 인 변수
2. Polynominal Regression
(해석) dgree=1 : 1차식, dgree=2 : 2차식
- Polynominal하려면 꼭 변수를 centering 해줘야
프로그램문 예
PROC GLM;
model female.employ.rate = urban.rate_c / solution clparm ;
Run;
PROC GLM;
model female.employ.rate = urban.rate_c*urban.rate_c / solution clparm ;
Run;
- centering significantly reduces thr correlation between the linear and quadratic variables in a polynominal model
- 유의할 사항 : the overfitted model may be biased toward the sample it was tested on and as a result may not fit as well in another sample. ~ bias-variance Tradeoff
3. Evaluating Model Fit
- evaluating the fit of model : significance of the regression coefficients and their confidence intervals.
- specification : the processing of developing regression model , 좋은 모델은 error terms are not correlated with 설명변수, residuals이 0 위주로 랜덤하게 분포되어 있어야
<-> Misspecification : 중요한 설명변수를 누락했거나 혹은 Linear Regression Assumptions 위배했거나
- 알아내기 violation of Assumptions by examining model residuals
-> 여기에서는 잔차들을 시각화 하는 방법을 알려줌
[방법1] PROC GLM 문
PROC GLM PLOTS (unpack) = all ;
Model female.employ.rate = urban.rate_c*urban.rate_c internet.use.rate. / solution clparm ;
Output residual=res student=stdres out=results ;
Run;
(해석)
Unpack : 각 변수별 산점도 별도로 그리기
residual=res : 잔차가 unstandardized
student=stdres : ask for a column with standardized residuals
Q-Q plot of residual for female.employ.rate
: 45도 직선에 가까울수록 잔차들이 normally distribited
[방법2] PROC GPLOT 문
standardized residual로 산점도 그리기
PROC GPLOT ;
label stdres ="standardized residual" country ="country" ;
Plot stdres * country / vref=0 ;
(해석) 평균값(잔차들의 평균은 0)에다가 수평선을 그려라
standardized residual의 standard deviation이 -2 와 +2 사이에 분포, 95% 범위 내 분포하나 2를 벗어나는 값이 조금 존재
-> fit 높이려면 설명변수 더 집어넣어야
[방법3] PROC REG 문
PROC REG PLOTS = partial ;
Model female.employ.rate = urban.rate. urban.rate_c*urban.rate_c internet.use.rate. / partial ;
Run;
****************************************************************************************************
MULTIPLE REGRESSION
****************************************************************************************************;
* centering quantitative number of cigarettes smoked (also age for later regression);
* print mean;
PROC MEANS;
var numbercigsmoked age;
run;
* centering (subtract mean);
data new2;
set new;
numbercigsmoked_c = numbercigsmoked - 13.3642586;
age_c = age - 21.6053030;
run;
* check coding;
PROC MEANS;
var numbercigsmoked_c age_c;
run;
* multiple regression model with centered number cigarettes smoked;
PROC GLM; model NDSymptoms=MAJORDEPLIFE numbercigsmoked_c /solution;
run;
* regression model with dysthymia;
PROC GLM; model NDSymptoms=DYSLIFE /solution;
run;
* adding depression;
PROC GLM; model NDSymptoms=DYSLIFE MAJORDEPLIFE/solution;
run;
* adding more explanatory variables;
PROC GLM; model NDSymptoms=dyslife majordeplife numbercigsmoked_c age_c SEX/solution;
run;
* confidence intervals for parameter estimates;
PROC GLM; model NDSymptoms=dyslife majordeplife numbercigsmoked_c age_c SEX/solution clparm;
run;
=============================예제=============================
/*
MAJORDEPLIFE = MAJOR DEPRESSION - LIFETIME
bHV_PRBL = 부모형제 중 BEHAVIOR PROBLEM
DRINKNUMPERYEAR = 1년동안 마시는 주량
CIGARNUMPERYEAR= 1년동안 피는 담배량
*/
DATA NEW2 (keep=idnum S2AQ4B S2AQ21C S3AQ3B1 period FREQ_DRINK_YEAR CNT_DRINK DAY_CIGAR FREQ_CIGAR_YEAR bHV_PRBL MAJORDEPLIFE)
; SET nesarc_pds;
if S2AQ4B='1' then period=1;
else if S2AQ4B='2' then period=1.5;
else if S2AQ4B='3' then period=2;
else if S2AQ4B='4' then period=3.5;
else if S2AQ4B='5' then period=7;
else if S2AQ4B='6' then period=15;
else if S2AQ4B='7' then period=30;
else if S2AQ4B='8' then period=40;
else if S2AQ4B='9' then period=120;
else if S2AQ4B='10' then period=300;
else if S2AQ4B NOT IN ('1','2','3','4','5','6','7','8','9','10') THEN period=0;
IF PERIOD^=0 THEN FREQ_DRINK_YEAR=360/PERIOD; ELSE FREQ_DRINK_YEAR=0;
IF 1<=S2AQ21C <=98 THEN S2AQ21C =CNT_DRINK; ELSE CNT_DRINK=0;
/*S2AQ21C = LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS */
IF S3AQ3B1='1' THEN DAY_CIGAR=30;
ELSE IF S3AQ3B1='2' THEN DAY_CIGAR=22;
ELSE IF S3AQ3B1='3' THEN DAY_CIGAR=17;
ELSE IF S3AQ3B1='4' THEN DAY_CIGAR=6;
ELSE IF S3AQ3B1='5' THEN DAY_CIGAR=3;
ELSE IF S3AQ3B1='6' THEN DAY_CIGAR=1;
ELSE IF S3AQ3B1 NOT IN ('1','2','3','4','5','6')THEN DAY_CIGAR=0;
FREQ_CIGAR_YEAR =DAY_CIGAR*12;
IF S11BQ1='1' OR S11BQ2='1' OR S11BQ3C='1' OR S11BQ4C='1' THEN bHV_PRBL=1 ;
ELSE bHV_PRBL=0;
RUN;
proc means data=new2; var FREQ_DRINK_YEAR ; run;
PROC GLM data=new2 PLOT(UNPACK)=ALL ;
MODEL FREQ_CIGAR_YEAR = FREQ_DRINK_YEAR MAJORDEPLIFE bHV_PRBL/SOLUTION CLPARM;
OUTPUT RESIDUAL=RES STUDENT=STDRES OUT=NEW3;
RUN;