R, Stata, and SAS code for implementing multiple imputation

13
R, Stata, and SAS code for implementing multiple imputation Kara Rudolph Elizabeth Stuart Johns Hopkins Bloomberg School of Public Health September 25, 2012 1 R: MICE using the mi package This section uses the mi package in R. You will need to install it on your computer using the command install.packages(“mi”). See here for more details: http://www.stat.columbia.edu/gelman/research/published/mipaper.pdf First, I set up the dataset. I substituted numbers for ordinal responses and reclassified some vari- ables (e.g., unordered categorical variables=factor, binary variables=factor, and ordered categorical vari- ables=integer/numeric). 1

Transcript of R, Stata, and SAS code for implementing multiple imputation

R, Stata, and SAS code for implementing multiple imputation

Kara RudolphElizabeth Stuart

Johns Hopkins Bloomberg School of Public Health

September 25, 2012

1 R: MICE using the mi package

This section uses the mi package in R. You will need to install it on your computer using the commandinstall.packages(“mi”). See here for more details:http://www.stat.columbia.edu/∼gelman/research/published/mipaper.pdf

First, I set up the dataset. I substituted numbers for ordinal responses and reclassified some vari-ables (e.g., unordered categorical variables=factor, binary variables=factor, and ordered categorical vari-ables=integer/numeric).

1

Table 1: Table 1. Number and percent missing data

number.nas percent.nasbatch2 0 0.0000

modeprt2 0 0.0000mmhbp 12 0.0084vagdel 3 0.0021mrace 56 0.0390

bornus 48 0.0334plural2 0 0.0000

married2 0 0.0000matdeg 3 0.0021

pnc 58 0.0404lga2 107 0.0745sga 107 0.0745

kotelchuck2 37 0.0258prelb 2 0.0014

gestwk 44 0.0306bmi 3 0.0021

wrknow 50 0.0348income2 307 0.2138

medicaid2 13 0.0091wicpreg 21 0.0146

pp drink2 54 0.0376diabetes 34 0.0237preexer 36 0.0251

infert tx2 45 0.0313FEEL PG 34 0.0237

smoke2 43 0.0299strs 66 0.0460

inficu 45 0.0313los2 52 0.0362

bfeed2 62 0.0432back sleep2 82 0.0571

cosleep2 92 0.0641ppvchk 33 0.0230

inf age2 0 0.0000sad 118 0.0822

nhope 161 0.1121slow 159 0.1107

matage 23 0.0160

The Figure below shows the missing data pattern. This was created with the mi package in R.

> library(mi)> missing.pattern.plot(data, y.order=TRUE, gray.scale=TRUE,> main="Missing Pattern Plot", xlab="index")

2

Figure 1: Figure 1. Missing data pattern.

Missing Pattern Plot

index

income2nhope

slowsad

lga2sga

cosleep2back_sleep2

strsbfeed2

pncmrace

pp_drink2los2

wrknowbornus

infert_tx2inficu

gestwksmoke2

kotelchuck2preexer

diabetesFEEL_PG

ppvchkmatagewicpreg

medicaid2mmhbpvagdel

matdegbmi

prelbbatch2

modeprt2factwtanal

plural2married2inf_age2

totcnt2

Next, I imputed the data with the mice package.

> library(mice)> load("data.Rdata")> pred<-quickpred(data, mincor=0.01) ## the pred command makes the predictor matrix.> ## I modified it slightly to choose predictors> ## that are at least 1\% correlated with the variable> ## they are predicting.)\\

3

>> pred

batch2 modeprt2 wtanal stratumc mmhbp vagdel mrace bornus plural2batch2 0 0 0 0 0 0 0 0 0modeprt2 0 0 0 0 0 0 0 0 0wtanal 0 0 0 0 0 0 0 0 0stratumc 0 0 0 0 0 0 0 0 0mmhbp 1 0 1 1 0 1 0 1 1vagdel 1 1 1 1 1 0 1 1 1mrace 1 1 1 1 1 1 0 1 1bornus 1 1 1 1 1 1 1 0 1plural2 0 0 0 0 0 0 0 0 0married2 0 0 0 0 0 0 0 0 0matdeg 1 1 1 1 1 1 1 1 1pnc 0 1 1 1 1 1 1 1 1lga2 1 1 1 1 1 1 1 1 1sga 1 1 1 1 1 1 1 1 1kotelchuck2 1 1 1 1 1 1 1 1 1prelb 1 1 0 1 1 1 1 1 1gestwk 1 1 1 1 1 1 1 1 1bmi 0 1 1 1 1 1 1 1 1wrknow 1 1 1 1 1 1 1 1 0income2 1 1 1 0 1 1 1 1 1medicaid2 0 1 1 1 1 1 1 1 1wicpreg 1 1 1 1 1 1 1 1 1pp_drink2 1 1 1 1 1 1 1 1 1diabetes 1 1 1 1 1 1 1 1 1preexer 1 1 1 1 1 1 1 1 1infert_tx2 1 1 1 1 1 1 1 1 1FEEL_PG 0 1 1 1 1 1 1 1 1smoke2 1 1 1 1 1 1 1 1 1strs 1 1 1 1 1 1 1 1 1inficu 1 1 1 1 1 1 1 1 1los2 1 1 1 1 1 1 1 1 1bfeed2 1 1 1 1 1 1 1 1 1back_sleep2 1 1 1 1 1 1 1 1 1cosleep2 1 1 1 1 1 1 1 1 1ppvchk 1 1 1 1 1 1 1 1 1inf_age2 0 0 0 0 0 0 0 0 0sad 1 1 1 1 1 1 1 1 1nhope 1 1 1 1 1 1 1 1 1slow 1 1 1 1 1 1 1 1 1matage 1 1 1 1 1 1 1 1 1totcnt 0 0 0 0 0 0 0 0 0

If you want to change anything in the predictor matrix, you can manually set variables to be used/notused for prediction and predicted/not predicted. For example, the following code would not allow thesurvey variables to be used for prediction, and would not allow sadness to be predicted.

> pred[,"wtanal"] <- 0> pred[,"stratumc"] <- 0> pred[,"totcnt"] <- 0> pred["sad",] <- 0

4

> ini<-mice(data, maxit=0, pri=F)> meth<-ini$meth> meth["sad"]<-""

You can also double check that mice is using the correct method to impute the variables.

> ini$method

batch2 modeprt2 wtanal stratumc mmhbp vagdel"" "" "" "" "logreg" "logreg"

mrace bornus plural2 married2 matdeg pnc"polyreg" "logreg" "" "" "pmm" "pmm"

lga2 sga kotelchuck2 prelb gestwk bmi"logreg" "logreg" "pmm" "pmm" "pmm" "pmm"wrknow income2 medicaid2 wicpreg pp_drink2 diabetes

"logreg" "pmm" "logreg" "logreg" "pmm" "polyreg"preexer infert_tx2 FEEL_PG smoke2 strs inficu"logreg" "logreg" "polyreg" "pmm" "pmm" "logreg"

los2 bfeed2 back_sleep2 cosleep2 ppvchk inf_age2"pmm" "pmm" "logreg" "logreg" "logreg" ""sad nhope slow matage totcnt

"pmm" "pmm" "pmm" "pmm" ""

When you are happy with your predictor matrix and method for imputation of all the variables, youare ready to impute.

> imp<-mice(data, pred=pred, maxit=10,m=10,seed=92385)\\>> ## I set a seed for reproducibility.> ## I am allowing for a maximum of 10 iterations> ## for each of 10 imputed datasets.)\\

The figure below compares the values for each imputed dataset.

5

Figure 2: Figure 2. Imputation performance.

Den

sity

0.00

.20.

40.6

0.81

.0

−5 0 5 10

matdeg

01

23

0 1 2

pnc

0.00

.20.

40.6

0.81

.0

0 2 4

kotelchuck2

0.0

1.0

2.0

0 1 2 3 4

gestwk

0.0

0.1

0.2

0.3

0.4

10 20 30 40 50 60

bmi

0.00

0.10

0.20

0.30

−2 0 2 4 6 8

income2

0.0

0.5

1.0

1.5

0 2 4

pp_drink2

01

23

0 1 2

smoke2

0.0

0.4

0.8

−1 0 1 2 3

strs

0.0

0.4

0.8

−1 0 1 2 3

los2

0.0

0.5

1.0

−1 0 1 2 3

bfeed2

0.0

0.2

0.4

0.6

0 2 4

sad

0.0

0.5

1.0

1.5

0 1 2 3 4

nhope

0.0

0.2

0.4

0.6

0 2 4

slow

0.0

0.4

0.8

1.2

0 2 4 6

matage

The next figure can be used to assess imputation convergence.

6

Figure 3: Figure 3. Imputation convergence.

Iteration

1.0

1.1

1.2

1.3

1.4

mean mmhbp

0.0

0.1

0.2

0.3

0.4

0.5

sd mmhbp

1.0

1.2

1.4

1.6

1.8

2.0

mean vagdel

0.0

0.2

0.4

0.6 sd vagdel

1.8

2.0

2.2

2.4

mean mrace

0.9

1.1

1.3

sd mrace

1.4

1.5

1.6

1.7 mean bornus

0.47

0.48

0.49

0.50

sd bornus

1.0

2.0

3.0

mean matdeg

01

23

sd matdeg

1.5

1.6

1.7

1.8

1.9

mean pnc

0.2

0.3

0.4

0.5

0.6

sd pnc

1.05

1.10

2 4 6 8 10

mean lga2

0.10

0.20

0.30

2 4 6 8 10

sd lga2

1.35

1.45

1.55

2 4 6 8 10

mean sga

0.48

00.

490

0.50

0

2 4 6 8 10

sd sga

Also, we can look at how the means changed before and after imputation.

7

Names Pre Post1 FEEL PGsooner 0.22 0.222 FEEL PGlater 0.27 0.273 FEEL PGthen 0.43 0.424 FEEL PGDID NOT WANT 0.079 0.0805 mraceNon-Hispanic White 0.24 0.256 mraceNon-Hispanic Black 0.24 0.247 mraceHispanic 0.35 0.358 mraceNon-Hispanic American Indian/Alaska Native 0.15 0.159 mraceNon-Hispanic Asian 0 0

10 mraceNon-Hispanic Hawaiian 0.018 0.01811 mraceNon-Hispanic Multiple Race 0 012 vagdel1 0.58 0.5813 married21 0.55 0.5514 matdeg 3.05 3.0515 mmhbp1 0.0681 0.067816 pnc 1.78 1.7817 plural21 0.0703 0.070318 preexer1 0.33 0.3320 inficu1 0.33 0.3321 ppvchk1 0.90 0.9022 sad 1.18 1.1723 nhope 0.56 0.5724 slow 1.23 1.2125 wrknow1 0.38 0.3826 los2 0.794 0.79827 infert tx21 0.10 0.1028 back sleep21 0.63 0.6229 cosleep21 0.20 0.2030 pp drink2 0.66 0.6631 smoke2 0.14 0.1432 bmi 24.97 24.97

You can take your imputed datasets and run a design-based, survey-weighted regression.

> library(mitools)> library(survey)> load("imp.Rdata")> imp1<-mice::complete(imp, action=1, include=FALSE)> imp2<-mice::complete(imp, action=2, include=FALSE)> imp3<-mice::complete(imp, action=3, include=FALSE)> imp4<-mice::complete(imp, action=4, include=FALSE)> imp5<-mice::complete(imp, action=5, include=FALSE)> imp6<-mice::complete(imp, action=6, include=FALSE)> imp7<-mice::complete(imp, action=7, include=FALSE)> imp8<-mice::complete(imp, action=8, include=FALSE)> imp9<-mice::complete(imp, action=9, include=FALSE)> imp10<-mice::complete(imp, action=10, include=FALSE)> implist<-list(imp1, imp2, imp3, imp4, imp5, imp6, imp7, imp8, imp9, imp10)> for(i in 1:10){+ implist[[i]]$stratum<-ifelse(implist[[i]]$stratumc=="YC: NORMAL BW", 1, 2)+ }

8

> multimp<-imputationList(implist)> desimp<-svydesign(id=~1, strata=~stratum, weight=~wtanal, fpc=~totcnt, data=multimp)> fit<- with(desimp, svyglm(sga ~ strs + FEEL_PG + vagdel + mrace + income2 + medicaid2 + wicpreg+ + matdeg + poly(bmi,2) + smoke2 + poly(matage,2) + infert_tx2 , family=quasibinomial()))> fitmi<-MIcombine(fit)> summary(fitmi)

Multiple imputation results:with(desimp, svyglm(sga ~ strs + FEEL_PG + vagdel + mrace + income2 +

medicaid2 + wicpreg + matdeg + poly(bmi, 2) + smoke2 + poly(matage,2) + infert_tx2, family = quasibinomial()))MIcombine.default(fit)

results se(Intercept) -1.242883e+00 0.53669526strs 1.259229e-01 0.13615997FEEL_PG2 -1.798350e-01 0.30112655FEEL_PG3 -2.239515e-02 0.25625769FEEL_PG4 -2.921535e-01 0.34799446vagdel2 -3.287655e-01 0.19468922mraceNon-Hispanic Black -5.866356e-01 0.34399488mraceHispanic -7.632288e-02 0.30300550mraceNon-Hispanic American Indian/Alaska Native -1.529159e-01 0.32040498mraceNon-Hispanic Hawaiian -1.401858e+00 0.54702418income2 -1.264901e-01 0.08682087medicaid22 -4.124848e-01 0.29607676wicpreg2 -8.333494e-02 0.27339916matdeg -7.676502e-04 0.07513768poly(bmi, 2)1 -1.131922e+01 3.81129690poly(bmi, 2)2 5.643566e+00 2.92501576smoke2 3.432494e-02 0.24843572poly(matage, 2)1 -3.282042e+00 4.05131475poly(matage, 2)2 3.044264e+00 3.43652006infert_tx22 7.035499e-01 0.36707274

(lower upper)(Intercept) -2.29580458 -0.18996138strs -0.14114838 0.39299410FEEL_PG2 -0.77068948 0.41101956FEEL_PG3 -0.52506024 0.48026994FEEL_PG4 -0.97428734 0.38998029vagdel2 -0.71045444 0.05292338mraceNon-Hispanic Black -1.26219925 0.08892809mraceHispanic -0.67065748 0.51801171mraceNon-Hispanic American Indian/Alaska Native -0.78154701 0.47571515mraceNon-Hispanic Hawaiian -2.47471450 -0.32900127income2 -0.29768058 0.04470039medicaid22 -0.99523068 0.17026115wicpreg2 -0.62027420 0.45360432matdeg -0.14818529 0.14664999poly(bmi, 2)1 -18.79196705 -3.84646424poly(bmi, 2)2 -0.08957018 11.37670264smoke2 -0.45325756 0.52190744poly(matage, 2)1 -11.22354828 4.65946456poly(matage, 2)2 -3.69222922 9.78075626

9

infert_tx22 -0.02221297 1.42931279missInfo

(Intercept) 9 %strs 8 %FEEL_PG2 9 %FEEL_PG3 8 %FEEL_PG4 3 %vagdel2 5 %mraceNon-Hispanic Black 12 %mraceHispanic 8 %mraceNon-Hispanic American Indian/Alaska Native 9 %mraceNon-Hispanic Hawaiian 7 %income2 22 %medicaid22 18 %wicpreg2 13 %matdeg 9 %poly(bmi, 2)1 5 %poly(bmi, 2)2 2 %smoke2 10 %poly(matage, 2)1 3 %poly(matage, 2)2 3 %infert_tx22 26 %

2 Stata: Ice

First, I set up the dataset. I read the dataset that I modified in R into Stata. I then had to change theNA’s so that Stata would recognize the missing data. See the following links for more details:http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt1.htm,http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt2.htm

Stata Code:

destring , replace ignore("NA")

encode mrace, generate(race2)replace race2=. if race2==2

encode diabetes, generate(diabetes2)replace diabetes2=. if diabetes2==2

encode feel\_pg, generate(feelpg2)replace feelpg2=. if feelpg2==2

This next bit of code uses the mvpatterns package to look at the missing data patterns.

mvpatterns modeprt2 batch2 wtanal stratumc mmhbp vagdel race2 bornus plural2 married2matdeg pnc lga2 sga kotelchuck2 prelb gestwk bmi wrknow income2 medicaid2 wicpregpp_drink diabetes2 preexer infert_tx feelpg2 smoke strs inficu los2 bfeed2back_sleep2 cosleep2 ppvchk inf_age2 sad nhope slow matage totcnt

Before doing the actual imputation, it’s recommended to do a dry run to insure that the variablesto be imputed are in the preferred format. The code is below. The i. prefix declares these variables to

10

be categorical without missing. The m. prefix is for unordered categorical variables with missing (m formultinomial). The o. prefix is for ordered categorical variables with missing (o for ordinal).

ice modeprt2 batch2 wtanal i.stratumc mmhbp vagdel m.race2 bornus plural2 married2o.matdeg o.pnc lga2 sga o.kotelchuck2 o.prelb o.gestwk bmi wrknow o.income2 medicaid2wicpreg o.pp_drink m.diabetes2 preexer infert_tx m.feelpg2 o.smoke o.strs inficuo.los2 o.bfeed2 back_sleep2 cosleep2 ppvchk i.inf_age2 o.sad o.nhope o.slow matagetotcnt, saving(dryrun) dryrun

Now, I’ll do the imputation. The code for the first bit is the same as the dryrun. m(10) tells it toimpute 10 datasets and seed() sets the seed, so that it can be replicated. boot() is recommended for vari-ables whose distributions do not approximate the normal distribution (most of the variables in this case!).

ice modeprt2 batch2 wtanal i.stratumc mmhbp vagdel m.race2 bornus plural2 married2o.matdeg o.pnc lga2 sga o.kotelchuck2 o.prelb o.gestwk bmi wrknow o.income2 medicaid2wicpreg o.pp_drink m.diabetes2 preexer infert_tx m.feelpg2 o.smoke o.strs inficu o.los2o.bfeed2 back_sleep2 cosleep2 ppvchk i.inf_age2 o.sad o.nhope o.slow matage totcnt,saving(impute, replace) m(10) boot(mmhbp vagdel race2 bornus matdeg pnc lga2 sgakotelchuck2 prelb gestwk wrknow income2 medicaid2 wicpreg pp_drink diabetes2 preexerinfert_tx feelpg2 smoke strs inficu los2 bfeed2 back_sleep2 cosleep2 ppvchk sad nhopeslow matage) seed(1285964)

We can then use the imputed data by importing it as mi data.

mi import ice

The following table compares some variable means before and after imputation.

Names Pre Postvagdel 0.5784 0.5787married2 0.5487 .5501matdeg 3.046 3.064mmhbp .0680 .0675plural2 .0703 .0704preexer .3302 .3312wicpreg .5553 .5524inficu .3285 3281ppvchk .8969 .8973sad 1.169 1.166nhope .5616 .5604slow 1.206 1.207wrknow .3754 .3763los2 .7968 .7963inferttx2 .1015 .1020backsleep2 .6229 .6234cosleep2 .2015 .2015ppdrink2 .6561 .6615smoke2 .1414 .1413bmi 24.97 24.96

11

We can then use the imputed data to run a design-based, survey weighted regression. Here, I ran thesame regression that I ran in R. The first line sets the survey parameters within the multiply imputeddataset. The next two lines create quadratic versions of the bmi and maternal age variables in eachimputed dataset, using the imputed variables. The last line runs the regression.

mi svyset [pweight=wtanal], strata(stratumc) fpc(totcnt)

mi xeq: generate bmisq = $bmi^2$mi xeq: generate matagesq = $matage^2$

mi estimate: logit sga strs i.feelpg2 vagdel i.race2 income2 medicaid2 wicpreg matdeg bmi bmisq smoke2 matage matagesq infert\_tx

sga Coef SE 95% CIstrs .04865 .09377 -.1354 .2327feelpg3 -.2786 .2515 -.7715 .2144feelpg4 -.1222 .2663 -.6442 .3999feelpg5 -.2294 .2481 -.7161 .2573vagdel -.4978 .1334 -.7593 -.2362race3 .1844 .2103 -.2281 .5970race4 -.0975 .1745 -.4397 .2448race5 .0505 .5039 -.9375 1.038race6 -.0053 .2015 -.4004 .3898income2 -.1232 .0603 -.2437 -.0028medicaid2 -.05313 .2255 -.4977 .3914wicpreg -.1886 .1740 -.5296 .1525matdeg .06623 .0476 -.0272 .1596bmi -.1370 .0655 -.2655 -.0086bmisq .0021 .0011 -.0001 .0042smoke2 .1784 .1391 -.0942 .4511matage -.4299 .2441 -.9089 .0491matagesq .0559 .0325 -.0079 .1198inferttx2 .7710 .2339 .3099 1.232cons 2.360 1.090 .2212 4.498

3 SAS: IVEWare

IVEWare is a stand-alone package or can be run through SAS. http://www.isr.umich.edu/src/smp/ive/

Here is some sample code. Multiply imputed data created in IVEWare can then be imported into Ror SAS to run analyses.

proc printto print=’C:\myoutdir\summerinst.lst’ NEW;run;

LIBNAME MYLIB1 ’C:\myindir’;LIBNAME MYLIB2 ’C:\myoutdir’;options set = SRCLIB "C:\Program Files\SAS\srclib" sasautos=(’!SRCLIB’ sasautos) mautosource;run;%IMPUTE (NAME=MYSETUP,DIR=C:\myoutdir,SETUP=NEW);DATAIN MYLIB1.sinst;DATAOUT MYLIB2.impsinst1;

12

DEFAULT categorical;

COUNT totchild totadu susa5 ;

CONTINUOUS age bersraw ctotcomr ctotraw cintraw cextraw ytotraw yintraw yextrawars a1_rs ar1_s a1_r1_s;

DROP cohort totrole;

TRANSFER childid axis_1a ;

RESTRICT susa5(susa1=1, atleast11=1) susb13a(atleast11=1) susb13b(susb13a=1,atleast11=1)ds1(atleast11=1) ds2(atleast11=1) ds3(atleast11=1) ;

BOUNDS susa5(>=0,<=40) ctotraw(>=0) cintraw(>=0) cextraw(>=0), ytotraw(>=0) yintraw(>=0)yextraw(>=0) ctotcomr(>=0) bersraw(>=0) ;

MINRSQD .01;ITERATIONS 10;SEED 24578MULTIPLES 5;

print coef;run;

proc printto;run;

13