Post on 02-Apr-2023
R, Stata, and SAS code for implementing multiple imputation
Kara RudolphElizabeth Stuart
Johns Hopkins Bloomberg School of Public Health
September 25, 2012
1 R: MICE using the mi package
This section uses the mi package in R. You will need to install it on your computer using the commandinstall.packages(“mi”). See here for more details:http://www.stat.columbia.edu/∼gelman/research/published/mipaper.pdf
First, I set up the dataset. I substituted numbers for ordinal responses and reclassified some vari-ables (e.g., unordered categorical variables=factor, binary variables=factor, and ordered categorical vari-ables=integer/numeric).
1
Table 1: Table 1. Number and percent missing data
number.nas percent.nasbatch2 0 0.0000
modeprt2 0 0.0000mmhbp 12 0.0084vagdel 3 0.0021mrace 56 0.0390
bornus 48 0.0334plural2 0 0.0000
married2 0 0.0000matdeg 3 0.0021
pnc 58 0.0404lga2 107 0.0745sga 107 0.0745
kotelchuck2 37 0.0258prelb 2 0.0014
gestwk 44 0.0306bmi 3 0.0021
wrknow 50 0.0348income2 307 0.2138
medicaid2 13 0.0091wicpreg 21 0.0146
pp drink2 54 0.0376diabetes 34 0.0237preexer 36 0.0251
infert tx2 45 0.0313FEEL PG 34 0.0237
smoke2 43 0.0299strs 66 0.0460
inficu 45 0.0313los2 52 0.0362
bfeed2 62 0.0432back sleep2 82 0.0571
cosleep2 92 0.0641ppvchk 33 0.0230
inf age2 0 0.0000sad 118 0.0822
nhope 161 0.1121slow 159 0.1107
matage 23 0.0160
The Figure below shows the missing data pattern. This was created with the mi package in R.
> library(mi)> missing.pattern.plot(data, y.order=TRUE, gray.scale=TRUE,> main="Missing Pattern Plot", xlab="index")
2
Figure 1: Figure 1. Missing data pattern.
Missing Pattern Plot
index
income2nhope
slowsad
lga2sga
cosleep2back_sleep2
strsbfeed2
pncmrace
pp_drink2los2
wrknowbornus
infert_tx2inficu
gestwksmoke2
kotelchuck2preexer
diabetesFEEL_PG
ppvchkmatagewicpreg
medicaid2mmhbpvagdel
matdegbmi
prelbbatch2
modeprt2factwtanal
plural2married2inf_age2
totcnt2
Next, I imputed the data with the mice package.
> library(mice)> load("data.Rdata")> pred<-quickpred(data, mincor=0.01) ## the pred command makes the predictor matrix.> ## I modified it slightly to choose predictors> ## that are at least 1\% correlated with the variable> ## they are predicting.)\\
3
>> pred
batch2 modeprt2 wtanal stratumc mmhbp vagdel mrace bornus plural2batch2 0 0 0 0 0 0 0 0 0modeprt2 0 0 0 0 0 0 0 0 0wtanal 0 0 0 0 0 0 0 0 0stratumc 0 0 0 0 0 0 0 0 0mmhbp 1 0 1 1 0 1 0 1 1vagdel 1 1 1 1 1 0 1 1 1mrace 1 1 1 1 1 1 0 1 1bornus 1 1 1 1 1 1 1 0 1plural2 0 0 0 0 0 0 0 0 0married2 0 0 0 0 0 0 0 0 0matdeg 1 1 1 1 1 1 1 1 1pnc 0 1 1 1 1 1 1 1 1lga2 1 1 1 1 1 1 1 1 1sga 1 1 1 1 1 1 1 1 1kotelchuck2 1 1 1 1 1 1 1 1 1prelb 1 1 0 1 1 1 1 1 1gestwk 1 1 1 1 1 1 1 1 1bmi 0 1 1 1 1 1 1 1 1wrknow 1 1 1 1 1 1 1 1 0income2 1 1 1 0 1 1 1 1 1medicaid2 0 1 1 1 1 1 1 1 1wicpreg 1 1 1 1 1 1 1 1 1pp_drink2 1 1 1 1 1 1 1 1 1diabetes 1 1 1 1 1 1 1 1 1preexer 1 1 1 1 1 1 1 1 1infert_tx2 1 1 1 1 1 1 1 1 1FEEL_PG 0 1 1 1 1 1 1 1 1smoke2 1 1 1 1 1 1 1 1 1strs 1 1 1 1 1 1 1 1 1inficu 1 1 1 1 1 1 1 1 1los2 1 1 1 1 1 1 1 1 1bfeed2 1 1 1 1 1 1 1 1 1back_sleep2 1 1 1 1 1 1 1 1 1cosleep2 1 1 1 1 1 1 1 1 1ppvchk 1 1 1 1 1 1 1 1 1inf_age2 0 0 0 0 0 0 0 0 0sad 1 1 1 1 1 1 1 1 1nhope 1 1 1 1 1 1 1 1 1slow 1 1 1 1 1 1 1 1 1matage 1 1 1 1 1 1 1 1 1totcnt 0 0 0 0 0 0 0 0 0
If you want to change anything in the predictor matrix, you can manually set variables to be used/notused for prediction and predicted/not predicted. For example, the following code would not allow thesurvey variables to be used for prediction, and would not allow sadness to be predicted.
> pred[,"wtanal"] <- 0> pred[,"stratumc"] <- 0> pred[,"totcnt"] <- 0> pred["sad",] <- 0
4
> ini<-mice(data, maxit=0, pri=F)> meth<-ini$meth> meth["sad"]<-""
You can also double check that mice is using the correct method to impute the variables.
> ini$method
batch2 modeprt2 wtanal stratumc mmhbp vagdel"" "" "" "" "logreg" "logreg"
mrace bornus plural2 married2 matdeg pnc"polyreg" "logreg" "" "" "pmm" "pmm"
lga2 sga kotelchuck2 prelb gestwk bmi"logreg" "logreg" "pmm" "pmm" "pmm" "pmm"wrknow income2 medicaid2 wicpreg pp_drink2 diabetes
"logreg" "pmm" "logreg" "logreg" "pmm" "polyreg"preexer infert_tx2 FEEL_PG smoke2 strs inficu"logreg" "logreg" "polyreg" "pmm" "pmm" "logreg"
los2 bfeed2 back_sleep2 cosleep2 ppvchk inf_age2"pmm" "pmm" "logreg" "logreg" "logreg" ""sad nhope slow matage totcnt
"pmm" "pmm" "pmm" "pmm" ""
When you are happy with your predictor matrix and method for imputation of all the variables, youare ready to impute.
> imp<-mice(data, pred=pred, maxit=10,m=10,seed=92385)\\>> ## I set a seed for reproducibility.> ## I am allowing for a maximum of 10 iterations> ## for each of 10 imputed datasets.)\\
The figure below compares the values for each imputed dataset.
5
Figure 2: Figure 2. Imputation performance.
Den
sity
0.00
.20.
40.6
0.81
.0
−5 0 5 10
matdeg
01
23
0 1 2
pnc
0.00
.20.
40.6
0.81
.0
0 2 4
kotelchuck2
0.0
1.0
2.0
0 1 2 3 4
gestwk
0.0
0.1
0.2
0.3
0.4
10 20 30 40 50 60
bmi
0.00
0.10
0.20
0.30
−2 0 2 4 6 8
income2
0.0
0.5
1.0
1.5
0 2 4
pp_drink2
01
23
0 1 2
smoke2
0.0
0.4
0.8
−1 0 1 2 3
strs
0.0
0.4
0.8
−1 0 1 2 3
los2
0.0
0.5
1.0
−1 0 1 2 3
bfeed2
0.0
0.2
0.4
0.6
0 2 4
sad
0.0
0.5
1.0
1.5
0 1 2 3 4
nhope
0.0
0.2
0.4
0.6
0 2 4
slow
0.0
0.4
0.8
1.2
0 2 4 6
matage
The next figure can be used to assess imputation convergence.
6
Figure 3: Figure 3. Imputation convergence.
Iteration
1.0
1.1
1.2
1.3
1.4
mean mmhbp
0.0
0.1
0.2
0.3
0.4
0.5
sd mmhbp
1.0
1.2
1.4
1.6
1.8
2.0
mean vagdel
0.0
0.2
0.4
0.6 sd vagdel
1.8
2.0
2.2
2.4
mean mrace
0.9
1.1
1.3
sd mrace
1.4
1.5
1.6
1.7 mean bornus
0.47
0.48
0.49
0.50
sd bornus
1.0
2.0
3.0
mean matdeg
01
23
sd matdeg
1.5
1.6
1.7
1.8
1.9
mean pnc
0.2
0.3
0.4
0.5
0.6
sd pnc
1.05
1.10
2 4 6 8 10
mean lga2
0.10
0.20
0.30
2 4 6 8 10
sd lga2
1.35
1.45
1.55
2 4 6 8 10
mean sga
0.48
00.
490
0.50
0
2 4 6 8 10
sd sga
Also, we can look at how the means changed before and after imputation.
7
Names Pre Post1 FEEL PGsooner 0.22 0.222 FEEL PGlater 0.27 0.273 FEEL PGthen 0.43 0.424 FEEL PGDID NOT WANT 0.079 0.0805 mraceNon-Hispanic White 0.24 0.256 mraceNon-Hispanic Black 0.24 0.247 mraceHispanic 0.35 0.358 mraceNon-Hispanic American Indian/Alaska Native 0.15 0.159 mraceNon-Hispanic Asian 0 0
10 mraceNon-Hispanic Hawaiian 0.018 0.01811 mraceNon-Hispanic Multiple Race 0 012 vagdel1 0.58 0.5813 married21 0.55 0.5514 matdeg 3.05 3.0515 mmhbp1 0.0681 0.067816 pnc 1.78 1.7817 plural21 0.0703 0.070318 preexer1 0.33 0.3320 inficu1 0.33 0.3321 ppvchk1 0.90 0.9022 sad 1.18 1.1723 nhope 0.56 0.5724 slow 1.23 1.2125 wrknow1 0.38 0.3826 los2 0.794 0.79827 infert tx21 0.10 0.1028 back sleep21 0.63 0.6229 cosleep21 0.20 0.2030 pp drink2 0.66 0.6631 smoke2 0.14 0.1432 bmi 24.97 24.97
You can take your imputed datasets and run a design-based, survey-weighted regression.
> library(mitools)> library(survey)> load("imp.Rdata")> imp1<-mice::complete(imp, action=1, include=FALSE)> imp2<-mice::complete(imp, action=2, include=FALSE)> imp3<-mice::complete(imp, action=3, include=FALSE)> imp4<-mice::complete(imp, action=4, include=FALSE)> imp5<-mice::complete(imp, action=5, include=FALSE)> imp6<-mice::complete(imp, action=6, include=FALSE)> imp7<-mice::complete(imp, action=7, include=FALSE)> imp8<-mice::complete(imp, action=8, include=FALSE)> imp9<-mice::complete(imp, action=9, include=FALSE)> imp10<-mice::complete(imp, action=10, include=FALSE)> implist<-list(imp1, imp2, imp3, imp4, imp5, imp6, imp7, imp8, imp9, imp10)> for(i in 1:10){+ implist[[i]]$stratum<-ifelse(implist[[i]]$stratumc=="YC: NORMAL BW", 1, 2)+ }
8
> multimp<-imputationList(implist)> desimp<-svydesign(id=~1, strata=~stratum, weight=~wtanal, fpc=~totcnt, data=multimp)> fit<- with(desimp, svyglm(sga ~ strs + FEEL_PG + vagdel + mrace + income2 + medicaid2 + wicpreg+ + matdeg + poly(bmi,2) + smoke2 + poly(matage,2) + infert_tx2 , family=quasibinomial()))> fitmi<-MIcombine(fit)> summary(fitmi)
Multiple imputation results:with(desimp, svyglm(sga ~ strs + FEEL_PG + vagdel + mrace + income2 +
medicaid2 + wicpreg + matdeg + poly(bmi, 2) + smoke2 + poly(matage,2) + infert_tx2, family = quasibinomial()))MIcombine.default(fit)
results se(Intercept) -1.242883e+00 0.53669526strs 1.259229e-01 0.13615997FEEL_PG2 -1.798350e-01 0.30112655FEEL_PG3 -2.239515e-02 0.25625769FEEL_PG4 -2.921535e-01 0.34799446vagdel2 -3.287655e-01 0.19468922mraceNon-Hispanic Black -5.866356e-01 0.34399488mraceHispanic -7.632288e-02 0.30300550mraceNon-Hispanic American Indian/Alaska Native -1.529159e-01 0.32040498mraceNon-Hispanic Hawaiian -1.401858e+00 0.54702418income2 -1.264901e-01 0.08682087medicaid22 -4.124848e-01 0.29607676wicpreg2 -8.333494e-02 0.27339916matdeg -7.676502e-04 0.07513768poly(bmi, 2)1 -1.131922e+01 3.81129690poly(bmi, 2)2 5.643566e+00 2.92501576smoke2 3.432494e-02 0.24843572poly(matage, 2)1 -3.282042e+00 4.05131475poly(matage, 2)2 3.044264e+00 3.43652006infert_tx22 7.035499e-01 0.36707274
(lower upper)(Intercept) -2.29580458 -0.18996138strs -0.14114838 0.39299410FEEL_PG2 -0.77068948 0.41101956FEEL_PG3 -0.52506024 0.48026994FEEL_PG4 -0.97428734 0.38998029vagdel2 -0.71045444 0.05292338mraceNon-Hispanic Black -1.26219925 0.08892809mraceHispanic -0.67065748 0.51801171mraceNon-Hispanic American Indian/Alaska Native -0.78154701 0.47571515mraceNon-Hispanic Hawaiian -2.47471450 -0.32900127income2 -0.29768058 0.04470039medicaid22 -0.99523068 0.17026115wicpreg2 -0.62027420 0.45360432matdeg -0.14818529 0.14664999poly(bmi, 2)1 -18.79196705 -3.84646424poly(bmi, 2)2 -0.08957018 11.37670264smoke2 -0.45325756 0.52190744poly(matage, 2)1 -11.22354828 4.65946456poly(matage, 2)2 -3.69222922 9.78075626
9
infert_tx22 -0.02221297 1.42931279missInfo
(Intercept) 9 %strs 8 %FEEL_PG2 9 %FEEL_PG3 8 %FEEL_PG4 3 %vagdel2 5 %mraceNon-Hispanic Black 12 %mraceHispanic 8 %mraceNon-Hispanic American Indian/Alaska Native 9 %mraceNon-Hispanic Hawaiian 7 %income2 22 %medicaid22 18 %wicpreg2 13 %matdeg 9 %poly(bmi, 2)1 5 %poly(bmi, 2)2 2 %smoke2 10 %poly(matage, 2)1 3 %poly(matage, 2)2 3 %infert_tx22 26 %
2 Stata: Ice
First, I set up the dataset. I read the dataset that I modified in R into Stata. I then had to change theNA’s so that Stata would recognize the missing data. See the following links for more details:http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt1.htm,http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt2.htm
Stata Code:
destring , replace ignore("NA")
encode mrace, generate(race2)replace race2=. if race2==2
encode diabetes, generate(diabetes2)replace diabetes2=. if diabetes2==2
encode feel\_pg, generate(feelpg2)replace feelpg2=. if feelpg2==2
This next bit of code uses the mvpatterns package to look at the missing data patterns.
mvpatterns modeprt2 batch2 wtanal stratumc mmhbp vagdel race2 bornus plural2 married2matdeg pnc lga2 sga kotelchuck2 prelb gestwk bmi wrknow income2 medicaid2 wicpregpp_drink diabetes2 preexer infert_tx feelpg2 smoke strs inficu los2 bfeed2back_sleep2 cosleep2 ppvchk inf_age2 sad nhope slow matage totcnt
Before doing the actual imputation, it’s recommended to do a dry run to insure that the variablesto be imputed are in the preferred format. The code is below. The i. prefix declares these variables to
10
be categorical without missing. The m. prefix is for unordered categorical variables with missing (m formultinomial). The o. prefix is for ordered categorical variables with missing (o for ordinal).
ice modeprt2 batch2 wtanal i.stratumc mmhbp vagdel m.race2 bornus plural2 married2o.matdeg o.pnc lga2 sga o.kotelchuck2 o.prelb o.gestwk bmi wrknow o.income2 medicaid2wicpreg o.pp_drink m.diabetes2 preexer infert_tx m.feelpg2 o.smoke o.strs inficuo.los2 o.bfeed2 back_sleep2 cosleep2 ppvchk i.inf_age2 o.sad o.nhope o.slow matagetotcnt, saving(dryrun) dryrun
Now, I’ll do the imputation. The code for the first bit is the same as the dryrun. m(10) tells it toimpute 10 datasets and seed() sets the seed, so that it can be replicated. boot() is recommended for vari-ables whose distributions do not approximate the normal distribution (most of the variables in this case!).
ice modeprt2 batch2 wtanal i.stratumc mmhbp vagdel m.race2 bornus plural2 married2o.matdeg o.pnc lga2 sga o.kotelchuck2 o.prelb o.gestwk bmi wrknow o.income2 medicaid2wicpreg o.pp_drink m.diabetes2 preexer infert_tx m.feelpg2 o.smoke o.strs inficu o.los2o.bfeed2 back_sleep2 cosleep2 ppvchk i.inf_age2 o.sad o.nhope o.slow matage totcnt,saving(impute, replace) m(10) boot(mmhbp vagdel race2 bornus matdeg pnc lga2 sgakotelchuck2 prelb gestwk wrknow income2 medicaid2 wicpreg pp_drink diabetes2 preexerinfert_tx feelpg2 smoke strs inficu los2 bfeed2 back_sleep2 cosleep2 ppvchk sad nhopeslow matage) seed(1285964)
We can then use the imputed data by importing it as mi data.
mi import ice
The following table compares some variable means before and after imputation.
Names Pre Postvagdel 0.5784 0.5787married2 0.5487 .5501matdeg 3.046 3.064mmhbp .0680 .0675plural2 .0703 .0704preexer .3302 .3312wicpreg .5553 .5524inficu .3285 3281ppvchk .8969 .8973sad 1.169 1.166nhope .5616 .5604slow 1.206 1.207wrknow .3754 .3763los2 .7968 .7963inferttx2 .1015 .1020backsleep2 .6229 .6234cosleep2 .2015 .2015ppdrink2 .6561 .6615smoke2 .1414 .1413bmi 24.97 24.96
11
We can then use the imputed data to run a design-based, survey weighted regression. Here, I ran thesame regression that I ran in R. The first line sets the survey parameters within the multiply imputeddataset. The next two lines create quadratic versions of the bmi and maternal age variables in eachimputed dataset, using the imputed variables. The last line runs the regression.
mi svyset [pweight=wtanal], strata(stratumc) fpc(totcnt)
mi xeq: generate bmisq = $bmi^2$mi xeq: generate matagesq = $matage^2$
mi estimate: logit sga strs i.feelpg2 vagdel i.race2 income2 medicaid2 wicpreg matdeg bmi bmisq smoke2 matage matagesq infert\_tx
sga Coef SE 95% CIstrs .04865 .09377 -.1354 .2327feelpg3 -.2786 .2515 -.7715 .2144feelpg4 -.1222 .2663 -.6442 .3999feelpg5 -.2294 .2481 -.7161 .2573vagdel -.4978 .1334 -.7593 -.2362race3 .1844 .2103 -.2281 .5970race4 -.0975 .1745 -.4397 .2448race5 .0505 .5039 -.9375 1.038race6 -.0053 .2015 -.4004 .3898income2 -.1232 .0603 -.2437 -.0028medicaid2 -.05313 .2255 -.4977 .3914wicpreg -.1886 .1740 -.5296 .1525matdeg .06623 .0476 -.0272 .1596bmi -.1370 .0655 -.2655 -.0086bmisq .0021 .0011 -.0001 .0042smoke2 .1784 .1391 -.0942 .4511matage -.4299 .2441 -.9089 .0491matagesq .0559 .0325 -.0079 .1198inferttx2 .7710 .2339 .3099 1.232cons 2.360 1.090 .2212 4.498
3 SAS: IVEWare
IVEWare is a stand-alone package or can be run through SAS. http://www.isr.umich.edu/src/smp/ive/
Here is some sample code. Multiply imputed data created in IVEWare can then be imported into Ror SAS to run analyses.
proc printto print=’C:\myoutdir\summerinst.lst’ NEW;run;
LIBNAME MYLIB1 ’C:\myindir’;LIBNAME MYLIB2 ’C:\myoutdir’;options set = SRCLIB "C:\Program Files\SAS\srclib" sasautos=(’!SRCLIB’ sasautos) mautosource;run;%IMPUTE (NAME=MYSETUP,DIR=C:\myoutdir,SETUP=NEW);DATAIN MYLIB1.sinst;DATAOUT MYLIB2.impsinst1;
12
DEFAULT categorical;
COUNT totchild totadu susa5 ;
CONTINUOUS age bersraw ctotcomr ctotraw cintraw cextraw ytotraw yintraw yextrawars a1_rs ar1_s a1_r1_s;
DROP cohort totrole;
TRANSFER childid axis_1a ;
RESTRICT susa5(susa1=1, atleast11=1) susb13a(atleast11=1) susb13b(susb13a=1,atleast11=1)ds1(atleast11=1) ds2(atleast11=1) ds3(atleast11=1) ;
BOUNDS susa5(>=0,<=40) ctotraw(>=0) cintraw(>=0) cextraw(>=0), ytotraw(>=0) yintraw(>=0)yextraw(>=0) ctotcomr(>=0) bersraw(>=0) ;
MINRSQD .01;ITERATIONS 10;SEED 24578MULTIPLES 5;
print coef;run;
proc printto;run;
13