Developing Imputation Methods for Crime Data
Transcript of Developing Imputation Methods for Crime Data
Developing Imputation Methods for Crime Data
Report to the American Statistical Association Committee on Law and Justice Statistics
By
Michael D. Maltz
Clint Roberts Elizabeth A. Stasny
The Ohio State University
September 29, 2006
1
I. INTRODUCTION
The Uniform Crime Reporting (UCR) System, collected and published by the Federal
Bureau of Investigation (FBI) since 1930, is one of the largest and oldest sources of social data.
It consists of, in part, monthly counts of crimes for over 18,000 police agencies throughout the
country. Although it is a voluntary reporting system – no agency is required to report its data to
the FBI1 – most agencies do comply and transmit their crime data to the FBI. That does not
mean, however, that the reporting is complete: for various reasons (Maltz, 1999, pp. 5-7)
agencies may miss reporting occasional months, strings of months, or even whole years.
As a consequence of this missing data, when the FBI made some major changes to the UCR
in 1958, they also decided to impute data to fill in the gaps so that the data set would be
comparable from year to year. The FBI used a simple imputation method. For example, if an
agency neglected, for whatever reason, to report data for half the year, the FBI would double the
agency’s crime counts so that there would not appear to be a dip in the agency’s chronological
crime trajectory. Although this imputation method has drawbacks – for example, in an agency
with a great deal of monthly or seasonal variation – it does have the benefit of filling in gaps
with a reasonable method. A more appropriate imputation method might be to base the
imputation method on the agency’s long-term history, but the computing power needed to do this
did not exist in the late 1950s and early 1960s, when the FBI method was implemented. Nor did
the FBI change the method afterwards, when computing became cheaper and easier, probably
preferring to maintain consistency in its reporting.
In recent years, however, the UCR has been used for making policy decisions. The Local
Law Enforcement Block Grant Program, funded by the US Congress in 1994, allocated funds to
1 Some states have required the reporting of crime data: see Maltz (1999, p. 45).
2
local jurisdictions based on the number of violent crimes they experienced in the three most
recent years. In addition, studies have used county-level UCR data to test policies despite the
inability of the data (as they now stand) to support such analyses (Lott, 1998; Lott & Mustard,
1997; Maltz & Targonski, 2002, 2003). Because of these uses of the data, and because of the
prospect of using a complete UCR data set to further study the effect of policies and programs
over the last few decades, the National Institute of Justice (NIJ) funded projects to clean the data
and the American Statistical Association (ASA) was given funding by the Bureau of Justice
Statistics (BJS) to support a study to develop imputation methods to apply to the cleaned UCR
data. This report describes the progress we have made on developing imputation methods for
missing UCR data.
The goal of this project is to develop and test imputation methods for longitudinal UCR data
and to develop variance estimates for the imputed data. Although it should be possible
theoretically to develop algorithms and apply them to every agency for every (apparently)
missing data point, certain factors militate against doing so.
First, not all “missing” data is truly missing. For example, agencies may merge with or
report through other agencies, or report a month’s data at other times to compensate for missed
months; Section II provides a description of crime data missingness.
Missingness in UCR data became an issue when the FBI revised its reporting policies and
procedures after a 1958 consultant report (Lejins et al., 1958) recommended new ways to
estimate and publish national crime trends. This formed the basis of the imputation methodology
used by the FBI to estimate crime, as described in Section III.
While we recognize the limitations of the current FBI imputation methods, we also find that
a “one size fits all” approach cannot be used for all truly missing data. For agencies that have
3
high crime counts (such that a zero count for a month is very unlikely), we found that a
SARIMA (Seasonal AutoRegressive Integrated Moving Average) model was most useful. The
assumption of normal error terms required for such a model was also more reasonable when the
number of crimes was large. When agencies had low crime counts, we chose models appropriate
for discrete data. For agencies with very low crime counts, we imputed the mean value averaged
over all available data and assumed a Poisson distribution of crime counts. For agencies with
intermediate crime counts, we used a Poisson regression model. Section IV describes the
analyses we conducted that led to this approach.
One of the more important results of our initial research – which is possible because of our
model-based approach – is the development of variance estimates for the imputed data; measures
of uncertainty in imputed values are not available for the FBI-imputed data. Since the unit of
analysis we used in imputing data was the crime-month, we had to combine variances from the
individual crime series to estimate the variance for the UCR Crime Index (the sum of seven
crimes) or the Violent (four crimes) or Property Crime Index (three crimes). Section VI deals
with that aspect of the research.
II. CRIME DATA MISSINGNESS One of the difficulties encountered in examining missingness in the UCR data is the
difficulty of initially determining whether a datum is missing (the data file does not use a specific
symbol to indicate when a datum is missing)2 or is merely zero. We have used exploratory data
analysis (EDA) techniques as initially proposed by Tukey (1957) to help in that determination.
2 This is not always the case; in many of the cases of interest, there are indicators in the full data set of when a datum is missing (K. Candell, personal communication, April 17, 2006), but they were not available in the data set provided to us by the NACJD.
4
When a zero occurs in the data set, it may be that no crimes of that type were committed that
month. There are, however, a number of possible explanations. It might also mean that (1) the
agency had not (yet) begun reporting data to the FBI because it didn’t exist (or did not have a
crime reporting unit) at that time, (2) the agency ceased to exist (it may have merged with
another agency), (3) the agency existed, but reported its crime and arrest data through another
agency (i.e., was “covered by” that agency), (4) the agency existed and reported data for that
month, but the data were aggregated so that, instead of reporting on a monthly basis, it reported
on a quarterly, semiannual, or annual basis, (5) the agency existed and reported monthly data in
general, but missed reporting for one month and compensated for the omission by reporting, in
the next month, aggregate data for both months, or (6) the agency did not submit data for that
month (a true missing datum).
Our goal has been to distinguish among these different types of missingness. This section
describes the characteristics of the data that are truly missing. In particular, we explore the
length of runs of missingness and how they vary by year and by size and type of agency.
Obviously, there are a number of other variables that might be investigated: state, population,
county urbanicity (Goodall et al., 1998), and crime rate. Our exploration of these candidate
variables did not suggest that they would be easily useful; this was true a fortiori for
sociodemographic variables (e.g., percent minority, percent aged 16-25). Therefore, this initial
analysis will focus only on year, state, and size and type of agency.
The typology of agencies we used is the one used by the FBI, and described in Table 1. As
can be seen, the typology is fairly simple, and agencies change their group designation as their
population changes.
5
Table 1. FBI Classification of Population Groups Population Group Political Label Population Range 1 City 250,000 and over 2 City 100,000 to 249,999 3 City 50,000 to 99,999 4 City 25,000 to 49,999 5 City 10,000 to 24,999 6 Citya Less than 10,000 8 (Nonmetropolitan County) Countyb N/A 9 (Metropolitan County) Countyb N/A Note: Group 7, missing from this table, consists of cities with populations under 2,500 and universities and colleges to which no population is attributed. For compilation of CIUS, Group 7 is included in Group 6. a Includes universities and colleges to which no population is attributed. b Includes state police to which no population is attributed.
Missingness Run Lengths
Figure 1 shows the overall pattern of missingness for all states, all years. The horizontal axis
is scaled logarithmically to highlight the shorter runs. As can be seen, the greatest number of the
over 44,000 missingness runs are of length 1, and 70 percent are 10 months or less.
The second panel in Figure 1 uses a logarithmic scale for both axes to give a better indication
of run length distribution and patterning; although the first 2 peaks are at 1 and 5, respectively,
the remaining peaks are in multiples of 12 – i.e., they represent full years of missingness.3
3 An agency that reports semiannually would have missingness run lengths of 5. A check of such runs was made to see if that was the reason for the peak at 5; it turned out that this was not the case, since very few of these runs were from January-May or July-November.
6
0
2500
5000
7500
10000
1 10 100 1000
Gap size (mos.)
Number of cases
1
10
100
1000
10000
1 10 100 1000
Gap size (mos.)
Number of cases
Figure 1. Number of cases of different gap sizes from missing data, depicted on standard (upper panel) and logarithmic (lower panel) scales. The logarithmic plot clearly shows the periodicity of missingness patterns: most of the run length peaks occur at multiples of 12, indicating whole years of missing data.
7
Missingness in Different FBI Groups
As shown in Table 1, Groups 1-5 represent cities of different sizes, but Groups 6-9 represent
both cities and other types of jurisdictions, including university, county, and state police
agencies, as well as fish and game police, park police, and other public law enforcement
organizations. Most (but not all) of these “other types” of jurisdictions are called “zero-
population” agencies by the FBI, because no population is attributed to them. This is because, for
example, counting the population policed by a university police department or by the state police
would be tantamount to counting that population twice. Thus, the crime count for these agencies
is merely added to the crime count for the other agencies to get the total crime count: for the city,
in the case of the university police, and for the state, in the case of the state police.4
Figure 2 depicts the missingness trends for these different Groups. As can be seen, Groups 5-
9 have the most missingness, concentrated in long run lengths. It stands to reason that the most
populous agencies (Groups 1-4) have the least missingness. First, agencies that have more crime
probably have stronger statistical capability; and second, if agencies with populations of 100,000
or more are missing reports, they are contacted by FBI personnel and urged to complete their
reports. This second practice, of course, makes an assumption of data being missing completely
at random (MCAR) implausible.
4 Of course, if a university has branches in different cities, each city records the crime count for its respective branch; and if there is a separate state police barracks in each county, then the crime count is allocated by county.
8
Group 1
Gap size (mos.)
0.1 1 10 100 1000
Num
ber o
f cas
es
1
10
100
1000 Group 2
Gap size (mos.)
0.1 1 10 100 1000
Group 3
Gap size (mos.)
0.1 1 10 100 1000
1
10
100
1000
Group 4
1
10
100
1000 Group 5 Group 6
1
10
100
1000
Group 7
1
10
100
1000 Group 8
0.1 1 10 100 1000
Group 9
0.1 1 10 100 1000
1
10
100
1000
Figure 2. Number of cases of different gap sizes from missing data for different FBI Groups, plotted on logarithmic scales. The agencies with higher populations have the least missingness.
III. PREVIOUS IMPUTATION METHODS In 1958 a consultant committee to the FBI recommended that a simple imputation method be
used to account for missing data (Lejins et al., 1958, p. 46): “The number of reported offenses
should then be proportionately increased to take care of the unreported portions, if any, of these
same [Index] categories within each state.” The FBI implemented this imputation approach
using two different methods, both based on the current year’s reporting. If three or more
months’ crimes were reported during the year, the estimated crime count for the year would be
12C/M, where M is the number of months reported and C is the total number of crimes reported
during the M months (Table 2). If fewer than three months were reported in that year, the
9
estimated crime count is based on “similar” agencies in the same state. A “similar” agency is
one that meets two selection criteria: it must be in the same state and the same Group (as per
Table 1), and it must have reported data for all 12 months. The crime rate for these agencies is
then computed (the total crime count is divided by the total population), and this rate is
multiplied by the population of the agency needing imputation, to estimate its annual crime
count.
Table 2. FBI Imputation Procedure
Number of months reported, M, by Agency A 0 to 2 3 to 11
CS PA/PS 12 CA / M
CA , PA: the agency’s crime count and population for the year in question CS , PS: the crime and population count of “similar” agencies in the state, for the year in question
As this imputation method makes clear, the FBI’s primary concern has been the annual
crime count. This imputation technique has been used since the early 1960s, and ignores
concerns of seasonality. (The FBI does conduct analyses of crime seasonality, but not of
individual agencies, and it only uses agencies that have reported all 12 months in its seasonality
analyses.)
IV. INITIAL APPROACHES Our approach has been to impute for each month with missing data and to use as much
available data as possible in our estimates, which means that we often use data of more than a
year’s duration.
Many of the standard approaches to imputing missing data, taken from the survey sampling
literature, do not apply for the UCR data. First, there is surprisingly little research on missing
10
data methods for time series data in the literature. For example, Little and Rubin (2002) include
only two pages on the topic. Furthermore, the UCR time series vary considerably across
agencies: some agencies have a great deal of seasonal variation in crime while others have none,
some are in cities with high population growth while others are in cities that have experienced
population declines. More importantly from our standpoint, some have high monthly crime
counts (so that the probability of a zero count is vanishingly small) while others have sparse
crime counts. As mentioned earlier, because of the high degree of variation in crime from
jurisdiction to jurisdiction (as well as from crime to crime within a jurisdiction), we employed
different methods for different situations.
An additional complication is that some small agencies may have reported data only in
months in which they had data to report. That is, if an agency reported one crime in February
and one crime in December, and provided no crime reports for any of the other months, a
standard imputation might conclude that the agency averages one crime per month – and
therefore estimate that about 12 crimes occurred in that jurisdiction for that year. Only by
looking at the agency’s entire trajectory can one determine the extent to which this assumption is
valid – in most cases this would not be a valid assumption. Graphical methods, therefore, were
an important aspect of our analysis.
Our exploration of the data began with the Columbus NDX (i.e., Index crime) series. This
time series has large crime counts and no missing observations. Not only does the series exhibit
an increasing trend and seasonality, but the variance and seasonal effect also change over time.
Model selection started with decomposing the data to a stationary series. A time series is
stationary if its mean function and covariance function are independent of time. The time series
for agencies reporting in the UCR data are not generally stationary, mainly because of population
11
changes over time.5 We must obtain stationary residuals to specify an ARIMA model. One way
to get stationary residuals is to estimate the trend component and the seasonal component of the
series and subtract them from the data via the classical decomposition algorithm. Less tedious,
but still achieving the desired result, is to apply differencing to the data. Lag-1 differencing will
eliminate trend, but we will still need to adjust for seasonality later. To better ensure stationarity,
we make a log10 transformation of the data. We initially tried complicated models with many
terms in them. Ultimately, however, we chose much simpler models. The autocorrelation
function (ACF) and the partial ACF for a number of time series following this data preparation
suggested that we should use seasonal ARIMA models with parameters (1,1,1)×(1,0,1), where
the first triple of parameters refer to the AR, differencing, and MA terms in the model and the
second triple is the same set of terms but for the seasonality component of the model. This
model works well to predict observations based on the pattern of the data over time.
We illustrate our analyses with the Columbus NDX series. Figure 3A shows the raw data
plotted as a time series, Figure 3B shows this same series after correcting for the mean and
taking first differences, and Figure 3C shows the log transformation of the same data series after
mean correction and differencing.
5 Our imputation strategy was to impute crime counts, not crime rates. The crime rates might be stationary, but considering the natural variation in crime over the past few decades, this is highly unlikely.
12
0 100 200 300 400 500
1000
2000
3000
4000
5000
6000
7000
Columbus, OH
ND
X
A. Columbus NDX Series – Raw Data
0 100 200 300 400 500
-400
0-3
000
-200
0-1
000
010
0020
00
Columbus, OH (NDX)
MC
D
B. Mean-corrected differenced Columbus NDX series
0 100 200 300 400 500
-0.3
-0.2
-0.1
0.0
0.1
Columbus, OH (logNDX)
MC
D
C. Mean-corrected differenced Columbus log NDX series
Figure 3. Columbus NDX time series Data
13
One of the first questions we set out to answer concerned how we should use the NDX series,
the sum of the seven index crimes. Possible approaches are to make imputations for each of the
seven category crimes separately and then sum them to get an estimate for the NDX, or we could
make our imputation on the NDX series and then allocate this estimate to the seven crime series.
Using the Columbus and Cleveland time series, we checked to see if there was a difference in
prediction estimates between the two prediction methods for NDX. We found that the
predictions for the two methods are relatively close, and we concluded that both methods
produce similar predictions. Plots in Figure 4 show sample comparisons of these predictions.
Based on comparisons such as these, we decided to make our imputations for each of the seven
index crimes and use the sum as an estimate for NDX.6 Advantages of this method are that it
mimics the way the NDX is created, as the sum of the seven crime counts, and it avoids the
question of how an estimate of the composite NDX should be allocated to the seven individual
crimes.
For the most part, our goal was to develop interpolation rather than extrapolation models,
since most of the missing data were within the time series rather than at the beginning or ends of
series. Abraham (1981) takes a simple approach that uses fore- and back-casting estimates from
time-series models, which, when combined, provides estimates with lower standard errors than
an estimation procedure that is based only on prior (or subsequent) data points alone.
To test our proposed models we simulated missingness by eliminating known data points
from the time series and then estimated the points we had deleted using various models. We used
both mean absolute and mean squared differences to assess the overall accuracy of our models’
predictions.
6 The same holds true for the Violent Crime Index (murder + rape + robbery + aggravated assault) and the Property Crime Index (burglary + larceny + vehicle theft).
14
Seven crime total versus Index
0
0.05
0.1
0.15
0.2
0 0.05 0.1 0.15 0.2
totaled pred abs rel diff
inde
x pr
ed a
bs re
l diff
Plot of the absolute relative prediction errors for predicting the index or summing individual crimes and summing to create the index.
Comparing predictions to actual values
4000
4500
5000
5500
6000
6500
7000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
forecast month
crim
e
Totaledprediction
True value
Indexprediction
Figure 4. The top plot is for Columbus, OH; the bottom plot is for Cleveland, OH. Both show that the prediction based on the sum of seven estimates and the prediction based on the NDX series are very close to each other.
15
As noted earlier, we found that one size does not fit all. We decided to use three different
imputation methods, depending on the statistics of the agency and crime in question. Based on
our graphical analyses of agency crime trajectories, we chose mean monthly counts of 1 and 35
as our break points. For example, in Figure 5, Columbiana, OH has a mean monthly count of 28
and the SARIMA method of imputation in months 476-492 produced fairly constant estimates.
Galion, OH has a mean monthly count of 37 and the SARIMA method of imputation is seen to
exhibit the seasonality and the trend from the data preceding the missing months. Also, UCR
time series often have much smaller counts in the earlier years than in the later years, so it is
better to be conservative with employing the SARIMA method to avoid using it in the cases
when we have missing data at the beginning of the series. We have seen throughout the course
of our research that average counts of 1 and 35 are reasonable break points for our three
imputation methods.
When an agency’s crime counts are included in another agency’s reports, the other agency is
said to be “covering” the initial agency, and the initial agency is said to be “covered by” the
other agency. Since such covered-by crimes have been counted and reported, we do not need to
impute them for the covered agency. Still, we cannot treat agencies with covered-by months or
covering months the same as other agencies. If a covering agency has periods of covering
surrounded by periods of non-covering, then we might have a problem when we try to model the
series. The months that are covering another agency will have a different model than the months
that are not covering. For agencies with covered-by months, making the usual imputations of
the missing values will not be a problem as long as the agency has a sufficient amount of
observed data. For each of the covering agencies, we would need to look at plots to see how
appropriate our regular model would be. Since the number of agencies affected by coverings is
16
large, it is not feasible to look at the plots of each one. We assume that most covering agencies
report large counts and most covered-by agencies contribute small counts, and therefore making
adjustments in the model for covering months will not significantly improve our predictions.
0 100 200 300 400 500
020
4060
80
Columbiana, OH
ND
X
A. Imputation based on SARIMA for Columbiana NDX series with average crime count of 28
0 100 200 300 400 500
020
4060
80
Galion, OH
ND
X
B. Imputation based on SARIMA for Galion NDX series with average crime count of 37
Figure 5. Plots showing imputations from time series models for two NDX series with different levels of crime.
In addition, in many cases an agency did not provide monthly data for some years, but
instead sent in an annual (or semiannual or quarterly) count of crimes. In those cases we imputed
17
the data for the entire time period, as if we had no information for that period, and then adjusted
all of the imputed values so that they would sum to the (known) total.
All imputation algorithms were written in R, a statistical software language that has strong
graphical capabilities. Since we relied to a great extent on “eyeballing” to determine whether a
particular imputation algorithm seemed reasonable, this allowed us to inspect the results of
different imputation procedures. The R code for our procedures is provided in the Appendix.
V. OUR IMPUTATION PROCEDURE We will now describe the general structure of the algorithms used to make imputations in our
data. After removing the outliers from the series, the program makes a search for the longest
complete segment in the series. Using this segment as a starting position, the program moves
outward to the ends of the series making imputations for each missing period. At each step,
imputations are based on forecasts and backcasts, provided that enough data is available
surrounding the missing period. Finally, the outliers are included back into the data. (There are
many outliers in the data that are suspected of being processing errors, and should be dealt with
outside of this project.)
Crime Count Less Than 1 per Month
When the average crime count was under 1 crime per month, we assumed that the data had a
Poisson distribution and, since with such sparse data seasonality is difficult or impossible to
identify, we estimated the crime count and the variance to be the mean monthly count. The
mean count was calculated by dividing the total count for the agency by the total number of
months the agency provided reports. An improvement over this method, which would attempt to
account for any time trends, might be to calculate monthly estimates using the average over a
18
window of months surrounding the missing month. Intuitively, this strategy makes sense, but the
actual gain may be insignificant, as we would be making a bias/variance trade-off. (The strategy
would have less stable estimates with larger variances.) In addition, the question of how many
months to include in the window would almost certainly require different answers for different
data series.
Figure 6 shows results of our mean imputation for East Liverpool, Ohio.
0 100 200 300 400 500
01
23
4
East Liverpool, OH
MU
R
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Figure 6. Imputation based on the mean of East Liverpool murder series. Point estimates and upper 95% confidence bounds are plotted.
Crime Count Between 1 and 35 per Month
For agencies with between 1 and 35 crimes per month on average, we used a generalized
linear model, assuming Poisson error terms, that based its estimates when possible on the
trajectories of “similar” agencies. Predictors in the Poisson regression model for estimating a
month’s crime count included the previous month’s counts, a seasonal effect, and the crime
count from a highly correlated “donor” series, when possible. One of the questions we dealt with
was how to select candidate donor agencies for the imputation algorithm. Our definition of
“similar” was different from that used by the FBI: we chose only one agency as a “similar”
agency, selected from a number of candidate agencies, those with the least missing data (and
19
with no missingness in the imputation interval in question). The agency we used was the one
whose time series for the crime in question was most similar to (i.e., most highly correlated with)
the imputed agency’s time series. When possible we used data on both sides of the missing
period and averaged two predictions based on different estimated models.
0 0 100 1 0 200 2 0 300
05
1015
2025
30
Saraland, AL
BU
R
A. Saraland burglary series
0 0 100 1 0 200 2 0 300
020
4060
8010
012
014
0
Anniston, AL
BU
R
B. Anniston burglary series
Figure 7. Time series plots of burglaries in two Alabama cities.
Figures 7-9 illustrate imputation results for burglary in Saraland, Alabama obtained using the
Poisson generalized linear model (GLM). Anniston, Alabama’s series of crime counts was used
as a donor series because both cities are in Alabama, both have populations between 10,000 and
20
30,000, and the correlation between the series was 0.756. The similarity in the series is shown in
the plots in Figure 7.
05
1015
2025
30Saraland, AL
BU
R
oooooooooo
oooo
oooooooooo
A. Forecasting based on GLM used to fill in gap in Saraland burglary series
05
1015
2025
30
Saraland, AL
BU
R
o
oo
o
o
ooo
ooo
o
o
ooo
o
ooo
ooo
o
B. Backcasting based on GLM used to fill in gap in Saraland burglary series
05
1015
2025
30
Saraland, AL
BU
R
ooo
oooooooooo
ooooo
oooooo
C. Combination of forecasting and backcasting used to fill in gap in Saraland burglary series.
Figure 8. Poisson regression model imputation for burglary series in Saraland, AL.
21
05
1015
2025
30
Saraland, AL
BU
R
oo
oooooooooooo
o
ooooooo
oo
A. Forecasting using Anniston information to fill in gap in Saraland burglary series
510
1520
2530
Saraland, AL
BU
R
o
o
oo
o
ooo
ooo
o
o
o
oo
o
ooo
ooo
o
B. Backcasting using Anniston information to fill in gap in Saraland burglary series
05
1015
2025
30
Saraland, AL
BU
R
o
oo
ooo
ooo
o
ooo
oo
ooo
ooo
o
oo
C. Combination of forecasting and backcasting using Anniston information to fill in gap in Saraland burglary series Figure 9. Poisson regression model imputation for burglary series in Saraland, AL using Anniston, AL as a donor series.
22
Figures 8 and 9 show examples of imputation for missing data when forecasting,
backcasting, and both are used to obtain the imputed values. The results in Figure 8 are
obtained using only Saraland’s crime series data. These may be compared to similar
imputations in Figure 9 obtained when Anniston’s data are used in the Poisson regression model
for prediction.
Crime Count Over 35 per Month
For a crime count averaging above 35 crimes per month, we used estimates from SARIMA
models with parameters (1,1,1)×(1,0,1). We first found the longest sequence of complete data
(consecutive months with no missing data). Starting with that sequence, we then imputed the
missing data to the right of that using the SARIMA model. If a sufficiently large sequence of
data7 existed to the right of the gap, we then went in the opposite direction and imputed the same
missing data sequence from the right. This algorithm provided us with two imputation sequences
for the gap, one prior to the period of missing data and one following that period. We then
averaged the two for our final estimate, reducing the variance from that obtained using a single
estimate.
Figure 10 shows results for the NDX series for Columbus, OH obtained using a SARIMA
time-series model as described above. The original series with a portion of the data deleted is
shown in Figure 10A. Examples of imputation for the missing data when forecasting,
backcasting, and both are used to obtain the imputed values are shown in Plots B, C, and D of
Figure 10, respectively.
7 “Sufficiently large” was defined as 25 consecutive months. Since one month is discarded, this meant that we had at least two data points for each month in the sequence.
23
3.0
3.2
3.4
3.6
3.8
logN
DX
A. Columbus log(NDX) series with section to be imputed removed
3.2
3.4
3.6
3.8
4.0
Forecasting
logN
DX
+++++++
++++++++
+++++++++
++
+++++
++++++
+++++++++++
B. Forecasting using SARIMA model to fill in the gap in the Columbus log(NDX) series
3.2
3.4
3.6
3.8
4.0
Backcasting
logN
DX
++++++++++++++++++++++++
++++++++++++++++++++++++
C. Backcasting using SARIMA model to fill in the gap in the Columbus log(NDX) series
3.2
3.4
3.6
3.8
4.0
Forecasting and Backcasting
logN
DX
+++
+++++++++++++++++++++
++++++++++++++++++++++++
D. Linear combination of forecasting and backcasting using SARIMA model to fill in gap in the Columbus log(NDX) series
Figure 10. SARIMA model imputation for NDX series in Columbus, OH.
24
VI. VARIANCE ESTIMATES We used standard techniques to estimate the variance in the imputed data. When the average
crime count was under 1 crime per month, we assumed that the data had a Poisson distribution
and the mean monthly count was used as our imputed value and our variance estimate. Figure 6
shows the upper confidence bound obtained using these variance estimates.
When the average crime count for an agency was between 1 and 35 crimes per month we
used a Poisson GLM and obtained variance estimates for the coefficients in the model. When we
do not have consecutive missing months, then the prediction variance can be easily computed in
R using its GLM capabilities. When we do have consecutive missing months, we run into the
problem of prediction based on models estimated from data that includes imputed values. Our
current variance estimates tend to be a bit too small because they ignore variability from all
previous imputations. Deriving the correct variance is difficult but, taking a different approach
to the problem, Bayesian tools8 might be useful in handling this difficulty.
For an average reported crime count above 35 crimes per month, when we used SARIMA
models, R provides variance estimates for the forecasts that increase as the forecasts departed
from the data. Confidence bounds based on these variance estimates for imputed values are
shown in Figure 10. Note that the bounds increase noticeably as we move away from the data
used for forecasts (Figure 10B) or backcasts (Figure 10C) but are more stable in the estimates
from the linear combination of the forecasts and backcasts (Figure 10D).
For the NDX series, each imputation is the sum of the imputations from the seven index
crime series. Because these seven crime series are not independent, we were able to use Taylor
series expansions to derive variance estimates for the NDX imputations. The same technique is
8 Developing Bayesian methods for imputation of the UCR data is an area of future research for our group.
25
used to calculate variance estimates for the imputed counts in the Violent Crime Index and the
Property Crime Index.
VII. CONCLUSION This report has described our research in developing algorithms to impute missing UCR data,
and includes the R code for applying them to the data. Going from this step to the next one –
actual imputation of the data – is a straightforward but nontrivial task. As we mentioned above,
dealing with covered-by situations may be complicated. Developing Bayesian methods and
improving our variance estimates based on our imputation models remain for future research. In
the end, however, these activities will provide the criminal justice research and policy
communities with some 18,000 agency-level time series for the seven Index crimes, a relatively
small price to pay for what may turn out to be a large payoff.
26
REFERENCES
Abraham, B. (1981). “Missing observations in time series.” Communications in Statistics, A, 10, 1643-1653.
Goodall, C. R., Kafadar, K., and Tukey, J. W. (1998). Computing and using rural versus urban measures in statistical applications. American Statistician, 52: 101–111.
Little, R.J.A., and D. B. Rubin. (2002). Statistical analysis with missing data. New York: Wiley.
Lejins, P. P., C. F. Chute, and S. F. Schrotel (1958). Uniform crime reporting: Report of the consultant committee. Unpublished report to the FBI; available from the second author.
Lott, J. R., Jr. (1998). More guns, less crime: Understanding crime and gun control laws. University of Chicago Press, Chicago, IL.
Lott, J. R., Jr., and D. B. Mustard (1997). "Crime, deterrence, and right-to-carry concealed handguns." Journal of Legal Studies, 26, 1, 1-68.
Maltz, M. D. (1999). Bridging Gaps in Police Crime Data. NCJ 176365. Washington, D.C.: Bureau of Justice Statistics.
Maltz, M. D., and J. Targonski (2002). “A note on the use of county-level crime data” (with Joseph Targonski), Journal of Quantitative Criminology, Vol. 18, No. 3, 297-318.
Maltz, M. D., and J. Targonski (2003). “Measurement and other errors in county-level UCR data: A reply to Lott and Whitley.” Journal of Quantitative Criminology, Vol. 19, No. 2, 199-206.
Tukey, J. W. (1977). Exploratory Data Analysis, Addison-Wesley, Reading, MA.
27
APPENDIX: COMPUTER CODE
#################################################################### ## ## ## Author: Clint Roberts ## ## Title: Data Channel ## ## ## ## Description: This program puts the data from Excel into R ## ## ## #################################################################### library(RODBC) channel <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\AL.xls") n <- 470 channel2 <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\AR.xls") n <- 250 channel3 <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\IA.xls") n <- 265 channel4 <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\OH.xls") n <- 720 NDX1 <- sqlFetch(channel, "NDX1", max=n) NDX2 <- sqlFetch(channel, "NDX2", max=n) NDX3 <- sqlFetch(channel, "NDX3", max=n) MUR1 <- sqlFetch(channel, "MUR1", max=n) MUR2 <- sqlFetch(channel, "MUR2", max=n) MUR3 <- sqlFetch(channel, "MUR3", max=n) RPF1 <- sqlFetch(channel, "RPF1", max=n) RPF2 <- sqlFetch(channel, "RPF2", max=n) RPF3 <- sqlFetch(channel, "RPF3", max=n) RBT1 <- sqlFetch(channel, "RBT1", max=n) RBT2 <- sqlFetch(channel, "RBT2", max=n) RBT3 <- sqlFetch(channel, "RBT3", max=n) AGA1 <- sqlFetch(channel, "AGA1", max=n) AGA2 <- sqlFetch(channel, "AGA2", max=n) AGA3 <- sqlFetch(channel, "AGA3", max=n) BUR1 <- sqlFetch(channel, "BUR1", max=n) BUR2 <- sqlFetch(channel, "BUR2", max=n) BUR3 <- sqlFetch(channel, "BUR3", max=n) LAR1 <- sqlFetch(channel, "LAR1", max=n) LAR2 <- sqlFetch(channel, "LAR2", max=n) LAR3 <- sqlFetch(channel, "LAR3", max=n) VTT1 <- sqlFetch(channel, "VTT1", max=n) VTT2 <- sqlFetch(channel, "VTT2", max=n) VTT3 <- sqlFetch(channel, "VTT3", max=n) Group <- sqlFetch(channel, "Group", max=n)
28
Population <- sqlFetch(channel, "Population", max=n) ORI <- NDX1[,1] NDX <- cbind(NDX1, NDX2, NDX3) NDX <- NDX[,-1] NDX <- NDX[,-241] NDX <- NDX[,-481] MUR <- cbind(MUR1, MUR2, MUR3) MUR <- MUR[,-1] MUR <- MUR[,-241] MUR <- MUR[,-481] RPF <- cbind(RPF1, RPF2, RPF3) RPF <- RPF[,-1] RPF <- RPF[,-241] RPF <- RPF[,-481] AGA <- cbind(AGA1, AGA2, AGA3) AGA <- AGA[,-1] AGA <- AGA[,-241] AGA <- AGA[,-481] BUR <- cbind(BUR1, BUR2, BUR3) BUR <- BUR[,-1] BUR <- BUR[,-241] BUR <- BUR[,-481] LAR <- cbind(LAR1, LAR2, LAR3) LAR <- LAR[,-1] LAR <- LAR[,-241] LAR <- LAR[,-481] RBT <- cbind(RBT1, RBT2,RBT3) RBT <- RBT[,-1] RBT <- RBT[,-241] RBT <- RBT[,-481] VTT <- cbind(VTT1, VTT2, VTT3) VTT <- VTT[,-1] VTT <- VTT[,-241] VTT <- VTT[,-481]
29
#################################################################### ## ## ## Author: Clint Roberts ## ## Title: Mean Imputation ## ## ## ## Description: This program makes imputations in time series ## ## with 1 or lower average crime counts ## #################################################################### ################################ ## ## ## Section I ## ## ## #################################################################### ## ## ## What crime do you want to work with? Put in for tc ## ## ## #################################################################### tc <- RPF tc2 <- t(tc) tc3 <- tc2 #################################################################### ## ## ## This finds out which agencies have aggregated months ## ## ## #################################################################### j <- 1 agg <- array() for ( i in 1:n) { tester <- tc2[,i] if (length(tester[tester < (-100)]) > 0) { agg[j] <- i j <- j +1 } } #################################################################### ## ## ## This adjusts months that are aggregated to 0 ## ## ## #################################################################### month <- rep(1:12, 43) for( j in agg) { for ( i in 1:516) { if ( tc3[i,j] < -100 )
30
{ off <- -100 - tc3[i,j] aggmonth <- i - month[i] + off if (tc3[aggmonth,j] == 0) { tc3[i,j] <- 0 } } } } #################################################################### ## ## ## This finds out which agencies still have aggregated ## ## ## #################################################################### j <- 1 agg <- array() for ( i in 1:n) { tester <- tc3[,i] if (length(tester[tester < (-100)]) > 0) { agg[j] <- i j <- j +1 } } ################################ ## ## ## Section II ## ## ## #################################################################### ## ## ## Program ## ## Description: Calculates the average and makes imputations ## ## ## ## ## #################################################################### tc[tc < (-100)] <- 0 tc[tc < (-3)] <- NA tc <- t(tc) colnames(tc) <- ORI avg <- apply(tc,2,mean, na.rm=T) mm <- apply(tc, 2, (is.na)) mm <- 516 - apply(mm, 2, sum) length(avg[avg < 1 ]) sum(is.nan(avg)) #################################################################### ## ## ## This finds out which series are eligible for method 1 ## ## ## ####################################################################
31
j <- 1 means <- array() for ( i in 1:n) { if (avg[i] < 1 && is.nan(avg[i])==FALSE) { means[j] <- i j <- j+1 } } #################################################################### ## ## ## This makes the imputations and the confidence limits ## ## ## #################################################################### tc2 <- t(MUR) climits1 <- matrix(NA, ncol=n, nrow=516) climits2 <- matrix(NA, ncol=n, nrow=516) for ( i in means) { for (m in 1:516) { if ( tc2[m,i] < (-90) && tc2[m,i] > (-100)) { tc2[m,i] <- avg[i] climits1[m,i] <- avg[i] - 1.96*sqrt(avg[i]/mm[i]) climits2[m,i] <- avg[i] + 1.96*sqrt(avg[i]/mm[i]) } } } climits1[climits1 < 0 ] <- NA #################################################################### ## ## ## This makes the plots ## ## ## #################################################################### v <- 189 plot(tc2[,v], type="l", ylim=c(0,1)) points(1:516, climits1[,v], pch="-") points(1:516, climits2[,v], pch="-") #################################################################### ## ## ## This adjusts months for aggregation under method 1 ## ## ## ####################################################################
32
for( j in agg) { for( i in 1:516) { if( tc3[i,j] < (-100) ) { off <- -100 - tc3[i,j] aggmonth <- i - month[i] + off q <- 0 s <- i while ( tc3[s,j] < (-100) ) { q <- q + 1 s <- s + 1 } tc3[i:(s-1),j] <- (1/(q+1)) * tc3[aggmonth,j] tc3[aggmonth,j] <- (1/(q+1)) * tc3[aggmonth,j] } } }
33
#################################################################### ## ## ## Author: Clint Roberts ## ## Title: GLM Imputation ## ## ## ## Description: This program makes imputations in time series ## ## when average crime counts are between 1 and 35 ## #################################################################### ################################ ## ## ## Section I ## ## ## #################################################################### ## ## ## What crime do you want to work with? Put in for tc ## ## ## #################################################################### tc <- LAR data3 <- tc data3.copy <- data3 data3.copy2 <- data3 #################################################################### ## ## ## Program ## ## Description: Lists all series for time series ## ## imputation according to the decision tree. ## ## ## #################################################################### tc[tc < (-3)] <- NA tc <- t(tc) tc <- as.numeric(tc) tc <- matrix(tc, ncol=n) colnames(tc) <- ORI avg <- apply(tc,2,mean, na.rm=T) n.glms <- length(avg[avg < 35 && avg > 1 ]) - sum(is.nan(avg)) n.glms j2 <- 1 glms <- array() for ( u in 1:n) { if (avg[u] < 35 && is.nan(avg[u])==FALSE && avg[u] > 1) { glms[j2] <- u j2 <- j2+1 } }
34
#################################################################### ## ## ## What agency do you want to work with? Put in for i ## ## ## #################################################################### i <- 72 #################################################################### ## ## ## Creates month factor to include seasonality in the model ## ## ## #################################################################### jan <- seq(1,515, by=12) feb <- jan + 1 mar <- jan + 2 apr <- jan + 3 may <- jan + 4 jun <- jan + 5 jul <- jan + 6 aug <- jan + 7 sep <- jan + 8 oct <- jan + 9 nov <- jan + 10 dec <- jan + 11 mon <- 1:516 mon[jan] <- "jan" mon[feb] <- "feb" mon[mar] <- "mar" mon[apr] <- "apr" mon[may] <- "may" mon[jun] <- "jun" mon[jul] <- "jul" mon[aug] <- "aug" mon[sep] <- "sep" mon[oct] <- "oct" mon[nov] <- "nov" mon[dec] <- "dec" mont <- factor(mon[1:516]) #################################################################### ## ## ## Look at correlation between two agencies series ## ## ## #################################################################### auxdata <- data3[,55] cor(data3[,i],auxdata,use="pairwise.complete.obs") plot(data3[,i], type="l") ################################
35
## ## ## Section II ## ## ## #################################################################### ## ## ## Program ## ## Description: Finds the biggest observed section ## ## in the time series ## ## ## #################################################################### data3 <- LAR data3 <- t(data3) #data3[,i] <- data3[516:1,i] plot(data3[,i], type="l", ylim=c(0,50)) big <- matrix(0, nrow = 516, ncol=1) m <- 1 a <- 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 } else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 } else { b <- m - 1 NOTDONE <- FALSE } } len <- (b-a) biga <- a bigb <- b m <- b + 1 a2 <- b + 1
36
STOP <- FALSE while( STOP == FALSE) { NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 b2 <- a2 + 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE b2 <- m-1 } } else { b2 <- m -1 NOTDONE <- FALSE } } if( (b2-a2) > len) { len <- (b2-a2) big[a2:b2] <- data3[a2:b2,i] biga<- a2 bigb<- b2 } }
37
################################ ## ## ## Section III ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from left to right ## ## starting at the big section of the series ## ## ## #################################################################### section <- matrix(0,ncol= 2000, nrow = 516) #climits <- matrix(NA, ncol=2, nrow = 516) nap <- 1 WORKING <- TRUE while( WORKING == TRUE) { m <- biga a <- biga NOTDONE <- TRUE if(m > 516) { m <- 516 } while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a <- m } } else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE)
38
{ if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { NOTDONE <- FALSE b <- m-1 } } else { b <- m - 1 NOTDONE <- FALSE } } m <- b + 1 a2 <- b + 1 if ( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 b2 <- a2 + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) {
39
NOTDONE <- FALSE b2 <- m-1 } } else { b2 <- m -1 NOTDONE <- FALSE } } if ( (b-a) >= 24 && (b2-a2) >= 24) { section[1:(b - a + 1),nap] <- data3[a:b, i] section[(b2-a2+1):1, (nap + 1)] <- data3[a2:b2,i] pred <- array() predb <- array() preda <- array() fmonth <- 1 for( y in (b+1):(a2-1)) { x <- section[1:(b-a+fmonth-1), nap] w <- section[1:(b2-a2+fmonth-1), (nap+1)] z <- auxdata[(a+1):(b+fmonth-1)] z2 <- auxdata[(b2-1):(a2-fmonth+1)] mona <- mont[(a+1):(b+fmonth-1)] monb <- mont[(b2-1):(a2-fmonth+1)] modela <- glm(section[2:(b-a+1+fmonth-1), nap] ~ I(log(x+1))+ I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(b-a+1+fmonth-1),nap], z=auxdata[b+fmonth],mona=mont[b+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) modelb <- glm(section[2:(b2-a2+1+fmonth-1), (nap+1)] ~ I(log(w+1))+ I(log(z2+1)) + monb, family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(b2-a2+1+fmonth-1),(nap+1)],z2=auxdata[a2-fmonth], monb=mont[a2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) preda[fmonth] <- predglma$fit predb[fmonth] <- predglmb$fit if ( data3[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(b-a+fmonth+1), nap] <- preda[fmonth]
40
section[(b2-a2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (b+1):(a2-1)) { pred[fmonth] <- .5 * (predb[a2-b-fmonth] + preda[fmonth]) if ( data3[y,i] < (-90)) { data3[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if ( (b-a) >= 24 && (b2-a2) < 24 && b < 516) { section[1:(b - a + 1),nap] <- data3[a:b, i] pred <- array() preda <- array() fmonth <- 1 for( y in (b+1):(a2-1)) { x <- section[1:(b-a+fmonth-1), nap] z <- auxdata[(a+1):(b+fmonth-1)] mona <- mont[(a+1):(b+fmonth-1)] modela <- glm(section[2:(b-a+1+fmonth-1), nap] ~ I(log(x+1))+I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(b-a+1+fmonth-1),nap], z=auxdata[b+fmonth],mona=mont[b+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) preda[fmonth] <- predglma$fit if ( data3[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(b-a+fmonth+1), nap] <- preda[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (b+1):(a2-1))
41
{ pred[fmonth] <- preda[fmonth] if ( data3[y,i] < (-90)) { data3[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if( (b-a) < 24 && (b2-a2) >= 24) { section[(b2-a2+1):1, (nap + 1)] <- data3[a2:b2,i] pred <- array() predb <- array() fmonth <- 1 for( y in (b+1):(a2-1)) { w <- section[1:(b2-a2+fmonth-1), (nap+1)] z2<- auxdata[(b2-1):(a2-fmonth+1)] monb <- mont[(b2-1):(a2-fmonth+1)] modelb <- glm(section[2:(b2-a2+1+fmonth-1), (nap+1)] ~ monb + I(log(z2+1))+ I(log(w+1)), family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(b2-a2+1+fmonth-1),(nap+1)], z2=auxdata[a2-fmonth],monb=mont[a2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) predb[fmonth] <- predglmb$fit if ( data3[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(b2-a2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (b+1):(a2-1)) { pred[fmonth] <- predb[a2-b-fmonth] if ( data3[y,i] < (-90)) { data3[y,i] <- pred[fmonth] fmonth <- fmonth + 1
42
} } nap <- nap + 1 } if(m > 515) { WORKING <- FALSE } } ################################ ## ## ## Section IV ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from right to left ## ## starting at the big section of the series ## ## ## #################################################################### WORKING <- TRUE while( WORKING == TRUE) { data3b[,i] <- data3[,i] data3b[,i] <- data3b[516:1, i] M <- 517 - bigb A <- 517 - bigb NOTDONE <- TRUE if(M > 516) { M <- 516 } while(NOTDONE == TRUE) { if (data3b[M,i]< (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE A <- M } } else { A <- M NOTDONE <- FALSE } } M <- A + 1
43
B <- A + 1 if( M > 516) { M <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3) ) { M <- M +1 if( M > 516) { NOTDONE <- FALSE B <- M-1 } } else { B <- M - 1 NOTDONE <- FALSE } } M <- B + 1 if ( M > 516) { M <- 516 } A2 <- M NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] < (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE A2 <- M } } else { A2 <- M NOTDONE <- FALSE } } M <- A2 + 1 if( M > 516) { M <- 516 } B2 <- M
44
NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3) ) { M <- M +1 if( M > 516) { NOTDONE <- FALSE B2 <- M-1 } } else { B2 <- M -1 NOTDONE <- FALSE } } if ( (B-A) >= 24 && (B2-A2) >= 24) { section[1:(B - A + 1),nap] <- data3b[A:B, i] section[(B2-A2+1):1, (nap + 1)] <- data3b[A2:B2,i] pred <- array() predb <- array() preda <- array() fmonth <- 1 for( y in (B+1):(A2-1)) { x <- section[1:(B-A+fmonth-1), nap] w <- section[1:(B2-A2+fmonth-1), (nap+1)] z <- auxdata[(A+1):(B+fmonth-1)] z2 <- auxdata[(B2-1):(A2-fmonth+1)] mona <- mont[(A+1):(B+fmonth-1)] monb <- mont[(B2-1):(A2-fmonth+1)] modela <- glm(section[2:(B-A+1+fmonth-1), nap] ~ I(log(x+1))+I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(B-A+1+fmonth-1),nap],z=auxdata[B+fmonth], mona=mont[B+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) modelb <- glm(section[2:(B2-A2+1+fmonth-1), (nap+1)] ~ I(log(w+1))+I(log(z2+1)) + monb, family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(B2-A2+1+fmonth-1),(nap+1)], z2=auxdata[A2-fmonth], monb=mont[A2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse))
45
preda[fmonth] <- predglma$fit predb[fmonth] <- predglmb$fit if ( data3b[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(B-A+fmonth+1), nap] <- preda[fmonth] section[(B2-A2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (B+1):(A2-1)) { pred[fmonth] <- .5 * (predb[A2-B-fmonth] + preda[fmonth]) if ( data3b[y,i] < (-90)) { data3b[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if ( (B-A) >= 24 && (B2-A2) < 24 && B < 515) { section[1:(B - A + 1),nap] <- data3b[A:B, i] pred <- array() preda <- array() fmonth <- 1 for( y in (B+1):(A2-1)) { x <- section[1:(B-A+fmonth-1), nap] z <- auxdata[(A+1):(B+fmonth-1)] mona <- mont[(A+1):(B+fmonth-1)] modela <- glm(section[2:(B-A+1+fmonth-1), nap] ~ I(log(x+1))+I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(B-A+1+fmonth-1),nap],z=auxdata[B+fmonth], mona=mont[B+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) preda[fmonth] <- predglma$fit if ( data3b[y,i] < (-90)) { #data3[y,i] <- pred[fmonth]
46
#data3o[q,i] <- pred[fmonth] section[(B-A+fmonth+1), nap] <- preda[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (B+1):(A2-1)) { pred[fmonth] <- preda[fmonth] if ( data3b[y,i] < (-90)) { data3b[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if( (B-A) < 24 && (B2-A2) >= 24) { section[(B2-A2+1):1, (nap + 1)] <- data3b[A2:B2,i] pred <- array() predb <- array() fmonth <- 1 for( y in (B+1):(A2-1)) { w <- section[1:(B2-A2+fmonth-1), (nap+1)] z2 <- auxdata[(B2-1):(A2-fmonth+1)] monb <- mont[(B2-1):(A2-fmonth+1)] modelb <- glm(section[2:(B2-A2+1+fmonth-1), (nap+1)] ~ monb + I(log(z2+1))+ I(log(w+1)), family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(B2-A2+1+fmonth-1),(nap+1)],z2=auxdata[A2-fmonth], monb=mont[A2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) predb[fmonth] <- predglmb$fit if ( data3b[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(B2-A2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1
47
for( y in (B+1):(A2-1)) { pred[fmonth] <- predb[A2-B-fmonth] if ( data3b[y,i] < (-90)) { data3b[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } data3c[,i] <- data3b[,i] data3c[,i] <- data3c[516:1, i] data3[,i] <- data3c[,i] if(M > 515) { WORKING <- FALSE } } plot(data3[,i], type="l", ylim=c(0,50)) #plot(data3[516:1,i], type="l", ylim=c(0,50))
48
#################################################################### ## ## ## Author: Clint Roberts ## ## Title: Time Series Imputation ## ## ## ## Description: This program makes imputations in time series ## ## ## #################################################################### ################################ ## ## ## Section I ## ## ## #################################################################### ## ## ## What crime do you want to work with? Put in for tc ## ## ## #################################################################### tc <- LAR data3 <- tc data3.copy <- data3 #################################################################### ## ## ## Program ## ## Description: Lists all series for time series ## ## imputation according to the decision tree. ## ## ## #################################################################### tc[tc < (-3)] <- NA tc <- t(tc) tc <- as.numeric(tc) tc <- matrix(tc, ncol=n) colnames(tc) <- ORI avg <- apply(tc,2,mean, na.rm=T) n.times <- length(avg[avg > 35 ]) - sum(is.nan(avg)) n.times j <- 1 times <- array() for ( i in 1:n) { if (avg[i] > 35 && is.nan(avg[i])==FALSE) { times[j] <- i j <- j+1 } } times
49
#################################################################### ## ## ## What agency do you want to work with? Put in for i ## ## ## #################################################################### i <- 189 ################################ ## ## ## Section II ## ## ## #################################################################### ## ## ## Program ## ## Description: Detects outliers in time series and ## ## creates a dataset with outliers removed ## ## called data3o to be used in model ## ## ## #################################################################### data3 <- t(data3) data3o <- data3 data3bo <- data3 # Note: all plots in this entire program are commented out #plot(data3[,i], type="l") outlimit1 <- matrix(100000, 516) outlimit3 <- matrix(-10, 516) u <- 37 for( m in 37:479) { q1 <- quantile(window(tc[,i], m-36,m+36), .25, na.rm=T) q3 <- quantile(window(tc[,i], m-36,m+36), .75, na.rm=T) iqr <- q3-q1 if(is.na(q1) == FALSE) { outlimit1[u] <- q1 - 1.5 * iqr outlimit3[u] <- q3 + 1.5 * iqr } u <- u+1 } outlimit1[1:36] <- outlimit1[37] outlimit3[1:36] <- outlimit3[37] outlimit1[480:516] <- outlimit1[479] outlimit3[480:516] <- outlimit3[479] #points(1:516, outlimit3) #points(1:516, outlimit1) go <- matrix(FALSE,516)
50
for(h in 1:516) { if(data3[h,i] < outlimit1[h]) { go[h] <- TRUE } if(data3[h,i] > outlimit3[h]) { go[h] <- TRUE } if(go[h] == TRUE) { if( data3[h,i] > (-3) ) { data3[h,i] <- NA } } } data3o[,i] <- data3[,i] data3bo[,i] <- data3[,i] data3bo[,i] <- data3bo[516:1, i] ################################ ## ## ## Section III ## ## ## #################################################################### ## ## ## Program ## ## Description: Finds the biggest observed section ## ## in the time series ## ## ## #################################################################### data3 <- t(data3.copy) data3b <- data3 data3c <- data3 big <- matrix(0, nrow = 516, ncol=1) m <- 1 a <- 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 }
51
else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 } else { b <- m - 1 NOTDONE <- FALSE } } len <- (b-a) biga <- a bigb <- b m <- b + 1 a2 <- b + 1 STOP <- FALSE while( STOP == FALSE) { NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 if( m > 516) { m <- 516
52
} b2 <- m NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE m <- 516 b2 <- m } } else { b2 <- m -1 NOTDONE <- FALSE } } if( (b2-a2) > len) { len <- (b2-a2) big[a2:b2] <- data3[a2:b2,i] biga<- a2 bigb<- b2 } } ################################ ## ## ## Section IV ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from left to right ## ## starting at the big section of the series ## ## ## #################################################################### section <- matrix(0,ncol= 2000, nrow = 516) #climits <- matrix(NA, ncol=2, nrow = 516) nap <- 1 WORKING <- TRUE while( WORKING == TRUE) {
53
m <- biga a <- biga NOTDONE <- TRUE if(m > 516) { m <- 516 } while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a <- m } } else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { NOTDONE <- FALSE b <- m-1 } } else { b <- m - 1 NOTDONE <- FALSE } } m <- b + 1 a2 <- b + 1 if ( m > 516) { m <- 516 }
54
NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 b2 <- a2 + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { NOTDONE <- FALSE b2 <- m-1 } } else { b2 <- m -1 NOTDONE <- FALSE } } if ( (b-a) >= 24 && (b2-a2) >= 24) { section[1:(b - a + 1),nap] <- data3o[a:b, i] section[(b2-a2+1):1, (nap + 1)] <- data3o[a2:b2,i] data.sarima.a <- arima(section[1:(b-a+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a fmonths.a <- (b+1):(a2-1)
55
predarima.a <- predict(data.sarima.a, (a2-b-1)) forecast.a <- predarima.a$pred flimits.a <- 1.96 * round(predarima.a$se, 3) data.sarima.b <- arima(section[1:(b2-a2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b fmonths.b <- (a2-1):(b+1) predarima.b <- predict(data.sarima.b, (a2-b-1)) forecast.b <- predarima.b$pred flimits.b <- 1.96 * round(predarima.b$se, 3) pred <- array() for ( j in 1:length(fmonths.a)) { pred[j] <- (predarima.b$se[j] * forecast.a[j] + predarima.a$se[j] * forecast.b[j])/(predarima.b$se[j] + predarima.a$se[j]) } for (j in 1:length(fmonths.a)) { #climits[(b+j), 1] <- pred[j] - 1.96*(sqrt(2)) * predarima.a$se[j] * predarima.b$se[j] * (1/(predarima.a$se[j]+predarima.b$se[j])) #climits[(b+j), 2] <- pred[j] + 1.96*(sqrt(2)) * predarima.a$se[j] * predarima.b$se[j] * (1/(predarima.a$se[j]+predarima.b$se[j])) } for( q in (b+1):(a2-1)) { if ( data3[q,i] < (-90)) { data3[q,i] <- pred[(q-b)] data3o[q,i] <- pred[(q-b)] } } } if ( (b-a) >= 24 && (b2-a2) < 24 && b < 516) { section[1:(b - a + 1),nap] <- data3o[a:b, i] data.sarima.a <- arima(section[1:(b-a+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a fmonths.a <- (b+1):(a2-1) predarima.a <- predict(data.sarima.a, (a2-b-1)) forecast.a <- predarima.a$pred pred <- array() for ( j in 1:length(fmonths.a)) { pred[j] <- forecast.a[j] } for (j in 1:length(fmonths.a)) { #climits[b+j,1] <- pred[j] - 1.96*predarima.a$se[j] #climits[b+j,2] <- pred[j] + 1.96*predarima.a$se[j] } for( q in (b+1):(a2-1)) { if ( data3[q,i] < (-90))
56
{ data3[q,i] <- pred[(q-b)] data3o[q,i] <- pred[(q-b)] } } } if( (b-a) < 24 && (b2-a2) >= 24) { section[(b2-a2+1):1, (nap + 1)] <- data3o[a2:b2,i] data.sarima.b <- arima(section[1:(b2-a2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b fmonths.b <- (a2-1):(b+1) predarima.b <- predict(data.sarima.b, (a2-b-1)) forecast.b <- predarima.b$pred flimits.b <- 1.96 * round(predarima.b$se, 3) pred <- array() for ( j in 1:length(fmonths.b)) { pred[j] <- forecast.b[j] } for (j in 1:length(fmonths.b)) { #climits[b+j,1] <- pred[j] - 1.96*predarima.b$se[j] #climits[b+j,2] <- pred[j] + 1.96*predarima.b$se[j] } for( q in (b+1):(a2-1)) { if ( data3[q,i] < (-90)) { data3[q,i] <- pred[(q-b)] data3o[q,i] <- pred[(q-b)] } } } if(m > 515) { WORKING <- FALSE } } ################################ ## ## ## Section V ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from right to left ## ## starting at the big section of the series ## ## ## ####################################################################
57
WORKING <- TRUE while( WORKING == TRUE) { data3b[,i] <- data3[,i] data3b[,i] <- data3b[516:1, i] M <- 516-b2+1 A <- 516-b2+1 if(M > 516) { M <- 516 } #This finds section a NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i]< (-3)) { M <- M +1 if(M > 516) { NOTDONE <- FALSE A <- M } } else { A <- M NOTDONE <- FALSE } } M <- A + 1 B <- A + 1 if(M > 516) { M <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3) ) { M <- M +1 if(M > 516) { NOTDONE <- FALSE B <- M-1 } } else { B <- M - 1 NOTDONE <- FALSE }
58
} M <- B + 1 A2 <- B + 1 if(M > 516) { M <- 516 } #This finds section b NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] < (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE A2 <- M } } else { A2 <- M NOTDONE <- FALSE } } M <- A2 + 1 B2 <- A2 + 1 if(M > 516) { M <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE B2 <- M-1 } } else { B2 <- M -1 NOTDONE <- FALSE } } nap <- nap + 2
59
if( (B-A) >= 10 && (B2-A2) >= 10) { section[1:(B - A + 1),nap] <- data3bo[A:B, i] section[(B2-A2+1):1,(nap + 1)] <- data3bo[A2:B2,i] data.sarima.a2 <- arima(section[1:(B-A+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a2 fmonths.a2 <- (B+1):(A2-1) predarima.a2 <- predict(data.sarima.a2, (A2-B-1)) forecast.a2 <- predarima.a2$pred flimits.a2 <- 1.96 * round(predarima.a2$se, 3) data.sarima.b2 <- arima(section[1:(B2-A2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b2 fmonths.b2 <- (A2-1):(B+1) predarima.b2 <- predict(data.sarima.b2, (A2-B-1)) forecast.b2 <- predarima.b2$pred flimits.b2 <- 1.96 * round(predarima.b2$se, 3) pred2 <- array() for ( j in 1:length(fmonths.a2)) { pred2[j] <- (predarima.b2$se[j] * forecast.a2[j] + predarima.a2$se[j] * forecast.b2[j])/(predarima.b2$se[j] + predarima.a2$se[j]) } for (j in 1:length(fmonths.a2)) { #climits[516-B-j+1,1] <- pred2[j] - 1.96*(sqrt(2)) * predarima.a2$se[j] * predarima.b2$se[j] * (1/(predarima.a2$se[j]+predarima.b2$se[j])) #climits[516-B-j+1,2] <- pred2[j] + 1.96*(sqrt(2)) * predarima.a2$se[j] * predarima.b2$se[j] * (1/(predarima.a2$se[j]+predarima.b2$se[j])) } for( q in (B+1):(A2-1)) { if ( data3b[q,i] < (-90)) { data3b[q,i] <- pred2[(q-B)] data3bo[q,i] <- pred2[(q-B)] } } } if ( (B-A) >= 10 && (B2-A2) < 10 && B < 516) { section[1:(B - A + 1),nap] <- data3bo[A:B, i] data.sarima.a2 <- arima(section[1:(B-A+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a2 fmonths.a2 <- (B+1):(A2-1) predarima.a2 <- predict(data.sarima.a2, (A2-B-1)) forecast.a2 <- predarima.a2$pred pred2 <- array() for ( j in 1:length(fmonths.a2))
60
{ pred2[j] <- forecast.a2[j] } for (j in 1:length(fmonths.a2)) { #climits[516-B-j+1,1] <- pred2[j] - 1.96*predarima.a2$se[j] #climits[516-B-j+1,2] <- pred2[j] + 1.96*predarima.a2$se[j] } for( q in (B+1):(A2-1)) { if ( data3b[q,i] < (-90)) { data3b[q,i] <- pred2[(q-B)] data3bo[q,i] <- pred2[(q-B)] } } } if( (B-A) < 10 && (B2-A2) >= 10) { section[(B2-A2+1):1,(nap + 1)] <- data3bo[A2:B2,i] data.sarima.b2 <- arima(section[1:(B2-A2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b2 fmonths.b2 <- (A2-1):(B+1) predarima.b2 <- predict(data.sarima.b2, (A2-B-1)) forecast.b2 <- predarima.b2$pred flimits.b2 <- 1.96 * round(predarima.b2$se, 3) pred2 <- array() for ( j in 1:length(fmonths.a2)) { pred2[j] <- forecast.b2[j] } for (j in 1:length(fmonths.b2)) { #climits[516-B-j+1,1] <- pred2[j] - 1.96*predarima.b2$se[j] #climits[516-B-j+1,2] <- pred2[j] + 1.96*predarima.b2$se[j] } for( q in (B+1):(A2-1)) { if ( data3b[q,i] < (-90)) { data3b[q,i] <- pred2[(q-B)] data3bo[q,i] <- pred2[(q-B)] } } } nap <- nap + 2 data3c[,i] <- data3b[,i] data3c[,i] <- data3c[516:1, i] data3[,i] <- data3c[,i] if(M > 515) { WORKING <- FALSE }