Developing Imputation Methods for Crime Data

Developing Imputation Methods for Crime Data

Report to the American Statistical Association Committee on Law and Justice Statistics

By

Michael D. Maltz

Clint Roberts Elizabeth A. Stasny

The Ohio State University

September 29, 2006

1

I. INTRODUCTION

The Uniform Crime Reporting (UCR) System, collected and published by the Federal

Bureau of Investigation (FBI) since 1930, is one of the largest and oldest sources of social data.

It consists of, in part, monthly counts of crimes for over 18,000 police agencies throughout the

country. Although it is a voluntary reporting system – no agency is required to report its data to

the FBI1 – most agencies do comply and transmit their crime data to the FBI. That does not

mean, however, that the reporting is complete: for various reasons (Maltz, 1999, pp. 5-7)

agencies may miss reporting occasional months, strings of months, or even whole years.

As a consequence of this missing data, when the FBI made some major changes to the UCR

in 1958, they also decided to impute data to fill in the gaps so that the data set would be

comparable from year to year. The FBI used a simple imputation method. For example, if an

agency neglected, for whatever reason, to report data for half the year, the FBI would double the

agency’s crime counts so that there would not appear to be a dip in the agency’s chronological

crime trajectory. Although this imputation method has drawbacks – for example, in an agency

with a great deal of monthly or seasonal variation – it does have the benefit of filling in gaps

with a reasonable method. A more appropriate imputation method might be to base the

imputation method on the agency’s long-term history, but the computing power needed to do this

did not exist in the late 1950s and early 1960s, when the FBI method was implemented. Nor did

the FBI change the method afterwards, when computing became cheaper and easier, probably

preferring to maintain consistency in its reporting.

In recent years, however, the UCR has been used for making policy decisions. The Local

Law Enforcement Block Grant Program, funded by the US Congress in 1994, allocated funds to

1 Some states have required the reporting of crime data: see Maltz (1999, p. 45).

2

local jurisdictions based on the number of violent crimes they experienced in the three most

recent years. In addition, studies have used county-level UCR data to test policies despite the

inability of the data (as they now stand) to support such analyses (Lott, 1998; Lott & Mustard,

1997; Maltz & Targonski, 2002, 2003). Because of these uses of the data, and because of the

prospect of using a complete UCR data set to further study the effect of policies and programs

over the last few decades, the National Institute of Justice (NIJ) funded projects to clean the data

and the American Statistical Association (ASA) was given funding by the Bureau of Justice

Statistics (BJS) to support a study to develop imputation methods to apply to the cleaned UCR

data. This report describes the progress we have made on developing imputation methods for

missing UCR data.

The goal of this project is to develop and test imputation methods for longitudinal UCR data

and to develop variance estimates for the imputed data. Although it should be possible

theoretically to develop algorithms and apply them to every agency for every (apparently)

missing data point, certain factors militate against doing so.

First, not all “missing” data is truly missing. For example, agencies may merge with or

report through other agencies, or report a month’s data at other times to compensate for missed

months; Section II provides a description of crime data missingness.

Missingness in UCR data became an issue when the FBI revised its reporting policies and

procedures after a 1958 consultant report (Lejins et al., 1958) recommended new ways to

estimate and publish national crime trends. This formed the basis of the imputation methodology

used by the FBI to estimate crime, as described in Section III.

While we recognize the limitations of the current FBI imputation methods, we also find that

a “one size fits all” approach cannot be used for all truly missing data. For agencies that have

3

high crime counts (such that a zero count for a month is very unlikely), we found that a

SARIMA (Seasonal AutoRegressive Integrated Moving Average) model was most useful. The

assumption of normal error terms required for such a model was also more reasonable when the

number of crimes was large. When agencies had low crime counts, we chose models appropriate

for discrete data. For agencies with very low crime counts, we imputed the mean value averaged

over all available data and assumed a Poisson distribution of crime counts. For agencies with

intermediate crime counts, we used a Poisson regression model. Section IV describes the

analyses we conducted that led to this approach.

One of the more important results of our initial research – which is possible because of our

model-based approach – is the development of variance estimates for the imputed data; measures

of uncertainty in imputed values are not available for the FBI-imputed data. Since the unit of

analysis we used in imputing data was the crime-month, we had to combine variances from the

individual crime series to estimate the variance for the UCR Crime Index (the sum of seven

crimes) or the Violent (four crimes) or Property Crime Index (three crimes). Section VI deals

with that aspect of the research.

II. CRIME DATA MISSINGNESS One of the difficulties encountered in examining missingness in the UCR data is the

difficulty of initially determining whether a datum is missing (the data file does not use a specific

symbol to indicate when a datum is missing)2 or is merely zero. We have used exploratory data

analysis (EDA) techniques as initially proposed by Tukey (1957) to help in that determination.

2 This is not always the case; in many of the cases of interest, there are indicators in the full data set of when a datum is missing (K. Candell, personal communication, April 17, 2006), but they were not available in the data set provided to us by the NACJD.

4

When a zero occurs in the data set, it may be that no crimes of that type were committed that

month. There are, however, a number of possible explanations. It might also mean that (1) the

agency had not (yet) begun reporting data to the FBI because it didn’t exist (or did not have a

crime reporting unit) at that time, (2) the agency ceased to exist (it may have merged with

another agency), (3) the agency existed, but reported its crime and arrest data through another

agency (i.e., was “covered by” that agency), (4) the agency existed and reported data for that

month, but the data were aggregated so that, instead of reporting on a monthly basis, it reported

on a quarterly, semiannual, or annual basis, (5) the agency existed and reported monthly data in

general, but missed reporting for one month and compensated for the omission by reporting, in

the next month, aggregate data for both months, or (6) the agency did not submit data for that

month (a true missing datum).

Our goal has been to distinguish among these different types of missingness. This section

describes the characteristics of the data that are truly missing. In particular, we explore the

length of runs of missingness and how they vary by year and by size and type of agency.

Obviously, there are a number of other variables that might be investigated: state, population,

county urbanicity (Goodall et al., 1998), and crime rate. Our exploration of these candidate

variables did not suggest that they would be easily useful; this was true a fortiori for

sociodemographic variables (e.g., percent minority, percent aged 16-25). Therefore, this initial

analysis will focus only on year, state, and size and type of agency.

The typology of agencies we used is the one used by the FBI, and described in Table 1. As

can be seen, the typology is fairly simple, and agencies change their group designation as their

population changes.

5

Table 1. FBI Classification of Population Groups Population Group Political Label Population Range 1 City 250,000 and over 2 City 100,000 to 249,999 3 City 50,000 to 99,999 4 City 25,000 to 49,999 5 City 10,000 to 24,999 6 Citya Less than 10,000 8 (Nonmetropolitan County) Countyb N/A 9 (Metropolitan County) Countyb N/A Note: Group 7, missing from this table, consists of cities with populations under 2,500 and universities and colleges to which no population is attributed. For compilation of CIUS, Group 7 is included in Group 6. a Includes universities and colleges to which no population is attributed. b Includes state police to which no population is attributed.

Missingness Run Lengths

Figure 1 shows the overall pattern of missingness for all states, all years. The horizontal axis

is scaled logarithmically to highlight the shorter runs. As can be seen, the greatest number of the

over 44,000 missingness runs are of length 1, and 70 percent are 10 months or less.

The second panel in Figure 1 uses a logarithmic scale for both axes to give a better indication

of run length distribution and patterning; although the first 2 peaks are at 1 and 5, respectively,

the remaining peaks are in multiples of 12 – i.e., they represent full years of missingness.3

3 An agency that reports semiannually would have missingness run lengths of 5. A check of such runs was made to see if that was the reason for the peak at 5; it turned out that this was not the case, since very few of these runs were from January-May or July-November.

6

0

2500

5000

7500

10000

1 10 100 1000

Gap size (mos.)

Number of cases

1

10

100

1000

10000

1 10 100 1000

Gap size (mos.)

Number of cases

Figure 1. Number of cases of different gap sizes from missing data, depicted on standard (upper panel) and logarithmic (lower panel) scales. The logarithmic plot clearly shows the periodicity of missingness patterns: most of the run length peaks occur at multiples of 12, indicating whole years of missing data.

7

Missingness in Different FBI Groups

As shown in Table 1, Groups 1-5 represent cities of different sizes, but Groups 6-9 represent

both cities and other types of jurisdictions, including university, county, and state police

agencies, as well as fish and game police, park police, and other public law enforcement

organizations. Most (but not all) of these “other types” of jurisdictions are called “zero-

population” agencies by the FBI, because no population is attributed to them. This is because, for

example, counting the population policed by a university police department or by the state police

would be tantamount to counting that population twice. Thus, the crime count for these agencies

is merely added to the crime count for the other agencies to get the total crime count: for the city,

in the case of the university police, and for the state, in the case of the state police.4

Figure 2 depicts the missingness trends for these different Groups. As can be seen, Groups 5-

9 have the most missingness, concentrated in long run lengths. It stands to reason that the most

populous agencies (Groups 1-4) have the least missingness. First, agencies that have more crime

probably have stronger statistical capability; and second, if agencies with populations of 100,000

or more are missing reports, they are contacted by FBI personnel and urged to complete their

reports. This second practice, of course, makes an assumption of data being missing completely

at random (MCAR) implausible.

4 Of course, if a university has branches in different cities, each city records the crime count for its respective branch; and if there is a separate state police barracks in each county, then the crime count is allocated by county.

8

Group 1

Gap size (mos.)

0.1 1 10 100 1000

Num

ber o

f cas

es

1

10

100

1000 Group 2

Gap size (mos.)

0.1 1 10 100 1000

Group 3

Gap size (mos.)

0.1 1 10 100 1000

1

10

100

1000

Group 4

1

10

100

1000 Group 5 Group 6

1

10

100

1000

Group 7

1

10

100

1000 Group 8

0.1 1 10 100 1000

Group 9

0.1 1 10 100 1000

1

10

100

1000

Figure 2. Number of cases of different gap sizes from missing data for different FBI Groups, plotted on logarithmic scales. The agencies with higher populations have the least missingness.

III. PREVIOUS IMPUTATION METHODS In 1958 a consultant committee to the FBI recommended that a simple imputation method be

used to account for missing data (Lejins et al., 1958, p. 46): “The number of reported offenses

should then be proportionately increased to take care of the unreported portions, if any, of these

same [Index] categories within each state.” The FBI implemented this imputation approach

using two different methods, both based on the current year’s reporting. If three or more

months’ crimes were reported during the year, the estimated crime count for the year would be

12C/M, where M is the number of months reported and C is the total number of crimes reported

during the M months (Table 2). If fewer than three months were reported in that year, the

9

estimated crime count is based on “similar” agencies in the same state. A “similar” agency is

one that meets two selection criteria: it must be in the same state and the same Group (as per

Table 1), and it must have reported data for all 12 months. The crime rate for these agencies is

then computed (the total crime count is divided by the total population), and this rate is

multiplied by the population of the agency needing imputation, to estimate its annual crime

count.

Table 2. FBI Imputation Procedure

Number of months reported, M, by Agency A 0 to 2 3 to 11

CS PA/PS 12 CA / M

CA , PA: the agency’s crime count and population for the year in question CS , PS: the crime and population count of “similar” agencies in the state, for the year in question

As this imputation method makes clear, the FBI’s primary concern has been the annual

crime count. This imputation technique has been used since the early 1960s, and ignores

concerns of seasonality. (The FBI does conduct analyses of crime seasonality, but not of

individual agencies, and it only uses agencies that have reported all 12 months in its seasonality

analyses.)

IV. INITIAL APPROACHES Our approach has been to impute for each month with missing data and to use as much

available data as possible in our estimates, which means that we often use data of more than a

year’s duration.

Many of the standard approaches to imputing missing data, taken from the survey sampling

literature, do not apply for the UCR data. First, there is surprisingly little research on missing

10

data methods for time series data in the literature. For example, Little and Rubin (2002) include

only two pages on the topic. Furthermore, the UCR time series vary considerably across

agencies: some agencies have a great deal of seasonal variation in crime while others have none,

some are in cities with high population growth while others are in cities that have experienced

population declines. More importantly from our standpoint, some have high monthly crime

counts (so that the probability of a zero count is vanishingly small) while others have sparse

crime counts. As mentioned earlier, because of the high degree of variation in crime from

jurisdiction to jurisdiction (as well as from crime to crime within a jurisdiction), we employed

different methods for different situations.

An additional complication is that some small agencies may have reported data only in

months in which they had data to report. That is, if an agency reported one crime in February

and one crime in December, and provided no crime reports for any of the other months, a

standard imputation might conclude that the agency averages one crime per month – and

therefore estimate that about 12 crimes occurred in that jurisdiction for that year. Only by

looking at the agency’s entire trajectory can one determine the extent to which this assumption is

valid – in most cases this would not be a valid assumption. Graphical methods, therefore, were

an important aspect of our analysis.

Our exploration of the data began with the Columbus NDX (i.e., Index crime) series. This

time series has large crime counts and no missing observations. Not only does the series exhibit

an increasing trend and seasonality, but the variance and seasonal effect also change over time.

Model selection started with decomposing the data to a stationary series. A time series is

stationary if its mean function and covariance function are independent of time. The time series

for agencies reporting in the UCR data are not generally stationary, mainly because of population

11

changes over time.5 We must obtain stationary residuals to specify an ARIMA model. One way

to get stationary residuals is to estimate the trend component and the seasonal component of the

series and subtract them from the data via the classical decomposition algorithm. Less tedious,

but still achieving the desired result, is to apply differencing to the data. Lag-1 differencing will

eliminate trend, but we will still need to adjust for seasonality later. To better ensure stationarity,

we make a log10 transformation of the data. We initially tried complicated models with many

terms in them. Ultimately, however, we chose much simpler models. The autocorrelation

function (ACF) and the partial ACF for a number of time series following this data preparation

suggested that we should use seasonal ARIMA models with parameters (1,1,1)×(1,0,1), where

the first triple of parameters refer to the AR, differencing, and MA terms in the model and the

second triple is the same set of terms but for the seasonality component of the model. This

model works well to predict observations based on the pattern of the data over time.

We illustrate our analyses with the Columbus NDX series. Figure 3A shows the raw data

plotted as a time series, Figure 3B shows this same series after correcting for the mean and

taking first differences, and Figure 3C shows the log transformation of the same data series after

mean correction and differencing.

5 Our imputation strategy was to impute crime counts, not crime rates. The crime rates might be stationary, but considering the natural variation in crime over the past few decades, this is highly unlikely.

12

0 100 200 300 400 500

1000

2000

3000

4000

5000

6000

7000

Columbus, OH

ND

X

A. Columbus NDX Series – Raw Data

0 100 200 300 400 500

-400

0-3

000

-200

0-1

000

010

0020

00

Columbus, OH (NDX)

MC

D

B. Mean-corrected differenced Columbus NDX series

0 100 200 300 400 500

-0.3

-0.2

-0.1

0.0

0.1

Columbus, OH (logNDX)

MC

D

C. Mean-corrected differenced Columbus log NDX series

Figure 3. Columbus NDX time series Data

13

One of the first questions we set out to answer concerned how we should use the NDX series,

the sum of the seven index crimes. Possible approaches are to make imputations for each of the

seven category crimes separately and then sum them to get an estimate for the NDX, or we could

make our imputation on the NDX series and then allocate this estimate to the seven crime series.

Using the Columbus and Cleveland time series, we checked to see if there was a difference in

prediction estimates between the two prediction methods for NDX. We found that the

predictions for the two methods are relatively close, and we concluded that both methods

produce similar predictions. Plots in Figure 4 show sample comparisons of these predictions.

Based on comparisons such as these, we decided to make our imputations for each of the seven

index crimes and use the sum as an estimate for NDX.6 Advantages of this method are that it

mimics the way the NDX is created, as the sum of the seven crime counts, and it avoids the

question of how an estimate of the composite NDX should be allocated to the seven individual

crimes.

For the most part, our goal was to develop interpolation rather than extrapolation models,

since most of the missing data were within the time series rather than at the beginning or ends of

series. Abraham (1981) takes a simple approach that uses fore- and back-casting estimates from

time-series models, which, when combined, provides estimates with lower standard errors than

an estimation procedure that is based only on prior (or subsequent) data points alone.

To test our proposed models we simulated missingness by eliminating known data points

from the time series and then estimated the points we had deleted using various models. We used

both mean absolute and mean squared differences to assess the overall accuracy of our models’

predictions.

6 The same holds true for the Violent Crime Index (murder + rape + robbery + aggravated assault) and the Property Crime Index (burglary + larceny + vehicle theft).

14

Seven crime total versus Index

0

0.05

0.1

0.15

0.2

0 0.05 0.1 0.15 0.2

totaled pred abs rel diff

inde

x pr

ed a

bs re

l diff

Plot of the absolute relative prediction errors for predicting the index or summing individual crimes and summing to create the index.

Comparing predictions to actual values

4000

4500

5000

5500

6000

6500

7000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

forecast month

crim

e

Totaledprediction

True value

Indexprediction

Figure 4. The top plot is for Columbus, OH; the bottom plot is for Cleveland, OH. Both show that the prediction based on the sum of seven estimates and the prediction based on the NDX series are very close to each other.

15

As noted earlier, we found that one size does not fit all. We decided to use three different

imputation methods, depending on the statistics of the agency and crime in question. Based on

our graphical analyses of agency crime trajectories, we chose mean monthly counts of 1 and 35

as our break points. For example, in Figure 5, Columbiana, OH has a mean monthly count of 28

and the SARIMA method of imputation in months 476-492 produced fairly constant estimates.

Galion, OH has a mean monthly count of 37 and the SARIMA method of imputation is seen to

exhibit the seasonality and the trend from the data preceding the missing months. Also, UCR

time series often have much smaller counts in the earlier years than in the later years, so it is

better to be conservative with employing the SARIMA method to avoid using it in the cases

when we have missing data at the beginning of the series. We have seen throughout the course

of our research that average counts of 1 and 35 are reasonable break points for our three

imputation methods.

When an agency’s crime counts are included in another agency’s reports, the other agency is

said to be “covering” the initial agency, and the initial agency is said to be “covered by” the

other agency. Since such covered-by crimes have been counted and reported, we do not need to

impute them for the covered agency. Still, we cannot treat agencies with covered-by months or

covering months the same as other agencies. If a covering agency has periods of covering

surrounded by periods of non-covering, then we might have a problem when we try to model the

series. The months that are covering another agency will have a different model than the months

that are not covering. For agencies with covered-by months, making the usual imputations of

the missing values will not be a problem as long as the agency has a sufficient amount of

observed data. For each of the covering agencies, we would need to look at plots to see how

appropriate our regular model would be. Since the number of agencies affected by coverings is

16

large, it is not feasible to look at the plots of each one. We assume that most covering agencies

report large counts and most covered-by agencies contribute small counts, and therefore making

adjustments in the model for covering months will not significantly improve our predictions.

0 100 200 300 400 500

020

4060

80

Columbiana, OH

ND

X

A. Imputation based on SARIMA for Columbiana NDX series with average crime count of 28

0 100 200 300 400 500

020

4060

80

Galion, OH

ND

X

B. Imputation based on SARIMA for Galion NDX series with average crime count of 37

Figure 5. Plots showing imputations from time series models for two NDX series with different levels of crime.

In addition, in many cases an agency did not provide monthly data for some years, but

instead sent in an annual (or semiannual or quarterly) count of crimes. In those cases we imputed

17

the data for the entire time period, as if we had no information for that period, and then adjusted

all of the imputed values so that they would sum to the (known) total.

All imputation algorithms were written in R, a statistical software language that has strong

graphical capabilities. Since we relied to a great extent on “eyeballing” to determine whether a

particular imputation algorithm seemed reasonable, this allowed us to inspect the results of

different imputation procedures. The R code for our procedures is provided in the Appendix.

V. OUR IMPUTATION PROCEDURE We will now describe the general structure of the algorithms used to make imputations in our

data. After removing the outliers from the series, the program makes a search for the longest

complete segment in the series. Using this segment as a starting position, the program moves

outward to the ends of the series making imputations for each missing period. At each step,

imputations are based on forecasts and backcasts, provided that enough data is available

surrounding the missing period. Finally, the outliers are included back into the data. (There are

many outliers in the data that are suspected of being processing errors, and should be dealt with

outside of this project.)

Crime Count Less Than 1 per Month

When the average crime count was under 1 crime per month, we assumed that the data had a

Poisson distribution and, since with such sparse data seasonality is difficult or impossible to

identify, we estimated the crime count and the variance to be the mean monthly count. The

mean count was calculated by dividing the total count for the agency by the total number of

months the agency provided reports. An improvement over this method, which would attempt to

account for any time trends, might be to calculate monthly estimates using the average over a

18

window of months surrounding the missing month. Intuitively, this strategy makes sense, but the

actual gain may be insignificant, as we would be making a bias/variance trade-off. (The strategy

would have less stable estimates with larger variances.) In addition, the question of how many

months to include in the window would almost certainly require different answers for different

data series.

Figure 6 shows results of our mean imputation for East Liverpool, Ohio.

0 100 200 300 400 500

01

23

4

East Liverpool, OH

MU

R

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Figure 6. Imputation based on the mean of East Liverpool murder series. Point estimates and upper 95% confidence bounds are plotted.

Crime Count Between 1 and 35 per Month

For agencies with between 1 and 35 crimes per month on average, we used a generalized

linear model, assuming Poisson error terms, that based its estimates when possible on the

trajectories of “similar” agencies. Predictors in the Poisson regression model for estimating a

month’s crime count included the previous month’s counts, a seasonal effect, and the crime

count from a highly correlated “donor” series, when possible. One of the questions we dealt with

was how to select candidate donor agencies for the imputation algorithm. Our definition of

“similar” was different from that used by the FBI: we chose only one agency as a “similar”

agency, selected from a number of candidate agencies, those with the least missing data (and

19

with no missingness in the imputation interval in question). The agency we used was the one

whose time series for the crime in question was most similar to (i.e., most highly correlated with)

the imputed agency’s time series. When possible we used data on both sides of the missing

period and averaged two predictions based on different estimated models.

0 0 100 1 0 200 2 0 300

05

1015

2025

30

Saraland, AL

BU

R

A. Saraland burglary series

0 0 100 1 0 200 2 0 300

020

4060

8010

012

014

0

Anniston, AL

BU

R

B. Anniston burglary series

Figure 7. Time series plots of burglaries in two Alabama cities.

Figures 7-9 illustrate imputation results for burglary in Saraland, Alabama obtained using the

Poisson generalized linear model (GLM). Anniston, Alabama’s series of crime counts was used

as a donor series because both cities are in Alabama, both have populations between 10,000 and

20

30,000, and the correlation between the series was 0.756. The similarity in the series is shown in

the plots in Figure 7.

05

1015

2025

30Saraland, AL

BU

R

oooooooooo

oooo

oooooooooo

A. Forecasting based on GLM used to fill in gap in Saraland burglary series

05

1015

2025

30

Saraland, AL

BU

R

o

oo

o

o

ooo

ooo

o

o

ooo

o

ooo

ooo

o

B. Backcasting based on GLM used to fill in gap in Saraland burglary series

05

1015

2025

30

Saraland, AL

BU

R

ooo

oooooooooo

ooooo

oooooo

C. Combination of forecasting and backcasting used to fill in gap in Saraland burglary series.

Figure 8. Poisson regression model imputation for burglary series in Saraland, AL.

21

05

1015

2025

30

Saraland, AL

BU

R

oo

oooooooooooo

o

ooooooo

oo

A. Forecasting using Anniston information to fill in gap in Saraland burglary series

510

1520

2530

Saraland, AL

BU

R

o

o

oo

o

ooo

ooo

o

o

o

oo

o

ooo

ooo

o

B. Backcasting using Anniston information to fill in gap in Saraland burglary series

05

1015

2025

30

Saraland, AL

BU

R

o

oo

ooo

ooo

o

ooo

oo

ooo

ooo

o

oo

C. Combination of forecasting and backcasting using Anniston information to fill in gap in Saraland burglary series Figure 9. Poisson regression model imputation for burglary series in Saraland, AL using Anniston, AL as a donor series.

22

Figures 8 and 9 show examples of imputation for missing data when forecasting,

backcasting, and both are used to obtain the imputed values. The results in Figure 8 are

obtained using only Saraland’s crime series data. These may be compared to similar

imputations in Figure 9 obtained when Anniston’s data are used in the Poisson regression model

for prediction.

Crime Count Over 35 per Month

For a crime count averaging above 35 crimes per month, we used estimates from SARIMA

models with parameters (1,1,1)×(1,0,1). We first found the longest sequence of complete data

(consecutive months with no missing data). Starting with that sequence, we then imputed the

missing data to the right of that using the SARIMA model. If a sufficiently large sequence of

data7 existed to the right of the gap, we then went in the opposite direction and imputed the same

missing data sequence from the right. This algorithm provided us with two imputation sequences

for the gap, one prior to the period of missing data and one following that period. We then

averaged the two for our final estimate, reducing the variance from that obtained using a single

estimate.

Figure 10 shows results for the NDX series for Columbus, OH obtained using a SARIMA

time-series model as described above. The original series with a portion of the data deleted is

shown in Figure 10A. Examples of imputation for the missing data when forecasting,

backcasting, and both are used to obtain the imputed values are shown in Plots B, C, and D of

Figure 10, respectively.

7 “Sufficiently large” was defined as 25 consecutive months. Since one month is discarded, this meant that we had at least two data points for each month in the sequence.

23

3.0

3.2

3.4

3.6

3.8

logN

DX

A. Columbus log(NDX) series with section to be imputed removed

3.2

3.4

3.6

3.8

4.0

Forecasting

logN

DX

+++++++

++++++++

+++++++++

++

+++++

++++++

+++++++++++

B. Forecasting using SARIMA model to fill in the gap in the Columbus log(NDX) series

3.2

3.4

3.6

3.8

4.0

Backcasting

logN

DX

++++++++++++++++++++++++

++++++++++++++++++++++++

C. Backcasting using SARIMA model to fill in the gap in the Columbus log(NDX) series

3.2

3.4

3.6

3.8

4.0

Forecasting and Backcasting

logN

DX

+++

+++++++++++++++++++++

++++++++++++++++++++++++

D. Linear combination of forecasting and backcasting using SARIMA model to fill in gap in the Columbus log(NDX) series

Figure 10. SARIMA model imputation for NDX series in Columbus, OH.

24

VI. VARIANCE ESTIMATES We used standard techniques to estimate the variance in the imputed data. When the average

crime count was under 1 crime per month, we assumed that the data had a Poisson distribution

and the mean monthly count was used as our imputed value and our variance estimate. Figure 6

shows the upper confidence bound obtained using these variance estimates.

When the average crime count for an agency was between 1 and 35 crimes per month we

used a Poisson GLM and obtained variance estimates for the coefficients in the model. When we

do not have consecutive missing months, then the prediction variance can be easily computed in

R using its GLM capabilities. When we do have consecutive missing months, we run into the

problem of prediction based on models estimated from data that includes imputed values. Our

current variance estimates tend to be a bit too small because they ignore variability from all

previous imputations. Deriving the correct variance is difficult but, taking a different approach

to the problem, Bayesian tools8 might be useful in handling this difficulty.

For an average reported crime count above 35 crimes per month, when we used SARIMA

models, R provides variance estimates for the forecasts that increase as the forecasts departed

from the data. Confidence bounds based on these variance estimates for imputed values are

shown in Figure 10. Note that the bounds increase noticeably as we move away from the data

used for forecasts (Figure 10B) or backcasts (Figure 10C) but are more stable in the estimates

from the linear combination of the forecasts and backcasts (Figure 10D).

For the NDX series, each imputation is the sum of the imputations from the seven index

crime series. Because these seven crime series are not independent, we were able to use Taylor

series expansions to derive variance estimates for the NDX imputations. The same technique is

8 Developing Bayesian methods for imputation of the UCR data is an area of future research for our group.

25

used to calculate variance estimates for the imputed counts in the Violent Crime Index and the

Property Crime Index.

VII. CONCLUSION This report has described our research in developing algorithms to impute missing UCR data,

and includes the R code for applying them to the data. Going from this step to the next one –

actual imputation of the data – is a straightforward but nontrivial task. As we mentioned above,

dealing with covered-by situations may be complicated. Developing Bayesian methods and

improving our variance estimates based on our imputation models remain for future research. In

the end, however, these activities will provide the criminal justice research and policy

communities with some 18,000 agency-level time series for the seven Index crimes, a relatively

small price to pay for what may turn out to be a large payoff.

26

REFERENCES

Abraham, B. (1981). “Missing observations in time series.” Communications in Statistics, A, 10, 1643-1653.

Goodall, C. R., Kafadar, K., and Tukey, J. W. (1998). Computing and using rural versus urban measures in statistical applications. American Statistician, 52: 101–111.

Little, R.J.A., and D. B. Rubin. (2002). Statistical analysis with missing data. New York: Wiley.

Lejins, P. P., C. F. Chute, and S. F. Schrotel (1958). Uniform crime reporting: Report of the consultant committee. Unpublished report to the FBI; available from the second author.

Lott, J. R., Jr. (1998). More guns, less crime: Understanding crime and gun control laws. University of Chicago Press, Chicago, IL.

Lott, J. R., Jr., and D. B. Mustard (1997). "Crime, deterrence, and right-to-carry concealed handguns." Journal of Legal Studies, 26, 1, 1-68.

Maltz, M. D. (1999). Bridging Gaps in Police Crime Data. NCJ 176365. Washington, D.C.: Bureau of Justice Statistics.

Maltz, M. D., and J. Targonski (2002). “A note on the use of county-level crime data” (with Joseph Targonski), Journal of Quantitative Criminology, Vol. 18, No. 3, 297-318.

Maltz, M. D., and J. Targonski (2003). “Measurement and other errors in county-level UCR data: A reply to Lott and Whitley.” Journal of Quantitative Criminology, Vol. 19, No. 2, 199-206.

Tukey, J. W. (1977). Exploratory Data Analysis, Addison-Wesley, Reading, MA.

27

APPENDIX: COMPUTER CODE

#################################################################### ## ## ## Author: Clint Roberts ## ## Title: Data Channel ## ## ## ## Description: This program puts the data from Excel into R ## ## ## #################################################################### library(RODBC) channel <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\AL.xls") n <- 470 channel2 <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\AR.xls") n <- 250 channel3 <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\IA.xls") n <- 265 channel4 <- odbcConnectExcel("C:\\Documents and Settings\\Clint Roberts\\Desktop\\RA work\\OH.xls") n <- 720 NDX1 <- sqlFetch(channel, "NDX1", max=n) NDX2 <- sqlFetch(channel, "NDX2", max=n) NDX3 <- sqlFetch(channel, "NDX3", max=n) MUR1 <- sqlFetch(channel, "MUR1", max=n) MUR2 <- sqlFetch(channel, "MUR2", max=n) MUR3 <- sqlFetch(channel, "MUR3", max=n) RPF1 <- sqlFetch(channel, "RPF1", max=n) RPF2 <- sqlFetch(channel, "RPF2", max=n) RPF3 <- sqlFetch(channel, "RPF3", max=n) RBT1 <- sqlFetch(channel, "RBT1", max=n) RBT2 <- sqlFetch(channel, "RBT2", max=n) RBT3 <- sqlFetch(channel, "RBT3", max=n) AGA1 <- sqlFetch(channel, "AGA1", max=n) AGA2 <- sqlFetch(channel, "AGA2", max=n) AGA3 <- sqlFetch(channel, "AGA3", max=n) BUR1 <- sqlFetch(channel, "BUR1", max=n) BUR2 <- sqlFetch(channel, "BUR2", max=n) BUR3 <- sqlFetch(channel, "BUR3", max=n) LAR1 <- sqlFetch(channel, "LAR1", max=n) LAR2 <- sqlFetch(channel, "LAR2", max=n) LAR3 <- sqlFetch(channel, "LAR3", max=n) VTT1 <- sqlFetch(channel, "VTT1", max=n) VTT2 <- sqlFetch(channel, "VTT2", max=n) VTT3 <- sqlFetch(channel, "VTT3", max=n) Group <- sqlFetch(channel, "Group", max=n)

28

Population <- sqlFetch(channel, "Population", max=n) ORI <- NDX1[,1] NDX <- cbind(NDX1, NDX2, NDX3) NDX <- NDX[,-1] NDX <- NDX[,-241] NDX <- NDX[,-481] MUR <- cbind(MUR1, MUR2, MUR3) MUR <- MUR[,-1] MUR <- MUR[,-241] MUR <- MUR[,-481] RPF <- cbind(RPF1, RPF2, RPF3) RPF <- RPF[,-1] RPF <- RPF[,-241] RPF <- RPF[,-481] AGA <- cbind(AGA1, AGA2, AGA3) AGA <- AGA[,-1] AGA <- AGA[,-241] AGA <- AGA[,-481] BUR <- cbind(BUR1, BUR2, BUR3) BUR <- BUR[,-1] BUR <- BUR[,-241] BUR <- BUR[,-481] LAR <- cbind(LAR1, LAR2, LAR3) LAR <- LAR[,-1] LAR <- LAR[,-241] LAR <- LAR[,-481] RBT <- cbind(RBT1, RBT2,RBT3) RBT <- RBT[,-1] RBT <- RBT[,-241] RBT <- RBT[,-481] VTT <- cbind(VTT1, VTT2, VTT3) VTT <- VTT[,-1] VTT <- VTT[,-241] VTT <- VTT[,-481]

29

#################################################################### ## ## ## Author: Clint Roberts ## ## Title: Mean Imputation ## ## ## ## Description: This program makes imputations in time series ## ## with 1 or lower average crime counts ## #################################################################### ################################ ## ## ## Section I ## ## ## #################################################################### ## ## ## What crime do you want to work with? Put in for tc ## ## ## #################################################################### tc <- RPF tc2 <- t(tc) tc3 <- tc2 #################################################################### ## ## ## This finds out which agencies have aggregated months ## ## ## #################################################################### j <- 1 agg <- array() for ( i in 1:n) { tester <- tc2[,i] if (length(tester[tester < (-100)]) > 0) { agg[j] <- i j <- j +1 } } #################################################################### ## ## ## This adjusts months that are aggregated to 0 ## ## ## #################################################################### month <- rep(1:12, 43) for( j in agg) { for ( i in 1:516) { if ( tc3[i,j] < -100 )

30

{ off <- -100 - tc3[i,j] aggmonth <- i - month[i] + off if (tc3[aggmonth,j] == 0) { tc3[i,j] <- 0 } } } } #################################################################### ## ## ## This finds out which agencies still have aggregated ## ## ## #################################################################### j <- 1 agg <- array() for ( i in 1:n) { tester <- tc3[,i] if (length(tester[tester < (-100)]) > 0) { agg[j] <- i j <- j +1 } } ################################ ## ## ## Section II ## ## ## #################################################################### ## ## ## Program ## ## Description: Calculates the average and makes imputations ## ## ## ## ## #################################################################### tc[tc < (-100)] <- 0 tc[tc < (-3)] <- NA tc <- t(tc) colnames(tc) <- ORI avg <- apply(tc,2,mean, na.rm=T) mm <- apply(tc, 2, (is.na)) mm <- 516 - apply(mm, 2, sum) length(avg[avg < 1 ]) sum(is.nan(avg)) #################################################################### ## ## ## This finds out which series are eligible for method 1 ## ## ## ####################################################################

31

j <- 1 means <- array() for ( i in 1:n) { if (avg[i] < 1 && is.nan(avg[i])==FALSE) { means[j] <- i j <- j+1 } } #################################################################### ## ## ## This makes the imputations and the confidence limits ## ## ## #################################################################### tc2 <- t(MUR) climits1 <- matrix(NA, ncol=n, nrow=516) climits2 <- matrix(NA, ncol=n, nrow=516) for ( i in means) { for (m in 1:516) { if ( tc2[m,i] < (-90) && tc2[m,i] > (-100)) { tc2[m,i] <- avg[i] climits1[m,i] <- avg[i] - 1.96*sqrt(avg[i]/mm[i]) climits2[m,i] <- avg[i] + 1.96*sqrt(avg[i]/mm[i]) } } } climits1[climits1 < 0 ] <- NA #################################################################### ## ## ## This makes the plots ## ## ## #################################################################### v <- 189 plot(tc2[,v], type="l", ylim=c(0,1)) points(1:516, climits1[,v], pch="-") points(1:516, climits2[,v], pch="-") #################################################################### ## ## ## This adjusts months for aggregation under method 1 ## ## ## ####################################################################

32

for( j in agg) { for( i in 1:516) { if( tc3[i,j] < (-100) ) { off <- -100 - tc3[i,j] aggmonth <- i - month[i] + off q <- 0 s <- i while ( tc3[s,j] < (-100) ) { q <- q + 1 s <- s + 1 } tc3[i:(s-1),j] <- (1/(q+1)) * tc3[aggmonth,j] tc3[aggmonth,j] <- (1/(q+1)) * tc3[aggmonth,j] } } }

33

#################################################################### ## ## ## Author: Clint Roberts ## ## Title: GLM Imputation ## ## ## ## Description: This program makes imputations in time series ## ## when average crime counts are between 1 and 35 ## #################################################################### ################################ ## ## ## Section I ## ## ## #################################################################### ## ## ## What crime do you want to work with? Put in for tc ## ## ## #################################################################### tc <- LAR data3 <- tc data3.copy <- data3 data3.copy2 <- data3 #################################################################### ## ## ## Program ## ## Description: Lists all series for time series ## ## imputation according to the decision tree. ## ## ## #################################################################### tc[tc < (-3)] <- NA tc <- t(tc) tc <- as.numeric(tc) tc <- matrix(tc, ncol=n) colnames(tc) <- ORI avg <- apply(tc,2,mean, na.rm=T) n.glms <- length(avg[avg < 35 && avg > 1 ]) - sum(is.nan(avg)) n.glms j2 <- 1 glms <- array() for ( u in 1:n) { if (avg[u] < 35 && is.nan(avg[u])==FALSE && avg[u] > 1) { glms[j2] <- u j2 <- j2+1 } }

34

#################################################################### ## ## ## What agency do you want to work with? Put in for i ## ## ## #################################################################### i <- 72 #################################################################### ## ## ## Creates month factor to include seasonality in the model ## ## ## #################################################################### jan <- seq(1,515, by=12) feb <- jan + 1 mar <- jan + 2 apr <- jan + 3 may <- jan + 4 jun <- jan + 5 jul <- jan + 6 aug <- jan + 7 sep <- jan + 8 oct <- jan + 9 nov <- jan + 10 dec <- jan + 11 mon <- 1:516 mon[jan] <- "jan" mon[feb] <- "feb" mon[mar] <- "mar" mon[apr] <- "apr" mon[may] <- "may" mon[jun] <- "jun" mon[jul] <- "jul" mon[aug] <- "aug" mon[sep] <- "sep" mon[oct] <- "oct" mon[nov] <- "nov" mon[dec] <- "dec" mont <- factor(mon[1:516]) #################################################################### ## ## ## Look at correlation between two agencies series ## ## ## #################################################################### auxdata <- data3[,55] cor(data3[,i],auxdata,use="pairwise.complete.obs") plot(data3[,i], type="l") ################################

35

## ## ## Section II ## ## ## #################################################################### ## ## ## Program ## ## Description: Finds the biggest observed section ## ## in the time series ## ## ## #################################################################### data3 <- LAR data3 <- t(data3) #data3[,i] <- data3[516:1,i] plot(data3[,i], type="l", ylim=c(0,50)) big <- matrix(0, nrow = 516, ncol=1) m <- 1 a <- 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 } else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 } else { b <- m - 1 NOTDONE <- FALSE } } len <- (b-a) biga <- a bigb <- b m <- b + 1 a2 <- b + 1

36

STOP <- FALSE while( STOP == FALSE) { NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 b2 <- a2 + 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE b2 <- m-1 } } else { b2 <- m -1 NOTDONE <- FALSE } } if( (b2-a2) > len) { len <- (b2-a2) big[a2:b2] <- data3[a2:b2,i] biga<- a2 bigb<- b2 } }

37

################################ ## ## ## Section III ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from left to right ## ## starting at the big section of the series ## ## ## #################################################################### section <- matrix(0,ncol= 2000, nrow = 516) #climits <- matrix(NA, ncol=2, nrow = 516) nap <- 1 WORKING <- TRUE while( WORKING == TRUE) { m <- biga a <- biga NOTDONE <- TRUE if(m > 516) { m <- 516 } while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a <- m } } else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE)

38

{ if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { NOTDONE <- FALSE b <- m-1 } } else { b <- m - 1 NOTDONE <- FALSE } } m <- b + 1 a2 <- b + 1 if ( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 b2 <- a2 + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) {

39

NOTDONE <- FALSE b2 <- m-1 } } else { b2 <- m -1 NOTDONE <- FALSE } } if ( (b-a) >= 24 && (b2-a2) >= 24) { section[1:(b - a + 1),nap] <- data3[a:b, i] section[(b2-a2+1):1, (nap + 1)] <- data3[a2:b2,i] pred <- array() predb <- array() preda <- array() fmonth <- 1 for( y in (b+1):(a2-1)) { x <- section[1:(b-a+fmonth-1), nap] w <- section[1:(b2-a2+fmonth-1), (nap+1)] z <- auxdata[(a+1):(b+fmonth-1)] z2 <- auxdata[(b2-1):(a2-fmonth+1)] mona <- mont[(a+1):(b+fmonth-1)] monb <- mont[(b2-1):(a2-fmonth+1)] modela <- glm(section[2:(b-a+1+fmonth-1), nap] ~ I(log(x+1))+ I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(b-a+1+fmonth-1),nap], z=auxdata[b+fmonth],mona=mont[b+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) modelb <- glm(section[2:(b2-a2+1+fmonth-1), (nap+1)] ~ I(log(w+1))+ I(log(z2+1)) + monb, family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(b2-a2+1+fmonth-1),(nap+1)],z2=auxdata[a2-fmonth], monb=mont[a2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) preda[fmonth] <- predglma$fit predb[fmonth] <- predglmb$fit if ( data3[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(b-a+fmonth+1), nap] <- preda[fmonth]

40

section[(b2-a2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (b+1):(a2-1)) { pred[fmonth] <- .5 * (predb[a2-b-fmonth] + preda[fmonth]) if ( data3[y,i] < (-90)) { data3[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if ( (b-a) >= 24 && (b2-a2) < 24 && b < 516) { section[1:(b - a + 1),nap] <- data3[a:b, i] pred <- array() preda <- array() fmonth <- 1 for( y in (b+1):(a2-1)) { x <- section[1:(b-a+fmonth-1), nap] z <- auxdata[(a+1):(b+fmonth-1)] mona <- mont[(a+1):(b+fmonth-1)] modela <- glm(section[2:(b-a+1+fmonth-1), nap] ~ I(log(x+1))+I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(b-a+1+fmonth-1),nap], z=auxdata[b+fmonth],mona=mont[b+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) preda[fmonth] <- predglma$fit if ( data3[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(b-a+fmonth+1), nap] <- preda[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (b+1):(a2-1))

41

{ pred[fmonth] <- preda[fmonth] if ( data3[y,i] < (-90)) { data3[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if( (b-a) < 24 && (b2-a2) >= 24) { section[(b2-a2+1):1, (nap + 1)] <- data3[a2:b2,i] pred <- array() predb <- array() fmonth <- 1 for( y in (b+1):(a2-1)) { w <- section[1:(b2-a2+fmonth-1), (nap+1)] z2<- auxdata[(b2-1):(a2-fmonth+1)] monb <- mont[(b2-1):(a2-fmonth+1)] modelb <- glm(section[2:(b2-a2+1+fmonth-1), (nap+1)] ~ monb + I(log(z2+1))+ I(log(w+1)), family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(b2-a2+1+fmonth-1),(nap+1)], z2=auxdata[a2-fmonth],monb=mont[a2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) predb[fmonth] <- predglmb$fit if ( data3[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(b2-a2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (b+1):(a2-1)) { pred[fmonth] <- predb[a2-b-fmonth] if ( data3[y,i] < (-90)) { data3[y,i] <- pred[fmonth] fmonth <- fmonth + 1

42

} } nap <- nap + 1 } if(m > 515) { WORKING <- FALSE } } ################################ ## ## ## Section IV ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from right to left ## ## starting at the big section of the series ## ## ## #################################################################### WORKING <- TRUE while( WORKING == TRUE) { data3b[,i] <- data3[,i] data3b[,i] <- data3b[516:1, i] M <- 517 - bigb A <- 517 - bigb NOTDONE <- TRUE if(M > 516) { M <- 516 } while(NOTDONE == TRUE) { if (data3b[M,i]< (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE A <- M } } else { A <- M NOTDONE <- FALSE } } M <- A + 1

43

B <- A + 1 if( M > 516) { M <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3) ) { M <- M +1 if( M > 516) { NOTDONE <- FALSE B <- M-1 } } else { B <- M - 1 NOTDONE <- FALSE } } M <- B + 1 if ( M > 516) { M <- 516 } A2 <- M NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] < (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE A2 <- M } } else { A2 <- M NOTDONE <- FALSE } } M <- A2 + 1 if( M > 516) { M <- 516 } B2 <- M

44

NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3) ) { M <- M +1 if( M > 516) { NOTDONE <- FALSE B2 <- M-1 } } else { B2 <- M -1 NOTDONE <- FALSE } } if ( (B-A) >= 24 && (B2-A2) >= 24) { section[1:(B - A + 1),nap] <- data3b[A:B, i] section[(B2-A2+1):1, (nap + 1)] <- data3b[A2:B2,i] pred <- array() predb <- array() preda <- array() fmonth <- 1 for( y in (B+1):(A2-1)) { x <- section[1:(B-A+fmonth-1), nap] w <- section[1:(B2-A2+fmonth-1), (nap+1)] z <- auxdata[(A+1):(B+fmonth-1)] z2 <- auxdata[(B2-1):(A2-fmonth+1)] mona <- mont[(A+1):(B+fmonth-1)] monb <- mont[(B2-1):(A2-fmonth+1)] modela <- glm(section[2:(B-A+1+fmonth-1), nap] ~ I(log(x+1))+I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(B-A+1+fmonth-1),nap],z=auxdata[B+fmonth], mona=mont[B+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) modelb <- glm(section[2:(B2-A2+1+fmonth-1), (nap+1)] ~ I(log(w+1))+I(log(z2+1)) + monb, family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(B2-A2+1+fmonth-1),(nap+1)], z2=auxdata[A2-fmonth], monb=mont[A2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse))

45

preda[fmonth] <- predglma$fit predb[fmonth] <- predglmb$fit if ( data3b[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(B-A+fmonth+1), nap] <- preda[fmonth] section[(B2-A2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (B+1):(A2-1)) { pred[fmonth] <- .5 * (predb[A2-B-fmonth] + preda[fmonth]) if ( data3b[y,i] < (-90)) { data3b[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if ( (B-A) >= 24 && (B2-A2) < 24 && B < 515) { section[1:(B - A + 1),nap] <- data3b[A:B, i] pred <- array() preda <- array() fmonth <- 1 for( y in (B+1):(A2-1)) { x <- section[1:(B-A+fmonth-1), nap] z <- auxdata[(A+1):(B+fmonth-1)] mona <- mont[(A+1):(B+fmonth-1)] modela <- glm(section[2:(B-A+1+fmonth-1), nap] ~ I(log(x+1))+I(log(z+1)) + mona, family=negative.binomial(theta=1)) predglma <- predict(modela, newdata=data.frame(x=section[(B-A+1+fmonth-1),nap],z=auxdata[B+fmonth], mona=mont[B+fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) preda[fmonth] <- predglma$fit if ( data3b[y,i] < (-90)) { #data3[y,i] <- pred[fmonth]

46

#data3o[q,i] <- pred[fmonth] section[(B-A+fmonth+1), nap] <- preda[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1 for( y in (B+1):(A2-1)) { pred[fmonth] <- preda[fmonth] if ( data3b[y,i] < (-90)) { data3b[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } if( (B-A) < 24 && (B2-A2) >= 24) { section[(B2-A2+1):1, (nap + 1)] <- data3b[A2:B2,i] pred <- array() predb <- array() fmonth <- 1 for( y in (B+1):(A2-1)) { w <- section[1:(B2-A2+fmonth-1), (nap+1)] z2 <- auxdata[(B2-1):(A2-fmonth+1)] monb <- mont[(B2-1):(A2-fmonth+1)] modelb <- glm(section[2:(B2-A2+1+fmonth-1), (nap+1)] ~ monb + I(log(z2+1))+ I(log(w+1)), family=negative.binomial(theta=1)) predglmb <- predict(modelb, newdata=data.frame(w=section[(B2-A2+1+fmonth-1),(nap+1)],z2=auxdata[A2-fmonth], monb=mont[A2-fmonth]),se.fit=TRUE, type="response") #mse <- sum(((data[1:(length(data)-1)]) - (model$fitted.values))**2)/(length(data)-3) #lower[i] <- pred$fit - 1.96*(sqrt(pred$se.fit**2+mse)) #upper[i] <- pred$fit + 1.96*(sqrt(pred$se.fit**2+mse)) predb[fmonth] <- predglmb$fit if ( data3b[y,i] < (-90)) { #data3[y,i] <- pred[fmonth] #data3o[q,i] <- pred[fmonth] section[(B2-A2+fmonth+1), (nap+1)] <- predb[fmonth] fmonth <- fmonth + 1 } } fmonth <- 1

47

for( y in (B+1):(A2-1)) { pred[fmonth] <- predb[A2-B-fmonth] if ( data3b[y,i] < (-90)) { data3b[y,i] <- pred[fmonth] fmonth <- fmonth + 1 } } nap <- nap + 1 } data3c[,i] <- data3b[,i] data3c[,i] <- data3c[516:1, i] data3[,i] <- data3c[,i] if(M > 515) { WORKING <- FALSE } } plot(data3[,i], type="l", ylim=c(0,50)) #plot(data3[516:1,i], type="l", ylim=c(0,50))

48

#################################################################### ## ## ## Author: Clint Roberts ## ## Title: Time Series Imputation ## ## ## ## Description: This program makes imputations in time series ## ## ## #################################################################### ################################ ## ## ## Section I ## ## ## #################################################################### ## ## ## What crime do you want to work with? Put in for tc ## ## ## #################################################################### tc <- LAR data3 <- tc data3.copy <- data3 #################################################################### ## ## ## Program ## ## Description: Lists all series for time series ## ## imputation according to the decision tree. ## ## ## #################################################################### tc[tc < (-3)] <- NA tc <- t(tc) tc <- as.numeric(tc) tc <- matrix(tc, ncol=n) colnames(tc) <- ORI avg <- apply(tc,2,mean, na.rm=T) n.times <- length(avg[avg > 35 ]) - sum(is.nan(avg)) n.times j <- 1 times <- array() for ( i in 1:n) { if (avg[i] > 35 && is.nan(avg[i])==FALSE) { times[j] <- i j <- j+1 } } times

49

#################################################################### ## ## ## What agency do you want to work with? Put in for i ## ## ## #################################################################### i <- 189 ################################ ## ## ## Section II ## ## ## #################################################################### ## ## ## Program ## ## Description: Detects outliers in time series and ## ## creates a dataset with outliers removed ## ## called data3o to be used in model ## ## ## #################################################################### data3 <- t(data3) data3o <- data3 data3bo <- data3 # Note: all plots in this entire program are commented out #plot(data3[,i], type="l") outlimit1 <- matrix(100000, 516) outlimit3 <- matrix(-10, 516) u <- 37 for( m in 37:479) { q1 <- quantile(window(tc[,i], m-36,m+36), .25, na.rm=T) q3 <- quantile(window(tc[,i], m-36,m+36), .75, na.rm=T) iqr <- q3-q1 if(is.na(q1) == FALSE) { outlimit1[u] <- q1 - 1.5 * iqr outlimit3[u] <- q3 + 1.5 * iqr } u <- u+1 } outlimit1[1:36] <- outlimit1[37] outlimit3[1:36] <- outlimit3[37] outlimit1[480:516] <- outlimit1[479] outlimit3[480:516] <- outlimit3[479] #points(1:516, outlimit3) #points(1:516, outlimit1) go <- matrix(FALSE,516)

50

for(h in 1:516) { if(data3[h,i] < outlimit1[h]) { go[h] <- TRUE } if(data3[h,i] > outlimit3[h]) { go[h] <- TRUE } if(go[h] == TRUE) { if( data3[h,i] > (-3) ) { data3[h,i] <- NA } } } data3o[,i] <- data3[,i] data3bo[,i] <- data3[,i] data3bo[,i] <- data3bo[516:1, i] ################################ ## ## ## Section III ## ## ## #################################################################### ## ## ## Program ## ## Description: Finds the biggest observed section ## ## in the time series ## ## ## #################################################################### data3 <- t(data3.copy) data3b <- data3 data3c <- data3 big <- matrix(0, nrow = 516, ncol=1) m <- 1 a <- 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 }

51

else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 } else { b <- m - 1 NOTDONE <- FALSE } } len <- (b-a) biga <- a bigb <- b m <- b + 1 a2 <- b + 1 STOP <- FALSE while( STOP == FALSE) { NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 if( m > 516) { m <- 516

52

} b2 <- m NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { STOP <- TRUE NOTDONE <- FALSE m <- 516 b2 <- m } } else { b2 <- m -1 NOTDONE <- FALSE } } if( (b2-a2) > len) { len <- (b2-a2) big[a2:b2] <- data3[a2:b2,i] biga<- a2 bigb<- b2 } } ################################ ## ## ## Section IV ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from left to right ## ## starting at the big section of the series ## ## ## #################################################################### section <- matrix(0,ncol= 2000, nrow = 516) #climits <- matrix(NA, ncol=2, nrow = 516) nap <- 1 WORKING <- TRUE while( WORKING == TRUE) {

53

m <- biga a <- biga NOTDONE <- TRUE if(m > 516) { m <- 516 } while(NOTDONE == TRUE) { if (data3[m,i]< (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a <- m } } else { a <- m NOTDONE <- FALSE } } m <- a + 1 b <- a + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { NOTDONE <- FALSE b <- m-1 } } else { b <- m - 1 NOTDONE <- FALSE } } m <- b + 1 a2 <- b + 1 if ( m > 516) { m <- 516 }

54

NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] < (-3)) { m <- m +1 if( m > 516) { NOTDONE <- FALSE a2 <- m } } else { a2 <- m NOTDONE <- FALSE } } m <- a2 + 1 b2 <- a2 + 1 if( m > 516) { m <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3[m,i] > (-3) ) { m <- m +1 if( m > 516) { NOTDONE <- FALSE b2 <- m-1 } } else { b2 <- m -1 NOTDONE <- FALSE } } if ( (b-a) >= 24 && (b2-a2) >= 24) { section[1:(b - a + 1),nap] <- data3o[a:b, i] section[(b2-a2+1):1, (nap + 1)] <- data3o[a2:b2,i] data.sarima.a <- arima(section[1:(b-a+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a fmonths.a <- (b+1):(a2-1)

55

predarima.a <- predict(data.sarima.a, (a2-b-1)) forecast.a <- predarima.a$pred flimits.a <- 1.96 * round(predarima.a$se, 3) data.sarima.b <- arima(section[1:(b2-a2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b fmonths.b <- (a2-1):(b+1) predarima.b <- predict(data.sarima.b, (a2-b-1)) forecast.b <- predarima.b$pred flimits.b <- 1.96 * round(predarima.b$se, 3) pred <- array() for ( j in 1:length(fmonths.a)) { pred[j] <- (predarima.b$se[j] * forecast.a[j] + predarima.a$se[j] * forecast.b[j])/(predarima.b$se[j] + predarima.a$se[j]) } for (j in 1:length(fmonths.a)) { #climits[(b+j), 1] <- pred[j] - 1.96*(sqrt(2)) * predarima.a$se[j] * predarima.b$se[j] * (1/(predarima.a$se[j]+predarima.b$se[j])) #climits[(b+j), 2] <- pred[j] + 1.96*(sqrt(2)) * predarima.a$se[j] * predarima.b$se[j] * (1/(predarima.a$se[j]+predarima.b$se[j])) } for( q in (b+1):(a2-1)) { if ( data3[q,i] < (-90)) { data3[q,i] <- pred[(q-b)] data3o[q,i] <- pred[(q-b)] } } } if ( (b-a) >= 24 && (b2-a2) < 24 && b < 516) { section[1:(b - a + 1),nap] <- data3o[a:b, i] data.sarima.a <- arima(section[1:(b-a+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a fmonths.a <- (b+1):(a2-1) predarima.a <- predict(data.sarima.a, (a2-b-1)) forecast.a <- predarima.a$pred pred <- array() for ( j in 1:length(fmonths.a)) { pred[j] <- forecast.a[j] } for (j in 1:length(fmonths.a)) { #climits[b+j,1] <- pred[j] - 1.96*predarima.a$se[j] #climits[b+j,2] <- pred[j] + 1.96*predarima.a$se[j] } for( q in (b+1):(a2-1)) { if ( data3[q,i] < (-90))

56

{ data3[q,i] <- pred[(q-b)] data3o[q,i] <- pred[(q-b)] } } } if( (b-a) < 24 && (b2-a2) >= 24) { section[(b2-a2+1):1, (nap + 1)] <- data3o[a2:b2,i] data.sarima.b <- arima(section[1:(b2-a2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b fmonths.b <- (a2-1):(b+1) predarima.b <- predict(data.sarima.b, (a2-b-1)) forecast.b <- predarima.b$pred flimits.b <- 1.96 * round(predarima.b$se, 3) pred <- array() for ( j in 1:length(fmonths.b)) { pred[j] <- forecast.b[j] } for (j in 1:length(fmonths.b)) { #climits[b+j,1] <- pred[j] - 1.96*predarima.b$se[j] #climits[b+j,2] <- pred[j] + 1.96*predarima.b$se[j] } for( q in (b+1):(a2-1)) { if ( data3[q,i] < (-90)) { data3[q,i] <- pred[(q-b)] data3o[q,i] <- pred[(q-b)] } } } if(m > 515) { WORKING <- FALSE } } ################################ ## ## ## Section V ## ## ## #################################################################### ## ## ## Program ## ## Description: Makes imputations going from right to left ## ## starting at the big section of the series ## ## ## ####################################################################

57

WORKING <- TRUE while( WORKING == TRUE) { data3b[,i] <- data3[,i] data3b[,i] <- data3b[516:1, i] M <- 516-b2+1 A <- 516-b2+1 if(M > 516) { M <- 516 } #This finds section a NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i]< (-3)) { M <- M +1 if(M > 516) { NOTDONE <- FALSE A <- M } } else { A <- M NOTDONE <- FALSE } } M <- A + 1 B <- A + 1 if(M > 516) { M <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3) ) { M <- M +1 if(M > 516) { NOTDONE <- FALSE B <- M-1 } } else { B <- M - 1 NOTDONE <- FALSE }

58

} M <- B + 1 A2 <- B + 1 if(M > 516) { M <- 516 } #This finds section b NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] < (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE A2 <- M } } else { A2 <- M NOTDONE <- FALSE } } M <- A2 + 1 B2 <- A2 + 1 if(M > 516) { M <- 516 } NOTDONE <- TRUE while(NOTDONE == TRUE) { if (data3b[M,i] > (-3)) { M <- M +1 if( M > 516) { NOTDONE <- FALSE B2 <- M-1 } } else { B2 <- M -1 NOTDONE <- FALSE } } nap <- nap + 2

59

if( (B-A) >= 10 && (B2-A2) >= 10) { section[1:(B - A + 1),nap] <- data3bo[A:B, i] section[(B2-A2+1):1,(nap + 1)] <- data3bo[A2:B2,i] data.sarima.a2 <- arima(section[1:(B-A+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a2 fmonths.a2 <- (B+1):(A2-1) predarima.a2 <- predict(data.sarima.a2, (A2-B-1)) forecast.a2 <- predarima.a2$pred flimits.a2 <- 1.96 * round(predarima.a2$se, 3) data.sarima.b2 <- arima(section[1:(B2-A2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b2 fmonths.b2 <- (A2-1):(B+1) predarima.b2 <- predict(data.sarima.b2, (A2-B-1)) forecast.b2 <- predarima.b2$pred flimits.b2 <- 1.96 * round(predarima.b2$se, 3) pred2 <- array() for ( j in 1:length(fmonths.a2)) { pred2[j] <- (predarima.b2$se[j] * forecast.a2[j] + predarima.a2$se[j] * forecast.b2[j])/(predarima.b2$se[j] + predarima.a2$se[j]) } for (j in 1:length(fmonths.a2)) { #climits[516-B-j+1,1] <- pred2[j] - 1.96*(sqrt(2)) * predarima.a2$se[j] * predarima.b2$se[j] * (1/(predarima.a2$se[j]+predarima.b2$se[j])) #climits[516-B-j+1,2] <- pred2[j] + 1.96*(sqrt(2)) * predarima.a2$se[j] * predarima.b2$se[j] * (1/(predarima.a2$se[j]+predarima.b2$se[j])) } for( q in (B+1):(A2-1)) { if ( data3b[q,i] < (-90)) { data3b[q,i] <- pred2[(q-B)] data3bo[q,i] <- pred2[(q-B)] } } } if ( (B-A) >= 10 && (B2-A2) < 10 && B < 516) { section[1:(B - A + 1),nap] <- data3bo[A:B, i] data.sarima.a2 <- arima(section[1:(B-A+1),nap], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.a2 fmonths.a2 <- (B+1):(A2-1) predarima.a2 <- predict(data.sarima.a2, (A2-B-1)) forecast.a2 <- predarima.a2$pred pred2 <- array() for ( j in 1:length(fmonths.a2))

60

{ pred2[j] <- forecast.a2[j] } for (j in 1:length(fmonths.a2)) { #climits[516-B-j+1,1] <- pred2[j] - 1.96*predarima.a2$se[j] #climits[516-B-j+1,2] <- pred2[j] + 1.96*predarima.a2$se[j] } for( q in (B+1):(A2-1)) { if ( data3b[q,i] < (-90)) { data3b[q,i] <- pred2[(q-B)] data3bo[q,i] <- pred2[(q-B)] } } } if( (B-A) < 10 && (B2-A2) >= 10) { section[(B2-A2+1):1,(nap + 1)] <- data3bo[A2:B2,i] data.sarima.b2 <- arima(section[1:(B2-A2+1),(nap + 1)], order=c(1,1,1), seasonal=list(order=c(1,0,0),period=12),include.mean=F) data.sarima.b2 fmonths.b2 <- (A2-1):(B+1) predarima.b2 <- predict(data.sarima.b2, (A2-B-1)) forecast.b2 <- predarima.b2$pred flimits.b2 <- 1.96 * round(predarima.b2$se, 3) pred2 <- array() for ( j in 1:length(fmonths.a2)) { pred2[j] <- forecast.b2[j] } for (j in 1:length(fmonths.b2)) { #climits[516-B-j+1,1] <- pred2[j] - 1.96*predarima.b2$se[j] #climits[516-B-j+1,2] <- pred2[j] + 1.96*predarima.b2$se[j] } for( q in (B+1):(A2-1)) { if ( data3b[q,i] < (-90)) { data3b[q,i] <- pred2[(q-B)] data3bo[q,i] <- pred2[(q-B)] } } } nap <- nap + 2 data3c[,i] <- data3b[,i] data3c[,i] <- data3c[516:1, i] data3[,i] <- data3c[,i] if(M > 515) { WORKING <- FALSE }

61

} ######################### #plot(data3[,i], type="l")

Developing Imputation Methods for Crime Data

Documents

Transcript of Developing Imputation Methods for Crime Data