Download - Regression Modelling - ISD Scotland


Public Health & Intelligence

Regression Modelling Document Control Version 0.4 Date Issued 28/11/2018 Author David Carr Comments to [email protected] Version Date Comment Author 0.1 21/09/2017 1st draft David Carr 0.2 21/12/2017 2nd draft with changes from Statistical

Governance David Carr

0.3 23/02/2018 3rd draft with changes from Statistical Advisory Group and addition of acknowledgements section

David Carr

0.4 28/11/2018 Revision to add references to ‘Hypothesis Testing’ paper and change links to example datasets

David Carr


The author would like to acknowledge Prof. Chris Robertson and colleagues at the University

of Strathclyde for carrying out the original analysis of the data used for the examples in this

paper and for allowing its replication here.

The simulated HAI data sets used in the worked examples were originally created by the

Health Protection Scotland SHAIPI team in collaboration with the University of Strathclyde.


Table of Contents

1 Introduction .............................................................................................................. 1

2 Linear Regression ...................................................................................................... 3

2.1 Model Formulation ................................................................................................................. 3

2.2 Assumptions of the Linear Model ........................................................................................... 5

2.3 Data Distributions and Transformations ................................................................................. 6

2.4 Explanatory Variables and Multicollinearity ........................................................................... 7

2.5 Assessing Goodness-of-Fit ...................................................................................................... 8

2.5.1 Residual Analysis ................................................................................................................. 8

2.5.2 R2 and AIC .......................................................................................................................... 10

2.6 Example ................................................................................................................................. 11

3 Generalised Linear Models....................................................................................... 19

3.1 Model Formulation ............................................................................................................... 19

3.2 Model Diagnostics ................................................................................................................. 20

4 Logistic Regression................................................................................................... 21

4.1 Model Formulation and Inference ........................................................................................ 21

4.2 Example ................................................................................................................................. 22

4.3 Further Examples .................................................................................................................. 25

5 Poisson Regression .................................................................................................. 27

5.1 Model Formulation and Inference ........................................................................................ 27

5.2 Example ................................................................................................................................. 28

5.3 Further Examples .................................................................................................................. 31

Bibliography ................................................................................................................... 33

Appendices .................................................................................................................... 34

A Model Fitting for Linear Models ............................................................................................... 34

B Model Fitting for Generalised Linear Models ........................................................................... 37

C R Code for Analyses .................................................................................................................. 39

D SPSS Syntax for Analyses ........................................................................................................... 44

E SPSS Output from Analyses ....................................................................................................... 52


1 Introduction

The purpose of this paper is to outline the statistical methodology behind basic regression

modelling and provide examples on how to fit such models in R and SPSS. The paper will

initially focus on linear regression before considering generalised linear models (specifically

logistic regression and Poisson regression). Some preliminary mathematical and statistical

knowledge is assumed.

Regression modelling is an important statistical analysis tool in many fields, including

healthcare. For example, if an analyst is looking at mortality in a particular cohort of

patients, it is possible to use statistical models to identify which variables (e.g. age, gender,

deprivation, etc.) are associated with mortality. Many other standard statistical tests such

as t-tests and chi-squared tests are ‘univariate’, meaning that they can assess the

relationship between an outcome (e.g. mortality) and one other variable only; however,

statistical modelling is ‘multivariate’ and can account for many variables at once.

Regression modelling is where the relationship between a response variable (sometimes

referred to as the ‘outcome’ or ‘dependent’ variable) and one or more explanatory variables

(sometimes referred to as ‘independent’ variables or ‘covariates’) is modelled

mathematically. There are many different types of regression models which can be used

and the choice is largely subjective. When choosing which method to implement, an analyst

must consider the purpose of the analysis (i.e. what questions are they attempting to

answer) and the nature of the data and variables, especially the response variable. Table 1

summarises the type of regression that is appropriate for different types of data in most,

but by no means all, cases.

Table 1: Summary of most common types of regression

Data Type Regression Model Continuous Linear regression Categorical (binary) Binomial/logistic regression Categorical (count/rate) Poisson regression


The following statistical issues will not be addressed specifically in this paper but are of

great importance in regression modelling as with many other types of analysis: data

completeness, data quality, sample size, outliers etc.

An often-quoted phrase among statisticians is that “all models are wrong but some are

useful”, which is attributed to the statistician George E. P. Box. The idea here is that there is

no ‘correct’ model but some are more ‘correct’ than others. Regression models are a

reflection of the statistical patterns in the population being modelled and “while a model

can never be ‘truth’, a model might be ranked from very useful, to useful, to somewhat

useful to, finally, essentially useless” (Burnham and Anderson, 2002). A ‘useful’ model is

one which is based on data that are fit-for-purpose, adequately satisfies any statistical

assumptions and could be re-fit using another sample of data from the same target

population and achieve similar results.


2 Linear Regression

The simplest form of regression model is a ‘linear model’, sometimes referred to as a

‘normal linear model’, due to the fact that it is intended to model normally-distributed data.

It can be expressed in words as:

response variable1 = explanatory variable(s) + error

or mathematically as:

𝐲 = 𝐗 + 𝛆 (1)

The right hand side of this equation can essentially be thought of as ‘signal + noise’.

For linear regression, the response variable must be a continuous (numeric) variable. It is

not suitable for modelling counts, rates or binary data – these are modelled using

Generalised Linear Models.

When a response variable is modelled against one explanatory variable, it is referred to as

‘simple linear regression’ while, if there is more than one explanatory variable, it is called

‘multiple linear regression’. However, they are fitted and assessed in a very similar manner

and simple linear regression is merely a special case of multiple linear regression.

2.1 Model Formulation A linear model can be expressed as a mathematical formula similar to Equation 1, which is

analogous to the formula for a straight line (i.e. y = α + β𝑥 where α is the intercept and β is

the gradient of the slope). Equation 1 can be expanded to become the general formula for a

multiple linear regression model, where there are i = 1, 2, ⋯ , k explanatory variables:

1 Also known as the ‘dependent’ or ‘outcome’ variable


𝐲 = 𝛃𝟎 + 𝛃𝟏𝒙𝟏 + 𝛃𝟐𝒙𝟐 + ⋯ + 𝛃𝐤𝒙𝐤 + 𝛆 (2)


· y is the response variable that is being modelled

· β0 is an unknown intercept term, which represents the average value of y when

𝑥1, 𝑥2, ⋯ , 𝑥k are all zero (this is the point at which the straight line intercepts the y-

axis) – this is calculated automatically as part of the model fitting

· 𝒙𝟏, 𝒙𝟐, ⋯ , 𝒙𝐤 are explanatory variables, each with an associated unknown

parameter βi that is estimated as part of the model fitting

· ε is a random error term that is assumed to follow a normal distribution with a mean

of zero and constant variance i.e. ε ~ N(0, σ2). The reasoning behind this is that if

the explanatory variables describe the response variable exactly, then the errors

should essentially be nothing more than random noise in the data with no effect on

the mean of y (rare in practice).

If this was a simple linear model, then Equation 2 would only include one explanatory

variable 𝑥1 and its associated parameter β1.

The ultimate aim of regression is to “estimate the mean value of 𝑦 and/or predict particular

values of 𝑦 to be observed in the future” (Mendenhall and Beaver, 1994) by examining the

relationships between 𝑦 and the explanatory variables.

It is assumed that a change in each 𝑥 causes a change in 𝑦 but the results of the model

fitting will show if this is the case or not. If a variable 𝑥 has a positive linear relationship

with 𝑦 (i.e. the correlation is greater than 0), then an increase in 𝑥 will result in an increase

in 𝑦. A higher correlation indicates a stronger linear relationship. This association can be

displayed graphically, illustrated in Figure 1(a). Figure 1(b) is an example of a negative linear

relationship (i.e. the correlation is less than 0, meaning an increase in 𝑥 results in a decrease

in 𝑦), and Figure 1(c) is an example of where no linear relationship is present (i.e. the

correlation is 0 or very close to 0).


Linear relationships between variables can be described as ‘strong’ or ‘weak’. Although

there is no consensus on what level of correlation constitutes a ‘strong’ or ‘weak’ linear

relationship, one rule-of-thumb could be that a correlation of at least 0.7 could be described

as ‘strong’, between 0.3 and 0.7 as ‘moderate’ and less than 0.3 as ‘weak’.

Figure 1: Examples of linear relationships between a hypothetical response variable y and

explanatory variable x – (a) strong, positive linear relationship; (b) moderate, negative linear relationship; (c) no linear relationship.

Correlation is related to R2 (the coefficient of determination), which can be used to describe

how much of the variation in the data is explained by the fitted linear regression model (see

Section 2.5.2 for further discussion about R2).

The mathematical detail on how a linear model is fitted is provided in Appendix A.

2.2 Assumptions of the Linear Model

The linear model makes a number of important assumptions about the data, which should

be checked:


1. Independent observations

· Each observation should be independent of all others

2. Linearity of the mean

· The mean of the response variable, given the explanatory variables, should be


· This can be assessed by creating a scatterplot of the response variable against

each continuous explanatory variable and checking whether it is reasonable that

a straight line could be fitted through the points

· If the response variable has a non-linear relationship with an explanatory

variable, then a linear model is unlikely to fit it sufficiently

3. Constant variance (homoscedasticity)

· The variance of the error term should be constant and the mean of the residuals

(see Section 2.5.1 for definition) should be as close to zero as possible

· This can be checked in a residual analysis (Section 2.5.1)

4. Normality

· The errors should be normally-distributed

· This can be checked in a residual analysis (Section 2.5.1)

These assumptions are critical in determining the so-called ‘goodness-of-fit’ of a model and

can be assessed by a combination of residual analysis and an understanding of the data and

their real-life context.

2.3 Data Distributions and Transformations Prior to any regression modelling, some preliminary checks of the data should be

undertaken. A linear model is only appropriate when the response variable follows a

normal distribution (a bell shape), although, in reality, it would be rare for data to follow a

normal distribution exactly.


The simplest way to assess whether a variable is normally-distributed is to plot a histogram

of the data. If the shape of the histogram resembles Figure 2(a), then it would be fair to

assume that the data follow a normal distribution. Figure 2(b) is an example of a positively-

skewed distribution.

Figure 2: (a) Histogram of 10,000 random draws from the normal (N(5, 1)) distribution with normal

curve plotted in red; (b) Histogram of positively-skewed data with curve plotted in red

However, if a response variable is not normally-distributed, then it does not necessarily

mean that linear regression cannot be used. A response variable can be transformed (for

example, a log transformation or square root transformation) and then modelled on the

transformed scale. Log transformations are particularly useful when the response variable

is non-negative and highly positively-skewed. If the transformed data resembles a bell-

shaped curve, then the model can be fitted to the data on the transformed scale. The data

in Section 2.6 are an example of non-normal data which can be log-transformed to fit a

linear model. If none of the standard transformations are effective in producing a bell-

shaped curve, then linear regression should not be used.

2.4 Explanatory Variables and Multicollinearity Explanatory variables can be quantitative (continuous/numeric) or qualitative

(categorical/factor), although these should be checked before fitting a model. R, for

instance, will automatically fit each variable according to its designated variable class (e.g.

numeric, factor etc.) unless the user changes the class or specifies such in the model code.


For example, if a categorical variable (with levels 1, 2 and 3) is read into R incorrectly as a

numeric variable, then the analyst will have to either change the variable to a factor (which

stores the variable as categorical) before fitting the model or enclose the variable by the

function factor() in the model code itself. Fitting a categorical variable as a numeric variable

(or vice versa) will lead to incorrect inference.

Multicollinearity is where a continuous explanatory variable in a regression analysis is highly

correlated with another. In these cases, the estimated regression coefficients will have very

high standard errors and increase the uncertainty surrounding any interpretation

(Mendenhall and Beaver, 1994). R, for instance, will not fit a regression model where two

variables have perfect correlation (i.e. correlation = ± 1) as the model is statistically

unstable. Further explanation on why this is the case is given in Appendix A.

There is no consensus on when multicollinearity becomes an issue, but the simplest way to

account for it is to remove one of the two variables from the model completely if their

correlation is greater than, say, 0.8. In R, the correlation matrix for a data set can be

calculated by using the function cor() with the name of the data set being provided as its

argument. Alternative, albeit more complex, approaches to alleviate multicollinearity

without excluding variables are available but these are outwith the scope of this paper.

2.5 Assessing Goodness-of-Fit Once a model has been fitted, it is essential to assess whether or not it is a good fit to the

data and whether it satisfies the assumptions made in Section 2.2. This can be done

through a combination of residual analysis and the calculation of various statistics such as R2

and AIC.

2.5.1 Residual Analysis The differences between the observed values of the response variable (i.e. the data used to

fit the model) and their predicted values from the model are known as ‘residuals’

(Mendenhall and Beaver, 1994). Residual analysis can be used to see whether a fitted


model has adequately captured the variation in the data and whether it has satisfied the

assumption of normally-distributed errors with constant variance (homoscedasticity).

There are standard residual plots that can be used to assess assumptions 3 (constant

variance of the errors) and 4 (normally-distributed errors). To test whether the errors have

constant variance, a scatterplot can be constructed with the model’s fitted values (predicted

values from the model) on the x-axis and the model residuals on the y-axis. What one

would like to see is that there is no discernible pattern in the plot. If an obvious pattern is

observed, then the model has not adequately captured the variation in the data. Along with

this, an analyst should also check that the mean of the residuals is zero or very close to zero.

Figure 3(a) shows an example of how a ‘fitted values vs. residuals’ plot should look in theory

but, in practice, such a plot would be uncommon.

Figure 3: (a) Theoretical fitted values versus residuals plot; (b) theoretical normal q-q plot

To test whether the errors follow a normal distribution, a so-called ‘normal q-q plot’ can be

constructed and, if the assumption is satisfied, the data points should form a straight line

with little variation, although it is common for there to be more variation at the lower and

upper ends of the distribution. Figure 3(b) shows how a ‘normal q-q plot’ should look in

theory where almost all of the points are captured by the straight line. If a linear model had

residual plots like those in Figure 3, then it would be considered a very good fit to the data

but most real-life residual plots are less straightforward than these.


Assessing goodness-of-fit from residual plots is a mostly subjective exercise. Residuals can

also be used for other purposes such as identifying outliers but this is not discussed here.

2.5.2 R2 and AIC R2 (or R-squared) is formally known as the ‘coefficient of determination’ and is defined as:

𝑹𝟐 = 𝐒𝐒𝐓 − 𝐒𝐒𝐄

𝐒𝐒𝐓 (3)

where ‘SST’ is the sum of squares for the total and ‘SSE’ is the sum of squares for the error.

R2 is closely related to the correlation coefficient (see Figure 1) and in the case of simple

linear regression, they are equivalent. R2 can take values between 0 and 1 with higher

values indicating that the explanatory variables are explaining more and more of the

response variable. However, a small R2 does not necessarily mean that the model is a poor

fit and if a model has a higher R2 than another, this does not imply that that model is better.

If one adds more explanatory variables into a model, R2 will always increase, regardless of

whether it makes sense to include them or not. It is best to think of R2 as being a guide

rather than a rule.

In multiple linear regression, it is better to use ‘adjusted R2’, which penalises the addition of

more than one explanatory variable and is defined as:

𝑹𝐚𝐝𝐣𝟐 = 𝟏 −

𝐒𝐒𝐄/(𝐧 − 𝐩)𝐒𝐒𝐓/(𝐧 − 𝟏)


where ‘n’ is the number of cases in the data and ‘p’ is the number of parameters in the


As aforementioned, it is not appropriate to compare models based on which one has the

higher R2. However, one can perform a basic comparison between models using an


information criterion. The most common criterion is AIC (Akaike’s Information Criterion),

defined below.

𝐀𝐈𝐂 = (−𝟐 × 𝐥𝐨𝐠- 𝐥𝐢𝐤𝐞𝐥𝐢𝐡𝐨𝐨𝐝 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧) + (𝟐 × 𝐧𝐨. 𝐨𝐟 𝐩𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫𝐬 𝐢𝐧 𝐦𝐨𝐝𝐞𝐥) (5)

A model with a lower AIC is preferable to a model with a higher value. Again, an

information criterion should be treated as a guide and is not as important as residual

analysis. It is only useful as far as model selection and an AIC value for one model in

isolation is not particularly informative.

The concept of formal model comparison and selection is a large statistical topic and is out-

of-scope for this paper. It is more closely related to hypothesis testing and analysis of

variance (ANOVA) but is addressed in Chapter 7 of the Hypothesis Testing paper.

2.6 Example For the examples in this paper, a simulated data set will be used, which was created by the

HPS SHAIPI team in collaboration with the University of Strathclyde. It consists of synthetic

healthcare associated infection (HAI) data along with other related variables. The original

analysis of these data was carried out by Prof. Chris Robertson and colleagues at the

University of Strathclyde. The dataset, as well as some documentation on the variables in

the dataset, can be found at this link (this can only be accessed when connected to the

internal NHS National Services Scotland network). The full R code and equivalent SPSS

syntax used for this example (derived from the original analysis) are displayed in Appendices

C and D, respectively. Some of the R code is included above specific plots in this section.

After loading the data and checking which of the variables are designated as

numeric/continuous or factor/categorical variables, patients with incomplete data on a

number of variables are excluded from the analysis. The aim of the analysis is to fit a linear

model to find which explanatory variables are associated with how long a patient stays in

hospital, regardless of whether or not they have an HAI. There are a total of 8,555 complete



Figure 4 shows two histograms. Figure 4(a) is a histogram of the variable that is to be

modelled i.e. the length of stay for each patient. It is clear that these data are not normally-

distributed as the data are entirely skewed to the right. However, since a patient cannot

have a negative length of stay, a log-transformation may be appropriate. Figure 4(b) shows

the data on the logarithmic scale and this histogram is much closer to the normal

distribution, such as the theoretical one shown in Figure 2(a). Since the number zero cannot

be log-transformed, each patient’s length of stay is shifted one whole number higher prior

to the transformation (i.e. log(Total.Stay + 1)).

> hist($Total.Stay, main = "(a)", xlab = "Total.Stay", + col = "lightblue")

> hist(log($Total.Stay + 1), main = "(b)",

+ xlab = "log(Total.Stay + 1)", col = "lightblue")

Figure 4: (a) Histogram of the original Total.Stay variable; (b) Histogram of the log-transformed

Total.Stay variable (with all stays shifted one unit higher)

At this stage, it would be appropriate to calculate the correlation between the different

continuous explanatory variables to check for multicollinearity, but the model being fitted

here will only have one numeric explanatory variable (Age), so it is not necessary to do this.

Prior to analysis, it is also good to see how the response variable varies according to the

different explanatory variables separately. This may give an indication of which explanatory

variables are likely to be associated with the response variable once the model has been

fitted. This can be done by constructing scatterplots for continuous explanatory variables

and boxplots for categorical explanatory variables. Two examples, Age and HAI status, are

shown below. Figure 5(a) displays a scatterplot of log(Total.Stay + 1) plotted against Age


and Figure 5(b) shows boxplots of log(Total.Stay + 1) by whether a patient had an HAI or not

during their admission.

> plot($Age, log($Total.Stay + 1),

+ xlab = "Age (years)", ylab = "log(Total_Stay + 1)", col = "lightblue",

+ pch = 20, main = "(a)") > boxplot(log($Total.Stay + 1) ~ factor($HAI),

+ xlab = "HAI", ylab = "log(Total_Stay + 1)", col = "lightblue",

+ main = "(b)")

Figure 5: (a) Scatterplot of Age (years) against log(Total.Stay + 1); (b) Boxplots of log(Total.Stay + 1)

by HAI status

From Figure 5(a), it seems as if the length of stay increases somewhat as age increases but it

does not appear to be a very strong relationship. The correlation between Age and the log-

transformed length of stay is 0.28 (which can be calculated in R using the cor() function),

suggesting a moderately-weak linear relationship. As shown in Figure 5(b), the median

length of stay appears to be higher for those with an HAI than those without, although there

is quite a lot of variability, especially among those without an HAI, although it should be

kept in mind that there were many more patients without HAIs than those who did.

A linear model can then be fitted for the transformed length of stay variable with a number

of explanatory variables from the data. The standard R output is displayed below. It should

be noted that this model is an example and may not necessarily be the ‘best’ model.

The adjusted R2 value is approximately 0.2, meaning that the model is explaining around

20% of the variability in the data. However, this does not necessarily mean that the model


is a poor fit and the residuals must be checked as well. AIC is not part of the default linear

model output in R but can be calculated using the AIC() function with the model’s name

provided as its argument. The AIC for this model is 25,589.9.

> summary(model) Call


lm(formula = log(Total.Stay + 1) ~ Age + HAI + Sex + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter, data = Residuals: Min 1Q Median 3Q Max -3.3331 -0.7384 -0.0704 0.7045 4.7907 Coefficients:

Estimate3 Std. Error

4 t value

5 Pr(>|t|)


(Intercept) 2.3309918 0.0808469 28.832 < 2e-16 *** Age 0.0149926 0.0007019 21.360 < 2e-16 *** HAIYes 0.7254588 0.0509391 14.242 < 2e-16 *** SexMale -0.0205254 0.0236233 -0.869 0.3849 Hospital.SizeMedium -0.0476521 0.0269000 -1.771 0.0765 . Hospital.SizeSmall 0.0597580 0.0519643 1.150 0.2502 SurgeryNo Surgery -0.1066112 0.0515086 -2.070 0.0385 * SurgeryNot Known -0.1416653 0.1399424 -1.012 0.3114 SurgerySurgery (NHSN Surgery) 0.0324752 0.0571668 0.568 0.5700 PrognosisLife Limiting -0.2558871 0.0433060 -5.909 3.58e-09 *** PrognosisNone/Non-fatal -0.4754648 0.0406985 -11.683 < 2e-16 *** PrognosisNot Known -0.1624948 0.1361281 -1.194 0.2326 centralcatheterYes 0.2783705 0.0608494 4.575 4.84e-06 *** peripheralcathYes -0.4672574 0.0250617 -18.644 < 2e-16 *** urinarycatheterYes 0.6048488 0.0298676 20.251 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.079 on 8540 degrees of freedom

Multiple R-squared: 0.201, Adjusted R-squared: 0.19977

F-statistic: 153.5 on 14 and 8540 DF, p-value: < 2.2e-16

The interpretation of the explanatory variables depends on whether it is continuous or

categorical. A continuous variable, for example Age, can be interpreted as being, ‘for every

one unit increase in Age, log(Total.Stay + 1) increases by 0.015 days’. For categorical

variables, one of the levels is chosen as a reference category and all other levels are

interpreted in relation to it. The categorical HAI variable had two levels, ‘No’ and ‘Yes’, and

2 Formula for the model 3 Parameter estimates (β) for each explanatory variable 4 Standard error for each parameter, used to construct confidence intervals 5 Used for calculating the p-value (t-value = Estimate / Std. Error) 6 p-value for each explanatory variable 7 Adjusted R2 is more robust for multiple linear regression (it is more ‘conservative’ than multiple R2)


‘No’ has been used automatically chosen as the reference level. The parameter estimate for

‘Yes’ can be interpreted as being, ‘on average, if a patient has an HAI, log(Total.Stay + 1) is

0.73 days higher than if they did not have an HAI’.

Although analysis of variance (ANOVA) is a somewhat separate topic from linear regression,

it is possible to check the statistical significance of a categorical variable, taking account of

all its levels, by using the anova() function with the regression model name provided as its

argument. The results are shown below:

> anova(model) Analysis of Variance Table Response: log(Total.Stay + 1)

Df Sum Sq Mean Sq F value Pr(>F)8

Age 1 983.0 983.04 844.9599 < 2.2e-16 *** HAI 1 311.2 311.22 267.5023 < 2.2e-16 *** Sex 1 0.0 0.03 0.0267 0.8703 Hospital.Size 2 4.0 2.01 1.7307 0.1772 Surgery 3 0.0 13.34 11.4669 1.685e-07 *** Prognosis 3 321.5 107.16 92.1096 < 2.2e-16 *** centralcatheter 1 0.9 60.90 52.3477 5.052e-13 *** peripheralcath 1 301.9 301.87 259.4686 < 2.2e-16 *** urinarycatheter 1 477.1 477.12 410.1038 < 2.2e-16 *** Residuals 8540 9935.6 1.16 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results of the regression model and ANOVA should broadly agree. It is shown in both

tables that there does not appear to be a difference in log(Total.Stay + 1) with regards to

Sex or Hospital.Size. It is typical to remove non-significant variables one-by-one and refit

the model as part of model selection. However, this process is outwith the scope of this

paper as it relies heavily on the theory of hypothesis testing and ANOVA (see Chapter 7 of

the Hypothesis Testing paper for an example of how this is carried out). Going back to the

linear regression model, 95% confidence intervals for each of the explanatory variables can

also be calculated and should be presented along with any p-values. The figures below have

been rounded to 2 decimal places.

The first column of this output is the ‘Estimate’ column (see footnote 3) from the model

output. The second and third columns are the lower (2.5th) and upper (97.5th) bounds of the

8 p-value for explanatory variable


confidence interval, respectively. On 95 out of 100 occasions, the confidence interval will

contain the parameter estimate.

> round(cbind(model$coefficients, confint(model)), 2) 2.5% 97.5% (Intercept) 2.33 2.17 2.49 Age 0.01 0.01 0.02 HAIYes 0.73 0.63 0.83 SexMale -0.02 -0.07 0.03 Hospital.SizeMedium -0.05 -0.10 0.01 Hospital.SizeSmall 0.06 -0.04 0.16 SurgeryNo Surgery -0.11 -0.21 -0.01 SurgeryNot Known -0.14 -0.42 0.13 SurgerySurgery (NHSN Surgery) 0.03 -0.08 0.14 PrognosisLife Limiting -0.26 -0.34 -0.17 PrognosisNone/Non-fatal -0.48 -0.56 -0.40 PrognosisNot Known -0.16 -0.43 0.10 centralcatheterYes 0.28 0.16 0.40 peripheralcathYes -0.47 -0.52 -0.42 urinarycatheterYes 0.60 0.55 0.66

For the Age variable (a continuous variable), the confidence interval is (0.01, 0.02), meaning

that for every one unit increase in Age, the estimated increase in log(Total.Stay +1) will be

between 0.01 and 0.02 days on 95 out of 100 occasions. The confidence interval is wholly

positive and does not contain the value 0, meaning there is evidence that age has a

statistically-significant effect on log(Total.Stay + 1). The confidence interval agrees with the

p-value in the model output (which was < 0.05).

For the Sex variable (a categorical variable with ‘Female’ as the baseline), the confidence

interval is (-0.07, 0.03), meaning that a male patient’s log(Total.Stay + 1) could be up to 0.07

days lower or up to 0.03 days higher than a female’s, on 95 out of 100 occasions. Since the

confidence interval contains the value 0, there is no evidence of a statistically-significant

difference between males and females, agreeing with the p-value in the model output

(which was 0.38).

The residuals of the model can now be analysed in order to verify assumptions 3 and 4 in

Section 2.2. This can be done using one plot for each of the assumptions. Assumption 3

concerned whether the residuals had a mean of zero and constant variance. The mean of

the residuals for the above model was approximately -2.67 x 10-17, i.e. effectively zero. The


constant variance part of the assumption can be checked using a plot of model fitted values

against model residuals, shown in Figure 6(a).

> plot(fitted(model), residuals(model), xlab = "Fitted Values",

+ ylab = "Residuals", main = "(a)")

> qqnorm(residuals(model), main = "(b)"); qqline(residuals(model))

Figure 6: (a) Scatterplot of model fitted values against model residuals; (b) normal q-q plot of

model residuals

The ‘fitted values vs. residuals’ plot is usually messy for real-life data but it is difficult to see

any pattern in this scatterplot, although there appears to be some contraction in the points

towards the right hand side of the plot. Figure 6(b) shows a normal q-q plot of the model

residuals to check assumption 4, that the residuals follow a normal distribution. If the

residuals follow a normal distribution, then they should all more-or-less lie on the solid

straight line. It appears that the residuals here do, although there is some movement away

at the lower and upper ends.

If the scatterplot contains any noticeably missed pattern or if the normal q-q plot deviates

strongly away from the straight line, then the model has not fit the data well. This may be

because important explanatory variables have not been included in the model, the

explanatory variables are not appropriate or linear regression itself may not have been

appropriate. If there are issues with the distribution of the response variable (i.e. the data

not being normally-distributed, see Figure 2), then this will be apparent from the normal q-q

plot in particular. If either of the two plots highlights any major problems with the model,

then any inference drawn from it could be incorrect. Any anomalies with the residuals


should be acknowledged in the final interpretation of the model, along with any data quality

issues or other caveats.

The purpose of Section 2.6 was to demonstrate how to fit a basic linear regression model

and how to check its assumptions. The full R code and equivalent SPSS syntax are shown in

Appendices C and D, respectively, as well as the equivalent SPSS analysis output in Appendix

E. However, the topic of regression modelling and checking is vast and there are many

other statistics and plots that can be used to scrutinise a model.


3 Generalised Linear Models The standard linear model is quite restrictive in the type of data for which it can be used,

particularly with regards to the response variable being required to be continuous and the

error term being assumed to follow a normal distribution.

A linear model is a special case of the wider family of ‘generalised linear models’ (GLMs),

where the distribution of the error term is allowed to follow any distribution that is a

member of the ‘exponential family’ (which includes the normal distribution).

After a discussion of GLM theory, the remainder of this paper will consider two types of

generalised linear models individually: logistic regression (used for categorical data such as

binary data) and Poisson regression (used for modelling counts or rates).

For example, logistic regression may be appropriate where the response variable is

mortality (either alive or deceased) or whether cancer is present or not in a patient. Poisson

regression may be suitable for modelling, for instance, the count of A&E attendances during

a twelve-hour period.

3.1 Model Formulation Generalised linear models can be characterised by the distribution of their response variable

and a link function (Dalgaard, 2008), which can be expressed mathematically as: 𝐠(𝐄(𝐲𝐢)) = 𝐠(𝛍𝐢) = 𝛃𝟎 + 𝛃𝟏𝒙𝟏 + 𝛃𝟐𝒙𝟐 + ⋯ + 𝛃𝐤𝒙𝐤 (6) where g(µi) is a monotonic (strictly increasing or strictly decreasing) link function. The

formulation is similar to a linear model (see Equation 2) except the function g transforms

the response variable onto a different scale.

There are standard link functions for the most common probability distributions. For

instance, the binomial distribution typically uses the logit link and the Poisson distribution


uses the log link. In the case of linear regression, where the distribution of the response

variable is normal, the link is the identity link (i.e. the link function is benign).

The mathematical detail on how GLMs are fitted is provided in Appendix B.

3.2 Model Diagnostics With regards to the assumptions of linear models outlined in Section 2.2, the only one of

these assumptions that directly applies to GLMs is that of independent observations.

However, some other assumptions apply here: that the errors follow the distribution that is

being assumed, that an appropriate link function has been used, etc.

In assessing the adequacy of a GLM, one approach is to compare the full (or saturated)

model, which has the maximum number of parameters that can be estimated, to the model

being fitted. This is known as the deviance (D) and is defined as: 𝐃 = 𝟐 𝐥𝐨𝐠 𝛌 = 𝟐 �𝐥�𝛃�𝐦𝐚𝐱; 𝐲� − 𝐥�𝛃�; 𝐲�� (7)

where l is the log-likelihood of the model.

There are two deviances given in the standard R and SPSS GLM outputs: the null deviance

and the residual deviance. The null deviance compares the saturated model to an intercept-

only model (i.e. no explanatory variables); the residual deviance compares the saturated

model to the model being fitted.

If the residual deviance is larger than the number of degrees of freedom, then this suggests

that the model is a poor fit to the data. The deviance is presented as part of the standard

GLM model outputs in R and SPSS and is more appropriate to use for GLMs than R2.

Multicollinearity can also affect GLMs and this should be checked for if there is more than

one continuous explanatory variable in the model.


4 Logistic Regression Logistic regression is a type of GLM used for modelling binary data, where the response

variable can only take one of two possible values e.g. yes/no, 0/1, alive/deceased etc.

Logistic regression can be extended for nominal variables (categorical variables with more

than two levels) or ordinal variables (nominal variables where the levels have a natural

ordering) but these will not be discussed here.

A well-known example of where logistic regression has been used in healthcare is the TARN

model for trauma patients ( It uses

mortality data from past patients and allocates weightings to different variables (such as

age, gender, comorbidity, injury severity etc.) so that a probability of survival can be

calculated for a future trauma patient based on their demographic and clinical information.

4.1 Model Formulation and Inference For binary logistic regression, the link function is the ‘logit’ link i.e.

𝐥𝐨𝐠𝐢𝐭 𝐩𝐢 = 𝐥𝐨𝐠 �


𝟏 − 𝐩𝐢� = 𝛃𝟎 + 𝛃𝟏𝒙𝟏 + 𝛃𝟐𝒙𝟐 + ⋯ + 𝛃𝐤𝒙𝐤 (8)

where pi is the probability of the outcome in question (e.g. the probability that cancer is

present). The model assumes that the observations are independent. The estimates of the

probabilities can be found by re-arranging Equation 8 i.e.

𝐩�𝐢 =

𝐞𝐱𝐩(𝛃𝟎 + 𝛃𝟏𝒙𝟏 + 𝛃𝟐𝒙𝟐 + ⋯ + 𝛃𝐤𝒙𝐤) 𝟏 + 𝐞𝐱𝐩(𝛃𝟎 + 𝛃𝟏𝒙𝟏 + 𝛃𝟐𝒙𝟐 + ⋯ + 𝛃𝐤𝒙𝐤)


An important statistic for inference in logistic regression is the ‘odds ratio’, which is a

measure of association between an outcome and an ‘exposure’. The odds ratio represents

the odds that an outcome will occur in the ‘exposed’ group compared to the odds of the

outcome in the ‘unexposed’ group (Szumilas, 2010). Using the example of an outcome

being that cancer is or is not present, the ‘exposure’ could be whether the person had been


given a vaccine earlier in life or not. In logistic regression, the parameters β are expressed in

terms of log-transformed odds (known as ‘log-odds’) in the model output i.e.

𝛟 = 𝐥𝐨𝐠 �

𝐩𝟏 − 𝐩

� (10)

Szumilas (2010) states that “the regression coefficient βi is the estimated increase in the log-

odds of the outcome per unit increase in the value of the exposure”. It is necessary to

exponentially-transform the log-odds ratio to the odds ratio, so that it can be interpreted.

If we assume that ‘0 = alive’ and ‘1 = deceased’ for the response variable, then if the odds

ratio for an explanatory variable is greater than 1, then it is associated with higher odds of

outcome; if the odds ratio is less than 1, then it is associated with lower odds of outcome;

and if the odds ratio is 1, then it does not affect the outcome (Szumilas, 2010). For example,

if the response variable in question is mortality and the odds ratio for the explanatory

variable gender (with females as baseline) is 1.5, this means that the odds of mortality are

1.5 times greater for males than females.

4.2 Example The same HAI data set used for linear regression in Section 2.6 will be used here (although

the incomplete cases will be kept in the analysis). The full R code and equivalent SPSS

syntax are displayed in Appendices C and D, respectively. Here, the response variable is the

HAI data themselves and the aim is to see which explanatory variables are associated with

whether a patient has an HAI or not. In R, a binary response variable must be coded as

either 0 or 1. In the data set, HAI is coded “No” and “Yes’, so this has to be changed to 0

and 1, respectively, before fitting the model.

For this analysis, the Age variable will be changed from a continuous variable to a

categorical variable by splitting the data into different age groups (< 40, 40 – 59, 60 – 69, 70

– 79 and 80+). This means that there will be no continuous explanatory variables in this

particular model but continuous variables can be included if required. The standard logistic

regression model output is shown below. The output is similar to that for linear regression

but with a few differences, mostly at the bottom of the output.


> summary(model)


glm(formula = HAI ~ Age_Group + Sex + Hospital.category + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter + intubation, family = "binomial", data = Deviance Residuals: Min 1Q Median 3Q Max -1.0397 -0.3438 -0.2735 -0.2324 3.0916 Coefficients:

Estimate10 Std. Error

11 z value

12 Pr(>|z|)


(Intercept) -2.64084 0.26242 -10.064 < 2e-16 *** Age_Group40-59 0.29699 0.19387 1.532 0.125543 Age_Group60-69 0.29789 0.20103 1.482 0.138393 Age_Group70-79 0.46407 0.18703 2.481 0.013091 * Age_Group80+ 0.40380 0.18929 2.133 0.032904 * SexMale 0.05050 0.09172 0.551 0.581908 SexUnknown 0.13888 0.76285 0.182 0.855541 Hospital.categoryNon Acute -1.37423 0.73396 -1.872 0.061158 . Hospital.categoryObsetrics 0.14174 0.40773 0.348 0.728108 Hospital.categoryTeaching 0.05597 0.11042 0.507 0.612245 Hospital.SizeMedium -0.01358 0.11877 -0.114 0.908937 Hospital.SizeSmall 0.20218 0.19260 1.050 0.293845 SurgeryNo Surgery -0.59855 0.17334 -3.453 0.000554 *** SurgeryNot Known 0.50574 0.39303 1.287 0.198179 SurgerySurgery (NHSN Surgery) 0.12194 0.18473 0.660 0.509197 PrognosisLife Limiting -0.55472 0.14319 -3.874 0.000107 *** PrognosisNone/Non-fatal -0.76285 0.13388 -5.698 1.21e-08 *** PrognosisNot Known -1.82795 0.64798 -2.821 0.004787 ** centralcatheterUnknown -0.48389 0.84573 -0.572 0.567219 centralcatheterYes 1.29187 0.15402 8.388 < 2e-16 *** peripheralcathUnknown 0.46336 0.61844 0.749 0.453714 peripheralcathYes 0.48898 0.09321 5.246 1.55e-07 *** urinarycatheterUnknown 0.81142 0.56718 1.431 0.152538 urinarycatheterYes 0.24314 0.10584 2.297 0.021610 * intubationUnknown -0.35468 0.71743 -0.494 0.621042 intubationYes 0.09269 0.32045 0.289 0.772389 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4324.7 on 10646 degrees of freedom

Residual deviance: 4073.4 on 10621 degrees of freedom14

AIC: 4125.415

Number of Fisher Scoring iterations: 616

9 Formula for the model 10 Log-odds ratio for each parameter 11 Standard error for log-odds ratio for each parameter 12 Z-statistic for each variable (Estimate / Std. Error) 13 p-value for each parameter 14 Residual deviance for model – this is not particularly useful for logistic regression 15 AIC value for model 16 Number of iterations before algorithm converged (see Appendix B) – smaller number indicates easier fit


Since all of the explanatory variables were categorical variables, the model summary is quite

long, as information for each level is shown. As with factors in the linear regression

example, the first level of each categorical variable is used as a reference group. For the Age

Group variable, the under 40 group is used as the reference group. It appears that older

patients have more HAIs than younger patients, with both the 70 – 79 and 80+ groups

having significantly higher numbers of HAIs than patients under 40.

However, it should be noted that the parameter estimates are log-odds ratios, not odds

ratios, and the log-odds ratio must be exponentially-transformed to be interpreted. The

odds ratio and confidence intervals for the age variable are shown in Table 2.

Table 2: Odds ratios and confidence intervals for age variable categories

Age Group Odds Ratio 95% Confidence Interval

< 40 Reference/baseline category

40 – 59 1.35 (0.92, 1.97)

60 – 69 1.35 (0.91, 2.00)

70 – 79 1.50 (1.10, 2.30)

80+ 1.59 (1.03, 2.17)

Since all odds ratios are greater than 1, each group has higher odds of contracting an HAI

than the < 40 age group. Taking each group in turn:

· The 40 – 59 age group has 35% higher odds of having an HAI than the < 40 patient

group. This is not a statistically-significant difference as the confidence interval

contains the value 1, meaning that it is plausible that the odds ratio could be 1 (i.e.

no difference).

· The 60 – 69 age group also has 35% higher odds of contracting an HAI than the < 40

patient group. This is also not statistically-significant.

· The 70 – 79 cohort has 50% higher odds of having an HAI than the < 40 cohort. This

is statistically-significant as the confidence interval does not contain the value 1.


· The 80+ age group has 59% higher odds of having an HAI than the < 40 group. This is

a statistically-significant difference.

The odds ratio and 95% confidence intervals for each parameter can be obtained using the

following syntax in R where the logistic regression model has been named ‘model’:

> exp(cbind(coef(model), confint(model)))

The residual plots used for model diagnostics in linear regression are not particularly useful

for binary data.

4.3 Further Examples A routine example of where logistic regression is used for analysis in PHI is the Hospital

Standardised Mortality Ratio (HSMR). For full details on the methodology behind this

indicator, please see the latest HSMR technical report17 or this more in-depth analytical

guide to calculating HSMR18, which also includes further details on how to fit logistic

regression models in R and SPSS.

The HSMR is defined as the number of observed deaths divided by the number of predicted

deaths and gives an indication of the mortality rate in a particular hospital whilst taking

account of various patient risk factors such as age, sex, primary diagnosis and comorbidities.

Since the outcome is binary (either the patient is alive or deceased), then logistic regression

is the obvious method when a multivariate analysis is required.

Once a logistic regression model has been fitted to the data, the log-odds for the

parameters can be used to calculate the predicted probability of death. These probabilities

17 18


are then summed to estimate the predicted number of deaths for a particular hospital and

then the HSMR can be calculated. If HSMR > 1, then the number of deaths for the hospital is

greater than predicted; if HSMR < 1, then the number of deaths is fewer than predicted.

Another worked example of logistic regression (in SPSS only) can be found in this PHI

methodology paper on trend analysis19. This paper also shows how to fit a logistic

regression model in SPSS using the drop-down menus.



5 Poisson Regression

Data that are counts of events, particularly during a specified timeframe, often follow the

Poisson distribution. A Poisson regression model is a type of GLM that can be used to model

counts and rates, either by modelling the rate directly or by modelling the count with some

form of exposure (e.g. time, population differences) taken into account.

Poisson regression has been used in PHI to model incidence rates, particularly when

analysing change in incidence and mortality due to conditions such as cancer


5.1 Model Formulation and Inference A Poisson regression model can be expressed as:

𝐥𝐨𝐠(𝛌) = 𝛃𝟎 + 𝛃𝟏𝒙𝟏 + 𝛃𝟐𝒙𝟐 + ⋯ + 𝛃𝐤𝒙𝐤 (11)

where λ is the rate. The link function for the Poisson GLM is the log link. The model

assumes that the observations are independent.

In many cases, it is necessary to take account of different levels of ‘exposure’ between data

points. For example, in a study investigating the link between mortality and smoking, it may

be appropriate to take into account the length of time each subject has been a smoker. This

is known as the ‘offset’, which is usually included in the model on the logarithmic scale. The

offset must be specified in the model syntax.

One of the main considerations of Poisson regression is assessing ‘overdispersion’. If the

variance is much larger than the mean, then the data are deemed to be overdispersed, as

the Poisson distribution assumes that the mean and variance are the same. In such cases,

the standard errors for the parameter estimates will be wrong and, therefore, it cannot be

determined which explanatory variables are statistically-significant and those which are not.

If the data are overdispersed, then a negative binomial GLM (where the mean and variance


are allowed to differ) can be used or a different Poisson GLM could be fitted. The level of

overdispersion in the data is presented as part of the output for a Poisson regression model

in R.

5.2 Example For this example, a different simulated HAI data set will be used, available at the same link

(this can only be accessed when connected to the internal NHS National Services Scotland

network). This data set contains information on the number and rate of MRSA and other

infections over a ten-year period as well as data on the number of nurses attending hand

hygiene courses and the number of cleanliness champions in each hospital.

After creating some new temporal variables to ease the model fitting and calculating the

rate of MRSA per 1,000 patients for each quarter, various plots can be constructed to see

how MRSA infection varies according to different explanatory variables, shown in Figure 7.

Figure 7: (a) Scatterplot of MRSA infection rates per quarter over time; (b) Scatterplot of

cumulative number of cleanliness champions against MRSA infection rates per quarter; (c) Scatterplot of cumulative number of nurses attending hand hygiene course against MRSA infection

rates per quarter


Figure 7(a) shows the rate of MRSA infection for each quarter over the decade. It appears

that the rate of infection stayed steady until 2006, when a long-term decrease began.

Figures 7(b) and 7(c) show that the rate of infection decreased as the cumulative numbers

of cleanliness champions and nurses attending hand hygiene courses increased.

A Poisson regression model with three explanatory variables can then be fitted. Note that

the response variable here is not the rate but the count of MRSA infections. The offset

term, the number of beds, is used to calculate the rate of MRSA infection within the model

itself. Alternatively, the rate could be modelled with no offset and the same results should

be achieved. As aforementioned, the offset is usually log-transformed. Including the

quarter of the year in the analysis may be useful as MRSA infection rates may exhibit a

seasonal pattern (i.e. more infections in the winter).

> summary(model)


glm(formula = MRSA ~ offset(log(beds)) + Cum.CCCC + Cum.SAHHMC + Qtr, family = "poisson", data = Deviance Residuals: Min 1Q Median 3Q Max -3.0579 -0.8973 -0.1833 0.8453 2.9707 Coefficients:

Estimate21 Std. Error

22 z value

23 Pr(>|z|)


(Intercept) -8.564e+00 2.647e-02 -323.518 < 2e-16 *** Cum.CCCC 5.838e-05 1.426e-05 4.095 4.21e-05 *** Cum.SAHHMC -2.369e-04 2.238e-05 -10.586 < 2e-16 *** Qtr2 -7.183e-02 3.351e-02 -2.144 0.0321 * Qtr3 -1.575e-01 3.488e-02 -4.515 6.33e-06 *** Qtr4 -2.188e-02 3.383e-02 -0.647 0.5178 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1537.786 on 39 degrees of freedom

Residual deviance: 72.403 on 34 degrees of freedom25

AIC: 356.1326

Number of Fisher Scoring iterations: 427

20 Formula for the model 21 Parameter estimate for explanatory variable (on log-scale) 22 Standard error for parameter estimate 23 Z-statistic for variable (Estimate / Std. Error) 24 p-value for parameter 25 Residual deviance for model (should ideally be lower than the number of degrees of freedom) 26 AIC value for model 27 Number of iterations before algorithm converged (see Appendix B) – smaller number indicates easier fit


The GLM output is very similar to that for logistic regression. As was also the case for

logistic regression, the parameter coefficients are presented on the logarithmic scale and

must be transformed back for interpretation. These are shown below along with 95%

confidence intervals.

> exp(cbind(coef(model), confint(model))) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.0001908267 0.0001811201 0.000200925 Cum.CCCC 1.0000583852 1.0000304296 1.000086316 Cum.SAHHMC 0.9997631308 0.9997192218 0.999806928 Qtr2 0.9306929696 0.8714915843 0.993829123 Qtr3 0.8542988172 0.7977607740 0.914652741 Qtr4 0.9783557322 0.9155275689 1.045376243

It appears that the time of year may be an important predictor of MRSA infection. Quarter

1 (January to March) is used as the reference group. There is no statistically-significant

difference between Quarter 1 and Quarter 4 (October to December). The confidence

interval for Quarter 2 (April to June) shows some evidence of a significant difference but not

particularly strong evidence, as the confidence interval is very close to containing the value

1. There does appear to be a highly-significant difference between Quarter 1 and Quarter 3

(July to September).

It seems that the cumulative number of cleanliness champions and the cumulative number

of nurses with hand hygiene training are statistically-significant predictors of MRSA

infection. However, notice that the estimate and confidence interval for cleanliness

champions is positive, meaning that the rate of MRSA has increased as the cumulative

number of cleanliness champions has increased. This seems counter-intuitive, especially as

Figure 7(b) clearly showed that the rate had decreased as the number of cleanliness

champions had increased.

The residual deviance for this model (72.403) is much higher than the number of degrees of

freedom (34), meaning that this model is a poor fit to these data, as the data are

overdispersed. Therefore, the p-values and confidence intervals for the parameters should

not be trusted, particularly those which are close to the borderline in terms of statistical



In the original Strathclyde University analysis, a changepoint analysis (where the data were

split into pre-2006 and 2006-onwards and modelled separately) was then carried out, as it

was shown in Figure 7(a) that MRSA rates did not change much until 2006, despite the

cumulative numbers of cleanliness champions and nurses with hand hygiene training having

increased every quarter. There may be other variables (known as ‘confounding variables’)

at work here, as the model has not fit the data well and it has produced some strange

results (with the cleanliness champions, for instance). Also, the issue of multicollinearity

may have had an effect here, with the correlation matrix for the three explanatory variables

shown below.

> with(, cor(cbind(quarter, Cum.SAHHMC, Cum.CCCC))) quarter Cum.SAHHMC Cum.CCCC quarter 1.0000000 0.9509023 0.9821609 Cum.SAHHMC 0.9509023 1.0000000 0.9849226 Cum.CCCC 0.9821609 0.9849226 1.0000000

The correlations between the three variables are very high (all > 0.95), meaning that the p-

values and confidence intervals for the explanatory variables could be affected. (If one

refits the model by dropping the ‘nurses with hand hygiene training’ variable out, then the

‘cleanliness champions’ variable remains statistically-significant but this time the estimate is

negative, as expected, but overdispersion remains).

This has been an example of where a regression model has been unsuccessful in capturing

the variability in the response variable and has several key flaws, such as overdispersion and


5.3 Further Examples A peer-reviewed publication involving staff from PHI in which Poisson regression was used

was the analysis of “patterns of second primary cancers following first primary lung cancers”

(Chuang et al., 2010)28, accounting for differences in lung cancer between men and women.

The analysis initially involved calculating standardised incidence ratios (SIRs) based on the

observed and expected number of patients who developed a second cancer following a first



primary lung cancer. However, the expected count was based on aggregated person-year

data and not on person-level data.

Poisson regression models were then fitted with the output being used to calculate the

‘absolute excess risk’ (AER), which is “a measure for estimating the absolute burden or

magnitude of a health problem” (Chuang et al., 2010) and is expressed as per 100,000

person-years. Chuang et al. (2010) defined the AER as “the difference between the

observed and expected number of second primary cancers divided by the total number of

person-years at risk”. In this example, the total number of person-years ‘at risk’ is the

exposure or the offset. Converting the output from the model into the AER statistic also

allowed the results to be expressed on a scale which provided easier, real-life interpretation

for an audience.

Poisson regression was also one of the techniques used in ISD’s high-profile cancer survival

analyses, in collaboration with Macmillan Cancer Support. The reports from the project,

focussing specifically on whether there is an association between cancer survival and

deprivation (Scottish Index of Multiple Deprivation quintile), can be viewed on the

Macmillan website29.

A further example of Poisson regression (in SPSS only) can be found in this PHI methodology

paper on trend analysis30. This paper also shows how to fit a Poisson regression model in

SPSS using the drop-down menus.






Box, G.E.P. (1976). “Science and Statistics”. Journal of the American Statistical Association.

71: 791 – 799.

Burnham, K.P. and Anderson, D.R. (2002). Model Selection and Multimodel Inference: A

Practical Information-Theoretical Approach. Second Edition. Springer-Verlag.

Chuang, S.C. et al. (2010). “Risks of second primary cancer among patients with major

histological types of lung cancers in both men and women”. British Journal of Cancer. 102

(7): 1190 – 1195.

Dalgaard, P. (2008). Introductory Statistics with R. Second Edition. New York: Springer.

Mendenhall, W. and Beaver, R.J. (1994). Introduction to Probability and Statistics. Ninth

Edition. Belmont, California: Duxbury Press.

R Core Team (2015). R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL

Szumilas, M. (2010). “Explaining odds ratios”. Journal of the Canadian Academy of Child

and Adolescent Psychiatry. 19 (3): 227 – 229.

This paper has also made use of lecture notes from the School of Mathematics and

Statistics, University of Glasgow, for the descriptions of least squares and IRWLS in

Appendices A and B, respectively.



A Model Fitting for Linear Models Regression models are typically fitted using the Method of Least Squares, which can be

defined formally as choosing “the ‘best-fitting’ line that minimises the sum of squares of the

deviations of the observed values of y from those predicted” (Mendenhall and Beaver,

1994). In other words, the Method of Least Squares tries to find the best-fitting line to the

data and estimates the parameters for the explanatory variables (β1, β2, ⋯ , βk).

To describe this process, Equation 2 (Page 4) will be converted into matrix algebra i.e.

𝐲 = 𝐗𝛃 + 𝛆 (A1)


· y is a vector representing the response variable

· X is the so-called ‘design matrix’ representing the explanatory variables

· β is the vector of unknown parameters associated with the explanatory variables

· ε is the vector of error terms

To estimate the model parameters, Equation A1 is reformulated to:

𝛆 = 𝐲 − 𝐗𝛃 (A2)

and then the entire equation is squared:

𝛆𝐓𝛆 = (𝐲 − 𝐗𝛃)𝐓(𝐲 − 𝐗𝛃) (A3)

where T is the transpose of the matrix. The left hand side of Equation A3 is known as the

‘sum of squares for the error’ and typically denoted ‘SSE’ i.e.


𝐒𝐒𝐄 = (𝐲 − 𝐗𝛃)𝐓(𝐲 − 𝐗𝛃) (A4)

The parameter estimates for β are then derived from Equation A4 as follows:

𝐒𝐒𝐄 = 𝐲𝐓𝐲 − 𝛃𝐓𝐗𝐓𝐲 − 𝐲𝐓𝐗𝛃 + 𝛃𝐓𝐗𝐓𝐗𝛃 (A5)

𝐒𝐒𝐄 = 𝐲𝐓𝐲 − 𝟐𝛃𝐓𝐗𝐓𝐲 + 𝛃𝐓𝐗𝐓𝐗𝛃 (A6)

The latter line is true since the transpose of a scalar is the scalar. This expression is then

differentiated with respect to β and set equal to zero.


= −𝟐𝐗𝐓𝐲 + 𝟐𝐗𝐓𝐗𝛃 = 𝟎 (A7)

Hence, the estimate for β, β�, is:

𝛃� = (𝐗𝐓𝐗)−𝟏𝐗𝐓𝐲 (A8)

β� can only be estimated if the inverse of XTX exists and this is only possible if the design

matrix is of full rank (i.e. all columns are linearly independent of each other). This causes

issues if some explanatory variables are highly correlated with each other (see Section 2.4

for a discussion on multicollinearity).

If an explanatory variable is categorical, then it is necessary to add constraints so that XTX

can be inverted. The most common method (and the default in R and SPSS) is to set one

level of these variables to be a reference group. For example, if categorical variable αi had

i = 1, 2, 3 levels, then 𝛼1 = 0. α2 and α3are then interpreted in relation to α1. The idea of

constraining groups is a concept more closely related to analysis of variance (ANOVA) rather

than regression modelling but there is overlap between the two approaches.

Once β has been estimated, the ‘fitted values’ for the model are:


𝐲� = 𝐗𝛃� (A9)

and this result leads to an estimate of the error, known as the ‘residuals’.

𝛆� = 𝐲 − 𝐲� (A10)

The residuals can be thought of the response variable minus the estimate of the response

variable given the explanatory variables. In other words, this is summarising how well the

model reflects the observed data. The goodness-of-fit of the model can be assessed

partially through an analysis of the residuals, discussed in Section 2.5.1.


B Model Fitting for Generalised Linear Models Although different GLMs depend on different probability distributions for the error, they are

all fitted in a similar manner.

In Appendix A, the parameter estimates for β� could be estimated directly through least

squares, but this is not possible when the error distribution is not normally-distributed.

Instead, numerical optimisation must be used through a process called iteratively

reweighted least squares (IRWLS).

Following on from Equation 6 (Page 19) and using the two equations: ηi = g(µi) and

µi = E(Yi), a first-order Taylor series expansion for g(yi) about the value µi can be

expressed as:

𝐠(𝐲𝐢) ≈ 𝐠(𝛍𝐢) + (𝐲𝐢 − 𝛍𝐢)𝐠′(𝛍𝐢) (B1)

= 𝛈𝐢 + (𝐲𝐢 − 𝛍𝐢)𝛛𝛈𝐢

𝛛𝛍𝐢 (B2)

= 𝐳𝐢 (B3)

and the weights are:

𝐰𝐢𝐢 =






The IRWLS algorithm consists of the following steps:

1. Set initial estimates η�i0 and µ�i


2. Set the ‘adjusted dependent variable’ to be:

𝐳𝐢𝟎 = 𝛈�𝐢

𝟎 + (𝐲𝐢 − 𝛍�𝐢𝟎)


𝛛𝛍𝐢 (B5)


3. Form the weights wii0.

4. Minimise the sum of squares �√W(z − Xβ)�2

with respect to β, where W is an (n x n)

matrix with elements wii.

5. Form η1 = Xβ�1 and µ1.

6. Repeat steps 1 to 5 until convergence, with the new estimates achieved at Step 5

forming the new ‘initial’ estimates for Step 1 at the next iteration.

Theoretically, any initial estimates for the two parameters at Step 1 in the first iteration can

be chosen but it is common to use µi0 = yi and ηi

0 = g(µi0). The number of iterations is

dependent on the data and the parameters but often a stopping rule is used once the

difference in the parameter estimates from one iteration to the next becomes small enough

(i.e. the parameter estimates have converged).

Minimising the sum of squares in Step 4 will eventually lead to:

(𝐗𝐓𝐖𝐗)𝛃 = 𝐗𝐓𝐖𝐳 (B6)

which is similar to Equation A8 for linear regression except that the matrix of weights, W, is



C R Code for Analyses

Linear Regression (Section 2.6)

# Read in csv file with first row for variable names (available at this link – only accessible

when connected to the internal NHS National Services Scotland network) <- read.csv("//stats/cl-out/Datasets for SAG Statistical

Papers/Simulated_HAI_Prevalence_Data.csv", header = TRUE)

# Shows the class (variable type) for each variable

sapply(, class)

# Subset the data to remove incomplete cases (spaces required due to how data have been

# coded) <- subset(, Discharged == 1) <- subset(, Sex != "Unknown") <- subset(, Specialty != "Not Known ") <- subset(, Hospital.category != "Non Acute ") <- subset(, Hospital.category != "Obsetrics ") <- subset(, centralcatheter != "Unknown") <- subset(, peripheralcath != "Unknown") <- subset(, urinarycatheter != "Unknown")

# Histograms of Total.Stay and log(Total.Stay + 1)

par(mfrow = c(1, 2))

hist($Total.Stay, main = "(a)", xlab = "Total.Stay")

hist(log($Total.Stay + 1), main = "(b)", xlab = "log(Total.Stay + 1)")

# Reset plot window

par(mfrow = c(1, 1))


# Scatterplot of Age vs. log(Total.Stay + 1)

plot($Age, log($Total.Stay + 1),

main = "Length of Stay by Age", xlab = "Age (years)", ylab = "log(Total_Stay + 1)")

# Boxplot of HAI status vs. log(Total.Stay + 1)

boxplot(log($Total.Stay + 1) ~ factor($HAI),

main = "Length of Stay by HAI Status", xlab = "HAI", ylab = "log(Total_Stay + 1)")

# Fit linear regression model for log(Total.Stay + 1)

# With the exception of 'HAI', other categorical variables are already designated as 'factors'

model <- lm(log(Total.Stay + 1) ~ Age + factor(HAI) + Sex + Hospital.Size + Surgery +

Prognosis + centralcatheter + peripheralcath + urinarycatheter, data =

# Model output


# Estimates and 95% confidence intervals for parameters

cbind(model$coefficients, confint(model))

# Analysis of variance


# Scatterplot of model fitted values against model residuals (assumption 3)

plot(fitted(model), residuals(model), xlab = "Fitted Values", ylab = "Residuals", main = "")

# Calculate mean of residuals (should be close to zero)


# Normal Q-Q plot of residuals (assumption 4)

qqnorm(residuals(model)); qqline(residuals(model))


Logistic Regression (Section 4.2)

# Read in csv file with first row for variable names (available at this link – only accessible

when connected to the internal NHS National Services Scotland network) <- read.csv("//stats/cl-out/Datasets for SAG Statistical

Papers/Simulated_HAI_Prevalence_Data.csv", header = TRUE)

# Shows the class (variable type) for each variable

sapply(, class)

# Recode HAI variable from 'No' and 'Yes' to '0' and '1', respectively$HAI <- ifelse($HAI == "Yes", 1, 0)

# Create new variable Age_Group from Age variable$Age_Group <- cut($Age, breaks = c(14, 39, 59, 69, 79, 104),

include.lowest = TRUE, labels = c("<40", "40-59", "60-69", "70-79", "80+"))

# Fit logistic regression model with logit link for HAI status

model <- glm(HAI ~ Age_Group + Sex + Hospital.category + Hospital.Size + Surgery +

Prognosis + centralcatheter + peripheralcath + urinarycatheter + intubation, data =, family = "binomial")

# Model output


# Estimates and 95% confidence intervals for parameters (profile likelihood)

exp(cbind(coef(model), confint(model)))

Poisson Regression (Section 5.2)

# Read in csv file with first row for variable names (available at this link – only accessible

when connected to the internal NHS National Services Scotland network)

41 <- read.csv("//stats/cl-out/Datasets for SAG Statistical

Papers/HAI_rates_MRSA_MSSA_CD_HH.csv", = TRUE)

# Shows the class (variable type) for each variable

sapply(, class)

# Create 'Time' variable from 'Year' and 'Quarter' variables$Time <-$Year + ($Quarter - 1)/4

# Create categorical 'Qtr' variable from 'Quarter' variable$Qtr <- as.factor($Quarter)

# Calculate MRSA rate per 1,000$MRSA_Rate <-$MRSA/$beds*1000

# Set plot window

layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE))

# Scatterplot of Time against MRSA rate

plot($Time,$MRSA_Rate, main = "(a)",

xlab = "Year (one data point per quarter)", ylab = "Rate of MRSA Infection per 1,000")

# Scatterplot of cumulative total of cleanliness champions against MRSA rate

plot($Cum.CCCC,$MRSA_Rate, main = "(b)",

xlab = "Cleanliness Champions", ylab = "Rate of MRSA Infection per 1,000")

# Scatterplot of cumulative total of nurses with hand hygiene training against MRSA rate

plot($Cum.SAHHMC,$MRSA_Rate, main = "(c)",

xlab = "Nurses Attending Hand Hygiene Course", ylab = "Rate of MRSA Infection per



# Fit Poisson regression model for MRSA count with the number of beds as offset

model <- glm(MRSA ~ offset(log(beds)) + Cum.CCCC + Cum.SAHHMC + Qtr,

data =, family = "poisson")

# Model output


# Estimates and 95% confidence intervals for parameters (profile likelihood)

exp(cbind(coef(model), confint(model)))

# Correlation matrix for three explanatory variables

with(, cor(cbind(quarter, Cum.SAHHMC, Cum.CCCC)))

# Additional code for fitting negative binomial regression model – requires MASS library


model.nb <- glm.nb(MRSA ~ offset(log(beds)) + Cum.CCCC + Cum.SAHHMC + Qtr,

data =



D SPSS Syntax for Analyses

Linear Regression (Section 2.6)

* Read in csv file (available at this link – only accessible when connected to the internal NHS

National Services Scotland network).

GET DATA /TYPE = TXT /FILE = '//conf/linkage/output/Datasets for SAG Statistical Papers/Simulated_HAI_Prevalence_Data.csv' /ENCODING = 'UTF8' /DELCASE = LINE /DELIMITERS = "," /ARRANGEMENT = DELIMITED /FIRSTCASE = 2 /IMPORTCASE = ALL /VARIABLES = counter F5.0 Hospital.Type A50 Hospital.category A50 Hospital.size A50 Sex A10 Specialty A50 Age F3.0 Surgery A50 Prognosis A50 centralcatheter A50 peripheralcath A50 urinarycatheter A50 intubation A50 antimicrobials_2 A50 HAI A3 Total.Stay F4.0 Discharged A1 Time.To.Survey F10.0 timetohai F10.0. CACHE. EXECUTE.


* Subset the data to remove incomplete cases.

SELECT IF Discharged EQ '1'. EXECUTE. SELECT IF Sex <> 'Unknown'. EXECUTE. SELECT IF Specialty <> 'Not Known'. EXECUTE. SELECT IF Hospital.category <> 'Non Acute'. EXECUTE. SELECT IF Hospital.category <> 'Obsetrics'. EXECUTE. SELECT IF centralcatheter <> 'Unknown'. EXECUTE. SELECT IF peripheralcath <> 'Unknown'. EXECUTE. SELECT IF urinarycatheter <> 'Unknown'. EXECUTE. * Calculate log-transformed total length of stay variable. COMPUTE Total.Stay.Transformed = LN(Total.Stay + 1). EXECUTE. * Histograms of Total.Stay and log(Total.Stay + 1).

GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Total.Stay[name="Total_Stay"] MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: Total_Stay=col(source(s), name("Total_Stay")) GUIDE: axis(dim(1), label("Total.Stay")) GUIDE: axis(dim(2), label("Frequency"))


ELEMENT: interval(position(summary.count(bin.rect(Total_Stay))), shape.interior(shape.square)) END GPL. GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Total.Stay.Transformed[name="Total_Stay_Transformed"] MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: Total_Stay_Transformed=col(source(s), name("Total_Stay_Transformed")) GUIDE: axis(dim(1), label("Total.Stay.Transformed")) GUIDE: axis(dim(2), label("Frequency")) ELEMENT: interval(position(summary.count(bin.rect(Total_Stay_Transformed))), shape.interior(shape.square)) END GPL. * Scatterplot of Age vs. log(Total.Stay + 1).

GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Age Total.Stay.Transformed[name="Total_Stay_Transformed"] MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: Age=col(source(s), name("Age")) DATA: Total_Stay_Transformed=col(source(s), name("Total_Stay_Transformed")) GUIDE: axis(dim(1), label("Age")) GUIDE: axis(dim(2), label("Total.Stay.Transformed")) ELEMENT: point(position(Age*Total_Stay_Transformed)) END GPL. * Boxplot of HAI status vs. log(Total.Stay + 1).



SOURCE: s=userSource(id("graphdataset")) DATA: HAI=col(source(s), name("HAI"), unit.category()) DATA: Total_Stay_Transformed=col(source(s), name("Total_Stay_Transformed")) DATA: id=col(source(s), name("$CASENUM"), unit.category()) GUIDE: axis(dim(1), label("HAI")) GUIDE: axis(dim(2), label("Total.Stay.Transformed")) SCALE: linear(dim(2), include(0)) ELEMENT: schema(position(bin.quantile.letter(HAI*Total_Stay_Transformed)), label(id)) END GPL. * Fit linear regression model for log(Total.Stay + 1).

GENLIN Total.Stay.Transformed BY HAI Sex Hospital.size Surgery Prognosis centralcatheter peripheralcath urinarycatheter (ORDER = DESCENDING)




/GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: MeanPredicted=col(source(s), name("MeanPredicted")) DATA: Residual=col(source(s), name("Residual")) GUIDE: axis(dim(1), label("Predicted Value of Mean of Response")) GUIDE: axis(dim(2), label("Raw Residual")) ELEMENT: point(position(MeanPredicted*Residual)) END GPL. * Normal Q-Q plot of residuals (assumption 4).


* Read in csv file (available at this link – only accessible when connected to the internal NHS

National Services Scotland network).

GET DATA /TYPE = TXT /FILE = '//conf/linkage/output/Datasets for SAG Statistical Papers/Simulated_HAI_Prevalence_Data.csv' /ENCODING = 'UTF8' /DELCASE = LINE /DELIMITERS = "," /ARRANGEMENT = DELIMITED /FIRSTCASE = 2 /IMPORTCASE = ALL /VARIABLES = counter F5.0 Hospital.Type A50 Hospital.category A50 Hospital.size A50


Sex A10 Specialty A50 Age F3.0 Surgery A50 Prognosis A50 centralcatheter A50 peripheralcath A50 urinarycatheter A50 intubation A50 antimicrobials_2 A50 HAI A3 Total.Stay F4.0 Discharged A1 Time.To.Survey F10.0 timetohai F10.0. CACHE. EXECUTE.

* Create new variable for HAI with ‘No’ and ‘Yes’ recoded to ‘0’ and ‘1’, respectively.

COMPUTE HAI_Numeric = 0. IF HAI = 'Yes' HAI_Numeric = 1. EXECUTE.

* Create new variable Age_Group from Age variable.

STRING Age_Group (A10). IF Age < 40 Age_Group = '< 40'. IF Age >= 40 AND Age < 60 Age_Group = '40 - 59'. IF Age >= 60 AND Age < 70 Age_Group = '60 - 69'. IF Age >= 70 AND Age < 80 Age_Group = '70 - 79'. IF Age >= 80 Age_Group = '80+'. EXECUTE. * Fit logistic regression model with logit link for HAI status.

GENLIN HAI (REFERENCE = FIRST) BY Age_Group Sex Hospital.category Hospital.size Surgery Prognosis centralcatheter peripheralcath urinarycatheter intubation (ORDER = DESCENDING) /MODEL Age_Group Sex Hospital.category Hospital.size Surgery Prognosis centralcatheter peripheralcath urinarycatheter intubation INTERCEPT = YES DISTRIBUTION = BINOMIAL LINK = LOGIT



National Services Scotland network).

GET DATA /TYPE = TXT /FILE = '//conf/linkage/output/Datasets for SAG Statistical Papers/HAI_rates_MRSA_MSSA_CD_HH.csv' /ENCODING = 'UTF8' /DELCASE = LINE /DELIMITERS = "," /ARRANGEMENT = DELIMITED /FIRSTCASE = 2 /IMPORTCASE = ALL /VARIABLES = quarter_text A20 quarter A2 MRSA F3.0 beds F10.0 MRSA_rate F10.6 MSSA F3.0 CD.65 F10.0 beds.65 F10.0 CD.Under65 F10.0


beds.Under65 F10.0 Year F4.0 Quarter_2 F2.0 Total.CCCC F3.0 Cum.CCCC F5.0 Total.SAHHMC F4.0 Cum.SAHHMC F5.0. CACHE. EXECUTE. * Create log-transformed beds variable to use as offset in model.

COMPUTE log.beds = LN(beds). EXECUTE. * Fit Poisson regression model.



E SPSS Output from Analyses

Figure E1: SPSS equivalent of Figure 4

Figure E2: SPSS equivalent of Figure 5

Figure E3: SPSS equivalent of Figure 6


The SPSS output below is from the linear model in Section 2.6 and is equivalent to the R

output on Page 14.


The SPSS output below is from the logistic regression model in Section 4.2 and is equivalent

to the R output on Page 23.


The SPSS output below is from the Poisson regression model in Section 5.2 and is equivalent

to the R output on Page 29.