Q-learning: A data analysis method for constructing adaptive interventions

1

Q-Learning: A Data Analysis Method for Constructing Adaptive Interventions

Technical Report Feb 2010

Inbal Nahum-Shani, Min Qian, William E. Pelham, Beth Gnagy, Greg Fabiano, Jim Waxmonsky, Jihnhee Yu, Susan Murphy.

Inbal Nahum-Shani: The Methodology Center, Pennsylvania State University, 204 E. Calder Way, Suite 400, State College, PA 16801

Min Qian and Susan Murphy: Department of Statistics and Quantitative Methodology Program, Institute for Social Research, University of Michigan 2068 Ann Arbor, MI 48106-1248.

William E. Pelham, Beth Gnagy, Greg Fabiano, Jim Waxmonsky, Jihnhee Yu: Center for Children and Families, State University of New York at Buffalo, 318 Diefendorf Hall 3435 Main Street, Building 20, Buffalo, NY 14214.

2

In recent years, research on treatment and intervention development has shifted from the

traditional “one size fits all” concept underlying the standard fixed intervention strategies, to

developing adaptive interventions, in which the dose or type of services that are offered to clients

are individualized based on clients’ characteristics or clinical presentation, and then readjusted in

response to their ongoing performance in treatment (Marlowe et al. 2008: 343). While in the first

approach, the composition and dosage of the intervention are not adjusted in response to the

needs, or characteristics, of individual subjects; the latter approach is based on the notion that

individuals differ in their responses to treatment such that in order for a program to be effective,

the intervention should vary over time in response to the needs of the individual. In this sense,

researchers are becoming increasingly interested in developing interventions that adapt to the

dynamics of the system of interest (e.g., individuals, social groups) via decision rules that

recommend when and how the intervention should be modified in order to maximize long term

outcomes. These recommendations are based not only on subjects’ characteristics but also on

outcomes collected during the intervention such as subject’s response and adherence. It follows

that dynamically adaptive interventions are time varying interventions that adapt to subject’s

intermediate outcomes and characteristics. These types of interventions are also known as

‘dynamic treatment regimes’ (Murphy et al., 2001; Robins, 1986), ‘adaptive treatment strategies’

(Lavori & Dawson, 2000; Murphy 2005), ‘multi-stage treatment strategies’ (Thall et al. 2002;

Thall & Wathen 2005), ‘treatment policies’ (Lunceford et al. 2002; Wahed & Tsiatis 2004, 2006)

or ‘individualized treatment rules’ (Petersen et al. 2007; van der Laan & Petersen 2007).

The conceptual advantages of adaptive interventions have long been recognized by

behavioral and social scientists. For example, in the area of learning and education, Brown’s

(1992) discussion of knowledge acquisition emphasizes the need to take into account the variable

3

responses of students when conducting clinical interviews and tests. Brown suggests a form of

“dynamic assessment” where the interviewer follows a process guided by decision rules in order

to measure students’ emergent competence and open the window of opportunity for learning. In

the area of organizational behavior, Martocchio and Webster (1990) studied the effect of

feedback on employee performance in microcomputer software training, stressing the need to

develop training programs in which training design characteristics are adapted to employee level

of cognitive playfulness (cognitive spontaneity in human-computer interactions). Recently, in

their study of career goal setting, Hirschi and Vondracek’s (2009) conceptualized the

development and adaptation of goals as a dynamic process, where individuals have to select

goals according to personal preferences and environmental opportunities and limitations,

optimize their behavior to achieve those goals, and compensate and adjust if goals become

unattainable or unattractive. In the area of psychotherapy, Laurenceau, Hayes and Feldman

(2007), noted that growth and change in psychotherapy has been conceptualized as a dynamic

process reflecting both destabilization of a stable behavioral and emotional pattern which thereby

increases the need to develop more adaptive patterns. Still, despite the appealing notion

underlying the conceptualization of adaptive interventions to behavioral and social scientists,

data analysis methods for informing the construction of adaptive interventions are still in their

infancy (Murphy, Collins and Rush, 2007). Accordingly, in the current study we aim to introduce

a data analysis method useful in constructing adaptive interventions to researchers in the

behavioral and social sciences.

We begin by discussing adaptive interventions as a vehicle for operationalizing

sequential decision making in behavioral interventions. We then discuss a new, yet

straightforward, method for data analysis, called Q-learning. Q-learning is an analysis method

4

drawn from computer science that can be used to inform the construction of an adaptive

intervention. We illustrate how this method can be applied to formulate adaptive interventions

using the Adaptive Interventions for Children with ADHD study (Center for Children and

Families, SUNY at Buffalo, William E. Pelham as PI). Finally, we discuss directions for future

research for behavioral and social scientists aiming to develop adaptive interventions.

Adaptive Interventions

Interventions are defined as “planned actions intended to produce desired changes in

existing conditions of persons or environments, usually with the condition identified as a

problem to be ameliorated” (Adelman & Taylor, 1994: 638). Although interventions are widely

conceptualized as complex processes (e.g., Cuijpers, 2002; Wandersman, & Florin, 2003;

Wampold, Lichtenberg, & Waehler, 2002), intervention scientists usually adopt the traditional

approach to intervention development, aiming to construct a “one size fits all” intervention in

which the composition and dosage are fixed. In this approach, all intervention participants are

offered with the same single intervention composition and dosage. For example, in Brand, Lakey

and Berman’s (1995) study on improving perceived social support among community residents,

the same group-based intervention was delivered to all community residents reporting low levels

of perceived support. Every component of the intervention, which included a combination of

social skills training (e.g., positive assertions to self and others, conflict resolution strategies,

active listening) and cognitive restructuring (e.g., identifying and correcting dysfunctional

attitudes that can occur in relationships, positive self-statements and self-acceptance) was

delivered to all participants with the assumption that each one of these components is necessary

for any particular resident. Moreover, each participant was offered the same intervention dosage

(13 weekly group sessions). Using such a fixed approach to intervention development and

5

assessment does not take into account the varying intervention needs of individuals (Collins,

Murphy & Bierman, 2004; Connell et al., 2008). It limits the ability to identify the truly

efficacious aspects of each treatment, potentially leading to inclusion or unnecessary or even

counterproductive components and does not allow clinicians to match treatments with individual

recipients most likely to benefit from them.

In recent years, intervention development is shifting from this traditional fixed research-

based approach, into conceptualizing interventions in terms of sequential processes in which the

varying needs of subjects are taken into consideration (Collins et al., 2004). Notice that this

conceptualization has two components: (1) the intervention is time-varying and (2) the

intervention adapts and readapts to the specific needs of individuals. Weisz, Chu and Polo’s

(2004) discussion of dissemination and evidence-based practice in clinical psychology suggests

that evidence-based practice should ideally consists of much more than simply obtaining an

initial diagnosis and choosing a matching treatment. Evidence-based practice “is not a specific

treatment or a set of treatments, but rather an orientation or a value system that relays on

evidence to guide the entire treatment process. Thus, a critical element of evidence-based care is

periodic assessment to gauge whether the treatment selected initially is in fact proving helpful. If

it is not, adjustments in procedures will be necessary, perhaps several times over the course of

the treatment “ (p.303). In fact, many behavioral and cognitive therapies can be seen as adaptive

interventions (Bierman et al., 2006). For example, group therapy processes are adjusted over

time based on group dynamics and the developmental stage of the group (Cole, 2005; Yalom,

1995). Cognitive therapy is tailored to address the unique cognitive conceptualization of the

patient and in response to his/her progress or worsening during the process with therapists basing

6

the format, content, duration and intensity of the upcoming sessions based on the results of the

prior sessions (Beck, Liese & Najavits, 2005).

One approach to operationalizing the conceptual idea of an adaptive intervention is to use

decision-rules (Bierman et al., 2006) that link subjects’ characteristics with specific levels and

types of intervention components. This approach is conceptually appealing because it mimics

decision processes in real life where individuals select their actions based on information

obtained from the environment and modify their actions based on this information with the

general aim to maximize long-term rewards. Take for example a typical classroom scenario

where the teacher selects teaching strategies that would fit with the needs of his/her students and

continuously modify these strategies based on students’ responses in class and performance on

exams in order to optimize the learning process.

The assignment of a particular intervention component and level of dosage are based on

the subject’s values on tailoring variables. These variables are expected to moderate the effect of

certain intervention components, and the logic is that the level or type of intervention should be

tailored according to these moderators. Note that all tailoring variables are moderators, but not

all moderators are tailoring variables. For example, consider an intervention to which both men

and women respond better than to a control but the intervention exhibits stronger effects for

women. In this case, although gender is a moderator, both men and women should be offered the

intervention as both groups benefit. Thus, gender is not a tailoring variable (otherwise different

genders should be offered different interventions).

Although the list of candidate tailoring variables depend on the study, common types of

these variables include individual, group, or context characteristics representing risk or

protective factors that influence responsiveness to (or need for) various types or intensity of

7

intervention components. For example, community residents who are characterized by particular

risk factors, for example low self esteem, are likely to benefit from an intervention that places

more emphasis on cognitive restructuring than on social skills, whereas those with high self

esteem may find other interventions more beneficial. Individual’s responsivity may also be an

important moderator. For example, it is possible that community residents who do not adequately

respond (report low levels of support) to the support intervention within a certain period of time

(say 13 weeks), may need a more intensive intervention or a different type of intervention. In this

case, community residents’ response to the intervention serves as a tailoring variable, allowing

the researcher to modify the support intervention in order to increase its efficiency. Intervention

decisions may also be tailored by previous intervention decisions. For example, the decision

whether to intensify the support intervention may vary as a function of the type of support

intervention given at the first stage. Assume there are two possible first stage interventions, one

that places more emphasis on cognitive restructuring, and one that places more emphasis on

developing social skills. It is possible that intensifying the intervention for non-responders may

be more effective for those residents who were initially assigned to the cognitive restructuring-

based intervention, while for those non-responders assigned to the social skills-based

intervention, it is better to add a cognitive component to the first stage intervention.

To demonstrate how decision rules link information obtained based on the tailoring

variables to intervention options, assume for simplicity that the supportive intervention is only

tailored according to individual’s response to the first stage intervention. In that case, the

decision rule can be expressed as

First stage intervention = {social skill}

IF evaluation = {non-response}

THEN at Step t+1 apply decision {intensify first stage intervention}

8

ELSE IF evaluation = {response}

THEN at Step t+1 continue on present intervention

Notice that the IF and ELSE IF part of the rule contains subjects’ tailoring variables; intervention

options are expressed in the THEN portion of this rule.

To construct high quality adaptive interventions, we focus on selecting good decision

rules. Accordingly, unlike fixed interventions, adaptive interventions not only include the

intervention components and dosage, but also an entire system for assigning components and

dosage. In other words, “the choice of tailoring variables, the measures of the tailoring variables,

the decision rules linking tailoring variables to the assignment of components and dosage, and

the implementation of these rules are all an integral part of the intervention itself” (Collins et al.,

2004: 186).

There are quite a few studies in which researchers used decision rules to operationalize

dynamically adaptive interventions. The Fast Track study included a dynamically adaptive

intervention program for preventing conduct problems among high-risk children (Conduct

Problems Prevention Research Group, 1992). Families participating in this study were provided

with home visits at different levels of intensity depending on clinical judgments of parental

functioning and family needs (see Bierman et al., 2006). In the misdemeanor drug court adaptive

intervention developed by Marlowe and colleagues (Marlowe et al., 2008), the frequency of

court hearings and type of counseling session were adjusted according to prespecified criteria in

response to participants’ performance. In a study of alcohol dependent patients, McKay (2005)

evaluated the effectiveness of an adaptive intervention that was built around brief telephone

contacts, consisting of risk for relapse assessment and problem- focused counseling. When risk

levels increased, participants received stepped up care such as frequent telephone sessions and

9

several sessions of motivational interviewing. The Early Steps Multisite study (ES-M), involved

the application of family-centered program for reducing emotional and behavioral problems in

children. This intervention was tailored and adapted according to the specific needs of each

family (Connell et al., 2008).

Still, research methods and procedures for finding the best sequence of decision rules are

considered relatively new and complex (Connell et al., 2008). Accordingly, in the following

section we introduce Q-learning (Watkins, 1989) – a novel methodology that can be used for the

construction of adaptive interventions from data. We illustrate the application of this method

using data from a study aiming to develop an adaptive intervention for improving the school-

based performance of children with Attention Deficit Hyperactivity Disorder (ADHD).

Motivation for using Q-learning

When developing adaptive interventions, the aim of the researcher is to find the optimal

sequence of decision rules, namely the sequence of adaptive or individualized decisions for

providing the intervention. Finding the optimal sequence of decision rules belongs to the class of

sequential or multistage decision problems, where a decision which appears optimal in the short-

term may not be a component of the optimal sequence of decisions (Lavori & Dawson, 2000). To

clarify this, consider a sequence of decisions, with one decision point per stage of intervention.

At each stage there may be several possible intervention options. For simplicity, throughout this

manuscript we assume we have only two intervention stages. Denote the intervention at the first

stage by �1 and denote the intervention at the second stage by �2. Also, for simplicity we

assume there are only two intervention options at each stage (�1 and �2 are coded as -1/1). Let

Y denote the primary outcome (the response at the end of the second stage), for which high

values are preferred.

10

To begin, first suppose that our goal is to find the sequence of non-adaptive decisions,

one per stage of intervention, that when implemented will lead to the maximal expected value of

the primary outcome. Consider a simple study in which individuals are randomized to

intervention options at each stage. A basic way to find an optimal sequence of non-adaptive

decisions, based on data collected from this study, would be to regress Y on �1, �2 and the

interaction between them (because the effect of the second stage decision may vary as a function

of the first stage decision).

(1) �~�0+ �1�1+ �2�2 + �3�1�2.

In order to find the best sequence of decisions, that is the sequence of intervention

options that leads to the maximal primary outcome, we estimate the regression coefficients and

simultaneously maximize over �1 and �2. More specifically, since

�0+ �1�1+ �2�2+�3�1�2=�0+ �1+ �2+�3 �� 1=1 �� 2=1 �0+ �1− �2−�3 ��

�1=1 �� 2=−1�0− �1+ �2−�3 �� 1=−1 �� 2=1�0− �1− �2+�3 �� 1=−1 ��

�2=−1 (*)

where �0, �1, �2 and �3 are the estimated regression coefficients, we choose the sequence of

decisions that maximizes the right hand side of the (*). For example, if �0> 0, �1>0, �2<0 and

�3<0, then the best sequence is to choose first stage intervention option 1, and then choose

second stage intervention option -1.

Notice that although the best sequence of decisions based on (1) can be used to construct

a time varying intervention, it cannot be used to construct an adaptive intervention since this

sequence of decisions is not adaptive, in that it is not tailored according to the changing status of

the subject. An example of such a time-varying (yet non-adaptive) intervention approach can be

seen in Raudenbush, Hong and Rowan’s (2002) study of the effects of time-varying mathematics

11

instructional “treatments”. However, adaptive interventions are not only time varying, but also

adaptive to the dynamics of the environment and hence an optimal adaptive intervention involves

an optimal sequence of decision rules, as opposed to an optimal sequence of decisions.

Consider a simple study in which individuals are randomized to interventions at each

stage and observations on the individual are collected prior to each randomization. Denote the

observations on the individual (possibly a vector) at the beginning of the first stage by O1 and at

the beginning of the second stage by O2. Accordingly, the data record for each subject would be:

�1, �1, �2, �2, �. In general O contains predictors of the primary outcome. O1 and O2 may

condition (moderate) the effects of the decisions; additionally �2 may be affected by both �1

and �1. Denote the decision rule at the first stage, that takes the available information (�1) as

the input and outputs a decision (i.e., intervention option) �1, by �1. Denote the decision rule at

the second stage, that takes the available information (�1, �1, �2) as the input and outputs a

decision (i.e., intervention option) �2, by �2. Our goal is to find the optimal sequence of

decision rules (�1∗, �2∗), namely the sequence of decision rules that would lead to the maximal

expected primary outcome if assigned to the entire study population.

In this case an intuitive way to find (�1∗, �2∗) would be to extend the regression in (1)

to include O1 and O2 as potential moderators. For example,

2 �~�0+ �1�1+ �2�1+ �3�1�1+ �4�2+ �5�2 + �6�1�2+ �7�2�2

However, using estimates based on this equation to make an inference about the optimal

sequence of adaptive decisions (�1∗, �2∗) is problematic in two main aspects. First, since �2

may be an outcome of �1 and a potential predictor of �, �2 cuts off any portion of the effect of

�1 on � that occurs via �2. To clarify this, �2 can be conceptualized as a mediator in the

relationship between �1 and �. Adding �2 to a regression in which �1 is used to predict � will

12

reduce the effect of �1. In the presence of �2, the coefficient for �1 no longer expresses the total

effect of the first stage intervention on the outcome, but rather what is left of the total effect (the

direct effect) after cutting off the part of the effect that is mediated by �1 (the indirect effect)

(Baron & Kenny, 1986; MacKinnon, Warsi, & Dwyer, 1995). Note that ascertaining the total

effect of the intervention (say �1) is crucial to finding the best decision rule (say �1∗), as it

provides information concerning the overall effect of the intervention. Although the direct effect

of the intervention, may be helpful in identifying mechanisms or processes through which the

intervention may affect the outcome, it does not help the researcher decide which intervention

option is superior. Accordingly, any inference concerning the optimal adaptive decision at the

first stage, based on (2) is likely to be biased.

Second, unknown causes of both �2 and � may introduce biases in the coefficients of �1

terms (main effects and interactions) such that �1 may appear to be falsely less or more

correlated with � because �1 affects �2 while �2 and � are affected by the same unknown

causes (see Figure 1).

--------------------------- Figure 1 about here

-----------------------------

To demonstrate this, consider the numerical example discussed by Murphy and Bingham

(2009). Let U (the unknown cause) be a Bernoulli random variable with success probability 12

represent an unknown cause of both O2 and Y (we assume there is no O1 in this case, i.e., O1 in

(2) equals zero). Suppose that �=�0+ �1�+ �, where � (mean zero, finite variance) is

independent of (�, O2, �1, �2). Thus, there is no effect of �1 or �2 on Y. �1, �2 can each

obtain -1/1 values; subjects are randomly assigned to these options at each stage with probability

12. Next, suppose that O2 can obtain two values 0 or 1, and

13

��2=1�, �1 =��1+�22+�1−�22 �1+(1−�)�3+�42+�3−�42 �1

where each ��∈[0,1]. It follows (see Appendix 1 for the proof) that

(3) ��1, �2=1, �2= �0+�12�1�1+�3+�2�2+�4+�12�1�1+�3−�2�2+�4�1

Since �1 = �12�1�1+�3−�2�2+�4 may be different from the true effect of zero, �1

in (2) reflects bias that occur because we are conditioning on O2 which is both an outcome of �1

and a predictor of �.

Motivated by the inference problems noted above, we introduce Q-learning -- a method

for using data to estimate the optimal sequence of decision rules. Q-learning uses backwards

induction (Bellman & Dreyfus, 1962) to construct a sequence of decision rules that map or link

the observations of the environment (here captured by tailoring variables) to the actions the agent

(Decision Maker) ought to take in order to maximize desired long-term primary outcome. In

terms of constructing an adaptive intervention, Q-learning can be used to find the sequence of

decision rules that link the subject’s observations (e.g., characteristics and responses to past

decisions) to the most efficient intervention component and dosage. The aim of Q-learning is to

evaluate the intervention components at each stage when the subsequent adaptive decision is

matched to the subjects as opposed to evaluating intervention components as stand-alone

components for each stage. In the following section we show how researchers can use Q-learning

to construct optimal sequence of adaptive decision rules.

14

Q-learning

Our goal is to find the optimal sequence of decisions rules (�1∗, �2∗). In some

applications (e.g., expert system1) an expert may provide the multivariate distribution of �1, �2

and �, for every sequence of decisions �1, �2. In this case, we can obtain the optimal sequence

of decision rules using backwards induction as follows:

�2∗�1,�1,�2=arg max �2�2�1,�1,�2, �2,

where �2�1,�1,�2, �2=��| �1,�1,�2, �2 is the expectation of the primary outcome

conditioning on �1,�2, for interventions �1, �2. This conditional expectation provides the

quality of the intervention option �2, as it expresses the expected primary outcome of choosing

intervention option �2 now, given the information available (the history: �1,�1,�2).

Then, we move backwards in time to find the optimal adaptive decision rule at stage 1,

namely �1∗�1.

�1∗�1=arg max �1�1�1, �1

where �1�1, �1=�max �2 �2�1,�1,�2, �2| �1,�1 is the conditional expectation that

provides the quality of choosing intervention option �1 initially, assuming that we choose the

best intervention option at the second stage. That is, it expresses the expected primary outcome

of choosing option �1 given the information available (the history �1).

1 Expert systems (or knowledge‐based systems) are defined broadly as computer programs that mimic the reasoning and problem solving of a human ‘expert’. These systems use pre‐specified knowledge about the particular problem area. They are based on theoretical models, employing deep knowledge systems as a basis for their operation (Velicer, James Prochaska & Redding (2006).

15

In general "�" denotes the Quality of the decision, given the history up to that decision

point. �1 and �2 are often called Q-functions (Sutton & Barto,1998). Note that the optimal

decision rules �1∗, �2∗ output the intervention options that maximize �1, �2 respectively.

The focus here is on the use of data to construct adaptive decision rules; we do not know

the true multivariate distribution of �1, �2 and �. We represent the study data as

�1�,�1�,�2�,�2�,�� , � = 1,…,�, where � is the number of study participants. Throughout,

for simplicity, we assume that participants are randomly assigned to the two intervention options

at each of the two decision stages (e.g., �1 and �2 are randomized). When participants are

randomly assigned to intervention options (randomization probabilities may depend on past

information), the conditional distributions required to form optimal adaptive decisions, are the

same as the corresponding conditional distributions in the data (In Appendix 2 we provide the

proof and relate these expectations to potential outcomes).

We can use the following version of Q-learning (Murphy, 2005) to estimate (e.g.,

“learn”) the Q-functions, from which we construct the optimal sequence of decision rules as

described above. Here, we use linear regressions. The second stage Q-function might be modeled

as:

4 �2�1,�1,�2, �2, �2, �2=�20+�21�1+�22�1+ �23�1�1+ �24�2+

(�21+�22�1+�23�2)�2,

where, �2=�20, �21, �22, �23, �24, and �2=�21, �22, �23. Notice that our main interest lies

primarily in the parameters �2’s as they contain information with respect to how the relevant

decision should vary as a function of the candidate tailoring variables (here �1 and �2). Based

on (4) one can see that the second decision (�2) that maximizes �2 is the one that maximizes the

term (�21+�22�1+�23�2)�2; that is �2 is 1 if (�21+�22�1+�23�2) is positive and �2 is -1

16

if (�21+�22�1+�23�2) is negative. We estimate the vector parameters �2 and �2, by the

following regression:

�~�20+�21�1+�22�1+ �23�1�1+ �24�2+ (�21+�22�1+�23�2)�2,

Next, we construct the estimated quality of the second stage intervention. This

intermediate outcome is the expected value of the primary outcome, given that the optimal

adaptive decision was taken at stage 2. That is:

��=max�2�2�1�,�1�,�2�, �2; �2, �2, �=1, …, �.

�=�20+�21�1+�22�1+ �23�1�1+ �24�2+ |�21+�22�1+�23�2|

We use a linear model for the first stage Q-function as well.

5 �1�1, �1, �1, �1=�10+�11�1+(�11+�12�1)�1,

where, �1=�10, �11, and �1=�11, �12. Based on (5) one can see that the first decision (�1)

that maximizes �1 is the one that maximizes the term (�11+�12�1)�1; that is �1 is 1 if

(�11+�12�1) is positive and �1 is -1 if (�11+�12�1) is negative. We again use regression to

estimate �1 and �1 as follows:

�~�10+�11�1+(�11+�12�1)�1.

Notice that this time we regress the estimated quality of the second stage intervention (the

predictor of the primary outcome obtained by taking the best intervention option at the second

stage) on �1, �1, and �1�1.

Accordingly, the estimated optimal sequence of adaptive decisions (i.e., intervention

options) would be:

�2�1,�1,�2=arg max �2 �2�1,�1,�2, �2, �2, �2 = ��(�21+�22�1+�23�2)

�1�1=arg max �1 �1�1, �1; �1, �1=��(�11+�12�1)

17

where �2�1,�1,�2 is the estimated best second stage intervention option (�2), that is the second

stage intervention option that maximizes the mean of the primary outcome, given the history

�1,�1,�2, based on the estimated parameters �2 and �2. And, �1�1 is the estimated best

initial decision (�1) that maximizes the estimated quality of the second stage intervention, given

the history �1, based on the estimated parameters �1 and �1.

Q-Learning assists us in estimating an optimal sequence of decision rules in two

important ways. First, this approach reduces potential bias compared to the regression in (2)

since in the first stage analysis the regression model omits variables that may mediate the

relationship between the first stage intervention and the primary outcome (omitting these

variables prevents the elimination of the portion of the intervention effect that goes through the

mediator). For example, if �2 mediates the relationships between �1 and Y, inference

concerning the effect of �1 on Y, based on the regression equation (2) in which �2 is present,

will not reflect the total effect of �1 on Y, but rather the part of this effect that does not go

through the mediator �2. However, taking the Q-learning approach, the inference concerning �1

is based on a regression equation in which �2 is not present and hence any bias to the estimated

effect of �1 due to the mediation of �2 is reduced.

Second, Q-learning reduces the bias incurred by the use of (2), bias that is a result of

unmeasured causes of both the tailoring variables (�2), and the primary outcome (Y). Note that

this bias resulting from unmeasured causes is different from the bias discussed above, and may

occur even if �2 does not mediate the relationship between �1 and Y. The following section

further demonstrates the second feature.

18

Comparing the single regression approach and Q-learning

In order to demonstrate how Q-learning reduces the bias resulting from unmeasured

causes, consider the following example (Example 1): Say that the outcome Y is the level of

community residents’ perceived support 26 weeks after the beginning of the first stage

intervention, and �2 is the level of perceived support after a 13 weeks period. For simplicity, we

assume there are no baseline variables �1. Say that �1 (social skills-based intervention=1 vs.

cognitive-based intervention=-1) and �2 (intensify first stage intervention =1 vs. add the other

intervention component = -1) are each randomly assigned to subjects with probability ½. We also

assume �~�(0, 1) is an unmeasured cause (say personality characteristic) that has an effect on

both perceived support measures Y and �2. More specifically, �=1+ 0.5�+ ��, and �2=1+

0.5�+ 0.5�1+��. For both models we assume the �’s (error terms) are independent and

standard normally distributed. Notice that �2 does not mediate the relationship between �1 and

� and �1 has an effect on �2, but neither �1 nor �2 has an effect on Y .

We generated 1,000 samples, n=500 each using the above example. On each data set we

used the single regression approach and the Q-learning approach.

The single regression model is:

6 � ~ �0+�1�1+�2�2+�3�2+�4�1�2.

A natural approach to using (6) to construct the sequence of decision rules is as follows.

We construct the optimal decision rule at the second stage by finding the value of �2 that

maximizes (6) (i.e. that maximizes the term (�3 + �4�1)�2 ). This is, �2�1=��(�3 +

�4�1). Replacing �2 by ��(�3 + �4�1), the estimated maximal expected outcome is

�0+�1�1+�2�2+|�3 + �4�1|. Now, we rewrite this maximal expected outcome as

19

�0+�1�1+�2�2+|�3 + �4�1|=�0+�1�1+�2�2+ �1+12|�3+�4|+ 1−�12|�3−�4|

= �0+�2�2+12(|�3+�4|+ |�3−�4| )+�1+12(|�3+�4|− |�3−�4| )�1

Next we find the value of �1 that maximizes the above. Accordingly, if �1+12(|�3+�4|−

|�3−�4| )>0, we can conclude that �1=1 (social skills based intervention) is the best

intervention option at the first stage given that we chose the best second stage intervention

option. If �1+12(|�3+�4|− |�3−�4| )<0 we conclude that �1= -1 (cognitive based

intervention) is the best intervention option at the first stage given that we chose the best second

stage intervention option.

On the other hand consider Q-learning. In analogy to (6) we use the models:

�2�1, �2, �2, �2, �2=�20+�21�1+�22�2+(�21+�22�1)�2

and �1 �1, �1, �1=�10+�11�1.

Applying the Q-learning algorithm, we obtained estimates to the parameters ��,��, �=1,2. We

estimated the best second stage decision by choosing �2=��(�21+�22�1), and the best first

stage decision by choosing �1=��(�11). Using this approach �11 is the estimated effect of

the first stage intervention given that we chose the best second stage intervention option.

In conclusion we have that the sign of �1+12(|�3+�4|− |�3−�4| ) determines which

first stage intervention is selected as best in the single stage regression approach whereas the sign

of �11 determines which first stage intervention is selected as best in Q-Learning. We compare

the distribution of these two quantities across the 1000 generated samples. Recall that in our

example there is no effect of the initial decision �1, thus both distributions should be centered at

zero. Figure 2 presents the distribution of �1+12(|�3+�4|− |�3−�4| ) and Figure 3 presents the

distribution of �11. It is easy to see that the distribution of the Q-learning-based estimate is

centered around zero (SD = .06), while the distribution of the single regression-based estimate

20

has a mean of -.10 (SD = .06). Thus if there are unobserved causes of both �2 and Y, the single

regression approach in (6) may lead to erroneous conclusions concerning the best sequence of

adaptive intervention options, while using the Q-learning method improves our ability to find the

optimal sequence of decision rules.

---------------------------------- Figures 2 & 3 about here

----------------------------------

Analysis based on the Adaptive Interventions for Children with ADHD study

To illustrate Q-learning, we use a simplified version of the Adaptive Interventions for

Children with ADHD study (a full analysis can be found in ADD CITATION). Attention-Deficit

Hyperactivity Disorder (ADHD) is a chronic disorder affecting 5-10% of school age children

that adversely impacts functioning at home, school and in social settings (Pliszka 2007). In

recent years there is a controversy concerning the relative effectiveness of behavioral- vs.

medication-based interventions for the treatment of ADHD (see Pelham & Fabiano, 2008,

Pliszak 2007). Accordingly, a Sequential Multiple Assignment Randomized Trial (SMART;

Murphy, 2005) was conducted (William E. Pelham as PI) with the general aim to find the

optimal sequence of treatments that reduces ADHD symptoms among children.

Design

In a SMART study, treatments are randomized at each stage, where observable data from

a subject is {O1, A1, O2, A2, Y }. Of course, the number of stages can be greater than 2, and the

observable data may also include baseline variables and/or measurements of potential

confounders. At the first stage, the intervention A1 is randomized with randomization distribution

allowed to depend on (O1) and at stage 2 the intervention A2 is randomized with randomization

distribution allowed to depend on (O1, A1, O2).

21

At the first stage of the ADHD SMART study (A1), children were randomly assigned

(with probability ½) to a low dose of medication (coded as -1) or a low dose of behavioral

intervention (coded as 1) at the beginning of a school year. After eight weeks, children’s

response to the first stage intervention was evaluated monthly until the end of that school year.

At each monthly assessment, if the child showed inadequate response to the first stage

intervention, then he/she entered the second stage of the intervention (A2) and was re-randomized

(with probability ½) to one of two second stage intervention options, either to increasing the dose

of the first stage intervention (coded as -1) or to augmenting the first stage intervention with the

other type of intervention (i.e., adding behavioral intervention for those who started with

medication, or add medication for those who started with behavioral intervention) (coded as 1).

Otherwise if the child is classified as a responder, then he/she remains in stage 1 and continue the

first stage intervention. Note that there are only two key decisions in this trial: the first stage

intervention decision (A1), and then the second stage intervention decision (A2) for those not

responding satisfactorily to the first stage intervention. The structure of this SMART study is

illustrated in Figure 4.

-------------------------- Figure 4 about here

--------------------------

Sample

149 children (75% boys) between the ages of 5-12 (mean 8.6 years) participated in the

study. Due to drop-out and missing data2, the effective sample used in the current analysis was

131. At the first stage of the intervention (A1), 67 children were randomized to receiving low

2 In a full analysis one would want to use a modern missing data method such as multiple imputation so as to avoid bias.

22

dose of medication, and 64 were randomized to receiving low dose of behavioral intervention.

By the end of the school year, 77 children were classified as non-responders and re-randomized

to one of the two second stage intervention options, with 37 children assigned to increasing the

dose of the first stage intervention, and 38 children assigned to augmenting the first stage

intervention with the other type of intervention.

Measures

Primary outcome (�): we consider the level of children’s classroom performance based

on the Impairment Rating Scale (IRS, Fabiano et al., 2006; available from

http://wings.buffalo.edu/adhd) after an 8-month period as our primary outcome. This outcome

ranges from 1 to 5, with higher values reflecting better classroom performance. Because the

current analysis is for illustrative, rather than for substantive purposes, we use this measure as an

outcome despite limitations relating to its distribution and reliability.

Baseline (B): we use the level of children’s classroom performance (based on the IRS

scale) measured during the first month of the school year (before the first stage intervention) as a

baseline measure.

Week of non-response (W): reflecting the week during the school year at which the child

showed inadequate response to the first stage intervention, and hence entered the second stage of

the intervention. This measure is relevant only for those who showed inadequate response during

the school year (i.e., classified as non-responders to the first stage intervention).

Medication prior to first-stage intervention (�1): This measure reflects whether (coded

as 1) or not (coded as 0) the child received medication at school during the previous school year

(i.e., prior to the first stage of the intervention).

23

Adherence to first stage intervention (�2): This measure reflects whether adherence to

the first stage intervention was high (coded as 1) or low (coded as 0). We constructed this

indicator based on two other measures that express (1) the percentage of days the child received

medication during the school year calculated based on pill counts (for those assigned to low dose

of medication as the first stage intervention), and (2) the percentage of days the child received

the behavioral intervention during the school year based on teacher report of behavioral

interventions used in the classroom (for those assigned to behavioral intervention as the first

stage treatment). The distributions of these two measures are presented in Figures 5 and 6. Based

on these distributions, we constructed �2, such that for those assigned to behavioral intervention

as the first stage treatment, low adherence (�2=0) means receiving less than 75% days of

behavioral intervention, and for those assigned to medication as the first stage treatment, low

adherence (�2=0) reflects receiving less than 100% days of medication3.

------------- Figures 5&6 about here

-------------

Data Analysis Procedure

Using the Q-learning approach, the optimal sequence of decision rules can be estimated

based on two regressions, one for each intervention stage. We start from the second stage, aiming

to find the best subsequent intervention for responders, given the history up to the second

decision point (�, �1, �1,�2, �). Because children were classified as non-responders at different

time points along the school year, we included the week of non-response (W) in this regression.

We also included the child’s baseline measure (B) in the regression, due to potential

confounding. We consider the first stage intervention (�1) as well as the level of adherence to 3 Such relatively high adherence rates may result from obtaining adherence data only for the first 8 weeks of the school year. Moreover, study medication was to be taken only on school days, and was dispensed monthly to parents.

24

the first stage intervention (�2), as candidate tailoring variables for the second stage

intervention. Accordingly, �2 for non-responders can be approximated by

7 �2�, �1, �1,�2, �,�2, �2, �2

=�20+ �21�1+�22�+ �23�1+�24�1�1+�25�+�26�2+(�21+�22�1+�23�22)�2

In general this regression might include further possible confounders or potential

tailoring variables such as negative/ineffective parenting styles and medication side effects. We

obtain (�2,�2) by using regression. In this simple case, the decision rule recommends adding

another intervention for a child who does not respond to the first stage treatment if

(�21+�22�1+�23�2)>0 and increasing the dose of the first stage intervention if

(�21+�22�1+�23�2)<0.

Now we move backwards in time aiming to find the best first stage intervention option

(�1) given that the best second stage intervention option was offered to non-responders. Based

on (7) the estimated quality of the second stage intervention for non-responders would be

�=�20+ �21�1+�22�+ �23�1+�24�1�1+�25�+�26�2+|�21+�22�1+�23�22|

For responders, we use the measure of classroom performance after an 8-month period as an

intermediate outcome.

�1 can be approximated by

�1 �, �1, �1, �1, �1=�10+ �11�+�12�1+(�11+�12�1)�1

We regress � on the predictors to obtain �1and �1. If (�11+�12�1)>0, the best first stage

intervention option would be to begin with a low dose of behavioral intervention (�1=1). If

(�11+�12�1)<0, the best first stage intervention options would be to begin with a low dose of

medication (�1=−1).

25

Notice that the estimated quality of the second stage intervention is a non-smooth function

of �2 (it is non-differentiable at �21+�22�1+�23�22=0), because of the maximization operation

|�21+�22�1+�23�22|. Since �1 is a function of the estimated quality of the second stage

intervention, it is in turn a non-smooth function of �2, and hence a non-regular estimator.

Accordingly, usual Wald-type significance tests for making inference concerning �1 tend to perform

poorly (Chakraborty, Murphy, & Strecher, 2009; Robins, 2004). The issue of non-regularity and the

associated inference problems associated with the first stage regression are discussed in detail in

Chakraborty et al. (2009), and are beyond the scope of the current manuscript. Still, in order to

overcome the inference problems noted above, we constructed confidence intervals for �1 using the

soft thresholding operation recommended by Chakraborty et al. (2009).

Results

Table 1 present the results for the second stage regression. Based on these estimates, we

estimated the term (�21+�22�1+�23�2) for every given combination of �1and �2 (see Table 2).

----------------------- Table 1 & Table 2 about here

----------------------- The results in Table 1 show that the effect of the second stage intervention (�2) is negative

and marginally significant (�21= -.42, lower limit .10 CI= -.79, upper limit .10 CI= -.06). The

interaction between the first stage intervention (�1) and the second stage intervention (�2) was not

found to be statistically significant (�22= -0.003, lower limit .10 CI= -.24, upper limit .10 CI= .23),

and the interaction between adherence to first stage intervention (�2) and the second stage

intervention (�2) was found to be statistically significant (�23=.75, lower limit .10 CI=.27, upper

limit .10 CI= 1.23).

The results in Table 2 indicate that when adherence to the first stage intervention is low

(�2=0), the term (�21+�22�1+�23�2) is negative and marginally significant, regardless of

26

whether the first stage intervention was medication (�21+�22�1+�23�2= - .42, lower limit .10 CI=

-.87, upper limit .10 CI= .02), or behavioral intervention (�21+�22�1+�23�2=.43, lower limit .10

CI= -.85, upper limit .10 CI= -.01). Accordingly, when adherence to the first stage intervention is

low, the term (�21+�22�1+�23�22)�2 is maximized when �2=−1 (medication). However, when

adherence to the first stage intervention is high (�2=1), the term (�21+�22�1+�23�2) was not

found to be significantly different from zero, regardless of whether the first stage intervention was

medication (�21+�22�1+�23�2= .33; lower limit .10 CI= -.07, upper limit .10 CI= .73) or

behavioral intervention (�21+�22�1+�23�2= .33, lower limit .10 CI= -.04, upper limit .10 CI=

.70).

Overall, the results of the second stage regression indicate that for non-responders to the first

stage intervention (regardless of whether the first stage intervention was medication or behavioral

intervention), if adherence to the first stage intervention is low, augmenting the first stage

intervention with the other type of intervention (�2=−1), leads to better classroom performance

relative to intensifying the first stage intervention (�2=1). However, if adherence to the first stage

intervention is high, there is no evidence to differentiate between the second stage intervention

options. Figure 7 presents the predicted means for each of the second stage intervention options

(�2), given the first stage intervention (�1) and adherence to first stage intervention (�2).

------------- Figure 7 about here

-------------

Table 3 presents the results for the first stage regression. Based on these estimates, we

estimated the term (�11+�12�1) for each value of �1 (see Table 4).

----------------------- Table 3 & Table 4 about here

-----------------------

27

The results in Table 4 indicate that the effect of the first stage intervention (�1) is positive

and marginally significant (�11=.20, lower limit .10 CI=.002, upper limit .10 CI=.38), and the

interaction between the first stage intervention (�1) and medication prior to first stage intervention

(�1) is negative and marginally significant (�12= -.24, lower limit .10 CI= -.49, upper limit .10 CI=

-.01).

The results in Table 5 indicate that the term (�11+�12�1) is positive and marginally

significant when �1= 0 (Estimate=.20, lower limit .10 CI=.002, upper limit .10 CI=.38). However,

when �1=1 the term (�11+�12�1) is not significantly different from zero (Estimate= -.04, lower

limit .10 CI=-.33, upper limit .10 CI=.24). This means that given that the best second stage

intervention option was offered to non-responders, low dose of behavioral intervention (�1=1) leads

to better classroom performance relative to low dose of medication (�1= -1), for children who did

not receive medication at school prior to the first stage intervention (�1=0). However, there is no

evidence favoring either first stage intervention option for children who received medication at

school prior to the first stage intervention. Figure 8 presents the predicted means for each of the first

stage intervention options (�1), given whether or not the child received medication at school prior to

first stage intervention (�1).

------------- Figure 8 about here

-------------

Overall, the optimal sequence of decision rules based on the second and the first stage

regressions is as follows:

IF the child received medication at school prior to first stage intervention

THEN offer low dose of medication or low dose of behavioral intervention.

28

ELSE IF the child did not receive medication at school prior to first stage intervention

THEN offer low dose of behavioral intervention.

Then,

IF the child show inadequate response to first stage intervention

THEN IF child’s adherence to first stage intervention is low,

THEN augment the first stage intervention with the other

type of intervention.

ELSE IF child’s adherence to first stage intervention is high

THEN augment the first stage intervention with the other

type of intervention or intensify the first stage intervention.

ELSE IF the child show adequate response to first stage intervention,

THEN continue first stage intervention.

29

References:

Adelman, H.S., & Taylor, L. (1994). On Understanding Intervention in Psychology and

Education. Westport CT: Praeger.

Beck, J.S., Liese, B.S., and Najavits, L.M. (2995). Cognitive therapy. In R.J. Frances, S.I. Miller,

and A. Mack (Eds.), Clinical Textbook of Addictive Disorders (3rd edition). NY: Guilford

Press.

Bierman, K.L., Nix, R.L., Maples, J.J., and Murphy, S.A. (2006). Examining clinical judgment

in an adaptive intervention design: The Fast Track Program. Journal of Consulting and

Clinical Psychology, 74, 468-481.

Brand, E., Lakey, B., & Berman, S. (1995). A preventive, psychoeducational approach to

increase perceived support. American Journal of Community Psychology, 23, 117–136.

Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social

psychological research: Conceptual, strategic, and statistical considerations. Journal of

Personality and Social Psychology, 51, 1173–1182.

Bellman, R.E. & Dreyfus, S.E. (1962). Applied dynamic programming. RAND Corporation.

Brown, A.L. (1992). Design Experiments: Theoretical and methodological challenges in creating

complex interventions in classroom settings. The journal of Learning Sciences, 2, 141-

178.

Chakraborty, B., Murphy, S.A., & Strecher, V. (2009). Inference for non-regular parameters in

optimal dynamic treatment regimes. Statistical Methods in Medical Research, In Press.

Cole, M.B. (2005). Group Dynamics in Occupational Therapy: The Theoretical Basis and

Practice Application of Group Treatment (3rd edition). Thorofare, NJ: Slack Inc.

30

Collins, L.M., Murphy, S.A., & Bierman, K.A. (2004), A Conceptual Framework for Adaptive

Preventive Interventions, Prevention Science, 5, 185-196.

Conduct Problems Prevention Research Group. (1992). A developmental and clinical model for

the prevention of conduct disorder: The Fast Track program. Development and

Psychopathology, 4, 509-527.

Connell, A., Bullock, B. M., Dishion, T. J., Shaw, D., Wilson, M., & Gardner, F. (2008). Family

intervention effects on co-occurring behavior and emotional problems in early childhood:

A latent transition analysis approach. Journal of Abnormal Child Psychology, 36, 1211-

1225.

Cuijpers, P., Jonkers, R., de Weerdt, I., & de Jong, A. (2002). The effects of drug abuse

prevention at school: The ‘Healthy School and Drugs’ project. Addiction, 97, 67–73.

Hirschi, A., & Vondracek, F.W. (2009). Adaptation of career goals to self and opportunities in

early adolescence. Journal of Vocational Behavior, 75, 120-128.

Fabiano, G. A., Pelham, W. E., Waschbusch, D. A., Gnagy, E. M., Lahey, B. B., Chronis, A. M.,

et al. (2006). A practical measure of impairment: Psychometric properties of the

impairment rating scale in samples of children with Attention Deficit Hyperactivity

Disorder and two school-based samples. Journal of Clinical Child and Adolescent

Psychology, 35, 369–385.

Lavori, P.W. & Dawson, R. (2000). A design for testing clinical strategies: biased individually

tailored within-subject randomization. Journal of the Royal Statistical Society A, 163, 29-

38.

Laurenceau, J-P, Hayes, A.M. , & Feldman, G.C. (2007). Statistical and methodological issues in

the study of change in psychotherapy. Clinical Psychology Review, 27, 682-695.

31

Lunceford, J. K., Davidian, M. & Tsiatis, A. A. (2002). Estimation of survival distributions of

treatment policies in two-stage randomization designs in clinical trials. Biometrics, 58,

48-57.

Marlowe, D.B., Festinger, D.S., Arabia, P.L., Dugosh, K.L., Benasutti, K.M., Croft, J.R., &

McKay, J.R. (2008). Adaptive interventions in drug court: A pilot experiment. Criminal

Justice Review, 33, 343-360.

Martocchio, J.J., & Webster, J. (1992). Effects of feedback and cognitive playfulness on

performance in microcomputer software training. Personnel Psychology, 45, 553-578.

McKay, J.R. (2005). Is there a case for extended interventions for alcohol and drug use

disorders? Addiction, 100, 1594-1610.

MacKinnon, D.P., Warsi, G., & Dwyer, J.H. (1995). A simulation study of mediated effect

measures. Multivariate Behavioral Research, 30, 41–62.

Murphy, S.A. (2005). An experimental design for the development of adaptive treatment

strategies. Statistics in Medicine, 24, 455-1481.

Murphy, S.A., & Bingham, D. (2009). Screening experiments for developing dynamic treatment

regimes Journal of American Statistical Association, In Press.

Murphy, S.A, Collins, L.M., & Rush, A.J. (2007). Customizing treatment to the patient:

Adaptive treatment strategies (editorial). Drug and Alcohol Dependence, 88, S1-S72.

Murphy, S.A., Lynch, K.G., Oslin, D., Mckay, J.R. & TenHave, T. (2007). Developing adaptive

treatment strategies in substance abuse research. Drug and Alcohol Dependence, 88s,

s24-s30.

Murphy, S.A., van der Laan, M.J., Robins, J.M. & CPPR (2001). Marginal mean models for

dynamic regimes. Journal of American Statistical Association, 96, 1410-1423.

32

Neyman, J. (1923). On the application of probability theory to agricultural experiments.

Translated in Statistical Science, 5, 465-480 (1990).

Pelham, W.E., & Fabiano, G.A. (2008). Evidence-based psychosocial treatment for attention-

deficit/hyperactivity disorder. Journal of Clinical Child and Adolescent Psychology, 37,

184–214.

Petersen, M.L., Deeks, S.G. and van der Laan, M.J. (2007). Individualized treatment rules:

Generating candidate clinical trials. Statistics in Medicine, 26, 4578-4601.

Pliszka S. (2007): Practice parameter for the assessment and treatment of children and

adolescents with attention-deficit/hyperactivity disorder. Journal of the American

Academy of Child & Adolescent Psychiatry, 46, 894-921.

Raudenbush, S. W., Hong, G., & Rowan, B. (2002). Studying the causal effects of instruction

with application to primary-school mathematics. Paper presented at the Research Seminar

II: Instructional and Performance Consequences of High-Poverty Schooling, National

Center for Educational Statistics, Washington, DC.

Robins, J.M. (1986). A new approach to causal inference in mortality studies with sustained

exposure periods -application to control of the healthy worker survivor effect. Computers

and Mathematics with Applications, 14, 1393-1512.

Robins, J.M. (1987). Addendum to “A new approach to causal inference in mortality studies

with sustained exposure periods -application to control of the healthy worker survivor

effect.” Computers and Mathematics with Applications, 14, 923-945.

Robins, J.M. (2004). Optimal structural nested models for optimal sequential decisions. In D.Y.

Lin, and P. Heagerty (Eds.), Proceedings of the Second Seattle Symposium on

Biostatistics (pp. 189-326). NY: Springer.

33

Rubin, D.B. (1978). Bayesian inference for causal effects: the role of randomization. The Annals

of Statistics, 6, 34-58.

Sutton, R.S. & Barto, A.G. (1998). Reinforcement Learning: An Introduction. Cambridge, Mass:

MIT Press.

Thall, P.F., Sung, H.G. & Estey, E.H. (2002). Selecting therapeutic strategies based on efficacy

and death in multicourse clinical trials. Journal of the American Statistical Association,

97, 29-39.

Thall, P. F. and Wathen, J. K (2005). Covariate-adjusted adaptive randomization in a sarcoma

trial with multi-stage treatments. Statistics in Medicine, 24:1947-1964.

van der Laan, M. J. & Petersen, M. L. (2007). Statistical learning of origin-specific statically

optimal individualized treatment rules. The International Journal of Biostatistics, 3,

Article 3.

Wahed, A. S. & Tsiatis, A. A (2004). Optimal estimator for the survival distribution and related

quantities for treatment policies in two-stage randomization designs in clinical trials.

Biometrics, 60, 124-133.

Wahed, A. S. & Tsiatis, A. A (2006). Semiparametric efficient estimation of survival distribution

for treatment policies in two-stage randomization designs in clinical trials with censored

data. Biometrika, 93, 163-177.

Wampold, B. E., Lichtenberg, J. W., & Waehler, C. A. (2002). Principles of empirically

supported interventions in counseling psychology. The Counseling Psychologist, 30,

197–207.

Wandersman, A., & Florin, P. (2003). Community interventions and effective prevention.

American Psychologist, 58, 441–448.

34

Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD Thesis, University of

Cambridge, England.

Weisz, J.R., Chu, B.C., & Polo, A.J. (2004). Treatment dissemination and evidence-based

practice: Strengthening interventions through clinician-researcher collaboration. Clinical

Psychology: Science and Practice, 11, 300-307.

Yalom, I. D. (1995). The Theory and Practice of Group Psychotherapy (4th edition). NY: Basic

Books.

35

Table 1: Estimated Coefficients for �2 (N=77).

Effect Estimate SE

Lower limit .10

CI

Upper limit .10

CI

Intercept 1.76 0.37

B (baseline) 0.46 0.09

W (week of non-response) -0.01 0.02

O1 (medication prior to first stage intervention) 0.32 0.32

O2 (adherence to first stage intervention) -0.10 0.29

A1 (first stage intervention) 0.08 0.14

A2 (second stage intervention) -0.42 0.21 -0.79 -0.06

O2*A2 (adherence to first stage intervention*second stage intervention) .75 0.29 0.27 1.23

A1*A2 (first stage intervention*second stage intervention) -0.003 0.14 -0.24 0.23

Table 2: Estimates of (�21+�22�1+�23�2) for every combination of �1and �2 (N=77)

A1 1 = behavioral intervention -1= medication

O2 1=high adherence 0=low adherence

Estimated (�21+�22�1+�23�2) SE Lower limit

.10 CI

Upper limit

.10 CI

-1 1 0.33 0.24 -0.07 0.73

-1 0 -0.42 0.27 -0.87 0.02

1 1 0.33 0.22 -0.04 0.70

1 0 -0.43 0.25 -0.85 -0.01

36

Table 3: Estimated coefficients and soft-threshold Confidence Intervals for Q1 (N=131).

Effect Estimate SE Lower limit

.10 CI Upper limit

.10 CI

Intercept 2.32 0.15

B (baseline) 0.43 0.04

O1 (medication prior to first stage intervention) 0.09 0.13

A1 (first stage intervention) 0.20 0.06 0.002 0.38

O1*A1 (medication prior to first stage intervention*first stage intervention)

-0.24 0.13 -0.49 -0.01

Table 4: Estimates of (�11+�12�1) for each level of �1.

O1 1= medication prior to first stage intervention 0= no medication prior to first stage intervention

Estimated (�11+�12�1) SE

Lower limit .10 CI

Upper limit .10 CI

1 -0.04 0.11 -0.33 0.24

0 0.20 0.06 0.002 0.38

37

Figure 1: Illustration of unmeasured confounders affecting �2 and �.

�1

�1

�2 �2 �

�

38

Figure 2: Distribution of estimated coefficient �1+12(|�3+�4|− |�3−�4| )

Figure 3: Distribution of estimated coefficient of �11

39

Figure 4: Sequential Multiple Assignment Randomized Trial for ADHD study

Continue Medication Responders

Medication Increase Medication Dose

R Non‐Responders Add Behavioral Intervention

R

Continue Behavioral Intervention Responders Behavioral

Intervention Increase Behavioral

Intervention Non‐Responders R

Add Medication

R

40

Figure 5: Distribution for % days on behavioral intervention for those assigned to low dose of behavioral intervention as the first stage intervention.

41

Figure 6: Distribution for % days on medication for those assigned to low dose of medication as the first stage intervention.

Figure 7: Predicted mean of classroom performance for each of the stage 2 intervention options (A2), given the first stage intervention (A1) and adherence to first stage intervention (O2).

2

2.25

2.5

2.75

3

3.25

3.5

3.75

Add Enhance

A1=BMOD O2=High Adherence

A1=BMOD O2=Low Adherence (Diff=0.85, P<.10)

A1=MED O2=High Adherence

A1=MED O2=Low Adherence (Diff=.84, P= <.10)

Stage 2 interven_on (A2)

42

Figure 8: Predicted estimated quality of the second stage intervention for each of the first stage intervention options (A1), given whether or not the child received medication at school prior to first stage intervention (O1).

3

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

4

Behavioral Interven_on Medica_on

O1=Medica_on at school prior to stage 1 interven_on

O1= No medica_on at school prior to stage 1 interven_on (Diff=0.40, P<10)

Pred

icted Es_m

ated

Quality of Stage 2 Interven

_on

Stage 1 interven_on (A1)

43

Appendix 1: Proof of (3)

First note that for �1 = -1 or 1, �2 = -1 or 1

��1=�1,�2=1,�2=�2=�0+�1��1=�1,�2=1,�2=�2

=�0+�1�(�=1|�1=�1,�2=1,�2=�2),

where the second equality follows since U is Bernoulli.

By the basic properties of conditional probability, we have

��=1�1=�1,�2=1,�2=�2

= ��=1,�1=�1,�2=1,�2=�2��1=�1,�2=1,�2=�2 =

��2=�2�=1,�1=�1,�2=1�(�=1,�1=�1,�2=1)��2=�2|�1=�1,�2=1�(�1=�1,�2=1).

Since the Time 2 intervention �2 is randomly assigned to 1 or −1 with probability ½ each, given

�2=1, we have ��2=�2�=1,�1=�1,�2=1= ��2=�2|�1=�1,�2=1=12 for �2=1 or −1. This

implies

��=1�1=�1,�2=1,�2=�2= �(�=1,�1=�1,�2=1)�(�1=�1,�2=1).

In the following we derive the joint probability �(�=�,�1=�1,�2=1) for � = 0 or 1, and �1 = -

1 or 1. Note that �1 and � are independently distributed. Thus

��=�,�1=�1=��=�)�(�1=�1=12×12=14.

It follows that

��=�,�1=�1,�2=1

=��2=1�=�,�1=�1��=�,�1=�1

44

=14��1+�22+�1−�22�1+(1−�)�3+�42+�3−�42�1 .

Hence,

��=1,�1=�1,�2=1= 14�1+�22+�1−�22�1=�14 �� 1=1 �24 �� 1=−1

and

��1=�1,�2=1= ��=1,�1=�1,�2=1+��=0,�1=�1,�2=1

=14�1+�22+�1−�22�1+14�3+�42+�3−�42�1

=�1+�34 �� 1=1 �2+�44 �� 1=−1 .

Therefore

��1=�1,�2=1,�2=�2

=�0+�1��=1,�1=�1,�2=1��1=�1,�2=1

=�0+�11+�12×�1�1+�3+1−�12×�2�2+�4

=�0+�12�1�1+�3+�2�2+�4+�12�1�1+�3−�2�2+�4�1

45

Appendix 2

Let (�1,�1,�2,�2,�) denote the observable data of an individual in a randomized trial. In

this section, we show that when individuals are randomly assigned to intervention options

(randomization probabilities may depend on past information) at each stage,

�.1 �(�≤�| �1=�1,�1=�1,�2=�2, �2=�2)=�(�≤�| �1=�1,�1,�2=�2, �2)

and

�.2 �(�2≤�2| �1=�1,�1=�1)=�(�2≤�2| �1=�1,�1)

for all possible values of (�1,�1,�2, �2). Note that the conditional probabilities on the left hand

side are based on the multivariate distribution where the interventions (�1, �2) may vary across

individuals. For example, �(�≤�| �1=�1,�1=�1,�2=�2, �2=�2) is the distribution of � among

the subpopulation of individuals with (�1=�1,�1=�1,�2=�2, �2=�2). Data from the

randomized trial provide information about these conditional probabilities. While the conditional

probabilities on the right hand side are based on the multivariate distribution where all

individuals have the same interventions (�1, �2). For example, �(�≤�| �1=�1,�1,�2=�2, �2)

is the distribution of � among the subpopulation of individuals with (�1=�1,�2=�2) if all

individuals were assigned (�1, �2). We need information about these conditional probabilities

to construct the optimal dynamically adaptive interventions. When interventions are sequentially

randomized (randomization probabilities may depend on past information), the left hand side

probabilities equal the right hand side probabilities. Thus data from the randomized trial can be

used to develop the optimal dynamically adaptive interventions. Below we prove (A.1) and (A.2)

using the potential outcome framework (Neyman, 1923; Rubin, 1978; Robins, 1986, 1987).

46

For each fixed sequence of interventions (�1, �2), we conceptualize potential outcomes

denoted by �2(�1) and ��1, �2, where �2(�1) is the observations that an individual would

have had at the second stage if he/she had followed �1 at stage 1, and ��1, �2 is the primary

outcome that would have been observed had an individual followed the sequence �1, �2. Let �1

and �2 denote the sets of all possible interventions at stages 1 and 2, respectively. Then the set

of all potential outcomes is �={�1,�2�1,��1, �2:�1∈�1, �2∈�2} (�1 is included for

completeness). Notice that the potential outcomes are only functions of the interventions (�1,

�2) since we will only manipulate interventions. By definition, the multivariate distribution of

(�1,�2�1,��1, �2) is the multivariate distribution of (�1,�2,�) when the sequence of

interventions is set at (�1, �2) for all individuals. This is the distribution needed to construct the

optimal adaptive interventions.

Assuming Robins’ consistency assumption holds (i.e. an individual’s intervention

assignment does not affect other individuals’ outcomes; see Robins and Wasserman (1997)), the

potential outcomes are connected to the individual’s data by �2=�2�1 and �=��1, �2. In

addition, since the randomization probabilities in the randomized trial only depend on past

information, �2 is independent of the set of all potential outcomes � given (�1,�1,�2) and �1 is

independent of � given �1. Hence,

�(�≤�| �1=�1,�1=�1,�2=�2, �2=�2)

=�(��1, �2≤�| �1=�1,�1=�1,�2=�2, �2=�2)

=�(��1, �2≤�| �1=�1,�1=�1,�2=�2)

=�(��1, �2≤�| �1=�1,�1=�1,�2(�1)=�2)

=�(��1, �2≤�| �1=�1,�2(�1)=�2)

=�(�≤�| �1=�1,�1,�2=�2, �2)

47

where the first and the third equalities follow from the consistency assumption, the second and

the fourth equality follow from the fact that the randomization probabilities depend only on the

past information and the last equality follows from the definition of potential outcomes.

Similarly, we can show that (A.2) holds.

Q-learning: A data analysis method for constructing adaptive interventions

Documents

Transcript of Q-learning: A data analysis method for constructing adaptive interventions