Q-learning: A data analysis method for constructing adaptive interventions
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Q-learning: A data analysis method for constructing adaptive interventions
1
Q-Learning: A Data Analysis Method for Constructing Adaptive Interventions
Technical Report Feb 2010
Inbal Nahum-Shani, Min Qian, William E. Pelham, Beth Gnagy, Greg Fabiano, Jim Waxmonsky, Jihnhee Yu, Susan Murphy.
Inbal Nahum-Shani: The Methodology Center, Pennsylvania State University, 204 E. Calder Way, Suite 400, State College, PA 16801
Min Qian and Susan Murphy: Department of Statistics and Quantitative Methodology Program, Institute for Social Research, University of Michigan 2068 Ann Arbor, MI 48106-1248.
William E. Pelham, Beth Gnagy, Greg Fabiano, Jim Waxmonsky, Jihnhee Yu: Center for Children and Families, State University of New York at Buffalo, 318 Diefendorf Hall 3435 Main Street, Building 20, Buffalo, NY 14214.
2
In recent years, research on treatment and intervention development has shifted from the
traditional “one size fits all” concept underlying the standard fixed intervention strategies, to
developing adaptive interventions, in which the dose or type of services that are offered to clients
are individualized based on clients’ characteristics or clinical presentation, and then readjusted in
response to their ongoing performance in treatment (Marlowe et al. 2008: 343). While in the first
approach, the composition and dosage of the intervention are not adjusted in response to the
needs, or characteristics, of individual subjects; the latter approach is based on the notion that
individuals differ in their responses to treatment such that in order for a program to be effective,
the intervention should vary over time in response to the needs of the individual. In this sense,
researchers are becoming increasingly interested in developing interventions that adapt to the
dynamics of the system of interest (e.g., individuals, social groups) via decision rules that
recommend when and how the intervention should be modified in order to maximize long term
outcomes. These recommendations are based not only on subjects’ characteristics but also on
outcomes collected during the intervention such as subject’s response and adherence. It follows
that dynamically adaptive interventions are time varying interventions that adapt to subject’s
intermediate outcomes and characteristics. These types of interventions are also known as
‘dynamic treatment regimes’ (Murphy et al., 2001; Robins, 1986), ‘adaptive treatment strategies’
(Lavori & Dawson, 2000; Murphy 2005), ‘multi-stage treatment strategies’ (Thall et al. 2002;
Thall & Wathen 2005), ‘treatment policies’ (Lunceford et al. 2002; Wahed & Tsiatis 2004, 2006)
or ‘individualized treatment rules’ (Petersen et al. 2007; van der Laan & Petersen 2007).
The conceptual advantages of adaptive interventions have long been recognized by
behavioral and social scientists. For example, in the area of learning and education, Brown’s
(1992) discussion of knowledge acquisition emphasizes the need to take into account the variable
3
responses of students when conducting clinical interviews and tests. Brown suggests a form of
“dynamic assessment” where the interviewer follows a process guided by decision rules in order
to measure students’ emergent competence and open the window of opportunity for learning. In
the area of organizational behavior, Martocchio and Webster (1990) studied the effect of
feedback on employee performance in microcomputer software training, stressing the need to
develop training programs in which training design characteristics are adapted to employee level
of cognitive playfulness (cognitive spontaneity in human-computer interactions). Recently, in
their study of career goal setting, Hirschi and Vondracek’s (2009) conceptualized the
development and adaptation of goals as a dynamic process, where individuals have to select
goals according to personal preferences and environmental opportunities and limitations,
optimize their behavior to achieve those goals, and compensate and adjust if goals become
unattainable or unattractive. In the area of psychotherapy, Laurenceau, Hayes and Feldman
(2007), noted that growth and change in psychotherapy has been conceptualized as a dynamic
process reflecting both destabilization of a stable behavioral and emotional pattern which thereby
increases the need to develop more adaptive patterns. Still, despite the appealing notion
underlying the conceptualization of adaptive interventions to behavioral and social scientists,
data analysis methods for informing the construction of adaptive interventions are still in their
infancy (Murphy, Collins and Rush, 2007). Accordingly, in the current study we aim to introduce
a data analysis method useful in constructing adaptive interventions to researchers in the
behavioral and social sciences.
We begin by discussing adaptive interventions as a vehicle for operationalizing
sequential decision making in behavioral interventions. We then discuss a new, yet
straightforward, method for data analysis, called Q-learning. Q-learning is an analysis method
4
drawn from computer science that can be used to inform the construction of an adaptive
intervention. We illustrate how this method can be applied to formulate adaptive interventions
using the Adaptive Interventions for Children with ADHD study (Center for Children and
Families, SUNY at Buffalo, William E. Pelham as PI). Finally, we discuss directions for future
research for behavioral and social scientists aiming to develop adaptive interventions.
Adaptive Interventions
Interventions are defined as “planned actions intended to produce desired changes in
existing conditions of persons or environments, usually with the condition identified as a
problem to be ameliorated” (Adelman & Taylor, 1994: 638). Although interventions are widely
conceptualized as complex processes (e.g., Cuijpers, 2002; Wandersman, & Florin, 2003;
Wampold, Lichtenberg, & Waehler, 2002), intervention scientists usually adopt the traditional
approach to intervention development, aiming to construct a “one size fits all” intervention in
which the composition and dosage are fixed. In this approach, all intervention participants are
offered with the same single intervention composition and dosage. For example, in Brand, Lakey
and Berman’s (1995) study on improving perceived social support among community residents,
the same group-based intervention was delivered to all community residents reporting low levels
of perceived support. Every component of the intervention, which included a combination of
social skills training (e.g., positive assertions to self and others, conflict resolution strategies,
active listening) and cognitive restructuring (e.g., identifying and correcting dysfunctional
attitudes that can occur in relationships, positive self-statements and self-acceptance) was
delivered to all participants with the assumption that each one of these components is necessary
for any particular resident. Moreover, each participant was offered the same intervention dosage
(13 weekly group sessions). Using such a fixed approach to intervention development and
5
assessment does not take into account the varying intervention needs of individuals (Collins,
Murphy & Bierman, 2004; Connell et al., 2008). It limits the ability to identify the truly
efficacious aspects of each treatment, potentially leading to inclusion or unnecessary or even
counterproductive components and does not allow clinicians to match treatments with individual
recipients most likely to benefit from them.
In recent years, intervention development is shifting from this traditional fixed research-
based approach, into conceptualizing interventions in terms of sequential processes in which the
varying needs of subjects are taken into consideration (Collins et al., 2004). Notice that this
conceptualization has two components: (1) the intervention is time-varying and (2) the
intervention adapts and readapts to the specific needs of individuals. Weisz, Chu and Polo’s
(2004) discussion of dissemination and evidence-based practice in clinical psychology suggests
that evidence-based practice should ideally consists of much more than simply obtaining an
initial diagnosis and choosing a matching treatment. Evidence-based practice “is not a specific
treatment or a set of treatments, but rather an orientation or a value system that relays on
evidence to guide the entire treatment process. Thus, a critical element of evidence-based care is
periodic assessment to gauge whether the treatment selected initially is in fact proving helpful. If
it is not, adjustments in procedures will be necessary, perhaps several times over the course of
the treatment “ (p.303). In fact, many behavioral and cognitive therapies can be seen as adaptive
interventions (Bierman et al., 2006). For example, group therapy processes are adjusted over
time based on group dynamics and the developmental stage of the group (Cole, 2005; Yalom,
1995). Cognitive therapy is tailored to address the unique cognitive conceptualization of the
patient and in response to his/her progress or worsening during the process with therapists basing
6
the format, content, duration and intensity of the upcoming sessions based on the results of the
prior sessions (Beck, Liese & Najavits, 2005).
One approach to operationalizing the conceptual idea of an adaptive intervention is to use
decision-rules (Bierman et al., 2006) that link subjects’ characteristics with specific levels and
types of intervention components. This approach is conceptually appealing because it mimics
decision processes in real life where individuals select their actions based on information
obtained from the environment and modify their actions based on this information with the
general aim to maximize long-term rewards. Take for example a typical classroom scenario
where the teacher selects teaching strategies that would fit with the needs of his/her students and
continuously modify these strategies based on students’ responses in class and performance on
exams in order to optimize the learning process.
The assignment of a particular intervention component and level of dosage are based on
the subject’s values on tailoring variables. These variables are expected to moderate the effect of
certain intervention components, and the logic is that the level or type of intervention should be
tailored according to these moderators. Note that all tailoring variables are moderators, but not
all moderators are tailoring variables. For example, consider an intervention to which both men
and women respond better than to a control but the intervention exhibits stronger effects for
women. In this case, although gender is a moderator, both men and women should be offered the
intervention as both groups benefit. Thus, gender is not a tailoring variable (otherwise different
genders should be offered different interventions).
Although the list of candidate tailoring variables depend on the study, common types of
these variables include individual, group, or context characteristics representing risk or
protective factors that influence responsiveness to (or need for) various types or intensity of
7
intervention components. For example, community residents who are characterized by particular
risk factors, for example low self esteem, are likely to benefit from an intervention that places
more emphasis on cognitive restructuring than on social skills, whereas those with high self
esteem may find other interventions more beneficial. Individual’s responsivity may also be an
important moderator. For example, it is possible that community residents who do not adequately
respond (report low levels of support) to the support intervention within a certain period of time
(say 13 weeks), may need a more intensive intervention or a different type of intervention. In this
case, community residents’ response to the intervention serves as a tailoring variable, allowing
the researcher to modify the support intervention in order to increase its efficiency. Intervention
decisions may also be tailored by previous intervention decisions. For example, the decision
whether to intensify the support intervention may vary as a function of the type of support
intervention given at the first stage. Assume there are two possible first stage interventions, one
that places more emphasis on cognitive restructuring, and one that places more emphasis on
developing social skills. It is possible that intensifying the intervention for non-responders may
be more effective for those residents who were initially assigned to the cognitive restructuring-
based intervention, while for those non-responders assigned to the social skills-based
intervention, it is better to add a cognitive component to the first stage intervention.
To demonstrate how decision rules link information obtained based on the tailoring
variables to intervention options, assume for simplicity that the supportive intervention is only
tailored according to individual’s response to the first stage intervention. In that case, the
decision rule can be expressed as
First stage intervention = {social skill}
IF evaluation = {non-response}
THEN at Step t+1 apply decision {intensify first stage intervention}
8
ELSE IF evaluation = {response}
THEN at Step t+1 continue on present intervention
Notice that the IF and ELSE IF part of the rule contains subjects’ tailoring variables; intervention
options are expressed in the THEN portion of this rule.
To construct high quality adaptive interventions, we focus on selecting good decision
rules. Accordingly, unlike fixed interventions, adaptive interventions not only include the
intervention components and dosage, but also an entire system for assigning components and
dosage. In other words, “the choice of tailoring variables, the measures of the tailoring variables,
the decision rules linking tailoring variables to the assignment of components and dosage, and
the implementation of these rules are all an integral part of the intervention itself” (Collins et al.,
2004: 186).
There are quite a few studies in which researchers used decision rules to operationalize
dynamically adaptive interventions. The Fast Track study included a dynamically adaptive
intervention program for preventing conduct problems among high-risk children (Conduct
Problems Prevention Research Group, 1992). Families participating in this study were provided
with home visits at different levels of intensity depending on clinical judgments of parental
functioning and family needs (see Bierman et al., 2006). In the misdemeanor drug court adaptive
intervention developed by Marlowe and colleagues (Marlowe et al., 2008), the frequency of
court hearings and type of counseling session were adjusted according to prespecified criteria in
response to participants’ performance. In a study of alcohol dependent patients, McKay (2005)
evaluated the effectiveness of an adaptive intervention that was built around brief telephone
contacts, consisting of risk for relapse assessment and problem- focused counseling. When risk
levels increased, participants received stepped up care such as frequent telephone sessions and
9
several sessions of motivational interviewing. The Early Steps Multisite study (ES-M), involved
the application of family-centered program for reducing emotional and behavioral problems in
children. This intervention was tailored and adapted according to the specific needs of each
family (Connell et al., 2008).
Still, research methods and procedures for finding the best sequence of decision rules are
considered relatively new and complex (Connell et al., 2008). Accordingly, in the following
section we introduce Q-learning (Watkins, 1989) – a novel methodology that can be used for the
construction of adaptive interventions from data. We illustrate the application of this method
using data from a study aiming to develop an adaptive intervention for improving the school-
based performance of children with Attention Deficit Hyperactivity Disorder (ADHD).
Motivation for using Q-learning
When developing adaptive interventions, the aim of the researcher is to find the optimal
sequence of decision rules, namely the sequence of adaptive or individualized decisions for
providing the intervention. Finding the optimal sequence of decision rules belongs to the class of
sequential or multistage decision problems, where a decision which appears optimal in the short-
term may not be a component of the optimal sequence of decisions (Lavori & Dawson, 2000). To
clarify this, consider a sequence of decisions, with one decision point per stage of intervention.
At each stage there may be several possible intervention options. For simplicity, throughout this
manuscript we assume we have only two intervention stages. Denote the intervention at the first
stage by �1 and denote the intervention at the second stage by �2. Also, for simplicity we
assume there are only two intervention options at each stage (�1 and �2 are coded as -1/1). Let
Y denote the primary outcome (the response at the end of the second stage), for which high
values are preferred.
10
To begin, first suppose that our goal is to find the sequence of non-adaptive decisions,
one per stage of intervention, that when implemented will lead to the maximal expected value of
the primary outcome. Consider a simple study in which individuals are randomized to
intervention options at each stage. A basic way to find an optimal sequence of non-adaptive
decisions, based on data collected from this study, would be to regress Y on �1, �2 and the
interaction between them (because the effect of the second stage decision may vary as a function
of the first stage decision).
(1) �~�0+ �1�1+ �2�2 + �3�1�2.
In order to find the best sequence of decisions, that is the sequence of intervention
options that leads to the maximal primary outcome, we estimate the regression coefficients and
simultaneously maximize over �1 and �2. More specifically, since
�0+ �1�1+ �2�2+�3�1�2=�0+ �1+ �2+�3 �� �1=1 ��� �2=1 �0+ �1− �2−�3 ��
�1=1 ��� �2=−1�0− �1+ �2−�3 �� �1=−1 ��� �2=1�0− �1− �2+�3 �� �1=−1 ���
�2=−1 (*)
where �0, �1, �2 and �3 are the estimated regression coefficients, we choose the sequence of
decisions that maximizes the right hand side of the (*). For example, if �0> 0, �1>0, �2<0 and
�3<0, then the best sequence is to choose first stage intervention option 1, and then choose
second stage intervention option -1.
Notice that although the best sequence of decisions based on (1) can be used to construct
a time varying intervention, it cannot be used to construct an adaptive intervention since this
sequence of decisions is not adaptive, in that it is not tailored according to the changing status of
the subject. An example of such a time-varying (yet non-adaptive) intervention approach can be
seen in Raudenbush, Hong and Rowan’s (2002) study of the effects of time-varying mathematics
11
instructional “treatments”. However, adaptive interventions are not only time varying, but also
adaptive to the dynamics of the environment and hence an optimal adaptive intervention involves
an optimal sequence of decision rules, as opposed to an optimal sequence of decisions.
Consider a simple study in which individuals are randomized to interventions at each
stage and observations on the individual are collected prior to each randomization. Denote the
observations on the individual (possibly a vector) at the beginning of the first stage by O1 and at
the beginning of the second stage by O2. Accordingly, the data record for each subject would be:
�1, �1, �2, �2, �. In general O contains predictors of the primary outcome. O1 and O2 may
condition (moderate) the effects of the decisions; additionally �2 may be affected by both �1
and �1. Denote the decision rule at the first stage, that takes the available information (�1) as
the input and outputs a decision (i.e., intervention option) �1, by �1. Denote the decision rule at
the second stage, that takes the available information (�1, �1, �2) as the input and outputs a
decision (i.e., intervention option) �2, by �2. Our goal is to find the optimal sequence of
decision rules (�1∗, �2∗), namely the sequence of decision rules that would lead to the maximal
expected primary outcome if assigned to the entire study population.
In this case an intuitive way to find (�1∗, �2∗) would be to extend the regression in (1)
to include O1 and O2 as potential moderators. For example,
2 �~�0+ �1�1+ �2�1+ �3�1�1+ �4�2+ �5�2 + �6�1�2+ �7�2�2
However, using estimates based on this equation to make an inference about the optimal
sequence of adaptive decisions (�1∗, �2∗) is problematic in two main aspects. First, since �2
may be an outcome of �1 and a potential predictor of �, �2 cuts off any portion of the effect of
�1 on � that occurs via �2. To clarify this, �2 can be conceptualized as a mediator in the
relationship between �1 and �. Adding �2 to a regression in which �1 is used to predict � will
12
reduce the effect of �1. In the presence of �2, the coefficient for �1 no longer expresses the total
effect of the first stage intervention on the outcome, but rather what is left of the total effect (the
direct effect) after cutting off the part of the effect that is mediated by �1 (the indirect effect)
(Baron & Kenny, 1986; MacKinnon, Warsi, & Dwyer, 1995). Note that ascertaining the total
effect of the intervention (say �1) is crucial to finding the best decision rule (say �1∗), as it
provides information concerning the overall effect of the intervention. Although the direct effect
of the intervention, may be helpful in identifying mechanisms or processes through which the
intervention may affect the outcome, it does not help the researcher decide which intervention
option is superior. Accordingly, any inference concerning the optimal adaptive decision at the
first stage, based on (2) is likely to be biased.
Second, unknown causes of both �2 and � may introduce biases in the coefficients of �1
terms (main effects and interactions) such that �1 may appear to be falsely less or more
correlated with � because �1 affects �2 while �2 and � are affected by the same unknown
causes (see Figure 1).
--------------------------- Figure 1 about here
-----------------------------
To demonstrate this, consider the numerical example discussed by Murphy and Bingham
(2009). Let U (the unknown cause) be a Bernoulli random variable with success probability 12
represent an unknown cause of both O2 and Y (we assume there is no O1 in this case, i.e., O1 in
(2) equals zero). Suppose that �=�0+ �1�+ �, where � (mean zero, finite variance) is
independent of (�, O2, �1, �2). Thus, there is no effect of �1 or �2 on Y. �1, �2 can each
obtain -1/1 values; subjects are randomly assigned to these options at each stage with probability
12. Next, suppose that O2 can obtain two values 0 or 1, and
13
��2=1�, �1 =��1+�22+�1−�22 �1+(1−�)�3+�42+�3−�42 �1
where each ��∈[0,1]. It follows (see Appendix 1 for the proof) that
(3) ���1, �2=1, �2= �0+�12�1�1+�3+�2�2+�4+�12�1�1+�3−�2�2+�4�1
Since �1 = �12�1�1+�3−�2�2+�4 may be different from the true effect of zero, �1
in (2) reflects bias that occur because we are conditioning on O2 which is both an outcome of �1
and a predictor of �.
Motivated by the inference problems noted above, we introduce Q-learning -- a method
for using data to estimate the optimal sequence of decision rules. Q-learning uses backwards
induction (Bellman & Dreyfus, 1962) to construct a sequence of decision rules that map or link
the observations of the environment (here captured by tailoring variables) to the actions the agent
(Decision Maker) ought to take in order to maximize desired long-term primary outcome. In
terms of constructing an adaptive intervention, Q-learning can be used to find the sequence of
decision rules that link the subject’s observations (e.g., characteristics and responses to past
decisions) to the most efficient intervention component and dosage. The aim of Q-learning is to
evaluate the intervention components at each stage when the subsequent adaptive decision is
matched to the subjects as opposed to evaluating intervention components as stand-alone
components for each stage. In the following section we show how researchers can use Q-learning
to construct optimal sequence of adaptive decision rules.
14
Q-learning
Our goal is to find the optimal sequence of decisions rules (�1∗, �2∗). In some
applications (e.g., expert system1) an expert may provide the multivariate distribution of �1, �2
and �, for every sequence of decisions �1, �2. In this case, we can obtain the optimal sequence
of decision rules using backwards induction as follows:
�2∗�1,�1,�2=arg max �2�2�1,�1,�2, �2,
where �2�1,�1,�2, �2=��| �1,�1,�2, �2 is the expectation of the primary outcome
conditioning on �1,�2, for interventions �1, �2. This conditional expectation provides the
quality of the intervention option �2, as it expresses the expected primary outcome of choosing
intervention option �2 now, given the information available (the history: �1,�1,�2).
Then, we move backwards in time to find the optimal adaptive decision rule at stage 1,
namely �1∗�1.
�1∗�1=arg max �1�1�1, �1
where �1�1, �1=�max �2 �2�1,�1,�2, �2| �1,�1 is the conditional expectation that
provides the quality of choosing intervention option �1 initially, assuming that we choose the
best intervention option at the second stage. That is, it expresses the expected primary outcome
of choosing option �1 given the information available (the history �1).
1 Expert systems (or knowledge‐based systems) are defined broadly as computer programs that mimic the reasoning and problem solving of a human ‘expert’. These systems use pre‐specified knowledge about the particular problem area. They are based on theoretical models, employing deep knowledge systems as a basis for their operation (Velicer, James Prochaska & Redding (2006).
15
In general "�" denotes the Quality of the decision, given the history up to that decision
point. �1 and �2 are often called Q-functions (Sutton & Barto,1998). Note that the optimal
decision rules �1∗, �2∗ output the intervention options that maximize �1, �2 respectively.
The focus here is on the use of data to construct adaptive decision rules; we do not know
the true multivariate distribution of �1, �2 and �. We represent the study data as
�1�,�1�,�2�,�2�,�� , � = 1,…,�, where � is the number of study participants. Throughout,
for simplicity, we assume that participants are randomly assigned to the two intervention options
at each of the two decision stages (e.g., �1 and �2 are randomized). When participants are
randomly assigned to intervention options (randomization probabilities may depend on past
information), the conditional distributions required to form optimal adaptive decisions, are the
same as the corresponding conditional distributions in the data (In Appendix 2 we provide the
proof and relate these expectations to potential outcomes).
We can use the following version of Q-learning (Murphy, 2005) to estimate (e.g.,
“learn”) the Q-functions, from which we construct the optimal sequence of decision rules as
described above. Here, we use linear regressions. The second stage Q-function might be modeled
as:
4 �2�1,�1,�2, �2, �2, �2=�20+�21�1+�22�1+ �23�1�1+ �24�2+
(�21+�22�1+�23�2)�2,
where, �2=�20, �21, �22, �23, �24, and �2=�21, �22, �23. Notice that our main interest lies
primarily in the parameters �2’s as they contain information with respect to how the relevant
decision should vary as a function of the candidate tailoring variables (here �1 and �2). Based
on (4) one can see that the second decision (�2) that maximizes �2 is the one that maximizes the
term (�21+�22�1+�23�2)�2; that is �2 is 1 if (�21+�22�1+�23�2) is positive and �2 is -1
16
if (�21+�22�1+�23�2) is negative. We estimate the vector parameters �2 and �2, by the
following regression:
�~�20+�21�1+�22�1+ �23�1�1+ �24�2+ (�21+�22�1+�23�2)�2,
Next, we construct the estimated quality of the second stage intervention. This
intermediate outcome is the expected value of the primary outcome, given that the optimal
adaptive decision was taken at stage 2. That is:
��=max�2�2�1�,�1�,�2�, �2; �2, �2, �=1, …, �.
�=�20+�21�1+�22�1+ �23�1�1+ �24�2+ |�21+�22�1+�23�2|
We use a linear model for the first stage Q-function as well.
5 �1�1, �1, �1, �1=�10+�11�1+(�11+�12�1)�1,
where, �1=�10, �11, and �1=�11, �12. Based on (5) one can see that the first decision (�1)
that maximizes �1 is the one that maximizes the term (�11+�12�1)�1; that is �1 is 1 if
(�11+�12�1) is positive and �1 is -1 if (�11+�12�1) is negative. We again use regression to
estimate �1 and �1 as follows:
�~�10+�11�1+(�11+�12�1)�1.
Notice that this time we regress the estimated quality of the second stage intervention (the
predictor of the primary outcome obtained by taking the best intervention option at the second
stage) on �1, �1, and �1�1.
Accordingly, the estimated optimal sequence of adaptive decisions (i.e., intervention
options) would be:
�2�1,�1,�2=arg max �2 �2�1,�1,�2, �2, �2, �2 = ����(�21+�22�1+�23�2)
�1�1=arg max �1 �1�1, �1; �1, �1=����(�11+�12�1)
17
where �2�1,�1,�2 is the estimated best second stage intervention option (�2), that is the second
stage intervention option that maximizes the mean of the primary outcome, given the history
�1,�1,�2, based on the estimated parameters �2 and �2. And, �1�1 is the estimated best
initial decision (�1) that maximizes the estimated quality of the second stage intervention, given
the history �1, based on the estimated parameters �1 and �1.
Q-Learning assists us in estimating an optimal sequence of decision rules in two
important ways. First, this approach reduces potential bias compared to the regression in (2)
since in the first stage analysis the regression model omits variables that may mediate the
relationship between the first stage intervention and the primary outcome (omitting these
variables prevents the elimination of the portion of the intervention effect that goes through the
mediator). For example, if �2 mediates the relationships between �1 and Y, inference
concerning the effect of �1 on Y, based on the regression equation (2) in which �2 is present,
will not reflect the total effect of �1 on Y, but rather the part of this effect that does not go
through the mediator �2. However, taking the Q-learning approach, the inference concerning �1
is based on a regression equation in which �2 is not present and hence any bias to the estimated
effect of �1 due to the mediation of �2 is reduced.
Second, Q-learning reduces the bias incurred by the use of (2), bias that is a result of
unmeasured causes of both the tailoring variables (�2), and the primary outcome (Y). Note that
this bias resulting from unmeasured causes is different from the bias discussed above, and may
occur even if �2 does not mediate the relationship between �1 and Y. The following section
further demonstrates the second feature.
18
Comparing the single regression approach and Q-learning
In order to demonstrate how Q-learning reduces the bias resulting from unmeasured
causes, consider the following example (Example 1): Say that the outcome Y is the level of
community residents’ perceived support 26 weeks after the beginning of the first stage
intervention, and �2 is the level of perceived support after a 13 weeks period. For simplicity, we
assume there are no baseline variables �1. Say that �1 (social skills-based intervention=1 vs.
cognitive-based intervention=-1) and �2 (intensify first stage intervention =1 vs. add the other
intervention component = -1) are each randomly assigned to subjects with probability ½. We also
assume �~�(0, 1) is an unmeasured cause (say personality characteristic) that has an effect on
both perceived support measures Y and �2. More specifically, �=1+ 0.5�+ ��, and �2=1+
0.5�+ 0.5�1+��. For both models we assume the �’s (error terms) are independent and
standard normally distributed. Notice that �2 does not mediate the relationship between �1 and
� and �1 has an effect on �2, but neither �1 nor �2 has an effect on Y .
We generated 1,000 samples, n=500 each using the above example. On each data set we
used the single regression approach and the Q-learning approach.
The single regression model is:
6 � ~ �0+�1�1+�2�2+�3�2+�4�1�2.
A natural approach to using (6) to construct the sequence of decision rules is as follows.
We construct the optimal decision rule at the second stage by finding the value of �2 that
maximizes (6) (i.e. that maximizes the term (�3 + �4�1)�2 ). This is, �2�1=����(�3 +
�4�1). Replacing �2 by ����(�3 + �4�1), the estimated maximal expected outcome is
�0+�1�1+�2�2+|�3 + �4�1|. Now, we rewrite this maximal expected outcome as
19
�0+�1�1+�2�2+|�3 + �4�1|=�0+�1�1+�2�2+ �1+12|�3+�4|+ 1−�12|�3−�4|
= �0+�2�2+12(|�3+�4|+ |�3−�4| )+�1+12(|�3+�4|− |�3−�4| )�1
Next we find the value of �1 that maximizes the above. Accordingly, if �1+12(|�3+�4|−
|�3−�4| )>0, we can conclude that �1=1 (social skills based intervention) is the best
intervention option at the first stage given that we chose the best second stage intervention
option. If �1+12(|�3+�4|− |�3−�4| )<0 we conclude that �1= -1 (cognitive based
intervention) is the best intervention option at the first stage given that we chose the best second
stage intervention option.
On the other hand consider Q-learning. In analogy to (6) we use the models:
�2�1, �2, �2, �2, �2=�20+�21�1+�22�2+(�21+�22�1)�2
and �1 �1, �1, �1=�10+�11�1.
Applying the Q-learning algorithm, we obtained estimates to the parameters ��,��, �=1,2. We
estimated the best second stage decision by choosing �2=����(�21+�22�1), and the best first
stage decision by choosing �1=����(�11). Using this approach �11 is the estimated effect of
the first stage intervention given that we chose the best second stage intervention option.
In conclusion we have that the sign of �1+12(|�3+�4|− |�3−�4| ) determines which
first stage intervention is selected as best in the single stage regression approach whereas the sign
of �11 determines which first stage intervention is selected as best in Q-Learning. We compare
the distribution of these two quantities across the 1000 generated samples. Recall that in our
example there is no effect of the initial decision �1, thus both distributions should be centered at
zero. Figure 2 presents the distribution of �1+12(|�3+�4|− |�3−�4| ) and Figure 3 presents the
distribution of �11. It is easy to see that the distribution of the Q-learning-based estimate is
centered around zero (SD = .06), while the distribution of the single regression-based estimate
20
has a mean of -.10 (SD = .06). Thus if there are unobserved causes of both �2 and Y, the single
regression approach in (6) may lead to erroneous conclusions concerning the best sequence of
adaptive intervention options, while using the Q-learning method improves our ability to find the
optimal sequence of decision rules.
---------------------------------- Figures 2 & 3 about here
----------------------------------
Analysis based on the Adaptive Interventions for Children with ADHD study
To illustrate Q-learning, we use a simplified version of the Adaptive Interventions for
Children with ADHD study (a full analysis can be found in ADD CITATION). Attention-Deficit
Hyperactivity Disorder (ADHD) is a chronic disorder affecting 5-10% of school age children
that adversely impacts functioning at home, school and in social settings (Pliszka 2007). In
recent years there is a controversy concerning the relative effectiveness of behavioral- vs.
medication-based interventions for the treatment of ADHD (see Pelham & Fabiano, 2008,
Pliszak 2007). Accordingly, a Sequential Multiple Assignment Randomized Trial (SMART;
Murphy, 2005) was conducted (William E. Pelham as PI) with the general aim to find the
optimal sequence of treatments that reduces ADHD symptoms among children.
Design
In a SMART study, treatments are randomized at each stage, where observable data from
a subject is {O1, A1, O2, A2, Y }. Of course, the number of stages can be greater than 2, and the
observable data may also include baseline variables and/or measurements of potential
confounders. At the first stage, the intervention A1 is randomized with randomization distribution
allowed to depend on (O1) and at stage 2 the intervention A2 is randomized with randomization
distribution allowed to depend on (O1, A1, O2).
21
At the first stage of the ADHD SMART study (A1), children were randomly assigned
(with probability ½) to a low dose of medication (coded as -1) or a low dose of behavioral
intervention (coded as 1) at the beginning of a school year. After eight weeks, children’s
response to the first stage intervention was evaluated monthly until the end of that school year.
At each monthly assessment, if the child showed inadequate response to the first stage
intervention, then he/she entered the second stage of the intervention (A2) and was re-randomized
(with probability ½) to one of two second stage intervention options, either to increasing the dose
of the first stage intervention (coded as -1) or to augmenting the first stage intervention with the
other type of intervention (i.e., adding behavioral intervention for those who started with
medication, or add medication for those who started with behavioral intervention) (coded as 1).
Otherwise if the child is classified as a responder, then he/she remains in stage 1 and continue the
first stage intervention. Note that there are only two key decisions in this trial: the first stage
intervention decision (A1), and then the second stage intervention decision (A2) for those not
responding satisfactorily to the first stage intervention. The structure of this SMART study is
illustrated in Figure 4.
-------------------------- Figure 4 about here
--------------------------
Sample
149 children (75% boys) between the ages of 5-12 (mean 8.6 years) participated in the
study. Due to drop-out and missing data2, the effective sample used in the current analysis was
131. At the first stage of the intervention (A1), 67 children were randomized to receiving low
2 In a full analysis one would want to use a modern missing data method such as multiple imputation so as to avoid bias.
22
dose of medication, and 64 were randomized to receiving low dose of behavioral intervention.
By the end of the school year, 77 children were classified as non-responders and re-randomized
to one of the two second stage intervention options, with 37 children assigned to increasing the
dose of the first stage intervention, and 38 children assigned to augmenting the first stage
intervention with the other type of intervention.
Measures
Primary outcome (�): we consider the level of children’s classroom performance based
on the Impairment Rating Scale (IRS, Fabiano et al., 2006; available from
http://wings.buffalo.edu/adhd) after an 8-month period as our primary outcome. This outcome
ranges from 1 to 5, with higher values reflecting better classroom performance. Because the
current analysis is for illustrative, rather than for substantive purposes, we use this measure as an
outcome despite limitations relating to its distribution and reliability.
Baseline (B): we use the level of children’s classroom performance (based on the IRS
scale) measured during the first month of the school year (before the first stage intervention) as a
baseline measure.
Week of non-response (W): reflecting the week during the school year at which the child
showed inadequate response to the first stage intervention, and hence entered the second stage of
the intervention. This measure is relevant only for those who showed inadequate response during
the school year (i.e., classified as non-responders to the first stage intervention).
Medication prior to first-stage intervention (�1): This measure reflects whether (coded
as 1) or not (coded as 0) the child received medication at school during the previous school year
(i.e., prior to the first stage of the intervention).
23
Adherence to first stage intervention (�2): This measure reflects whether adherence to
the first stage intervention was high (coded as 1) or low (coded as 0). We constructed this
indicator based on two other measures that express (1) the percentage of days the child received
medication during the school year calculated based on pill counts (for those assigned to low dose
of medication as the first stage intervention), and (2) the percentage of days the child received
the behavioral intervention during the school year based on teacher report of behavioral
interventions used in the classroom (for those assigned to behavioral intervention as the first
stage treatment). The distributions of these two measures are presented in Figures 5 and 6. Based
on these distributions, we constructed �2, such that for those assigned to behavioral intervention
as the first stage treatment, low adherence (�2=0) means receiving less than 75% days of
behavioral intervention, and for those assigned to medication as the first stage treatment, low
adherence (�2=0) reflects receiving less than 100% days of medication3.
------------- Figures 5&6 about here
-------------
Data Analysis Procedure
Using the Q-learning approach, the optimal sequence of decision rules can be estimated
based on two regressions, one for each intervention stage. We start from the second stage, aiming
to find the best subsequent intervention for responders, given the history up to the second
decision point (�, �1, �1,�2, �). Because children were classified as non-responders at different
time points along the school year, we included the week of non-response (W) in this regression.
We also included the child’s baseline measure (B) in the regression, due to potential
confounding. We consider the first stage intervention (�1) as well as the level of adherence to 3 Such relatively high adherence rates may result from obtaining adherence data only for the first 8 weeks of the school year. Moreover, study medication was to be taken only on school days, and was dispensed monthly to parents.
24
the first stage intervention (�2), as candidate tailoring variables for the second stage
intervention. Accordingly, �2 for non-responders can be approximated by
7 �2�, �1, �1,�2, �,�2, �2, �2
=�20+ �21�1+�22�+ �23�1+�24�1�1+�25�+�26�2+(�21+�22�1+�23�22)�2
In general this regression might include further possible confounders or potential
tailoring variables such as negative/ineffective parenting styles and medication side effects. We
obtain (�2,�2) by using regression. In this simple case, the decision rule recommends adding
another intervention for a child who does not respond to the first stage treatment if
(�21+�22�1+�23�2)>0 and increasing the dose of the first stage intervention if
(�21+�22�1+�23�2)<0.
Now we move backwards in time aiming to find the best first stage intervention option
(�1) given that the best second stage intervention option was offered to non-responders. Based
on (7) the estimated quality of the second stage intervention for non-responders would be
�=�20+ �21�1+�22�+ �23�1+�24�1�1+�25�+�26�2+|�21+�22�1+�23�22|
For responders, we use the measure of classroom performance after an 8-month period as an
intermediate outcome.
�1 can be approximated by
�1 �, �1, �1, �1, �1=�10+ �11�+�12�1+(�11+�12�1)�1
We regress � on the predictors to obtain �1and �1. If (�11+�12�1)>0, the best first stage
intervention option would be to begin with a low dose of behavioral intervention (�1=1). If
(�11+�12�1)<0, the best first stage intervention options would be to begin with a low dose of
medication (�1=−1).
25
Notice that the estimated quality of the second stage intervention is a non-smooth function
of �2 (it is non-differentiable at �21+�22�1+�23�22=0), because of the maximization operation
|�21+�22�1+�23�22|. Since �1 is a function of the estimated quality of the second stage
intervention, it is in turn a non-smooth function of �2, and hence a non-regular estimator.
Accordingly, usual Wald-type significance tests for making inference concerning �1 tend to perform
poorly (Chakraborty, Murphy, & Strecher, 2009; Robins, 2004). The issue of non-regularity and the
associated inference problems associated with the first stage regression are discussed in detail in
Chakraborty et al. (2009), and are beyond the scope of the current manuscript. Still, in order to
overcome the inference problems noted above, we constructed confidence intervals for �1 using the
soft thresholding operation recommended by Chakraborty et al. (2009).
Results
Table 1 present the results for the second stage regression. Based on these estimates, we
estimated the term (�21+�22�1+�23�2) for every given combination of �1and �2 (see Table 2).
----------------------- Table 1 & Table 2 about here
----------------------- The results in Table 1 show that the effect of the second stage intervention (�2) is negative
and marginally significant (�21= -.42, lower limit .10 CI= -.79, upper limit .10 CI= -.06). The
interaction between the first stage intervention (�1) and the second stage intervention (�2) was not
found to be statistically significant (�22= -0.003, lower limit .10 CI= -.24, upper limit .10 CI= .23),
and the interaction between adherence to first stage intervention (�2) and the second stage
intervention (�2) was found to be statistically significant (�23=.75, lower limit .10 CI=.27, upper
limit .10 CI= 1.23).
The results in Table 2 indicate that when adherence to the first stage intervention is low
(�2=0), the term (�21+�22�1+�23�2) is negative and marginally significant, regardless of
26
whether the first stage intervention was medication (�21+�22�1+�23�2= - .42, lower limit .10 CI=
-.87, upper limit .10 CI= .02), or behavioral intervention (�21+�22�1+�23�2=.43, lower limit .10
CI= -.85, upper limit .10 CI= -.01). Accordingly, when adherence to the first stage intervention is
low, the term (�21+�22�1+�23�22)�2 is maximized when �2=−1 (medication). However, when
adherence to the first stage intervention is high (�2=1), the term (�21+�22�1+�23�2) was not
found to be significantly different from zero, regardless of whether the first stage intervention was
medication (�21+�22�1+�23�2= .33; lower limit .10 CI= -.07, upper limit .10 CI= .73) or
behavioral intervention (�21+�22�1+�23�2= .33, lower limit .10 CI= -.04, upper limit .10 CI=
.70).
Overall, the results of the second stage regression indicate that for non-responders to the first
stage intervention (regardless of whether the first stage intervention was medication or behavioral
intervention), if adherence to the first stage intervention is low, augmenting the first stage
intervention with the other type of intervention (�2=−1), leads to better classroom performance
relative to intensifying the first stage intervention (�2=1). However, if adherence to the first stage
intervention is high, there is no evidence to differentiate between the second stage intervention
options. Figure 7 presents the predicted means for each of the second stage intervention options
(�2), given the first stage intervention (�1) and adherence to first stage intervention (�2).
------------- Figure 7 about here
-------------
Table 3 presents the results for the first stage regression. Based on these estimates, we
estimated the term (�11+�12�1) for each value of �1 (see Table 4).
----------------------- Table 3 & Table 4 about here
-----------------------
27
The results in Table 4 indicate that the effect of the first stage intervention (�1) is positive
and marginally significant (�11=.20, lower limit .10 CI=.002, upper limit .10 CI=.38), and the
interaction between the first stage intervention (�1) and medication prior to first stage intervention
(�1) is negative and marginally significant (�12= -.24, lower limit .10 CI= -.49, upper limit .10 CI=
-.01).
The results in Table 5 indicate that the term (�11+�12�1) is positive and marginally
significant when �1= 0 (Estimate=.20, lower limit .10 CI=.002, upper limit .10 CI=.38). However,
when �1=1 the term (�11+�12�1) is not significantly different from zero (Estimate= -.04, lower
limit .10 CI=-.33, upper limit .10 CI=.24). This means that given that the best second stage
intervention option was offered to non-responders, low dose of behavioral intervention (�1=1) leads
to better classroom performance relative to low dose of medication (�1= -1), for children who did
not receive medication at school prior to the first stage intervention (�1=0). However, there is no
evidence favoring either first stage intervention option for children who received medication at
school prior to the first stage intervention. Figure 8 presents the predicted means for each of the first
stage intervention options (�1), given whether or not the child received medication at school prior to
first stage intervention (�1).
------------- Figure 8 about here
-------------
Overall, the optimal sequence of decision rules based on the second and the first stage
regressions is as follows:
IF the child received medication at school prior to first stage intervention
THEN offer low dose of medication or low dose of behavioral intervention.
28
ELSE IF the child did not receive medication at school prior to first stage intervention
THEN offer low dose of behavioral intervention.
Then,
IF the child show inadequate response to first stage intervention
THEN IF child’s adherence to first stage intervention is low,
THEN augment the first stage intervention with the other
type of intervention.
ELSE IF child’s adherence to first stage intervention is high
THEN augment the first stage intervention with the other
type of intervention or intensify the first stage intervention.
ELSE IF the child show adequate response to first stage intervention,
THEN continue first stage intervention.
29
References:
Adelman, H.S., & Taylor, L. (1994). On Understanding Intervention in Psychology and
Education. Westport CT: Praeger.
Beck, J.S., Liese, B.S., and Najavits, L.M. (2995). Cognitive therapy. In R.J. Frances, S.I. Miller,
and A. Mack (Eds.), Clinical Textbook of Addictive Disorders (3rd edition). NY: Guilford
Press.
Bierman, K.L., Nix, R.L., Maples, J.J., and Murphy, S.A. (2006). Examining clinical judgment
in an adaptive intervention design: The Fast Track Program. Journal of Consulting and
Clinical Psychology, 74, 468-481.
Brand, E., Lakey, B., & Berman, S. (1995). A preventive, psychoeducational approach to
increase perceived support. American Journal of Community Psychology, 23, 117–136.
Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of
Personality and Social Psychology, 51, 1173–1182.
Bellman, R.E. & Dreyfus, S.E. (1962). Applied dynamic programming. RAND Corporation.
Brown, A.L. (1992). Design Experiments: Theoretical and methodological challenges in creating
complex interventions in classroom settings. The journal of Learning Sciences, 2, 141-
178.
Chakraborty, B., Murphy, S.A., & Strecher, V. (2009). Inference for non-regular parameters in
optimal dynamic treatment regimes. Statistical Methods in Medical Research, In Press.
Cole, M.B. (2005). Group Dynamics in Occupational Therapy: The Theoretical Basis and
Practice Application of Group Treatment (3rd edition). Thorofare, NJ: Slack Inc.
30
Collins, L.M., Murphy, S.A., & Bierman, K.A. (2004), A Conceptual Framework for Adaptive
Preventive Interventions, Prevention Science, 5, 185-196.
Conduct Problems Prevention Research Group. (1992). A developmental and clinical model for
the prevention of conduct disorder: The Fast Track program. Development and
Psychopathology, 4, 509-527.
Connell, A., Bullock, B. M., Dishion, T. J., Shaw, D., Wilson, M., & Gardner, F. (2008). Family
intervention effects on co-occurring behavior and emotional problems in early childhood:
A latent transition analysis approach. Journal of Abnormal Child Psychology, 36, 1211-
1225.
Cuijpers, P., Jonkers, R., de Weerdt, I., & de Jong, A. (2002). The effects of drug abuse
prevention at school: The ‘Healthy School and Drugs’ project. Addiction, 97, 67–73.
Hirschi, A., & Vondracek, F.W. (2009). Adaptation of career goals to self and opportunities in
early adolescence. Journal of Vocational Behavior, 75, 120-128.
Fabiano, G. A., Pelham, W. E., Waschbusch, D. A., Gnagy, E. M., Lahey, B. B., Chronis, A. M.,
et al. (2006). A practical measure of impairment: Psychometric properties of the
impairment rating scale in samples of children with Attention Deficit Hyperactivity
Disorder and two school-based samples. Journal of Clinical Child and Adolescent
Psychology, 35, 369–385.
Lavori, P.W. & Dawson, R. (2000). A design for testing clinical strategies: biased individually
tailored within-subject randomization. Journal of the Royal Statistical Society A, 163, 29-
38.
Laurenceau, J-P, Hayes, A.M. , & Feldman, G.C. (2007). Statistical and methodological issues in
the study of change in psychotherapy. Clinical Psychology Review, 27, 682-695.
31
Lunceford, J. K., Davidian, M. & Tsiatis, A. A. (2002). Estimation of survival distributions of
treatment policies in two-stage randomization designs in clinical trials. Biometrics, 58,
48-57.
Marlowe, D.B., Festinger, D.S., Arabia, P.L., Dugosh, K.L., Benasutti, K.M., Croft, J.R., &
McKay, J.R. (2008). Adaptive interventions in drug court: A pilot experiment. Criminal
Justice Review, 33, 343-360.
Martocchio, J.J., & Webster, J. (1992). Effects of feedback and cognitive playfulness on
performance in microcomputer software training. Personnel Psychology, 45, 553-578.
McKay, J.R. (2005). Is there a case for extended interventions for alcohol and drug use
disorders? Addiction, 100, 1594-1610.
MacKinnon, D.P., Warsi, G., & Dwyer, J.H. (1995). A simulation study of mediated effect
measures. Multivariate Behavioral Research, 30, 41–62.
Murphy, S.A. (2005). An experimental design for the development of adaptive treatment
strategies. Statistics in Medicine, 24, 455-1481.
Murphy, S.A., & Bingham, D. (2009). Screening experiments for developing dynamic treatment
regimes Journal of American Statistical Association, In Press.
Murphy, S.A, Collins, L.M., & Rush, A.J. (2007). Customizing treatment to the patient:
Adaptive treatment strategies (editorial). Drug and Alcohol Dependence, 88, S1-S72.
Murphy, S.A., Lynch, K.G., Oslin, D., Mckay, J.R. & TenHave, T. (2007). Developing adaptive
treatment strategies in substance abuse research. Drug and Alcohol Dependence, 88s,
s24-s30.
Murphy, S.A., van der Laan, M.J., Robins, J.M. & CPPR (2001). Marginal mean models for
dynamic regimes. Journal of American Statistical Association, 96, 1410-1423.
32
Neyman, J. (1923). On the application of probability theory to agricultural experiments.
Translated in Statistical Science, 5, 465-480 (1990).
Pelham, W.E., & Fabiano, G.A. (2008). Evidence-based psychosocial treatment for attention-
deficit/hyperactivity disorder. Journal of Clinical Child and Adolescent Psychology, 37,
184–214.
Petersen, M.L., Deeks, S.G. and van der Laan, M.J. (2007). Individualized treatment rules:
Generating candidate clinical trials. Statistics in Medicine, 26, 4578-4601.
Pliszka S. (2007): Practice parameter for the assessment and treatment of children and
adolescents with attention-deficit/hyperactivity disorder. Journal of the American
Academy of Child & Adolescent Psychiatry, 46, 894-921.
Raudenbush, S. W., Hong, G., & Rowan, B. (2002). Studying the causal effects of instruction
with application to primary-school mathematics. Paper presented at the Research Seminar
II: Instructional and Performance Consequences of High-Poverty Schooling, National
Center for Educational Statistics, Washington, DC.
Robins, J.M. (1986). A new approach to causal inference in mortality studies with sustained
exposure periods -application to control of the healthy worker survivor effect. Computers
and Mathematics with Applications, 14, 1393-1512.
Robins, J.M. (1987). Addendum to “A new approach to causal inference in mortality studies
with sustained exposure periods -application to control of the healthy worker survivor
effect.” Computers and Mathematics with Applications, 14, 923-945.
Robins, J.M. (2004). Optimal structural nested models for optimal sequential decisions. In D.Y.
Lin, and P. Heagerty (Eds.), Proceedings of the Second Seattle Symposium on
Biostatistics (pp. 189-326). NY: Springer.
33
Rubin, D.B. (1978). Bayesian inference for causal effects: the role of randomization. The Annals
of Statistics, 6, 34-58.
Sutton, R.S. & Barto, A.G. (1998). Reinforcement Learning: An Introduction. Cambridge, Mass:
MIT Press.
Thall, P.F., Sung, H.G. & Estey, E.H. (2002). Selecting therapeutic strategies based on efficacy
and death in multicourse clinical trials. Journal of the American Statistical Association,
97, 29-39.
Thall, P. F. and Wathen, J. K (2005). Covariate-adjusted adaptive randomization in a sarcoma
trial with multi-stage treatments. Statistics in Medicine, 24:1947-1964.
van der Laan, M. J. & Petersen, M. L. (2007). Statistical learning of origin-specific statically
optimal individualized treatment rules. The International Journal of Biostatistics, 3,
Article 3.
Wahed, A. S. & Tsiatis, A. A (2004). Optimal estimator for the survival distribution and related
quantities for treatment policies in two-stage randomization designs in clinical trials.
Biometrics, 60, 124-133.
Wahed, A. S. & Tsiatis, A. A (2006). Semiparametric efficient estimation of survival distribution
for treatment policies in two-stage randomization designs in clinical trials with censored
data. Biometrika, 93, 163-177.
Wampold, B. E., Lichtenberg, J. W., & Waehler, C. A. (2002). Principles of empirically
supported interventions in counseling psychology. The Counseling Psychologist, 30,
197–207.
Wandersman, A., & Florin, P. (2003). Community interventions and effective prevention.
American Psychologist, 58, 441–448.
34
Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD Thesis, University of
Cambridge, England.
Weisz, J.R., Chu, B.C., & Polo, A.J. (2004). Treatment dissemination and evidence-based
practice: Strengthening interventions through clinician-researcher collaboration. Clinical
Psychology: Science and Practice, 11, 300-307.
Yalom, I. D. (1995). The Theory and Practice of Group Psychotherapy (4th edition). NY: Basic
Books.
35
Table 1: Estimated Coefficients for �2 (N=77).
Effect Estimate SE
Lower limit .10
CI
Upper limit .10
CI
Intercept 1.76 0.37
B (baseline) 0.46 0.09
W (week of non-response) -0.01 0.02
O1 (medication prior to first stage intervention) 0.32 0.32
O2 (adherence to first stage intervention) -0.10 0.29
A1 (first stage intervention) 0.08 0.14
A2 (second stage intervention) -0.42 0.21 -0.79 -0.06
O2*A2 (adherence to first stage intervention*second stage intervention) .75 0.29 0.27 1.23
A1*A2 (first stage intervention*second stage intervention) -0.003 0.14 -0.24 0.23
Table 2: Estimates of (�21+�22�1+�23�2) for every combination of �1and �2 (N=77)
A1 1 = behavioral intervention -1= medication
O2 1=high adherence 0=low adherence
Estimated (�21+�22�1+�23�2) SE Lower limit
.10 CI
Upper limit
.10 CI
-1 1 0.33 0.24 -0.07 0.73
-1 0 -0.42 0.27 -0.87 0.02
1 1 0.33 0.22 -0.04 0.70
1 0 -0.43 0.25 -0.85 -0.01
36
Table 3: Estimated coefficients and soft-threshold Confidence Intervals for Q1 (N=131).
Effect Estimate SE Lower limit
.10 CI Upper limit
.10 CI
Intercept 2.32 0.15
B (baseline) 0.43 0.04
O1 (medication prior to first stage intervention) 0.09 0.13
A1 (first stage intervention) 0.20 0.06 0.002 0.38
O1*A1 (medication prior to first stage intervention*first stage intervention)
-0.24 0.13 -0.49 -0.01
Table 4: Estimates of (�11+�12�1) for each level of �1.
O1 1= medication prior to first stage intervention 0= no medication prior to first stage intervention
Estimated (�11+�12�1) SE
Lower limit .10 CI
Upper limit .10 CI
1 -0.04 0.11 -0.33 0.24
0 0.20 0.06 0.002 0.38
38
Figure 2: Distribution of estimated coefficient �1+12(|�3+�4|− |�3−�4| )
Figure 3: Distribution of estimated coefficient of �11
39
Figure 4: Sequential Multiple Assignment Randomized Trial for ADHD study
Continue Medication Responders
Medication Increase Medication Dose
R Non‐Responders Add Behavioral Intervention
R
Continue Behavioral Intervention Responders Behavioral
Intervention Increase Behavioral
Intervention Non‐Responders R
Add Medication
R
40
Figure 5: Distribution for % days on behavioral intervention for those assigned to low dose of behavioral intervention as the first stage intervention.
41
Figure 6: Distribution for % days on medication for those assigned to low dose of medication as the first stage intervention.
Figure 7: Predicted mean of classroom performance for each of the stage 2 intervention options (A2), given the first stage intervention (A1) and adherence to first stage intervention (O2).
2
2.25
2.5
2.75
3
3.25
3.5
3.75
Add Enhance
A1=BMOD O2=High Adherence
A1=BMOD O2=Low Adherence (Diff=0.85, P<.10)
A1=MED O2=High Adherence
A1=MED O2=Low Adherence (Diff=.84, P= <.10)
Stage 2 interven_on (A2)
42
Figure 8: Predicted estimated quality of the second stage intervention for each of the first stage intervention options (A1), given whether or not the child received medication at school prior to first stage intervention (O1).
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
Behavioral Interven_on Medica_on
O1=Medica_on at school prior to stage 1 interven_on
O1= No medica_on at school prior to stage 1 interven_on (Diff=0.40, P<10)
Pred
icted Es_m
ated
Quality of Stage 2 Interven
_on
Stage 1 interven_on (A1)
43
Appendix 1: Proof of (3)
First note that for �1 = -1 or 1, �2 = -1 or 1
���1=�1,�2=1,�2=�2=�0+�1���1=�1,�2=1,�2=�2
=�0+�1�(�=1|�1=�1,�2=1,�2=�2),
where the second equality follows since U is Bernoulli.
By the basic properties of conditional probability, we have
��=1�1=�1,�2=1,�2=�2
= ��=1,�1=�1,�2=1,�2=�2��1=�1,�2=1,�2=�2 =
��2=�2�=1,�1=�1,�2=1�(�=1,�1=�1,�2=1)��2=�2|�1=�1,�2=1�(�1=�1,�2=1).
Since the Time 2 intervention �2 is randomly assigned to 1 or −1 with probability ½ each, given
�2=1, we have ��2=�2�=1,�1=�1,�2=1= ��2=�2|�1=�1,�2=1=12 for �2=1 or −1. This
implies
��=1�1=�1,�2=1,�2=�2= �(�=1,�1=�1,�2=1)�(�1=�1,�2=1).
In the following we derive the joint probability �(�=�,�1=�1,�2=1) for � = 0 or 1, and �1 = -
1 or 1. Note that �1 and � are independently distributed. Thus
��=�,�1=�1=��=�)�(�1=�1=12×12=14.
It follows that
��=�,�1=�1,�2=1
=��2=1�=�,�1=�1��=�,�1=�1
44
=14��1+�22+�1−�22�1+(1−�)�3+�42+�3−�42�1 .
Hence,
��=1,�1=�1,�2=1= 14�1+�22+�1−�22�1=�14 �� �1=1 �24 �� �1=−1
and
��1=�1,�2=1= ��=1,�1=�1,�2=1+��=0,�1=�1,�2=1
=14�1+�22+�1−�22�1+14�3+�42+�3−�42�1
=�1+�34 �� �1=1 �2+�44 �� �1=−1 .
Therefore
���1=�1,�2=1,�2=�2
=�0+�1��=1,�1=�1,�2=1��1=�1,�2=1
=�0+�11+�12×�1�1+�3+1−�12×�2�2+�4
=�0+�12�1�1+�3+�2�2+�4+�12�1�1+�3−�2�2+�4�1
45
Appendix 2
Let (�1,�1,�2,�2,�) denote the observable data of an individual in a randomized trial. In
this section, we show that when individuals are randomly assigned to intervention options
(randomization probabilities may depend on past information) at each stage,
�.1 �(�≤�| �1=�1,�1=�1,�2=�2, �2=�2)=�(�≤�| �1=�1,�1,�2=�2, �2)
and
�.2 �(�2≤�2| �1=�1,�1=�1)=�(�2≤�2| �1=�1,�1)
for all possible values of (�1,�1,�2, �2). Note that the conditional probabilities on the left hand
side are based on the multivariate distribution where the interventions (�1, �2) may vary across
individuals. For example, �(�≤�| �1=�1,�1=�1,�2=�2, �2=�2) is the distribution of � among
the subpopulation of individuals with (�1=�1,�1=�1,�2=�2, �2=�2). Data from the
randomized trial provide information about these conditional probabilities. While the conditional
probabilities on the right hand side are based on the multivariate distribution where all
individuals have the same interventions (�1, �2). For example, �(�≤�| �1=�1,�1,�2=�2, �2)
is the distribution of � among the subpopulation of individuals with (�1=�1,�2=�2) if all
individuals were assigned (�1, �2). We need information about these conditional probabilities
to construct the optimal dynamically adaptive interventions. When interventions are sequentially
randomized (randomization probabilities may depend on past information), the left hand side
probabilities equal the right hand side probabilities. Thus data from the randomized trial can be
used to develop the optimal dynamically adaptive interventions. Below we prove (A.1) and (A.2)
using the potential outcome framework (Neyman, 1923; Rubin, 1978; Robins, 1986, 1987).
46
For each fixed sequence of interventions (�1, �2), we conceptualize potential outcomes
denoted by �2(�1) and ��1, �2, where �2(�1) is the observations that an individual would
have had at the second stage if he/she had followed �1 at stage 1, and ��1, �2 is the primary
outcome that would have been observed had an individual followed the sequence �1, �2. Let �1
and �2 denote the sets of all possible interventions at stages 1 and 2, respectively. Then the set
of all potential outcomes is �={�1,�2�1,��1, �2:�1∈�1, �2∈�2} (�1 is included for
completeness). Notice that the potential outcomes are only functions of the interventions (�1,
�2) since we will only manipulate interventions. By definition, the multivariate distribution of
(�1,�2�1,��1, �2) is the multivariate distribution of (�1,�2,�) when the sequence of
interventions is set at (�1, �2) for all individuals. This is the distribution needed to construct the
optimal adaptive interventions.
Assuming Robins’ consistency assumption holds (i.e. an individual’s intervention
assignment does not affect other individuals’ outcomes; see Robins and Wasserman (1997)), the
potential outcomes are connected to the individual’s data by �2=�2�1 and �=��1, �2. In
addition, since the randomization probabilities in the randomized trial only depend on past
information, �2 is independent of the set of all potential outcomes � given (�1,�1,�2) and �1 is
independent of � given �1. Hence,
�(�≤�| �1=�1,�1=�1,�2=�2, �2=�2)
=�(��1, �2≤�| �1=�1,�1=�1,�2=�2, �2=�2)
=�(��1, �2≤�| �1=�1,�1=�1,�2=�2)
=�(��1, �2≤�| �1=�1,�1=�1,�2(�1)=�2)
=�(��1, �2≤�| �1=�1,�2(�1)=�2)
=�(�≤�| �1=�1,�1,�2=�2, �2)
47
where the first and the third equalities follow from the consistency assumption, the second and
the fourth equality follow from the fact that the randomization probabilities depend only on the
past information and the last equality follows from the definition of potential outcomes.
Similarly, we can show that (A.2) holds.