Post on 18-Jan-2023
Measuring ClinicalDecision Making:Do Key FeaturesProblems MeasureHigher LevelCognitive Processes?
Gregory M. Hurtz1, Roberta N. Chinn2,Grady C. Barnhill3, and Norman R. Hertz4
AbstractReliable and objective assessment of clinical decision-making skills has beena long-standing goal in occupational testing in the allied health professions.With this goal in mind, the key features problem (KFP) format was devel-oped which elicits targeted decisions about key features of clinical scenar-ios. To build on a small body of empirical evidence evaluating their efficacy,this study evaluates whether KFPs successfully assess higher order cogni-tive processes. Analysis of objective data (item length and difficulty, itemperformance, and response times) and subjective data (expert ratings ofcognitive complexity) supported the proposition that KFPs tend to be
1 Department of Psychology, California State University, Sacramento, CA, USA2 PSI Services, LLC, Burbank, CA, USA3 Commission on Dietetic Registration, Chicago, IL, USA4 Progeny Systems Corporation, Manassas, VA, USA
Corresponding Author:
Gregory M. Hurtz, California State University, Sacramento, 6000 J Street, Sacramento, CA
95819-6007, USA
Email: ghurtz@csus.edu
Evaluation & the Health Professions35(4) 396-415
ª The Author(s) 2012Reprints and permission:
sagepub.com/journalsPermissions.navDOI: 10.1177/0163278712446639
http://ehp.sagepub.com
more cognitively complex than conventional multiple-choice questions. Notonly were they rated as more complex, but this complexity accounted forsome of the increase in time spent responding to these items. Results supportthe use of KFPs in standardized assessments for measuring higher order cog-nitive processes such as clinical decision making.
Keywordskey features, clinical decision making, standardized assessment, objectivescoring, cognitive levels, multiple choice
Decision making can be viewed as a stage in the problem-solving process,
where the problem solver must recognize and understand the nature of the
problem, generate possible solutions, evaluate the likely benefits, risks, and
consequences of those alternative solutions, and make one or more deci-
sions about the best course/courses of action. Clinical decision making
involves problem solving in clinical cases, and clinical decision-making
skills are important in the allied health field because patients are put at
potential risk if health care personnel cannot apply their knowledge and
training to real-world problems in an integrated and appropriate manner.
As such, developers of licensure and certification examinations for allied
health professions have expended considerable efforts and resources toward
measuring clinical decision-making skills.
Many allied health and medical organizations have focused their efforts
on ‘‘authentic’’ assessments of competencies with clinical scenarios. The
National Board of Medical Examiners has conducted continual research
with computer-based, clinical scenarios to assess clinical reasoning now
known as the computer-based case simulation (CCS) section of the United
States Medical Licensing Examination (USMLE, Step 3). Another organi-
zation for veterinary medicine (National Board Examination Committee
for Veterinary Medicine) used a series of patient management problems
(clinical competency test) that required test takers to choose among a vari-
ety of alternatives for action some of which may be appropriate, others
either not appropriate or even contraindicated about a clinical case sce-
nario. Other medical and some allied health organizations have attempted
to assess clinical reasoning through the use of oral examinations. All of
these complex item and scoring formats required considerable expendi-
tures of staff and subject matter expert (SME) time, as well as consider-
able financial resources. The decision to make such an investment
Hurtz et al. 397
demonstrates the importance of assessing clinical decision making in a
licensing/certification context.
Patient management problems and oral examinations have been largely
abandoned in certification and licensure testing. In the case of the patient
management problem format, there has been a great deal of divided psycho-
metric opinion as to its reliability and validity. Many have expressed the
opinion that the expense and time required to produce such items is not
repaid in terms of measurement reliability or validity. Oral examinations
were discontinued due to measurement issues, time constraints, and con-
cerns with bias and candidate anonymity.
Allied health and medical organizations have continued their efforts to
assess clinical decision-making skills, primarily through the use of the
multiple-choice item format. Efforts are made to improve fidelity, some-
times through the use of clinical photographs, video recordings, or other
means, but the multiple-choice question (MCQ) is still the format of
choice. Sometimes a ‘‘standardized patient’’ format will utilize interac-
tions with a patient/actor to assess patient communications or even diag-
nostic skills, but this is a complicated and expensive approach. The same
can be said of the multistation, objective structured clinical examination
(OSCE) format, generally used to assess various clinical skills. In short,
the allied health certification and licensure testing landscape is still in
need of a better measurement strategy to assess clinical decision making.
What is needed is an item format that is easier and less expensive to
develop and maintain, more straightforward to administer and score, and
stable in its measurement attributes, all while assessing clinical decision
making at a higher cognitive level.
The key features problem (KFP) is a promising alternative measurement
strategy designed to improve upon the use of scenario-based assessments in
this context (Bordage & Page, 1987; Page & Bordage, 1995; Page, Bordage,
& Allen, 1995). KFPs present clinical scenarios and can elicit short-answer
constructed responses or can be objectively scored from a menu of response
options where test takers choose multiple responses and are scored against a
correct response profile derived by experts. A small body of research evi-
dence supports the proposition that KFPs measure decision making which
results from higher order cognitive processes and problem-solving strate-
gies (Schuwirth et al., 1999; Schuwirth, Verheggen, van der Vleuten, Boshui-
zen, & Dinant, 2001). Haladyna (2004) suggests that the KFP strategy is a
‘‘strong contender among other approaches to modeling higher level thinking
that is sought in testing competence in every profession’’ (pp. 169–170) but
goes on to say that more research is needed to evaluate the degree to which
398 Evaluation & the Health Professions 35(4)
KFPs accomplish the ‘‘elusive goal’’ (p. 170) of assessing clinical problem
solving. The current study addresses this call for research by further evaluat-
ing the proposition that KFPs assess higher order cognitive processes consis-
tent with clinical decision making.
The idea underlying KFPs is the notion of a ‘‘key feature,’’ or critical
step in the resolution of a clinical problem (Bordage & Page, 1987; Farmer
& Page, 2005). For example, the key feature of a solution may be a step
where a practitioner is most likely to make an error, or a decision point
where the next course of action could be harmful to the patient if the wrong
decision is made. KFPs can address common errors made by practitioners as
well as rare but critical conditions such as infrequent emergency procedures
(Kim, Amin, & Ng, 2007). The KFP format presents a description of the
clinical case scenario followed by questions that focus solely on the identi-
fied key features of the case. The questions typically provide several differ-
ent response options from which multiple critical elements can be selected.
Scoring can be dichotomous (i.e., credit for choosing all correct options and
no incorrect options, versus no credit for partial, incorrect, or too many
selections) or can allow partial credit to be assigned. In addition, ‘‘auto-
matic zero’’ options may be included, for example in a situation where one
particular decision kills the patient. Farmer and Page (2005) and Haladyna
(2004) provide brief guides to the development of KFPs.
KFPs are not intended to measure factual knowledge; they are intended
to measure application and synthesis of knowledge and perhaps experience,
coupled with reasoning and judgment, in the selection of appropriate
courses of action for scenarios that are actually encountered on the job.
Research has indicated that KFPs correlate only moderately with measures
of factual knowledge (Fischer, Kopp, Holzer, Ruderich, & Junger, 2005;
Hatala & Norman, 2002), and that the thought processes underlying
experts’ responses to KFPs are more elaborate and qualitatively different
when compared to those underlying more factual-based multiple-choice
items (Schuwirth et al., 2001). For instance, Schuwirth et al. (2001) found
that experts used less information and responded more quickly to KFPs than
did novices, whereas no such expert–novice differences emerged on the
more factual MCQs. In addition, experts engaged in more ‘‘nonsequential’’
processing of information in KFPs, while novices accessed the information
in a more sequential manner. These findings support the notion that KFPs
elicit different reasoning and decision-making strategies than more factual
items. In another study, Schuwirth et al. (1999) found that KFPs signifi-
cantly differentiated students from a medical school that used an exclusive
problem-based learning curriculum from a similar school that used
Hurtz et al. 399
primarily a lecture-based curriculum, further supporting the notion that KFP
performance relies on problem-based clinical decision-making skills.
While the existing research on KFPs is promising, more evidence is
needed to support the notion that these relatively elaborate items actually
do go beyond more conventional, brief, single-answer MCQs and tap into
the higher cognitive processes they purport to measure. Of the various
objectively scored multiple-choice item formats available to test developers
for designing their assessments (see Haladyna, 2004, for different options),
the question is ‘‘Do KFPs stand apart from more conventional, brief single-
answer MCQs in terms of the cognitive processes they elicit?’’ In order
words, to what degree do KFPs, which to date have appeared superior to
other attempts in measuring clinical decision making, actually accomplish
the ‘‘elusive goal’’ of measuring higher level cognitive processes? The cur-
rent study evaluated this question using a combination of SME ratings on
the cognitive complexity of KFP and MCQ items and the response time and
performance of actual test takers.
We proposed three research questions for this study. First, do experts
view KFPs as more cognitively complex than MCQs? Answering this ques-
tion provides new information on whether high-level cognitive processes
are involved in answering KFPs to a greater extent than in more conven-
tional MCQs. Second, does this complexity explain differences in
response times between KFPs and MCQs, after controlling for differences
in item length and difficulty? KFPs tend to be longer than MCQs which
adds reading time, and KFPs also tend to be more difficult than MCQs
which can affect response times. The question here is not only whether
more time is spent on KFPs relative to MCQs over and above these fac-
tors, but also the degree to which this additional time spent can be
accounted for by the KFPs eliciting higher level cognitive processes.
Answering this question helps build the pattern of evidence regarding
whether KFPs are tapping into clinical decision making.
Third, do these residual differences in response times (after controlling
for item length and difficulty) have a different pattern of relationship to item
performance for KFPs versus MCQs? To a certain degree, it can be argued
that experts either know the answers or they do not, so that delayed
responses may indicate that the test taker simply does not know the answer.
However, if KFPs require higher level cognitive processes, then we would
expect at least some of the extra time to translate into better item perfor-
mance, whereas delayed responses on the simpler MCQs may be a stronger
indication of a lack of knowledge. In other words, if KFPs involve higher
400 Evaluation & the Health Professions 35(4)
level processes then answering correctly should take a bit longer as test
takers work through their decision processes.
Method
Setting
This study was conducted in the context of examination development
for two dietetic specialty areas (gerontology and oncology). A total of six dif-
ferent forms were used. These examinations contain a combination of con-
ventional, relatively brief, four-option MCQs, and the more elaborate KFPs.
Item Development
The procedures for developing MCQs and KFPs were similar. Participants
were provided a formal orientation in the principles of good item construc-
tion, opportunities to familiarize themselves with the content specifications,
and opportunities to work with fellow participants to create the items. For
each item, considerable emphasis was placed on specifying the linkage of
item content to the content specifications and providing a citation from
an authoritative reference source. Therefore, each item was linked to a
specific task and knowledge from the test specifications and to a page or
section of an authoritative reference source. There were numerous opportu-
nities for individual assistance with item development as well as opportuni-
ties for review by other participants.
The primary distinction between the development process for MCQs and
KFPs was the degree of focus on the key features of the problems for clin-
ical decisions. MCQs were relatively brief and had four response options
with one correct answer and focused on either applied questions regarding
facts, concepts, and procedures or scenario-based questions regarding com-
monly encountered situations that require analysis, interpretations, and
decisions. KFPs differed from scenario-based multiple-choice items in the
degree of complexity of the scenarios and the degree to which they empha-
sized the candidate’s ability to triage among multiple correct answers and
multiple distractors, many of which could be correct, but were not the
actions or interventions that addressed the most urgent needs of the patient
in the scenario. The intent of the KFP was to assess the candidates’ abilities
to implement more than one step in their decision-making process, discri-
minate between relevant and irrelevant information presented in the sce-
nario, and prioritize correct responses in terms of urgency of the
Hurtz et al. 401
situation. Test takers were instructed on how many responses to choose
when responding to each item (e.g., ‘‘Choose three’’), and they were scored
in an all-or-none fashion. SMEs determined the minimum number of these
correct options needed to receive a point for each item and those selecting
fewer than the minimum received no credit for that item.
Examination Administration
Examinations were delivered to certification candidates as computer-based
tests at testing centers where proctors provided standardized instructions.
The candidates were actual test takers who were completing the examina-
tions for certification and not for the purposes of this study. For this study,
the data were extracted from database archives without any identifying
information associated with the test takers.
Cognitive Complexity Surveys
Multiple survey forms were constructed using subsets of the KFPs from
each of the six operational examination forms used in this study inter-
spersed in a counterbalanced fashion with six corresponding subsets of
MCQs drawn randomly from the same forms. The correct answers were not
marked in order to force raters through the thought process of answering the
items. Raters were instructed to read each item, think about the correct
answer/answers, and decide on the level of cognitive processing required
to answer correctly. Ratings were given on a 4-point scale derived from the
language of common cognitive level taxonomies: 1 ¼ Factual, 2 ¼ Appli-
cation, 3 ¼ Interpretation, and 4 ¼ Synthesis. Table 1 provides the defini-
tions that were developed using the terminology of the focal professions to
make them most meaningful to expert raters from these professions. The
MCQs and KFPs were counterbalanced on the survey and were not labeled
in any way for the raters as being one type of item or another.
Procedure and Participants
SMEs were contacted via e-mail and asked to serve as raters in a quality-
control research study designed to assess the cognitive processes elicited
by certification candidates when they take the examinations. All invited
SMEs were certified and in good standing in their respective specialty
areas. Most were drawn at random from a database of certificate holders,
while SMEs who had previously served as item writers were also invited
402 Evaluation & the Health Professions 35(4)
to participate. The surveys were put on a password-protected Internet site,
and when SMEs logged in they were randomly assigned one of the alternate
forms in their specialty area (gerontology vs. oncology). They were allowed
3 weeks to log in and complete their ratings, and on completion they were
sent an honorarium of $25. Each participant was assured that survey rat-
ings would be combined with those of other participants to determine
group trends. Individual ratings were kept confidential and had no real
or potential impact on any aspect of their employment or status as a reg-
istered dietitian. All participants and their data were treated in accordance
with the American Psychological Association (APA) ethical guidelines.
Variables and Data Collected
Comparisons of KFPs and MCQs were first made in terms of SME ratings
of cognitive complexity on the survey forms. In addition, test taker response
Table 1. Definitions of Cognitive Complexity Levels for the Rating Task
Level Definitiona
1 FACTUAL, e.g. identifies signs, symptoms, and nutrition-related side effects;identifies standardized tools for assessing cancer and treatment-relatedside effects; determines energy, protein, fluid needs; determines nutritionsupport recommendations; provides interventions in accordance withtreatment goals
2 APPLICATION, e.g. recognizes treatment-related/nutrition-related causesof nutrition impact symptoms; develops nutrition care plans specific tophase of cancer care; develops nutrition care plans in accordance withintent/goal of cancer therapy; uses standardized tools for assessing cancerand treatment-related side effects
3 INTERPRETATION, e.g. differentiates between treatment-related/nutrition-related causes of nutrition impact symptoms; integrates nutritioninterventions for managing side effects related to multimodal treatment;initiates nutrition care plans in accordance with intent/goal of therapy;prioritizes nutrition care services; adapts nutrition care plan to maximizebenefit of diet and medications
4 SYNTHESIS e.g. anticipates treatment-related/nutrition-related causes ofnutrition impact symptoms; recommends modifications in supportive caretherapies; anticipates fluctuations in relevant clinical data and its impact ontreatment; monitors recovery from treatment and response to treatment;anticipates and recommends modifications in nutrition care plan
Note. a The details shown in this example after the ‘‘e.g.’’ are for the oncology examinations.
Hurtz et al. 403
times (measured in seconds) and item performance (correct/incorrect) were
extracted from data archives for each of the items that were included in the
SME surveys. In addition, each test taker’s total examination score (i.e., their
sum correct across all items, on which certification decisions were made in
practice) was collected, and these were converted into within-form standard
scores. These four variables—cognitive complexity, response times, item
performance (both binary correct/incorrect values and aggregated percentage
correct p values), and overall examination performance—were the primary
focal variables of the study.
Because MCQs and KFPs typically differ in length which may covary
with the primary variables (especially response times), two separate length
variables were coded to incorporate statistical controls. First, we coded item
stem length as a simple count of words within the ‘‘prompt’’ or stem of the
item. This accounted for each word the test taker would need to read from
the start of the item text up to the point where they begin reading and decid-
ing on the response options. Second, we coded the total word count of the
block of response options.
Data Analysis
Survey response rates were tallied, chi-square tests were used to evaluate
whether response rates for the different forms varied across demographic
variables, and rater reliability was evaluated using generalizability theory.
Ratings of cognitive complexity were compared between item types using
a graphical display and a 2-way mixed analysis of variance (ANOVA) with
item type as the repeated variable and survey form as the independent-group
variable. Item stem and option length, item difficulties, average cognitive
complexity ratings of items, and an item type dummy variable (MCQ ¼ 0,
KFP¼ 1) were then used as predictors of response time in a hierarchical mul-
tiple regression analysis. Finally, standardized examination scores, item
length, item type, item response time, and an interaction term between item
type and item response time were used as predictors of item performance
(correct/incorrect) in a multiple logistic regression analysis.
Results
Response Rate, Respondent Characteristics, and Reliability of Ratings
Participation requests were sent to 183 SMEs, and 101 (55.2%) agreed to
participate and provided their ratings of items within their specialties. All
404 Evaluation & the Health Professions 35(4)
participants were college educated, with 46 (45.5%) possessing bachelor’s
degrees, 52 (51.5%) possessing master’s degrees, and 3 (3.0%) possessing
doctoral degrees. Ninety-nine (98.0%) were employed in dietetic practice at
the time of the study and likewise 99 (98.0%) were actively providing
oncology nutrition services, with 81 (80.2%) providing these services for
more than 20 hours per week and 62 (61.0%) for more than 30 hours per
week. All 101 were credentialed as registered dietitians with the Commis-
sion on Dietetic Registration for at least 3 years, while 64 (63.4%) had been
for over 10 years.
Random assignment to groups resulted in 12–22 raters per form.
Affirming the success of the random assignment, chi-square tests of inde-
pendence indicated that rater group membership was unrelated (ps > .05)
to all demographic variable categories summarized above, supporting the
equivalence of the groups of raters along these relevant demographic char-
acteristics. Rater-by-item generalizability theory analyses for each form
resulted in generalizability coefficients of .76 to .95 across examination
forms, and .89 when pooled across forms (for both relative and absolute
error), indicating strong interrater reliability for the ratings.
Do SMEs View KFPs as Requiring More Complex CognitiveProcesses than MCQs?
Figure 1 provides a simple tally of cognitive level ratings, split by item type
for visual comparison. A clear pattern is indicated where MCQs were more
often seen as factual and progressively less often seen as measuring the
application, interpretation, and synthesis levels. KFPs were least often seen
as factual and more often seen as measuring the application, interpretation,
and synthesis levels. Because this figure tallies ratings across items that
were provided by the same SMEs, data were aggregated to means across
raters and item types (MCQs vs. KFPs), and the mean ratings were used
as the unit of analysis for tests of statistical significance.
Figure 2 displays a plot of the average ratings for each item type across
the six test forms. A two-way mixed ANOVA revealed a statistically signif-
icant and quite strong interaction between Item Type and Test Forms on
cognitive complexity ratings, F(5, 95) ¼ 18.00, p ¼ .000, partial Z2 ¼.49. The main effect of item type was also significant with a stronger effect
size, F(1, 95) ¼ 295.53, p ¼ .000, partial Z2 ¼ .76, and the main effect of
forms was also significant and moderately strong, F(5, 95)¼ 5.35, p¼ .000,
partial Z2 ¼ .22. The main effect of item type, which was the strongest of
the three effects, is of particular interest, with Figure 2 showing clearly that
Hurtz et al. 405
KFPs (M ¼ 2.73; SD ¼ .31) were rated higher, on average, than the MCQs
(M ¼ 1.93; SD ¼ .44). While the significant interaction suggests this effect
is moderated by form, simple effects with a Bonferroni-corrected a of
.05/6¼ .008 suggest that the difference is nevertheless significant for all six
forms. As seen in Figure 2, the mean ratings for KFPs were always higher
than those for MCQs, and the nature of the interaction is such that the effect
was simply stronger for the gerontology test forms (Z2s ¼ .58 and .68) than
for the oncology test forms (Z2s ¼ .08–.23).
Does Cognitive Complexity Help to Explain Differences BetweenKFPs and MCQs in Test Taker Response Times?
The mean response times computed from actual test takers for each of
the 166 items (Ns ¼ 24 to 79 per item, depending on the test form)
were then regressed onto the dummy variable representing item type
Figure 1. Frequency counts of cognitive level ratings by item type summed acrossitems and raters.
406 Evaluation & the Health Professions 35(4)
(0 ¼ MCQ, 1 ¼ KFP) as well as the mean cognitive complexity rating
across SMEs, after controlling for response time differences due to variabil-
ity in item length and difficulty. Table 2 provides the descriptive statistics
for each variable in this analysis broken down by MCQ versus KFP. For the
regression analysis, item difficulties were converted into logits and then
multiplied by�1 so that higher numbers represent more difficulty. All con-
trol variables were mean centered before entering them into the regression
analysis to facilitate interpretation of the multiple regression results.
Table 3 presents the results. Model 1 accounted for the effects of item
length and included the item stem word count and options word count vari-
ables, which both significantly increased response times and together
explained 57% of the variance (R2 ¼ .57, p ¼ .000). Model 2 added item
difficulty, which was found to account for further increases in response
times and produced an increment of 13% to the explained variance. The
total variance explained by these control variables was approximately
71%, and the b weights indicated that differences in stem word counts
MCQ KFP
SYNTHESIS
APPLICATION
INTERPRETATION
FACTUAL
Figure 2. Means of cognitive level ratings for MCQs and KFPs across test forms.MCQs ¼ multiple-choice questions; KFPs ¼ key features problems.
Hurtz et al. 407
contributed the most to differences in overall response times. Due to mean
centering, the raw equation for Model 2 reveals that for items of average
length and difficulty test takers spent approximately 85.26 seconds respond-
ing; all else equal, each additional word in the stem adds approximately
0.45 seconds, each additional word in the response options adds approxi-
mately 0.86 seconds, and each 1-logit increase in item difficulty adds
approximately 11.23 seconds of response time.
In Model 3, the item type dummy variable (0 ¼ MCQ, 1 ¼ KFP) was
entered and found to explain an additional 1% of variance in response time
which was statistically significant, DR2¼ .01, p¼ .045. The unstandardized
weight shows that of the 68.88-second raw difference in response times
between MCQs and KFPs (derived from Table 2), 15.46 seconds still
remain after accounting for differences in item length and difficulty.
However, when mean cognitive level ratings were added in Model 4, they
were found to be significant as well (DR2 ¼ .01, p ¼ .006) and rendered the
coefficient for item type nonsignificant (p¼ .122). This pattern, where item
type was significant in the model excluding cognitive levels but became
nonsignificant when cognitive levels were added, is consistent with a med-
iation effect where differences in response times between MCQs and KFPs
are largely explained by differences in the cognitive levels they measure.
Table 2. Descriptive Statistics for Control Variables (Item Length and Difficulty),Mean Cognitive Level Ratings, and Response Times
M SD Min Max
Stem length (word count)MCQ 34.37 23.48 8 105KFP 117.78 36.87 44 200
Options length (word count)MCQ 21.77 19.09 4 88KFP 35.13 21.61 6 121
Item difficulty (p value)MCQ 0.75 0.22 0.19 1.00KFP 0.54 0.28 0.00 1.00
Mean cognitive level ratingsMCQ 1.93 .62 1.00 3.73KFP 2.74 .43 1.54 3.58
Response timesMCQ 51.70 26.55 14.54 175.93KFP 120.58 45.36 53.85 217.46
Notes. MCQs ¼ multiple-choice questions; KFPs ¼ key features problems.
408 Evaluation & the Health Professions 35(4)
Does Time Spent Responding Relate Differently to ItemPerformance for KFPs versus MCQs?
Table 4 presents the results of a logistic regression analysis focusing on item
performance (incorrect ¼ 0 and correct ¼ 1) as the dependent variable.
Each test taker’s standardized total score on the examination form was
entered as a control variable in Model 1 to partial out differences in item
performances due to levels of proficiency. Item length (stem and option
word count) and item type were also entered in Model 1 to partial out the
differences in item difficulty between MCQs and KFPs. Model 1 was found
Table 3. Results of Multiple Regression Analysis Predicting Mean Response TimePer Item From the Item’s Cognitive Complexity, Controlling for Item Length andDifficulty
Model Variables entered b b t p sr2
1a (Constant) 85.10 32.84 .000Stem Word Count .55 .57 10.75 .000 .31Options Word Count .92 .39 7.40 .000 .15
2b (Constant) 85.26 39.57 .000Stem Word Count .45 .47 10.24 .000 .19Options Word Count .86 .37 8.30 .000 .13Item Difficulty (�1*logit) 11.23 .38 8.49 .000 .13
3c (Constant) 77.69 18.03 .000Stem Word Count .34 .36 4.92 .000 .04Options Word Count .81 .34 7.63 .000 .11Item Difficulty (�1*logit) 10.59 .36 7.86 .000 .11Item Type (0 ¼ MCQ, 1 ¼ KFP) 15.46 .15 2.02 .045 .01
4d (Constant) 64.02 9.95 .000Stem Word Count .28 .29 3.92 .000 .03Options Word Count .78 .33 7.46 .000 .10Item Difficulty (�1*logit) 10.48 .36 7.94 .000 .11Item Type (0 ¼ MCQ, 1 ¼ KFP) 11.81 .12 1.56 .122 .00Mean Cognitive Level Rating 11.67 .16 2.81 .006 .01
Notes. N¼ 166 items, which was the unit of the analysis. To facilitate interpretation of the con-stant and b weights, the Stem Word Count, Options Word Count, and Item Difficulty variableswere mean centered prior to the regression analysis, while the Mean Cognitive Level Ratingvariable was recoded such that the factual level equals 0. MCQs ¼ multiple-choice questions;KFPs ¼ key features problems.a Model 1 R2 ¼ .573 (adjusted R2 ¼ .568), F(2, 159) ¼ 106.75, p ¼ .000. b Model 2 R2 ¼ .707(adjusted R2¼ .701), �R2 ¼ .13, F(1, 158) ¼ 72.01, p ¼ .000. c Model 3 R2 ¼ .714 (adjusted R2 ¼.707), �R2 ¼ .01, F(1, 157) ¼ 4.09, p ¼ .045. d Model 4 R2 ¼ .728 (adjusted R2 ¼ .719), �R2 ¼.01, F(1, 156) ¼ 7.916, p¼ .006.
Hurtz et al. 409
statistically significant, w2(4) ¼ 189.42, p ¼ .000, with a Nagelkerke R2 of
.07. Model 2, which added item response time, was significantly stronger
than Model 1, Dw2(1) ¼ 131.31, p ¼ .000, and incremented the Nagelkerke
R2 by .04 up to a total of .11. The nature of the effect was such that after
accounting for differences in test taker abilities and item lengths and diffi-
culties, more time spent on the items was associated with a lower likelihood
of answering correctly.
In Model 3, however, time spent was found to significantly interact with
item type, Dw2(1) ¼ 20.81, p = .000, and incremented the Nagelkerke R2, by
.01 up to a total of .12. The nature of the interaction effect is plotted in Figure
3, which shows the predicted probabilities of success for MCQs and KFPs as
a function of the number of seconds spent on the item. Note that the control
variables (test taker proficiency, item stem word count, and options word
count) were held constant at their average values when plotting these prob-
abilities, and the time elapsed across the horizontal axis runs approximately
through the 99th percentile (351 seconds or 5.85 minutes) of time spent in the
Table 4. Results of Logistic Regression Analysis Predicting Item Performance FromTime Spent on KFPs Versus MCQs, Controlling for Test Taker Proficiency Levels
Model Variables Entered b exp(b) Wald p
1a (Constant) 1.06 2.89 194.97 .000Test-Taker Proficiency 0.06 1.06 2.31 .129Stem Word Count 0.00 1.00 .72 .395Options Word Count 0.00 1.00 .05 .820Item Type (0 ¼ MCQ, 1 ¼ KFP) �1.04 0.36 62.24 .000
2b (Constant) 1.21 3.35 236.66 .000Test-Taker Proficiency 0.05 1.05 1.70 .193Stem Word Count 0.00 1.00 6.95 .008Options Word Count 0.01 1.01 9.77 .002Item Type (0 ¼ MCQ, 1 ¼ KFP) �0.86 0.42 41.28 .000Time Spent (in seconds) �0.01 0.99 114.76 .000
3c (Constant) 1.44 4.23 223.84 .000Test-Taker Proficiency 0.04 1.04 1.24 .266Stem Word Count 0.00 1.00 10.39 .001Options Word Count 0.01 1.01 9.46 .002Item Type (0 ¼ MCQ, 1 ¼ KFP) �1.37 0.26 61.71 .000Time spent (in seconds) �0.01 0.99 80.64 .000Item Type � Time Spent 0.01 1.01 19.81 .000
Note. MCQ ¼ multiple-choice question; KFP ¼ key features problem.a Model 1 Nagelkerke R2 ¼ .07, �2(4) ¼ 189.42, p ¼ .000. b Model 2 Nagelkerke R2 ¼ .11,��2(1) ¼ 131.31, p ¼ .000. c Model 3 Nagelkerke R2 ¼ .12, ��2(1) ¼ 20.81, p ¼ .000.
410 Evaluation & the Health Professions 35(4)
observed data. Figure 3 shows that the rate of decline in predicted probabil-
ities of success over time for KFPs was not as steep as it was for MCQs.
Discussion
The primary purpose of this study was to evaluate the degree to which
KFPs tend to measure cognitive processes that are more complex than
conventional MCQs, in order to expand the existing body of evidence
(e.g., Schuwirth et al., 1999, 2001) regarding their validity for assessing
clinical decision making. SME ratings of the cognitive complexity of KFP
and MCQ items were found to be reliable and, addressing our first research
question, were found to consistently reveal that they believed the KFPs
measured higher cognitive levels than the MCQs. Thus, in the judgment
of SMEs, the KFPs used in this study tended to achieve the goal of assessing
Figure 3. Interaction plot of the effects of time spent on item performance forMCQs versus KFPs. MCQs ¼ multiple-choice questions; KFPs ¼ key featuresproblems.
Hurtz et al. 411
relatively more complex cognitive processes than MCQs in the decision-
making process required for selecting the correct answers to the items.
With respect to our second research question, consistent with the evi-
dence that KFPs require more complex cognitive processes, our multiple
regression analysis focusing on response times revealed that even after con-
trolling for response time differences that were explained by the fact that the
KFPs tended to be longer and more difficult than MCQs, test takers still
spent more time responding to the KFPs. On average, test takers spent
approximately 69 more seconds on KFPs than MCQs, and our results sug-
gested that approximately 15 of these seconds could not be accounted for
by differences in item length and difficulty alone. Moreover, when cogni-
tive level ratings were added to the regression model, the differences in
response times between KFPs and MCQs were reduced to nonsignifi-
cance. This finding is consistent with a mediation effect where the differ-
ences in response times between item types appear to be largely attributed
to the higher level processes elicited by the KFPs. Thus, in response to our
second research question, it appears that cognitive complexity does in fact
help to explain longer performance on KFPs relative to MCQs. In other
words, it is not only reading time and item difficulty that produces longer
response times to KFPs but also more complex problem-solving and
decision-making processes. This again supports the validity of KFPs as
measures of higher level cognitive processes.
While cognitive complexity helps explain some of the extra time test
takers spend responding to KFPs, the third research question went a little
further and asked whether the extra time actually translates into greater
gains in performance on KFPs versus MCQs. If so, this is again consistent
with the proposition that the extra time tends to be spent engaging in a more
complex decision-making process required to successfully answer the item.
When evaluating this question the logistic regression results suggested that,
after taking into account differences in test taker proficiency and differ-
ences in item length and difficulties, the more time a test taker spends on
an item the less likely they are to answer correctly—however, the interac-
tion model suggested that this pattern is more true of MCQs than KFPs. This
is consistent with the SME ratings which identified many MCQs as factual
in nature, where a short factual question is likely to result in a quick
response for test takers who know the answer. On the other hand, SMEs felt
that more of the KFPs required complex cognitive processes, and the regres-
sion results described earlier suggested that some of the extra time spent on
KFPs was likely due to those complex processes. This may explain why
quickness of responding did not relate as strongly to performance for KFPs
412 Evaluation & the Health Professions 35(4)
as it did for MCQs. In other words, to a certain degree at least, it appears that
it legitimately takes more time on KFPs to work through a more complex
problem-solving process and provide a judgment and decision regarding the
correct answer/answers.
To summarize, the findings of this study provide evidence that KFPs tap
into higher level cognitive processes consistent with the intended problem-
solving and clinical decision-making process. As with any test develop-
ment program, some of the items in this study appeared to achieve this
goal better than others. In other words, as seen in Figure 1, some KFPs
were in fact rated at the factual level. On the other hand, some of the
MCQs were rated at higher cognitive levels. On balance, however, the ten-
dency was for MCQs to be rated at lower levels and KFPs to be rated at
higher levels, and these differences had substantive relations with item
response times and performance.
Through correlations between MCQ and KFP sections of tests (Fischer et
al., 2005; Hatala & Norman, 2002), comparisons of schools with different
degrees of problem-based learning programs (Schuwirth et al., 1999), and a
think-aloud test-taking session (Schuwirth et al., 2001), past research has
supported the notion that KFPs measure different, and higher order, cogni-
tive processes compared to more conventional MCQs. Because of the rel-
atively limited empirical research base in the published literature,
Haladyna (2004) suggested that additional types of validity evidence
should be sought to further evaluate KFPs. Following the principle of
methodological triangulation, the current study adds breadth to this
knowledge base supporting KFPs by introducing results based on SME
judgments of the thought processes that underlie KFPs, and the degree
to which those processes affect response times and ultimately item perfor-
mance in theoretically meaningful ways. Therefore, this study contributes
valuable new validity evidence in support of the notion that KFPs tap into
clinical decision-making skills as intended.
Additional research, replicating the methodologies used to date or
extending into other research strategies, is warranted to further expand the
research base on KFPs. For example, research into construct validity issues
is warranted to better understand the nature of the skills KFPs are attempt-
ing to measure. Schuwirth (2009) provided a thoughtful commentary on
issues in disentangling the complex notion of clinical reasoning. In addition,
there is a lack of extant research on criterion-related validity evidence
demonstrating whether KFPs help predict future performance on the
job. A substantial body of criterion-related validity evidence exists for
the conceptually similar situational judgment tests (Weekley & Ployhart,
Hurtz et al. 413
2006; Whetzel & McDaniel, 2009), which might be tapped in search of
implications for KFPs. Research into more generic problem solving (e.g.,
Davidson & Sternberg, 2003) might also be explored in attempt to better
understand psychological research into the processes involved in clinical
reasoning and correlates of KFP performance.
Certification and licensure organizations continue to administer ‘‘high
stakes’’ examinations to graduates of professional degree programs to deter-
mine who is allowed to practice as a health care professional. In addressing
the issue of public protection, these examinations attempt to assess not only
factual knowledge but application of knowledge and clinical decision mak-
ing. In an effort to assess decision making at a higher cognitive level, the KFP
item format has been utilized which offers an easily administered and objec-
tively scored assessment process that can utilize a variety of clinical cases and
cover a broad spectrum of content domains. Though this item format is not as
difficult to construct as other formats, such as the patient management prob-
lem format, it still requires more work and resources to produce than more
standard multiple-choice item types. Accordingly, it is appropriate to attempt
to verify that the effort is ‘‘paying off’’ in the context of assessing skills at a
higher cognitive level. The current study tends to support that assertion.
As measures of clinical decision making, KFPs play a key role in helping
to ensure that licensed and certified allied health practitioners have the
skills they need to make important decisions on the job and improve their
practice and quality of patient care. If evidence continues to favor the KFP
format in the allied health education and assessment field, it may be a for-
mat worth considering for examinations in other fields where decision-
making skills are relevant to the quality of work performance.
Acknowledgments
This research was carried out in conjunction with test development work at Comira.
The authors would like to thank the Commission on Dietetic Registration for
allowing us to obtain data from certified specialists in gerontological and oncology
nutrition.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or
publication of this article
414 Evaluation & the Health Professions 35(4)
References
Bordage, G., & Page, G. (1987). An alternate approach to PMPs: The key features
concept. In I. Hart & R. Harden (Eds.). Further developments in assessing
clinical competence (pp. 57–75). Montreal, Quebec, Canada: Can-Heal
Publications.
Davidson, J. E., & Sternberg, R. J. (2003). The psychology of problem solving. New
York, NY: Cambridge University Press.
Farmer, E. A., & Page, G. (2005). A practical guide to assessing clinical decision-
making skills using the key features approach. Medical Education, 39,
1188–1194.
Fischer, M. R., Kopp, V., Holzer, M., Ruderich, F., & Junger, J. (2005). A modified
electronic key feature examination for undergraduate medical students: Valida-
tion threats and opportunities. Medical Teacher, 27, 450–455.
Haladyna, T. M. (2004). Developing and validating multiple-choice test items
(3rd ed). Mahwah, NJ: Lawrence-Erlbaum.
Hatala, R., & Norman, G. R. (2002). Adapting the key features examination for a
clinical clerkship. Medical Education, 36, 160–165.
Kim, K. S., Amin, S. M., & Ng, A. C. (2007). Avoiding common errors in key
feature problems. Malaysian Family Physician, 2, 18–21.
Page, G., & Bordage, G. (1995). The Medical Council of Canada’s key features
project: A more valid written examination of clinical decision-making skills.
Academic Medicine, 70, 104–110.
Page, G., Bordage, G., & Allen, T. (1995). Developing key-feature problems and
examinations to assess clinical decision-making skills. Academic Medicine,
70, 194–201.
Schuwirth, L. (2009). Is assessment of clinical reasoning still the Holy Grail?
Medical Education, 43, 298–299.
Schuwirth, L. W. T., Verhoeven, B. H., Scherpbier, A. J. J. A., Mom, E. M. A.,
Cohen-Schotanus, J., van Rossum, H. J. M., & van der Vleuten, C. P. M.
(1999). An inter- and intra-university comparison with short case-based testing.
Advances in Health Sciences Education, 4, 233–244.
Schuwirth, L. W. T., Verheggen, M. M., van der Vleuten, C. P. M., Boshuizen, H. P.
A., & Dinant, G. F. (2001). Do short cases elicit different thinking processes than
factual knowledge questions do? Medical Education, 35, 348–356.
Weekley, J. A., & Ployhart, R. E. (2006). Situational judgment tests: Theory, Mea-
surement and Application. In the Frontiers Series of the Society for Industrial
and Organizational Psychology. Mahwah, NJ: Lawrence Erlbaum Associates.
Whetzel, D. L., & McDaniel, M. A. (2009). Situational judgment tests: An overview
of current research. Human Resource Management Review, 19, 188–202.
Hurtz et al. 415