Measuring Clinical Decision Making: Do Key Features Problems Measure Higher Level Cognitive...

20
Measuring Clinical Decision Making: Do Key Features Problems Measure Higher Level Cognitive Processes? Gregory M. Hurtz 1 , Roberta N. Chinn 2 , Grady C. Barnhill 3 , and Norman R. Hertz 4 Abstract Reliable and objective assessment of clinical decision-making skills has been a long-standing goal in occupational testing in the allied health professions. With this goal in mind, the key features problem (KFP) format was devel- oped which elicits targeted decisions about key features of clinical scenar- ios. To build on a small body of empirical evidence evaluating their efficacy, this study evaluates whether KFPs successfully assess higher order cogni- tive processes. Analysis of objective data (item length and difficulty, item performance, and response times) and subjective data (expert ratings of cognitive complexity) supported the proposition that KFPs tend to be 1 Department of Psychology, California State University, Sacramento, CA, USA 2 PSI Services, LLC, Burbank, CA, USA 3 Commission on Dietetic Registration, Chicago, IL, USA 4 Progeny Systems Corporation, Manassas, VA, USA Corresponding Author: Gregory M. Hurtz, California State University, Sacramento, 6000 J Street, Sacramento, CA 95819-6007, USA Email: [email protected] Evaluation & the Health Professions 35(4) 396-415 ª The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0163278712446639 http://ehp.sagepub.com

Transcript of Measuring Clinical Decision Making: Do Key Features Problems Measure Higher Level Cognitive...

Measuring ClinicalDecision Making:Do Key FeaturesProblems MeasureHigher LevelCognitive Processes?

Gregory M. Hurtz1, Roberta N. Chinn2,Grady C. Barnhill3, and Norman R. Hertz4

AbstractReliable and objective assessment of clinical decision-making skills has beena long-standing goal in occupational testing in the allied health professions.With this goal in mind, the key features problem (KFP) format was devel-oped which elicits targeted decisions about key features of clinical scenar-ios. To build on a small body of empirical evidence evaluating their efficacy,this study evaluates whether KFPs successfully assess higher order cogni-tive processes. Analysis of objective data (item length and difficulty, itemperformance, and response times) and subjective data (expert ratings ofcognitive complexity) supported the proposition that KFPs tend to be

1 Department of Psychology, California State University, Sacramento, CA, USA2 PSI Services, LLC, Burbank, CA, USA3 Commission on Dietetic Registration, Chicago, IL, USA4 Progeny Systems Corporation, Manassas, VA, USA

Corresponding Author:

Gregory M. Hurtz, California State University, Sacramento, 6000 J Street, Sacramento, CA

95819-6007, USA

Email: [email protected]

Evaluation & the Health Professions35(4) 396-415

ª The Author(s) 2012Reprints and permission:

sagepub.com/journalsPermissions.navDOI: 10.1177/0163278712446639

http://ehp.sagepub.com

more cognitively complex than conventional multiple-choice questions. Notonly were they rated as more complex, but this complexity accounted forsome of the increase in time spent responding to these items. Results supportthe use of KFPs in standardized assessments for measuring higher order cog-nitive processes such as clinical decision making.

Keywordskey features, clinical decision making, standardized assessment, objectivescoring, cognitive levels, multiple choice

Decision making can be viewed as a stage in the problem-solving process,

where the problem solver must recognize and understand the nature of the

problem, generate possible solutions, evaluate the likely benefits, risks, and

consequences of those alternative solutions, and make one or more deci-

sions about the best course/courses of action. Clinical decision making

involves problem solving in clinical cases, and clinical decision-making

skills are important in the allied health field because patients are put at

potential risk if health care personnel cannot apply their knowledge and

training to real-world problems in an integrated and appropriate manner.

As such, developers of licensure and certification examinations for allied

health professions have expended considerable efforts and resources toward

measuring clinical decision-making skills.

Many allied health and medical organizations have focused their efforts

on ‘‘authentic’’ assessments of competencies with clinical scenarios. The

National Board of Medical Examiners has conducted continual research

with computer-based, clinical scenarios to assess clinical reasoning now

known as the computer-based case simulation (CCS) section of the United

States Medical Licensing Examination (USMLE, Step 3). Another organi-

zation for veterinary medicine (National Board Examination Committee

for Veterinary Medicine) used a series of patient management problems

(clinical competency test) that required test takers to choose among a vari-

ety of alternatives for action some of which may be appropriate, others

either not appropriate or even contraindicated about a clinical case sce-

nario. Other medical and some allied health organizations have attempted

to assess clinical reasoning through the use of oral examinations. All of

these complex item and scoring formats required considerable expendi-

tures of staff and subject matter expert (SME) time, as well as consider-

able financial resources. The decision to make such an investment

Hurtz et al. 397

demonstrates the importance of assessing clinical decision making in a

licensing/certification context.

Patient management problems and oral examinations have been largely

abandoned in certification and licensure testing. In the case of the patient

management problem format, there has been a great deal of divided psycho-

metric opinion as to its reliability and validity. Many have expressed the

opinion that the expense and time required to produce such items is not

repaid in terms of measurement reliability or validity. Oral examinations

were discontinued due to measurement issues, time constraints, and con-

cerns with bias and candidate anonymity.

Allied health and medical organizations have continued their efforts to

assess clinical decision-making skills, primarily through the use of the

multiple-choice item format. Efforts are made to improve fidelity, some-

times through the use of clinical photographs, video recordings, or other

means, but the multiple-choice question (MCQ) is still the format of

choice. Sometimes a ‘‘standardized patient’’ format will utilize interac-

tions with a patient/actor to assess patient communications or even diag-

nostic skills, but this is a complicated and expensive approach. The same

can be said of the multistation, objective structured clinical examination

(OSCE) format, generally used to assess various clinical skills. In short,

the allied health certification and licensure testing landscape is still in

need of a better measurement strategy to assess clinical decision making.

What is needed is an item format that is easier and less expensive to

develop and maintain, more straightforward to administer and score, and

stable in its measurement attributes, all while assessing clinical decision

making at a higher cognitive level.

The key features problem (KFP) is a promising alternative measurement

strategy designed to improve upon the use of scenario-based assessments in

this context (Bordage & Page, 1987; Page & Bordage, 1995; Page, Bordage,

& Allen, 1995). KFPs present clinical scenarios and can elicit short-answer

constructed responses or can be objectively scored from a menu of response

options where test takers choose multiple responses and are scored against a

correct response profile derived by experts. A small body of research evi-

dence supports the proposition that KFPs measure decision making which

results from higher order cognitive processes and problem-solving strate-

gies (Schuwirth et al., 1999; Schuwirth, Verheggen, van der Vleuten, Boshui-

zen, & Dinant, 2001). Haladyna (2004) suggests that the KFP strategy is a

‘‘strong contender among other approaches to modeling higher level thinking

that is sought in testing competence in every profession’’ (pp. 169–170) but

goes on to say that more research is needed to evaluate the degree to which

398 Evaluation & the Health Professions 35(4)

KFPs accomplish the ‘‘elusive goal’’ (p. 170) of assessing clinical problem

solving. The current study addresses this call for research by further evaluat-

ing the proposition that KFPs assess higher order cognitive processes consis-

tent with clinical decision making.

The idea underlying KFPs is the notion of a ‘‘key feature,’’ or critical

step in the resolution of a clinical problem (Bordage & Page, 1987; Farmer

& Page, 2005). For example, the key feature of a solution may be a step

where a practitioner is most likely to make an error, or a decision point

where the next course of action could be harmful to the patient if the wrong

decision is made. KFPs can address common errors made by practitioners as

well as rare but critical conditions such as infrequent emergency procedures

(Kim, Amin, & Ng, 2007). The KFP format presents a description of the

clinical case scenario followed by questions that focus solely on the identi-

fied key features of the case. The questions typically provide several differ-

ent response options from which multiple critical elements can be selected.

Scoring can be dichotomous (i.e., credit for choosing all correct options and

no incorrect options, versus no credit for partial, incorrect, or too many

selections) or can allow partial credit to be assigned. In addition, ‘‘auto-

matic zero’’ options may be included, for example in a situation where one

particular decision kills the patient. Farmer and Page (2005) and Haladyna

(2004) provide brief guides to the development of KFPs.

KFPs are not intended to measure factual knowledge; they are intended

to measure application and synthesis of knowledge and perhaps experience,

coupled with reasoning and judgment, in the selection of appropriate

courses of action for scenarios that are actually encountered on the job.

Research has indicated that KFPs correlate only moderately with measures

of factual knowledge (Fischer, Kopp, Holzer, Ruderich, & Junger, 2005;

Hatala & Norman, 2002), and that the thought processes underlying

experts’ responses to KFPs are more elaborate and qualitatively different

when compared to those underlying more factual-based multiple-choice

items (Schuwirth et al., 2001). For instance, Schuwirth et al. (2001) found

that experts used less information and responded more quickly to KFPs than

did novices, whereas no such expert–novice differences emerged on the

more factual MCQs. In addition, experts engaged in more ‘‘nonsequential’’

processing of information in KFPs, while novices accessed the information

in a more sequential manner. These findings support the notion that KFPs

elicit different reasoning and decision-making strategies than more factual

items. In another study, Schuwirth et al. (1999) found that KFPs signifi-

cantly differentiated students from a medical school that used an exclusive

problem-based learning curriculum from a similar school that used

Hurtz et al. 399

primarily a lecture-based curriculum, further supporting the notion that KFP

performance relies on problem-based clinical decision-making skills.

While the existing research on KFPs is promising, more evidence is

needed to support the notion that these relatively elaborate items actually

do go beyond more conventional, brief, single-answer MCQs and tap into

the higher cognitive processes they purport to measure. Of the various

objectively scored multiple-choice item formats available to test developers

for designing their assessments (see Haladyna, 2004, for different options),

the question is ‘‘Do KFPs stand apart from more conventional, brief single-

answer MCQs in terms of the cognitive processes they elicit?’’ In order

words, to what degree do KFPs, which to date have appeared superior to

other attempts in measuring clinical decision making, actually accomplish

the ‘‘elusive goal’’ of measuring higher level cognitive processes? The cur-

rent study evaluated this question using a combination of SME ratings on

the cognitive complexity of KFP and MCQ items and the response time and

performance of actual test takers.

We proposed three research questions for this study. First, do experts

view KFPs as more cognitively complex than MCQs? Answering this ques-

tion provides new information on whether high-level cognitive processes

are involved in answering KFPs to a greater extent than in more conven-

tional MCQs. Second, does this complexity explain differences in

response times between KFPs and MCQs, after controlling for differences

in item length and difficulty? KFPs tend to be longer than MCQs which

adds reading time, and KFPs also tend to be more difficult than MCQs

which can affect response times. The question here is not only whether

more time is spent on KFPs relative to MCQs over and above these fac-

tors, but also the degree to which this additional time spent can be

accounted for by the KFPs eliciting higher level cognitive processes.

Answering this question helps build the pattern of evidence regarding

whether KFPs are tapping into clinical decision making.

Third, do these residual differences in response times (after controlling

for item length and difficulty) have a different pattern of relationship to item

performance for KFPs versus MCQs? To a certain degree, it can be argued

that experts either know the answers or they do not, so that delayed

responses may indicate that the test taker simply does not know the answer.

However, if KFPs require higher level cognitive processes, then we would

expect at least some of the extra time to translate into better item perfor-

mance, whereas delayed responses on the simpler MCQs may be a stronger

indication of a lack of knowledge. In other words, if KFPs involve higher

400 Evaluation & the Health Professions 35(4)

level processes then answering correctly should take a bit longer as test

takers work through their decision processes.

Method

Setting

This study was conducted in the context of examination development

for two dietetic specialty areas (gerontology and oncology). A total of six dif-

ferent forms were used. These examinations contain a combination of con-

ventional, relatively brief, four-option MCQs, and the more elaborate KFPs.

Item Development

The procedures for developing MCQs and KFPs were similar. Participants

were provided a formal orientation in the principles of good item construc-

tion, opportunities to familiarize themselves with the content specifications,

and opportunities to work with fellow participants to create the items. For

each item, considerable emphasis was placed on specifying the linkage of

item content to the content specifications and providing a citation from

an authoritative reference source. Therefore, each item was linked to a

specific task and knowledge from the test specifications and to a page or

section of an authoritative reference source. There were numerous opportu-

nities for individual assistance with item development as well as opportuni-

ties for review by other participants.

The primary distinction between the development process for MCQs and

KFPs was the degree of focus on the key features of the problems for clin-

ical decisions. MCQs were relatively brief and had four response options

with one correct answer and focused on either applied questions regarding

facts, concepts, and procedures or scenario-based questions regarding com-

monly encountered situations that require analysis, interpretations, and

decisions. KFPs differed from scenario-based multiple-choice items in the

degree of complexity of the scenarios and the degree to which they empha-

sized the candidate’s ability to triage among multiple correct answers and

multiple distractors, many of which could be correct, but were not the

actions or interventions that addressed the most urgent needs of the patient

in the scenario. The intent of the KFP was to assess the candidates’ abilities

to implement more than one step in their decision-making process, discri-

minate between relevant and irrelevant information presented in the sce-

nario, and prioritize correct responses in terms of urgency of the

Hurtz et al. 401

situation. Test takers were instructed on how many responses to choose

when responding to each item (e.g., ‘‘Choose three’’), and they were scored

in an all-or-none fashion. SMEs determined the minimum number of these

correct options needed to receive a point for each item and those selecting

fewer than the minimum received no credit for that item.

Examination Administration

Examinations were delivered to certification candidates as computer-based

tests at testing centers where proctors provided standardized instructions.

The candidates were actual test takers who were completing the examina-

tions for certification and not for the purposes of this study. For this study,

the data were extracted from database archives without any identifying

information associated with the test takers.

Cognitive Complexity Surveys

Multiple survey forms were constructed using subsets of the KFPs from

each of the six operational examination forms used in this study inter-

spersed in a counterbalanced fashion with six corresponding subsets of

MCQs drawn randomly from the same forms. The correct answers were not

marked in order to force raters through the thought process of answering the

items. Raters were instructed to read each item, think about the correct

answer/answers, and decide on the level of cognitive processing required

to answer correctly. Ratings were given on a 4-point scale derived from the

language of common cognitive level taxonomies: 1 ¼ Factual, 2 ¼ Appli-

cation, 3 ¼ Interpretation, and 4 ¼ Synthesis. Table 1 provides the defini-

tions that were developed using the terminology of the focal professions to

make them most meaningful to expert raters from these professions. The

MCQs and KFPs were counterbalanced on the survey and were not labeled

in any way for the raters as being one type of item or another.

Procedure and Participants

SMEs were contacted via e-mail and asked to serve as raters in a quality-

control research study designed to assess the cognitive processes elicited

by certification candidates when they take the examinations. All invited

SMEs were certified and in good standing in their respective specialty

areas. Most were drawn at random from a database of certificate holders,

while SMEs who had previously served as item writers were also invited

402 Evaluation & the Health Professions 35(4)

to participate. The surveys were put on a password-protected Internet site,

and when SMEs logged in they were randomly assigned one of the alternate

forms in their specialty area (gerontology vs. oncology). They were allowed

3 weeks to log in and complete their ratings, and on completion they were

sent an honorarium of $25. Each participant was assured that survey rat-

ings would be combined with those of other participants to determine

group trends. Individual ratings were kept confidential and had no real

or potential impact on any aspect of their employment or status as a reg-

istered dietitian. All participants and their data were treated in accordance

with the American Psychological Association (APA) ethical guidelines.

Variables and Data Collected

Comparisons of KFPs and MCQs were first made in terms of SME ratings

of cognitive complexity on the survey forms. In addition, test taker response

Table 1. Definitions of Cognitive Complexity Levels for the Rating Task

Level Definitiona

1 FACTUAL, e.g. identifies signs, symptoms, and nutrition-related side effects;identifies standardized tools for assessing cancer and treatment-relatedside effects; determines energy, protein, fluid needs; determines nutritionsupport recommendations; provides interventions in accordance withtreatment goals

2 APPLICATION, e.g. recognizes treatment-related/nutrition-related causesof nutrition impact symptoms; develops nutrition care plans specific tophase of cancer care; develops nutrition care plans in accordance withintent/goal of cancer therapy; uses standardized tools for assessing cancerand treatment-related side effects

3 INTERPRETATION, e.g. differentiates between treatment-related/nutrition-related causes of nutrition impact symptoms; integrates nutritioninterventions for managing side effects related to multimodal treatment;initiates nutrition care plans in accordance with intent/goal of therapy;prioritizes nutrition care services; adapts nutrition care plan to maximizebenefit of diet and medications

4 SYNTHESIS e.g. anticipates treatment-related/nutrition-related causes ofnutrition impact symptoms; recommends modifications in supportive caretherapies; anticipates fluctuations in relevant clinical data and its impact ontreatment; monitors recovery from treatment and response to treatment;anticipates and recommends modifications in nutrition care plan

Note. a The details shown in this example after the ‘‘e.g.’’ are for the oncology examinations.

Hurtz et al. 403

times (measured in seconds) and item performance (correct/incorrect) were

extracted from data archives for each of the items that were included in the

SME surveys. In addition, each test taker’s total examination score (i.e., their

sum correct across all items, on which certification decisions were made in

practice) was collected, and these were converted into within-form standard

scores. These four variables—cognitive complexity, response times, item

performance (both binary correct/incorrect values and aggregated percentage

correct p values), and overall examination performance—were the primary

focal variables of the study.

Because MCQs and KFPs typically differ in length which may covary

with the primary variables (especially response times), two separate length

variables were coded to incorporate statistical controls. First, we coded item

stem length as a simple count of words within the ‘‘prompt’’ or stem of the

item. This accounted for each word the test taker would need to read from

the start of the item text up to the point where they begin reading and decid-

ing on the response options. Second, we coded the total word count of the

block of response options.

Data Analysis

Survey response rates were tallied, chi-square tests were used to evaluate

whether response rates for the different forms varied across demographic

variables, and rater reliability was evaluated using generalizability theory.

Ratings of cognitive complexity were compared between item types using

a graphical display and a 2-way mixed analysis of variance (ANOVA) with

item type as the repeated variable and survey form as the independent-group

variable. Item stem and option length, item difficulties, average cognitive

complexity ratings of items, and an item type dummy variable (MCQ ¼ 0,

KFP¼ 1) were then used as predictors of response time in a hierarchical mul-

tiple regression analysis. Finally, standardized examination scores, item

length, item type, item response time, and an interaction term between item

type and item response time were used as predictors of item performance

(correct/incorrect) in a multiple logistic regression analysis.

Results

Response Rate, Respondent Characteristics, and Reliability of Ratings

Participation requests were sent to 183 SMEs, and 101 (55.2%) agreed to

participate and provided their ratings of items within their specialties. All

404 Evaluation & the Health Professions 35(4)

participants were college educated, with 46 (45.5%) possessing bachelor’s

degrees, 52 (51.5%) possessing master’s degrees, and 3 (3.0%) possessing

doctoral degrees. Ninety-nine (98.0%) were employed in dietetic practice at

the time of the study and likewise 99 (98.0%) were actively providing

oncology nutrition services, with 81 (80.2%) providing these services for

more than 20 hours per week and 62 (61.0%) for more than 30 hours per

week. All 101 were credentialed as registered dietitians with the Commis-

sion on Dietetic Registration for at least 3 years, while 64 (63.4%) had been

for over 10 years.

Random assignment to groups resulted in 12–22 raters per form.

Affirming the success of the random assignment, chi-square tests of inde-

pendence indicated that rater group membership was unrelated (ps > .05)

to all demographic variable categories summarized above, supporting the

equivalence of the groups of raters along these relevant demographic char-

acteristics. Rater-by-item generalizability theory analyses for each form

resulted in generalizability coefficients of .76 to .95 across examination

forms, and .89 when pooled across forms (for both relative and absolute

error), indicating strong interrater reliability for the ratings.

Do SMEs View KFPs as Requiring More Complex CognitiveProcesses than MCQs?

Figure 1 provides a simple tally of cognitive level ratings, split by item type

for visual comparison. A clear pattern is indicated where MCQs were more

often seen as factual and progressively less often seen as measuring the

application, interpretation, and synthesis levels. KFPs were least often seen

as factual and more often seen as measuring the application, interpretation,

and synthesis levels. Because this figure tallies ratings across items that

were provided by the same SMEs, data were aggregated to means across

raters and item types (MCQs vs. KFPs), and the mean ratings were used

as the unit of analysis for tests of statistical significance.

Figure 2 displays a plot of the average ratings for each item type across

the six test forms. A two-way mixed ANOVA revealed a statistically signif-

icant and quite strong interaction between Item Type and Test Forms on

cognitive complexity ratings, F(5, 95) ¼ 18.00, p ¼ .000, partial Z2 ¼.49. The main effect of item type was also significant with a stronger effect

size, F(1, 95) ¼ 295.53, p ¼ .000, partial Z2 ¼ .76, and the main effect of

forms was also significant and moderately strong, F(5, 95)¼ 5.35, p¼ .000,

partial Z2 ¼ .22. The main effect of item type, which was the strongest of

the three effects, is of particular interest, with Figure 2 showing clearly that

Hurtz et al. 405

KFPs (M ¼ 2.73; SD ¼ .31) were rated higher, on average, than the MCQs

(M ¼ 1.93; SD ¼ .44). While the significant interaction suggests this effect

is moderated by form, simple effects with a Bonferroni-corrected a of

.05/6¼ .008 suggest that the difference is nevertheless significant for all six

forms. As seen in Figure 2, the mean ratings for KFPs were always higher

than those for MCQs, and the nature of the interaction is such that the effect

was simply stronger for the gerontology test forms (Z2s ¼ .58 and .68) than

for the oncology test forms (Z2s ¼ .08–.23).

Does Cognitive Complexity Help to Explain Differences BetweenKFPs and MCQs in Test Taker Response Times?

The mean response times computed from actual test takers for each of

the 166 items (Ns ¼ 24 to 79 per item, depending on the test form)

were then regressed onto the dummy variable representing item type

Figure 1. Frequency counts of cognitive level ratings by item type summed acrossitems and raters.

406 Evaluation & the Health Professions 35(4)

(0 ¼ MCQ, 1 ¼ KFP) as well as the mean cognitive complexity rating

across SMEs, after controlling for response time differences due to variabil-

ity in item length and difficulty. Table 2 provides the descriptive statistics

for each variable in this analysis broken down by MCQ versus KFP. For the

regression analysis, item difficulties were converted into logits and then

multiplied by�1 so that higher numbers represent more difficulty. All con-

trol variables were mean centered before entering them into the regression

analysis to facilitate interpretation of the multiple regression results.

Table 3 presents the results. Model 1 accounted for the effects of item

length and included the item stem word count and options word count vari-

ables, which both significantly increased response times and together

explained 57% of the variance (R2 ¼ .57, p ¼ .000). Model 2 added item

difficulty, which was found to account for further increases in response

times and produced an increment of 13% to the explained variance. The

total variance explained by these control variables was approximately

71%, and the b weights indicated that differences in stem word counts

MCQ KFP

SYNTHESIS

APPLICATION

INTERPRETATION

FACTUAL

Figure 2. Means of cognitive level ratings for MCQs and KFPs across test forms.MCQs ¼ multiple-choice questions; KFPs ¼ key features problems.

Hurtz et al. 407

contributed the most to differences in overall response times. Due to mean

centering, the raw equation for Model 2 reveals that for items of average

length and difficulty test takers spent approximately 85.26 seconds respond-

ing; all else equal, each additional word in the stem adds approximately

0.45 seconds, each additional word in the response options adds approxi-

mately 0.86 seconds, and each 1-logit increase in item difficulty adds

approximately 11.23 seconds of response time.

In Model 3, the item type dummy variable (0 ¼ MCQ, 1 ¼ KFP) was

entered and found to explain an additional 1% of variance in response time

which was statistically significant, DR2¼ .01, p¼ .045. The unstandardized

weight shows that of the 68.88-second raw difference in response times

between MCQs and KFPs (derived from Table 2), 15.46 seconds still

remain after accounting for differences in item length and difficulty.

However, when mean cognitive level ratings were added in Model 4, they

were found to be significant as well (DR2 ¼ .01, p ¼ .006) and rendered the

coefficient for item type nonsignificant (p¼ .122). This pattern, where item

type was significant in the model excluding cognitive levels but became

nonsignificant when cognitive levels were added, is consistent with a med-

iation effect where differences in response times between MCQs and KFPs

are largely explained by differences in the cognitive levels they measure.

Table 2. Descriptive Statistics for Control Variables (Item Length and Difficulty),Mean Cognitive Level Ratings, and Response Times

M SD Min Max

Stem length (word count)MCQ 34.37 23.48 8 105KFP 117.78 36.87 44 200

Options length (word count)MCQ 21.77 19.09 4 88KFP 35.13 21.61 6 121

Item difficulty (p value)MCQ 0.75 0.22 0.19 1.00KFP 0.54 0.28 0.00 1.00

Mean cognitive level ratingsMCQ 1.93 .62 1.00 3.73KFP 2.74 .43 1.54 3.58

Response timesMCQ 51.70 26.55 14.54 175.93KFP 120.58 45.36 53.85 217.46

Notes. MCQs ¼ multiple-choice questions; KFPs ¼ key features problems.

408 Evaluation & the Health Professions 35(4)

Does Time Spent Responding Relate Differently to ItemPerformance for KFPs versus MCQs?

Table 4 presents the results of a logistic regression analysis focusing on item

performance (incorrect ¼ 0 and correct ¼ 1) as the dependent variable.

Each test taker’s standardized total score on the examination form was

entered as a control variable in Model 1 to partial out differences in item

performances due to levels of proficiency. Item length (stem and option

word count) and item type were also entered in Model 1 to partial out the

differences in item difficulty between MCQs and KFPs. Model 1 was found

Table 3. Results of Multiple Regression Analysis Predicting Mean Response TimePer Item From the Item’s Cognitive Complexity, Controlling for Item Length andDifficulty

Model Variables entered b b t p sr2

1a (Constant) 85.10 32.84 .000Stem Word Count .55 .57 10.75 .000 .31Options Word Count .92 .39 7.40 .000 .15

2b (Constant) 85.26 39.57 .000Stem Word Count .45 .47 10.24 .000 .19Options Word Count .86 .37 8.30 .000 .13Item Difficulty (�1*logit) 11.23 .38 8.49 .000 .13

3c (Constant) 77.69 18.03 .000Stem Word Count .34 .36 4.92 .000 .04Options Word Count .81 .34 7.63 .000 .11Item Difficulty (�1*logit) 10.59 .36 7.86 .000 .11Item Type (0 ¼ MCQ, 1 ¼ KFP) 15.46 .15 2.02 .045 .01

4d (Constant) 64.02 9.95 .000Stem Word Count .28 .29 3.92 .000 .03Options Word Count .78 .33 7.46 .000 .10Item Difficulty (�1*logit) 10.48 .36 7.94 .000 .11Item Type (0 ¼ MCQ, 1 ¼ KFP) 11.81 .12 1.56 .122 .00Mean Cognitive Level Rating 11.67 .16 2.81 .006 .01

Notes. N¼ 166 items, which was the unit of the analysis. To facilitate interpretation of the con-stant and b weights, the Stem Word Count, Options Word Count, and Item Difficulty variableswere mean centered prior to the regression analysis, while the Mean Cognitive Level Ratingvariable was recoded such that the factual level equals 0. MCQs ¼ multiple-choice questions;KFPs ¼ key features problems.a Model 1 R2 ¼ .573 (adjusted R2 ¼ .568), F(2, 159) ¼ 106.75, p ¼ .000. b Model 2 R2 ¼ .707(adjusted R2¼ .701), �R2 ¼ .13, F(1, 158) ¼ 72.01, p ¼ .000. c Model 3 R2 ¼ .714 (adjusted R2 ¼.707), �R2 ¼ .01, F(1, 157) ¼ 4.09, p ¼ .045. d Model 4 R2 ¼ .728 (adjusted R2 ¼ .719), �R2 ¼.01, F(1, 156) ¼ 7.916, p¼ .006.

Hurtz et al. 409

statistically significant, w2(4) ¼ 189.42, p ¼ .000, with a Nagelkerke R2 of

.07. Model 2, which added item response time, was significantly stronger

than Model 1, Dw2(1) ¼ 131.31, p ¼ .000, and incremented the Nagelkerke

R2 by .04 up to a total of .11. The nature of the effect was such that after

accounting for differences in test taker abilities and item lengths and diffi-

culties, more time spent on the items was associated with a lower likelihood

of answering correctly.

In Model 3, however, time spent was found to significantly interact with

item type, Dw2(1) ¼ 20.81, p = .000, and incremented the Nagelkerke R2, by

.01 up to a total of .12. The nature of the interaction effect is plotted in Figure

3, which shows the predicted probabilities of success for MCQs and KFPs as

a function of the number of seconds spent on the item. Note that the control

variables (test taker proficiency, item stem word count, and options word

count) were held constant at their average values when plotting these prob-

abilities, and the time elapsed across the horizontal axis runs approximately

through the 99th percentile (351 seconds or 5.85 minutes) of time spent in the

Table 4. Results of Logistic Regression Analysis Predicting Item Performance FromTime Spent on KFPs Versus MCQs, Controlling for Test Taker Proficiency Levels

Model Variables Entered b exp(b) Wald p

1a (Constant) 1.06 2.89 194.97 .000Test-Taker Proficiency 0.06 1.06 2.31 .129Stem Word Count 0.00 1.00 .72 .395Options Word Count 0.00 1.00 .05 .820Item Type (0 ¼ MCQ, 1 ¼ KFP) �1.04 0.36 62.24 .000

2b (Constant) 1.21 3.35 236.66 .000Test-Taker Proficiency 0.05 1.05 1.70 .193Stem Word Count 0.00 1.00 6.95 .008Options Word Count 0.01 1.01 9.77 .002Item Type (0 ¼ MCQ, 1 ¼ KFP) �0.86 0.42 41.28 .000Time Spent (in seconds) �0.01 0.99 114.76 .000

3c (Constant) 1.44 4.23 223.84 .000Test-Taker Proficiency 0.04 1.04 1.24 .266Stem Word Count 0.00 1.00 10.39 .001Options Word Count 0.01 1.01 9.46 .002Item Type (0 ¼ MCQ, 1 ¼ KFP) �1.37 0.26 61.71 .000Time spent (in seconds) �0.01 0.99 80.64 .000Item Type � Time Spent 0.01 1.01 19.81 .000

Note. MCQ ¼ multiple-choice question; KFP ¼ key features problem.a Model 1 Nagelkerke R2 ¼ .07, �2(4) ¼ 189.42, p ¼ .000. b Model 2 Nagelkerke R2 ¼ .11,��2(1) ¼ 131.31, p ¼ .000. c Model 3 Nagelkerke R2 ¼ .12, ��2(1) ¼ 20.81, p ¼ .000.

410 Evaluation & the Health Professions 35(4)

observed data. Figure 3 shows that the rate of decline in predicted probabil-

ities of success over time for KFPs was not as steep as it was for MCQs.

Discussion

The primary purpose of this study was to evaluate the degree to which

KFPs tend to measure cognitive processes that are more complex than

conventional MCQs, in order to expand the existing body of evidence

(e.g., Schuwirth et al., 1999, 2001) regarding their validity for assessing

clinical decision making. SME ratings of the cognitive complexity of KFP

and MCQ items were found to be reliable and, addressing our first research

question, were found to consistently reveal that they believed the KFPs

measured higher cognitive levels than the MCQs. Thus, in the judgment

of SMEs, the KFPs used in this study tended to achieve the goal of assessing

Figure 3. Interaction plot of the effects of time spent on item performance forMCQs versus KFPs. MCQs ¼ multiple-choice questions; KFPs ¼ key featuresproblems.

Hurtz et al. 411

relatively more complex cognitive processes than MCQs in the decision-

making process required for selecting the correct answers to the items.

With respect to our second research question, consistent with the evi-

dence that KFPs require more complex cognitive processes, our multiple

regression analysis focusing on response times revealed that even after con-

trolling for response time differences that were explained by the fact that the

KFPs tended to be longer and more difficult than MCQs, test takers still

spent more time responding to the KFPs. On average, test takers spent

approximately 69 more seconds on KFPs than MCQs, and our results sug-

gested that approximately 15 of these seconds could not be accounted for

by differences in item length and difficulty alone. Moreover, when cogni-

tive level ratings were added to the regression model, the differences in

response times between KFPs and MCQs were reduced to nonsignifi-

cance. This finding is consistent with a mediation effect where the differ-

ences in response times between item types appear to be largely attributed

to the higher level processes elicited by the KFPs. Thus, in response to our

second research question, it appears that cognitive complexity does in fact

help to explain longer performance on KFPs relative to MCQs. In other

words, it is not only reading time and item difficulty that produces longer

response times to KFPs but also more complex problem-solving and

decision-making processes. This again supports the validity of KFPs as

measures of higher level cognitive processes.

While cognitive complexity helps explain some of the extra time test

takers spend responding to KFPs, the third research question went a little

further and asked whether the extra time actually translates into greater

gains in performance on KFPs versus MCQs. If so, this is again consistent

with the proposition that the extra time tends to be spent engaging in a more

complex decision-making process required to successfully answer the item.

When evaluating this question the logistic regression results suggested that,

after taking into account differences in test taker proficiency and differ-

ences in item length and difficulties, the more time a test taker spends on

an item the less likely they are to answer correctly—however, the interac-

tion model suggested that this pattern is more true of MCQs than KFPs. This

is consistent with the SME ratings which identified many MCQs as factual

in nature, where a short factual question is likely to result in a quick

response for test takers who know the answer. On the other hand, SMEs felt

that more of the KFPs required complex cognitive processes, and the regres-

sion results described earlier suggested that some of the extra time spent on

KFPs was likely due to those complex processes. This may explain why

quickness of responding did not relate as strongly to performance for KFPs

412 Evaluation & the Health Professions 35(4)

as it did for MCQs. In other words, to a certain degree at least, it appears that

it legitimately takes more time on KFPs to work through a more complex

problem-solving process and provide a judgment and decision regarding the

correct answer/answers.

To summarize, the findings of this study provide evidence that KFPs tap

into higher level cognitive processes consistent with the intended problem-

solving and clinical decision-making process. As with any test develop-

ment program, some of the items in this study appeared to achieve this

goal better than others. In other words, as seen in Figure 1, some KFPs

were in fact rated at the factual level. On the other hand, some of the

MCQs were rated at higher cognitive levels. On balance, however, the ten-

dency was for MCQs to be rated at lower levels and KFPs to be rated at

higher levels, and these differences had substantive relations with item

response times and performance.

Through correlations between MCQ and KFP sections of tests (Fischer et

al., 2005; Hatala & Norman, 2002), comparisons of schools with different

degrees of problem-based learning programs (Schuwirth et al., 1999), and a

think-aloud test-taking session (Schuwirth et al., 2001), past research has

supported the notion that KFPs measure different, and higher order, cogni-

tive processes compared to more conventional MCQs. Because of the rel-

atively limited empirical research base in the published literature,

Haladyna (2004) suggested that additional types of validity evidence

should be sought to further evaluate KFPs. Following the principle of

methodological triangulation, the current study adds breadth to this

knowledge base supporting KFPs by introducing results based on SME

judgments of the thought processes that underlie KFPs, and the degree

to which those processes affect response times and ultimately item perfor-

mance in theoretically meaningful ways. Therefore, this study contributes

valuable new validity evidence in support of the notion that KFPs tap into

clinical decision-making skills as intended.

Additional research, replicating the methodologies used to date or

extending into other research strategies, is warranted to further expand the

research base on KFPs. For example, research into construct validity issues

is warranted to better understand the nature of the skills KFPs are attempt-

ing to measure. Schuwirth (2009) provided a thoughtful commentary on

issues in disentangling the complex notion of clinical reasoning. In addition,

there is a lack of extant research on criterion-related validity evidence

demonstrating whether KFPs help predict future performance on the

job. A substantial body of criterion-related validity evidence exists for

the conceptually similar situational judgment tests (Weekley & Ployhart,

Hurtz et al. 413

2006; Whetzel & McDaniel, 2009), which might be tapped in search of

implications for KFPs. Research into more generic problem solving (e.g.,

Davidson & Sternberg, 2003) might also be explored in attempt to better

understand psychological research into the processes involved in clinical

reasoning and correlates of KFP performance.

Certification and licensure organizations continue to administer ‘‘high

stakes’’ examinations to graduates of professional degree programs to deter-

mine who is allowed to practice as a health care professional. In addressing

the issue of public protection, these examinations attempt to assess not only

factual knowledge but application of knowledge and clinical decision mak-

ing. In an effort to assess decision making at a higher cognitive level, the KFP

item format has been utilized which offers an easily administered and objec-

tively scored assessment process that can utilize a variety of clinical cases and

cover a broad spectrum of content domains. Though this item format is not as

difficult to construct as other formats, such as the patient management prob-

lem format, it still requires more work and resources to produce than more

standard multiple-choice item types. Accordingly, it is appropriate to attempt

to verify that the effort is ‘‘paying off’’ in the context of assessing skills at a

higher cognitive level. The current study tends to support that assertion.

As measures of clinical decision making, KFPs play a key role in helping

to ensure that licensed and certified allied health practitioners have the

skills they need to make important decisions on the job and improve their

practice and quality of patient care. If evidence continues to favor the KFP

format in the allied health education and assessment field, it may be a for-

mat worth considering for examinations in other fields where decision-

making skills are relevant to the quality of work performance.

Acknowledgments

This research was carried out in conjunction with test development work at Comira.

The authors would like to thank the Commission on Dietetic Registration for

allowing us to obtain data from certified specialists in gerontological and oncology

nutrition.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research,

authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or

publication of this article

414 Evaluation & the Health Professions 35(4)

References

Bordage, G., & Page, G. (1987). An alternate approach to PMPs: The key features

concept. In I. Hart & R. Harden (Eds.). Further developments in assessing

clinical competence (pp. 57–75). Montreal, Quebec, Canada: Can-Heal

Publications.

Davidson, J. E., & Sternberg, R. J. (2003). The psychology of problem solving. New

York, NY: Cambridge University Press.

Farmer, E. A., & Page, G. (2005). A practical guide to assessing clinical decision-

making skills using the key features approach. Medical Education, 39,

1188–1194.

Fischer, M. R., Kopp, V., Holzer, M., Ruderich, F., & Junger, J. (2005). A modified

electronic key feature examination for undergraduate medical students: Valida-

tion threats and opportunities. Medical Teacher, 27, 450–455.

Haladyna, T. M. (2004). Developing and validating multiple-choice test items

(3rd ed). Mahwah, NJ: Lawrence-Erlbaum.

Hatala, R., & Norman, G. R. (2002). Adapting the key features examination for a

clinical clerkship. Medical Education, 36, 160–165.

Kim, K. S., Amin, S. M., & Ng, A. C. (2007). Avoiding common errors in key

feature problems. Malaysian Family Physician, 2, 18–21.

Page, G., & Bordage, G. (1995). The Medical Council of Canada’s key features

project: A more valid written examination of clinical decision-making skills.

Academic Medicine, 70, 104–110.

Page, G., Bordage, G., & Allen, T. (1995). Developing key-feature problems and

examinations to assess clinical decision-making skills. Academic Medicine,

70, 194–201.

Schuwirth, L. (2009). Is assessment of clinical reasoning still the Holy Grail?

Medical Education, 43, 298–299.

Schuwirth, L. W. T., Verhoeven, B. H., Scherpbier, A. J. J. A., Mom, E. M. A.,

Cohen-Schotanus, J., van Rossum, H. J. M., & van der Vleuten, C. P. M.

(1999). An inter- and intra-university comparison with short case-based testing.

Advances in Health Sciences Education, 4, 233–244.

Schuwirth, L. W. T., Verheggen, M. M., van der Vleuten, C. P. M., Boshuizen, H. P.

A., & Dinant, G. F. (2001). Do short cases elicit different thinking processes than

factual knowledge questions do? Medical Education, 35, 348–356.

Weekley, J. A., & Ployhart, R. E. (2006). Situational judgment tests: Theory, Mea-

surement and Application. In the Frontiers Series of the Society for Industrial

and Organizational Psychology. Mahwah, NJ: Lawrence Erlbaum Associates.

Whetzel, D. L., & McDaniel, M. A. (2009). Situational judgment tests: An overview

of current research. Human Resource Management Review, 19, 188–202.

Hurtz et al. 415