The methodological quality of diagnostic test accuracy studies for musculoskeletal conditions can be...

10
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/authorsrights

Transcript of The methodological quality of diagnostic test accuracy studies for musculoskeletal conditions can be...

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/authorsrights

Author's personal copy

The methodological quality of diagnostic test accuracy studiesfor musculoskeletal conditions can be improved

Nicholas Henschkea,b,*, Julia Keuerlebera, Manuela Ferreiraa,Christopher G. Mahera, Arianne P. Verhagenc

aMusculoskeletal Division, The George Institute for Global Health, Level 13, 321 Kent Street, Sydney NSW 2000, AustraliabInstitute of Public Health, University of Heidelberg, Im Neuenheimer Feld 324, 69120 Heidelberg, GermanycDepartment of General Practice, Erasmus MC, Dr. Molewaterplein 50, 3015 GJ Rotterdam, The Netherlands

Accepted 18 November 2013

Abstract

Objectives: To provide an overview of reporting and methodological quality in diagnostic test accuracy (DTA) studies in the muscu-loskeletal field and evaluate the use of the QUality Assessment of Diagnostic Accuracy Studies (QUADAS) checklist.

Study Design and Setting: A literature review identified all systematic reviews that evaluated the accuracy of clinical tests to diagnosemusculoskeletal conditions and used the QUADAS checklist. Two authors screened all identified reviews and extracted data on the targetcondition, index tests, reference standard, included studies, and QUADAS items. A descriptive analysis of the QUADAS checklist was per-formed, along with Rasch analysis to examine the construct validity and internal reliability.

Results: A total of 19 systematic reviews were included, which provided data on individual items of the QUADAS checklist for 392DTA studies. In the musculoskeletal field, uninterpretable or intermediate test results are commonly not reported, with 175 (45%) studiesscoring ‘‘no’’ to this item. The proportion of studies fulfilling certain items varied from 22% (item 11) to 91% (item 3). The interraterreliability of the QUADAS checklist was good and Rasch analysis showed excellent construct validity and internal consistency.

Conclusion: This overview identified areas where the reporting and performance of diagnostic studies within the musculoskeletal fieldcan be improved. � 2014 Elsevier Inc. All rights reserved.

Keywords: Musculoskeletal; Diagnosis; QUADAS; Study quality; Rasch analysis; Systematic review

1. Introduction

Making an accurate diagnosis is essential to guide man-agement and determine prognosis [1]. A common diag-nostic research study is the test accuracy study, whichgenerates a comparison of measurements made by an indextest against those of an accepted ‘‘gold standard’’ or refer-ence test [2]. Within the musculoskeletal field, arriving at aspecific diagnosis is seen to be particularly challenging [3].For various conditions, this difficulty has been attributed toa lack of adequate reference tests [4], poor reliability ofindex tests [5], or methodological weaknesses in test accu-racy studies [6]. An increasing amount of published litera-ture on the accuracy of index tests for musculoskeletalconditions is now available and systematic reviews of diag-nostic test accuracy (DTA) have become more common.These reviews draw together the available evidence on test

accuracy in light of potential methodological biases of theprimary studies to provide clinicians with recommenda-tions on diagnosis of specific conditions [7].

When incorporated into systematic reviews, assessmentof methodological quality is necessary to allow potentialbiases and sources of variation that might lead to heteroge-neity to be identified [8]. The QUality Assessment of Diag-nostic Accuracy Studies (QUADAS) initiative provides atool to assess the quality of test accuracy studies [9]. Thetool combines features derived from empirical evidenceand expert opinion into a checklist of 14 items assessingrisk of bias, applicability, and reporting quality. The QUA-DAS checklist is designed to be tailored to specific reviewquestions by authors of DTA reviews, allowing flexibilityacross numerous fields of research and study designs [9].Since its development, the QUADAS checklist has beenused in more than 200 DTA reviews [10]. The frequentuse of this checklist provides an opportunity to evaluatethe typical biases within the musculoskeletal field andprovide recommendations on how to avoid these whendesigning test accuracy studies.

Conflict of interest: The authors declare no conflict of interest.

* Corresponding author. Tel.: þ49-6221-56-5215; fax: þ49-6221-56-

5948.

E-mail address: [email protected] (N. Henschke).

0895-4356/$ - see front matter � 2014 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/j.jclinepi.2013.11.008

Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

What is new?

� Less than half of all primary diagnostic studies inthe musculoskeletal field fulfilled item 11 (blindingof reference standard), item 4 (time between tests),item 13 (intermediate tests results), or item 14(study withdrawals) of the QUality Assessment ofDiagnostic Accuracy Studies (QUADAS) checklist.

� Rasch analysis showed that the QUADAS checklisthas excellent construct validity and internalconsistency.

� The reporting and performance of diagnostic testaccuracy (DTA) studies in the musculoskeletal fieldcan be improved, and authors should be encouragedto follow recommended guidelines to do so.

� These findings provide empirical evidence support-ing the use of a summary score for the QUADASchecklist.

� The development of a summary score for the re-porting and methodological quality of DTA studiesshould be considered and may be a useful featurefor comparability across studies within the samefield and to facilitate interpretation.

The evaluation of quality checklists such as QUADAS is acontinuous process as new methodological developmentsarise. Recently, in light of feedback from review authorsand methodological research, the QUADAS checklist hasbeen updated to QUADAS-2, which offers additional fea-tures [10].Nevertheless, the available information in the orig-inal QUADAS tool provides a unique opportunity to evaluatethe checklist and describe methodological quality in specificfields. Previous studies have evaluated both the inter- and in-trarater reliability [11] of QUADAS and found a strong rela-tionship between fulfillment of the items and test accuracyresults [12,13]. Although the checklist appears to be reliableand useful, the use of an overall quality score derived from theQUADAS checklist has been discouraged [14]. However,several DTA reviews have calculated and used an overallquality score as this presumably eases the process of drawingconclusions. One method to evaluate the overall utility ofQUADAS that has not yet been applied is Rasch analysis.An advantage of Rasch analysis is that it offers a sophisti-cated method for assessing whether an instrument measuresa single construct, in this case, the methodological qualityof DTA studies [15]. Accordingly, Rasch analysis willprovide an empirical basis for judging whether or not it ismeaningful to sum item scores to create a summary score.

The present study aims to: (1) describe the application ofQUADAS in systematic DTA reviews for common muscu-loskeletal conditions, (2) identify the proportion of test

accuracy studies that fulfill each item of the QUADAS,and (3) evaluate the construct validity and internal reliabilityof the QUADAS checklist using Rasch analysis.

2. Methods

2.1. Data sources

A comprehensive search of the literature was performedto identify all published systematic reviews that evaluateddiagnostic accuracy of clinical tests to diagnose commonmusculoskeletal conditions. As this study aimed to describethe use of the QUADAS checklist in DTA reviews, an elec-tronic forward citation search was performed using the Webof Science database to identify all studies citing the originalpublication of the QUADAS checklist by Whiting et al. [9].This was supplemented with a search of MEDLINE and theCochrane Database of Systematic Reviews to identify morerecent reviews that used the QUADAS checklist. Finally, asearch was performed on the MEDION database of diag-nostic studies and diagnostic systematic reviews (www.mediondatabase.nl).

2.2. Study selection

Two authors (N.H. and J.K.) independently screened thetitles and abstracts of all reviews identified by the search toexclude those that were clearly outside the scope of thestudy. To determine eligibility for the analysis, reviewswere included if they satisfied the following criteria: (1)were systematic DTA reviews for musculoskeletal condi-tions; (2) reported on the accuracy of diagnostic imagingor index tests from the clinical history and physical exam-ination, (3) used the QUADAS checklist to evaluate qualityof the original test accuracy studies, and (4) provided dataon each item of the QUADAS fulfilled by the originalstudies. Where data were not reported in the systematic re-views, attempts were made to contact the review authors forfurther information.

2.3. Data extraction and analysis

Two authors (N.H. and J.K.) independently extracted thefollowing data from each eligible review: target condition,index tests evaluated, reference standard, number of includedstudies, and individual ratings for each item (yes/no/unclear)of the QUADAS checklist for each included study. All dis-agreements were resolved via discussion and consensus.

The QUADAS checklist from each review was tabu-lated; and for each item, the proportion of included studiesin each category (yes, no, or unclear) was determined. Fromeach included review, descriptions of how each item wasscored and modifications made to the original checklistwere recorded. Where primary test accuracy studies wereassessed by more than one review, the scores for each itemwere extracted to a separate checklist for assessment of in-terrater reliability by calculating the kappa statistic. For

417N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

these studies, a summary score (ie, number of items scoredas ‘‘yes’’) was calculated from each review and the agree-ment measured using the intraclass correlation coefficient(ICC). Where there was disagreement in the assessmentof items by the two reviews, the authors (N.H. and J.K.)referred to the primary studies and made a decision onthese items for data extraction.

2.4. Rasch analysis

Further analysis included examination of construct valid-ity and internal reliability of the QUADAS using Raschmodeling with the Winsteps 3.68.2 program (Chicago, IL)[15]. Rasch modeling transforms raw scores into scaledscores and builds a hypothetical model that places itemdifficulty and (in this case) study quality along the samehierarchy. This process assumes that measurement of theconstruct ismade up of itemsmore frequently fulfilled, whichwill be at the bottom of the hierarchy, and items lessfrequently fulfilled, which will be at the top of the hierarchy.

If the bulk of items and studies are evenly distributed inrelation to one another, it means that the construct is welltargeted for the sample (ie, test accuracy studies) [15].

Construct validity is a measure of how well a checklistmeasures one overall domain, in this case, methodologicalquality of test accuracy studies. To assess construct validity,Winsteps generates two sets of fit statistics, termed infit andoutfit, to indicate how well the observed data from eachitem and study fit along the hierarchy. Fit statistics are ex-pressed in mean square (MnSq) and standardized values.The MnSq expected value is 1.0; values greater than 0.5and less than 1.5 are acceptable. The expected standardizedvalue is 0 and values between þ2 and �2 are acceptable.The items outside these acceptable ranges on the fit statis-tics are referred to as ‘‘misfitting’’ items and reflect erraticscoring. Overall, 95% fit of data from items and partici-pants is desired to provide evidence of construct validity(unidimensionality) [15].

Rasch modeling also provides two reliability estimatesfor both participants (ie, individual test accuracy studies)

Included reviews

N = 19

Excluded studies

n = 9 did not provide detailed QUADAS data

Excluded studies

n = 11 were not systematic reviews

n = 11 were not DTA reviews

n = 21 did not assess methodological quality

n = 13 used different quality scores/checklists

Eligible reviews

N = 28

Full text screened

n = 84

Excluded studies

n = 75

Electronic database search (Medline, Embase, PEDro, Cochrane, Medion)

n = 8577

Abstract screening

n = 158

Fig. 1. Flowchart of search results and inclusion of DTA reviews. DTA, diagnostic test accuracy; QUADAS, QUality Assessment of DiagnosticAccuracy Studies.

418 N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

and items. A participant reliability index (PRI) is the equiv-alent to Cronbach a and should be 0.8 or better to demon-strate good internal consistency. A separation statistic (SS)is also provided, which provides information on the abilityof the instrument to identify levels of quality amongstudies. The SS should be 2.00 or greater and indicates thatdifferences in the measure are because of real differences inquality and not because of error of measurement.

3. Results

The search strategy resulted in a total of 8,577 titles,which were screened for eligibility. After removal of dupli-cates and titles obviously outside the scope of the review,158 abstracts were screened by two authors. The eligibilitycriteria were applied to 84 full-text articles and 28 weredeemed to be eligible for this review (Fig. 1).

Of the 28 eligible DTA reviews, 9 were excludedbecause they did not provide detailed QUADAS data forthe primary test accuracy studies, and these data were notmade available despite attempts to contact correspondingauthors. Therefore, a total of 19 DTA reviews on musculo-skeletal conditions were included, which reviewed 392 pri-mary test accuracy studies (Table 1). The included reviewspresented results of diagnostic tests for spine [4,16e23],wrist [24], shoulder [25e28], knee [29,30], and hip [31]conditions. Two reviews reported the diagnostic accuracyof tests for juvenile idiopathic arthritis [32,33].

The QUADAS checklist (see Appendix at www.jclinepi.com) includes 14 standard items, of which all were assessedby 14 (74%) of the included reviews. One included review[24] assessed only 13 of these items, whereas one review[19] assessed only 12 items. Three reviews [21e23]assessed only 11 of the 14 standard items but added up to

Table 1. Characteristics of included diagnostic test accuracy reviews

Review Target condition Index test(s) Reference standardIncludedstudies

QUADASstandarditems

Additionalitems

Alqarni et al. [16] Lumbar segmentalinstability

Clinical tests Radiography 4 14 0

Dessaur and Magarey [25] Shoulder (SLAP) lesions Clinical tests Arthroscopy 17 14 0Hancock et al. [4] Spinal pain Clinical tests/diagnostic

imagingDiscography/anesthetic

blocks41 14 0

Hegedus et al. [29] Meniscus injury Physical examination Surgery/MRI 18 14 0Henschke et al. [17] Malignancy in LBP History and physical

examinationRadiography/followup 6 14 0

Henschke et al. [18] Fracture in LBP History and physicalexamination

Diagnostic imaging 12 14 0

Luime et al. [26] Shoulder instability orlabral lesion

Clinical tests Surgery/arthroscopy 17 14 0

Miller et al. [32] Juvenile idiopathicarthritis

MRI Clinical diagnosis/arthroscopy/MRI

18 14 0

Miller et al. [33] Juvenile idiopathic arthritis(peripheral joints)

MRI Clinical diagnosis/arthroscopy/MRI

18 14 0

Ottenheijm et al. [27] Subacromial disorders Ultrasound MRI/surgery/radiography 23 14 0Rubinstein et al. [19] Cervical radiculopathy Clinical tests Electromyography/

diagnostic imaging6 12 0

Smith et al. [31] Acetabular labral tears MRI/MRA Surgery/arthroscopy 19 14 0Smith et al. [28] Rotator cuff tears Ultrasound Surgery/arthroscopy 62 14 0Smith et al. [30] Anterior cruciate ligament

ruptureMRI Surgery/arthroscopy 53 14 0

Szadec et al. [20] Sacroiliac joint (SIJ) pain SIJ infiltration andclinical tests

Diagnostic criteria ofsacroiliac pain

17 14 0

van der Windt et al. [21] Lumbar radiculopathy Physical examination Diagnostic imaging/surgery 20 11 3a

van Rijn et al. [22] Lumbar pathology CT Surgery/MRI 7 11 9b

Wassenaar et al. [23] Lumbar pathology MRI Surgery/expert opinion 8 11 9b

Yin et al. [24] Scaphoid fractures Bone scintigraphy, MRI,and CT

Diagnostic imaging/followup

26 13 0

Abbreviations: QUADAS, QUality Assessment of Diagnostic Accuracy Studies; SLAP, superior labrum anterior and posterior; LBP, low backpain; MRI, magnetic resonance imaging; MRA, magnetic resonance angiography; CT, computed tomography.

a (1) Did the study provide a clear definition of what was considered to be a ‘‘positive’’ result of the index test?; (2) Was treatment withheld untilboth index test and reference standard were performed? and (3) Were data on observer variation reported and within acceptable range?

b (1) If a cutoff value has been used, was it established before the study was started (prespecified cutoff value)?; (2) Is the technology of theindex test likely to have changed since the study was carried out?; (3) Did the study provide a clear definition of what was considered to be a ‘‘pos-itive’’ result?; (4) Were data on observer variation reported? and within acceptable range?; (5) Were data on instrument variation reported?; (6) Weredata presented for appropriate patient subgroups?; (7) Was an appropriate sample size included?; (8) Were objectives prespecified?; (9) Was treat-ment withheld until both index test and reference standard were performed?

419N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

9 additional items (Table 1). The most common additionalitems were: (1) Did the study provide a clear definition ofwhat was considered to be a ‘‘positive’’ result of the indextest?, (2) Was treatment withheld until both index test andreference standard were performed?, and (3) Were data onobserver variation reported and within acceptable range?

The number of primary test accuracy studies fulfillingeach of the 14 standard items is presented in Table 2. Item3 (adequate reference standard) was the most fulfilled itemon the checklist, with 358 (91%) primary studies scoring‘‘yes.’’ Item 13 (uninterpretable or intermediate test results)was the most common item not fulfilled, with 175 (45%)studies scoring ‘‘no.’’ Items 4 (time period between refer-ence standard and index test) and 11 (reference standardblinding) were most commonly scored ‘‘unclear.’’

Of the 392 primary test accuracy studies, 57were includedin two DTA reviews. The interrater reliability of the individ-ual items in this subsamplewas ‘‘moderate’’ or ‘‘substantial’’for all items [34] except for item 7 that had ‘‘almost perfect’’agreement (k5 0.85) and items 2 (k5 0.40) and 13(k5 0.22) that had ‘‘fair’’ agreement (Table 3). The ICC(2,1) for the summary score was 0.91 (95%CI: 0.85, 0.95).

3.1. Rasch analysis

The variable map (Fig. 2) shows the relationship betweenthe QUADAS items and the logit ratings of the studies. Theleft hand side of the map indicates study quality (based on

the number of QUADAS items fulfilled) plotted on a logitscale (�3 to 4) with 0 indicating the item with averagedifficultydin this case item 10. On the right hand side ofthe map, items that were more frequently fulfilled are shownat the bottom of the hierarchy and items less frequently ful-filled are at the top. The 14 items were well spread andaligned with 96% of the studies, indicating that the QUA-DAS is well targeted to assess the methodological qualityof test accuracy studies in the musculoskeletal field. A clearfinding from the variable map is that the top items, andtherefore the most likely to discriminate between studies,were item 11 (Were the reference standard results inter-preted without knowledge of the results of the index test?)followed by item 4 (Is the time period between referencestandard and index test short enough to be reasonably surethat the target condition did not change between the twotests?). Item 2 (Were selection criteria clearly described?)and item 12 (Were the same clinical data available when testresults were interpreted as would be available when the testis used in practice?) were aligned and thus seem to beredundant.

3.2. Construct validity

The fit statistics presented in Table 4 indicated that therewas almost complete fit of data to the model, which showsexcellent construct validity. Based on the standardizedvalues, item 12 (Were the same clinical data available when

Table 2. Assessment of primary test accuracy studies using the 14-item QUADAS checklist

Assessment

QUADAS itema

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Yes 256 220 358 118 283 314 331 276 203 244 89 232 164 196No 81 107 30 43 85 54 26 67 134 40 125 27 175 104Unclear 55 30 4 205 18 24 29 14 20 108 178 133 53 92Missingb 0 35 0 26 6 0 6 35 35 0 0 0 0 0Total 392 392 392 392 392 392 392 392 392 392 392 392 392 392

Abbreviation: QUADAS, QUality Assessment of Diagnostic Accuracy Studies.a Extra items assessed by the reviews are not included in this table.b A number of reviews did not assess certain items of the QUADAS.

Table 3. Interrater reliability of the QUADAS checklist

QUADAS items Kappaa 95% CI Interpretation [35]

1. Representative spectrum 0.57 0.38, 0.75 Moderate2. Selection criteria 0.40 0.19, 0.61 Fair3. Correct reference standard 0.71 0.39, 1.02 Substantial4. Time between tests 0.61 0.44, 0.78 Substantial5. Verification bias 0.62 0.43, 0.81 Substantial6. Clinical review bias 0.42 0.13, 0.72 Moderate7. Independent reference standard 0.85 0.68, 1.01 Almost perfect8. Index test details 0.54 0.26, 0.81 Moderate9. Reference standard details 0.54 0.34, 0.74 Moderate10. Blind index test 0.42 0.23, 0.62 Moderate11. Blind reference standard 0.47 0.29, 0.66 Moderate12. Clinical data 0.46 0.25, 0.68 Moderate13. Intermediate test results 0.22 0.07, 0.37 Fair14. Study withdrawals 0.51 0.35, 0.67 Moderate

Abbreviations: QUADAS, QUality Assessment of Diagnostic Accuracy Studies; CI, confidence interval.a All results are statistically significant, P! 0.05.

420 N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

test results were interpreted as would be available when thetest is used in practice?) and item 2 (Were selection criteriaclearly described?) were slightly misfitting, which reflectserratic scoring. These items could potentially be removedfrom the checklist without any loss in validity.

3.3. Internal consistency

The PRI (a Cronbach a equivalent) was 0.96 indicatingexcellent internal consistency and an SS of 8.33. Thismeans that the QUADAS is measuring one domain, thatis, the methodological quality of a DTA study. The PRI in-dicates that the QUADAS scale discriminates among threeto four ‘‘levels’’ of methodological quality and the favor-able SS means that this is because of real differences instudy quality and not measurement error.

4. Discussion

This review has identified 19 systematic DTA reviews onmusculoskeletal conditions, which provided data on indivi-

dual items of the QUADAS checklist. The reviews weremostly similar in their application of the checklist allowinga comparison of the methodological quality of 392 primaryDTA studies. The proportion of studies fulfilling certainitems (ie, ‘‘yes’’) was quite variable, with a range of 22%(item11) to 91% (item 3). Similarly, the decision to rate itemsas ‘‘unclear’’ ranged from 1% (item 3) to 52% (item 4). Over-all, the interrater reliability of the QUADAS checklist acrossthe systematic reviews was good, and Rasch analysis showedexcellent construct validity and internal consistency. Con-trary to previous findings [14], these results suggest that theuse of a summary score from the QUADAS checklist couldprovide a valid and reliable way to compare the methodolog-ical quality of DTA studies in the musculoskeletal field.

A main strength of the present study is the use of multi-ple electronic database searches and clearly defined in-clusion criteria to identify all relevant systematic DTAreviews for musculoskeletal conditions. This overview ofmethodological quality, as shown by the number of studiesfulfilling items of the QUADAS checklist, therefore pro-vides an insight into a large proportion of published

Fig. 2. Rasch variablemap of itemagreement and study quality asmeasuredby theQUADAS checklist.M,mean; S, one standard deviation; T, two stan-dard deviations; QUADAS, QUality Assessment of Diagnostic Accuracy Studies. Each ‘‘#’’ represents five studies. Each ‘‘.’’ represents one study.

421N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

diagnostic musculoskeletal research. One major insightfrom this is the potential to improve the reporting and per-formance of studies within the field, which could lead tomore reliable estimates of diagnostic accuracy. Less thanhalf of all primary DTA studies fulfilled item 11 (blindingof reference standard), item 4 (time between tests), item 13(intermediate tests results), or item 14 (study withdrawals).In addition, it is assumed that a large proportion of ‘‘un-clear’’ ratings were because of inadequate reporting of theprimary studies. As there is empirical evidence that a lackof blinding when interpreting reference standard results(item 11) can lead to biased results [35], specific attentionshould be paid to incorporating and reporting this feature infuture DTA studies.

The electronic database search and the forward citationsearch of the original QUADAS article [9] resulted in a sam-ple of included reviews, which were published withincreasing frequency over recent years. Although this canbe seen as an artifact of the rate of adoption of the QUADASchecklist by review authors, the number of DTA reviews islikely to continue to increase in the future [36]. Sufficientguidance should therefore be available to review authors onthe use and interpretation of methodological quality check-lists such as QUADAS. In areas such as musculoskeletalresearch where diagnosis is difficult, specific recommenda-tions on the interpretation of studies based onmethodologicalquality findings are needed. The use of summary scores hasbeen discouraged because of the lack of a gold standard formethodological quality and uncertainty about appropriateweighting of items [14]. However, our findings suggest thatin themusculoskeletal field theQUADAS checklist is a valid,unidimensional measure of methodological quality and thesummary scores have high reliability. These findings provideempirical evidence supporting the use of a summary score forthe QUADAS checklist. A summary score may be a useful

feature for comparability across studies within the same fieldand to facilitate interpretation.

Although user feedback and expert consensus has led tothe recent development of QUADAS-2, there is little tosuggest that the application of this new checklist will resultin markedly different results (ie, the proportion of studiesfulfilling specific items) to the present study. The majordifferences between the versions of the checklist are theflexibility and grouping of items into domains, which arerelated to study quality [10]. Many of the items from theoriginal checklist remain in QUADAS-2. What is ofconcern is that this second version of the checklist has beenrecommended for use without comprehensive evaluation ofthe original version. To our knowledge, this is the first studyto subject the QUADAS scale to Rasch analysis and despiteits adequate performance in this field, the same results maynot hold for diagnostic research in disciplines beyond themusculoskeletal field.

One limitation to the current analysis is that it does notallow for comparison across fields of research. It is likelythat within the individual systematic DTA reviews includedin the present study, authors interpreted some QUADASitems with respect to the target (musculoskeletal) conditionin question. As there is generally very little reported on theapplication of the checklist within the reviews, a compari-son of the methodological quality across fields of researchis problematic. Despite the good interrater reliability acrossreviews, it is possible that differences in the interpretationand application of the items can account for the observedvariations in ratings. Although the flexibility of the QUA-DAS (and QUADAS-2) checklist allows for addition andremoval of items as deemed necessary, few of the includedsystematic reviews utilized this. Recommendations specificto a field of research on which items are most relevantand necessary could provide researchers with guidance in

Table 4. Item-fit statistics for the 14-item QUADAS checklist

QUADAS items

Infit statistics Outfit statistics

Mean square Z-standardized statistic Mean square Z-standardized statistic

3. Correct reference standard 0.99 0.0 1.27 1.112. Clinical data 1.14 3.3a 1.20 3.0a

8. Index test details 1.08 1.1 1.13 1.113. Intermediate test results 1.08 1.8 1.04 0.71. Representative spectrum 1.05 1.0 1.02 0.49. Reference standard details 1.03 0.7 1.03 0.56. Clinical review bias 1.00 0 1.02 0.24. Time between tests 1.00 0.1 1.00 0.014. Study withdrawals 0.97 �0.7 0.94 �1.011. Blind reference standard 0.96 �0.5 0.90 �0.810. Blind index test 0.96 �0.8 0.91 �1.37. Independent reference standard 0.93 �0.7 0.90 �0.55. Verification bias 0.89 �1.9 0.91 �0.92. Selection criteria 0.87 �3.1a 0.78 �3.2a

Abbreviation: QUADAS, QUality Assessment of Diagnostic Accuracy Studies.a Mean square values beyond cutoffs of 0.5e1.5 indicate that items misfit the model (items below may be too predictable/redundant and items

above represent unmodeled noise). Expected value for Z-standardized statistic is between �2 andþ2. Therefore, items 12 and 2 are slightlymisfitting.

422 N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

performing not only systematic DTA reviews but alsodesigning primary DTA studies.

Musculoskeletal conditions can affect all regions of thebody and it is through correct diagnosis that effective treat-ment and accurate patient information regarding prognosiscan be provided. The lack of systematic review evidenceavailable for the multitude of musculoskeletal conditions,which are commonly diagnosed, is of concern. This is pre-sumably the result of a lack of primary DTA studies,deriving from the difficulties involved in performing suchstudies in this field. The absence of an adequate referencestandard for most musculoskeletal conditions has resultedin a heavy reliance on clinical diagnoses (expert opinion)or diagnoses of exclusion. Yet the presence or absence ofclinical signs and symptoms, which contribute to makingmany musculoskeletal diagnoses, are not often subject toDTA studies [37]. Using reference standards such aslong-term followup, expert consensus, or response to treat-ment may be potential solutions to this problem. Improvedimaging standards and procedures are also likely to lead tofurther improvements in musculoskeletal diagnosis.

As diagnostic accuracy in musculoskeletal research be-comes subject to further and more complex evaluation,the methodology for evaluating study quality and dealingwith biases also needs to progress. This study has providedan overview of important shortcomings in designing, per-forming, and reporting of DTA studies for musculoskeletalconditions. Further steps are needed to provide researchersand clinicians with recommendations for improving and in-terpreting studies of this nature.

Appendix

Supplementary data

Supplementary data related to this article can be found athttp://dx.doi.org/10.1016/j.jclinepi.2013.11.008.

References

[1] Knottnerus JA. The evidence base of clinical diagnosis. London, UK:

BMJ Books; 2002.

[2] Sackett DL, Haynes RB. The architecture of diagnostic research.

BMJ 2002;324:539e41.

[3] Deyo RA. Diagnostic evaluation of LBP: reaching a specific diag-

nosis is often impossible. Arch Intern Med 2002;162:1444e7.

[4] Hancock MJ, Maher CG, Latimer J, Spindler MF, McAuley JH,

Laslett M, et al. Systematic review of tests to identify the disc, SIJ or

facet joint as the source of low back pain. Eur Spine J 2007;16(10):

1539e50.[5] Lucas N, Macaskill P, Irwig L, Moran R, Bogduk N. Reliability of

physical examination for diagnosis of myofascial trigger points: a

systematic review of the literature. Clin J Pain 2009;25(1):80e9.

[6] Hollerwoger D. Methodological quality and outcomes of studies ad-

dressing manual cervical spine examinations: a review. Man Ther

2006;11:93e8.

[7] Deeks JJ. Systematic reviews of evaluations of diagnostic and

screening tests. [Review]. BMJ 2001;323:157e62.

[8] Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP,

Irwig LM, et al. Towards complete and accurate reporting of studies

of diagnostic accuracy: the STARD initiative. Ann Intern Med 2003;

138:40e4.

[9] Whiting P, Rutjes A, Reitsma J, Bossuyt P, Kleijnen J. The develop-

ment of QUADAS: a tool for the quality assessment of studies of

diagnostic accuracy included in systematic reviews. BMC Med Res

Methodol 2003;3:25.

[10] Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ,

Reitsma JB, et al. QUADAS-2: a revised tool for the Quality Assess-

ment of Diagnostic Accuracy Studies. Ann Intern Med 2011;155:

529e36.[11] Whiting P, Weswood M, Rutjes A, Reitsma J, Bossuyt P, Kleijnen J.

Evaluation of QUADAS, a tool for the quality assessment of diag-

nostic accuracy studies. BMC Med Res Methodol 2006;6:9.

[12] Westwood M, Whiting P, Kleijnen J. How does study quality affect

the results of a diagnostic meta-analysis? BMC Med Res Methodol

2005;5:20.

[13] Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J.

Sources of variation and bias in studies of diagnostic accuracy: a sys-

tematic review. Ann Intern Med 2004;140:189e202.

[14] Whiting P, Harbord R, Kleijnen J. No role for quality scores in sys-

tematic reviews of diagnostic accuracy studies. BMC Med Res Meth-

odol 2005;5:19.

[15] Linacre J. A user’s guide to WINSTEPS and Rasch-modelling com-

puter programs. Chicago, IL: John M Linacre; 2006.

[16] Alqarni AM, Schneiders AG, Hendrick PA. Clinical tests to diagnose

lumbar segmental instability: a systematic review. J Orthop Sports

Phys Ther 2011;41:130e40.

[17] Henschke N, Maher CG, Refshauge KM. Screening for malignancy in

low back pain patients: a systematic review. Eur Spine J 2007;16(10):

1673e9.

[18] Henschke N, Maher CG, Refshauge KM. A systematic review iden-

tifies five ‘‘red flags’’ to screen for vertebral fracture in patients with

low back pain. J Clin Epidemiol 2008;61:110e8.

[19] Rubinstein SM, Pool JJM, van Tulder MW, Riphagen II, de Vet HCW.

A systematic review of the diagnostic accuracy of provocative tests of

the neck for diagnosing cervical radiculopathy. Eur Spine J 2007;16:

307e19.

[20] Szadec KM, van der Wurff P, van Tulder MW, Zuurmond WW,

Perez RSGM. Diagnostic validity of criteria for sacroiliac joint pain:

a systematic review. J Pain 2009;10:354e68.[21] van der Windt DAWM, Simons E, Riphagen II, Ammendolia C,

Verhagen AP, Laslett M, et al. Physical examination for lumbar rad-

iculopathy due to disc herniation in patients with low-back pain. Co-

chrane Database Syst Rev 2010;(2):CD007431.http://dx.doi.org/10.

1002/14651858.CD007431.pub2.

[22] van Rijn RM, Wassenaar M, Verhagen AP, Ostelo RWJG, Ginai AZ,

de Boer MR, et al. Computed tomography for the diagnosis of lumbar

spinal pathology in adult patients with low back pain or sciatica: a

diagnostic systematic review. Eur Spine J 2012;21:228e39.

[23] Wassenaar M, van Rijn RM, van Tulder MW, Verhagen AP, van der

Windt DAWM, Koes BW, et al. Magnetic resonance imaging for

diagnosing lumbar spinal pathology in adult patients with low back

pain or sciatica: a diagnostic systematic review. Eur Spine J 2012;

21:220e7.

[24] Yin Z-G, Zhang J-B, Kan S-L, Wang X-G. Diagnosing suspected

scaphoid fractures: a systematic review and meta-analysis. Clin

Orthop Relat Res 2010;468:723e34.

[25] Dessaur WA, Magarey ME. Diagnostic accuracy of clinical tests for

superior labral anterior posterior lesions: a systematic review. J Or-

thop Sports Phys Ther 2008;38:341e52.

[26] Luime JJ, Verhagen AP, Miedema HS, Kuiper JI, Burdorf A,

Verhaar JAN, et al. Does this patient have an instability of the shoul-

der or a labrum lesion? JAMA 2004;292:1989e99.

[27] Ottenheijm RP, Jansen MJ, Staal JB, van den Bruel A, Weijers RE,

De Bie RA, et al. Accuracy of diagnostic ultrasound in patients with

423N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424

Author's personal copy

suspected subacromial disorders: a systematic review and meta-anal-

ysis. Arch Phys Med Rehabil 2010;91:1616e25.

[28] Smith TO, Back T, Toms AP, Hing CB. Diagnostic accuracy of ultra-

sound for rotator cuff tears in adults: a systematic review and meta-

analysis. Clin Radiol 2011;66:1036e48.[29] Hegedus EJ, Cook C, Hasselblad V, Goode A, McCrory DC. Physical

examination tests for assessing a tornmeniscus in the knee: a systematic

reviewwithmeta-analysis. JOrthopSports PhysTher 2007;39:541e50.

[30] Smith TO, Lewis M, Song F, Toms AP, Donell ST, Hing CB. The

diagnostic accuracy of anterior cruciate ligament rupture using mag-

netic resonance imaging: a meta-analysis. Eur J Orthop Surg Trauma-

tol 2012;22:315e26.[31] Smith TO, Hilton G, Toms AP, Donell ST, Hing CB. The diagnostic

accuracy of acetabular labral tears using magnetic resonance imaging

and magnetic resonance arthrography: a meta-analysis. Eur Radiol

2011;21:863e74.

[32] Miller E, Roposch A, Uleryk E, Doria AS. Juvenile idiopathic

arthritis of peripheral joints: quality of reporting of diagnostic accu-

racy of conventional MRI. Acad Radiol 2009;16:739e57.

[33] Miller E, Uleryk E, Doria AS. Evidence-based outcomes of studies

addressing diagnostic accuracy of MRI of juvenile idiopathic

arthritis. Am J Roentgenol 2009;192:1209e18.

[34] Landis JR, Koch GG. The measurement of observer agreement for

categorical data. Biometrics 1977;33:159e74.

[35] Leeflang M, Reitsma J, Scholten R, Rutjes A, Di NM, Deeks J, et al.

Impact of adjustment for quality on results of meta-analyses of diag-

nostic accuracy. Clin Chem 2007;53:164e72.

[36] Deville W, Buntinx F, Bouter L, Montori V, de Vet H, van der

Windt D, et al. Conducting systematic reviews of diagnostic studies:

didactic guidelines. BMC Med Res Methodol 2002;2:9.

[37] Henschke N, Maher C. Red flags need more evaluation. Rheuma-

tology (Oxford) 2006;45(7):920e1.

424 N. Henschke et al. / Journal of Clinical Epidemiology 67 (2014) 416e424