Content Validity and Inter-Rater Reliability of an Instrument to Characterize Unintentional...

24
CONTENT VALIDITY AND INTER-RATER RELIABILITY OF THE HALLIWICK- CONCEPT-BASED INSTRUMENT "SWIMMING WITH INDEPENDENT MEASURE" Katja Groleger Sršen 1 , Gaj Vidmar 1 , Maša Pikl 2 , Irena Vrečar 1 , Cirila Burja 3 , Klavdija Krušec 4 1 University Rehabilitation Institute, Republic of Slovenia, Ljubljana 2 Elementary School Gradec, Litija, Slovenia 3 CIRIUS Kamnik, Slovenia 4 CIRIUS Vipava, Slovenia Corresponding author: Assist Prof Gaj Vidmar, PhD University Rehabilitation Institute, Republic of Slovenia Linhartova 51, SI-1000 Ljubljana, Slovenia E-mail: [email protected] Conflicts of interest: none declared Source of funding: none

Transcript of Content Validity and Inter-Rater Reliability of an Instrument to Characterize Unintentional...

CONTENT VALIDITY AND INTER-RATER RELIABILITY OF THE HALLIWICK-

CONCEPT-BASED INSTRUMENT "SWIMMING WITH INDEPENDENT MEASURE"

Katja Groleger Sršen1, Gaj

Vidmar

1, Maša Pikl

2, Irena Vrečar

1, Cirila Burja

3, Klavdija Krušec

4

1University Rehabilitation Institute, Republic of Slovenia, Ljubljana

2Elementary School Gradec, Litija, Slovenia

3CIRIUS Kamnik, Slovenia

4CIRIUS Vipava, Slovenia

Corresponding author:

Assist Prof Gaj Vidmar, PhD

University Rehabilitation Institute, Republic of Slovenia

Linhartova 51, SI-1000 Ljubljana, Slovenia

E-mail: [email protected]

Conflicts of interest: none declared

Source of funding: none

1

Abstract

Objective: The Halliwick concept is widely used in different settings to promote joyful

movement in water and swimming. To assess swimming skills and progression of individual

swimmer, one should use a valid and reliable measure. The Halliwick-concept-based Swimming

with Independent Measure (SWIM) was introduced for this purpose. We wanted to determine

its content validity and inter-rater reliability.

Methods: 54 healthy children, 3.5 to 11 years old, from a mainstream swimming program

participated in a content validity study. They were evaluated with SWIM and the national

evaluation system of swimming abilities (classifying children into seven categories). For

studying inter-rater reliability of SWIM, we included 37 children and youth from a Halliwick

swimming program, aged 7-22 years, who were evaluated by two Halliwick instructors

independently.

Results: Average SWIM score differed between national evaluation system categories and

followed the expected order (p<0.001), whereby a ceiling effect was observed in the higher

categories. High inter-rater reliability was found for all 11 SWIM items. The lowest reliability

was observed for item G (sagittal rotation), though the estimates were still above 0.9. As

expected, the highest reliability was observed for total score (intraclass correlation 0.996).

Conclusions: Validity of SWIM with respect to the national evaluation system of swimming

abilities is high until the point where a swimmer is well adapted to water and already able to

learn some swimming techniques. Inter-rater reliability of SWIM is very high, so we believe

that SWIM can be used in further research and practice to follow the progress of swimmers.

Key words: swimming, children, Halliwick, evaluation of progress, SWIM

2

Introduction

The Halliwick concept of teaching swimming is widely used across the world. In short, it is a

well-developed concept of teaching people with physical and/or learning difficulties to move

independently in water and also to swim, if possible. It comprises knowledge on

hydromechanics, hydrostatics, biomechanics and teaching, aiming to help swimmer develop

water confidence. Its development began in 1950s, when Phyl McMillan, James McMillan and

Joan Martin wanted to develop a special swimming program for children at the Halliwick

School for Crippled Girls in London. Eventually, all the knowledge and experience that they

had acquired led to the a development of a ten-point program: mental adjustment to water,

disengagement of swimmer (building independence), transversal rotation control, sagittal,

longitudinal and combined rotation control, up-trust, balance in stillness, turbulent gliding,

simple progression and basic swimming stroke (McMillan, 2002).

Within the Halliwick system, swimmers are actively engaged in the learning process through

different activities and play. Their abilities are traditionally assessed through a system of four

Halliwick badges: red, yellow, green and blue. The tests are based on the ten-point program.

Swimmers have to pass several items, which are scored either "passed" or "not passed". In order

to pass the items for red or yellow badge, the swimmer has to perform several activities while

instructor is providing some physical support. Only to pass for the green badge the swimmer has

to perform the items unaided. The blue badge is aimed at providing a wide range of water skills

for advanced swimmers (McMillan, 2006).

To our knowledge, nobody has reported on content validity and inter-rater reliability of this

system yet, even though the system's fundamental validity appears to be self-evident. On the

other hand, it is also obvious that the four-badge system is too rough to precisely evaluate a

wide range of swimmer abilities and cannot be sensitive to a small change. Taking into account

3

that there are quite some swimmers with profound physical difficulties who are not expected to

progress much or at least not in a reasonably short time, another test is needed.

Based on extensive experience gained while working in a Halliwick swimming club, Kim

Peackok (1993) developed a new test called "Swimming with Independent Measure" (SWIM).

SWIM is based on the ten-point Halliwick program. It is aimed at evaluating functional abilities

within any swimming pool setting and can be applied to any diagnostic group and to all ages.

The results of a small recent study suggest that it is sensitive enough to evaluate, follow-up and

plan the individual or group program (Groleger Sršen et. al., 2008, 2010).

There are some other tests to evaluate swimming skills for children with physical or learning

disabilities: the Aquatic Independence Measure – AIM (Chacham and Hutzler, 2001), the Water

Orientation Test of Alyn – WOTA (Tirosh et al., 2008) and Humphries’ Assessment of Aquatic

Readiness – HAAR (Humphries, 2008). They are all related to the Halliwick concept to some

extent. A comparison of the items of those tests suggests that SWIM follows the Halliwick ten-

point program more closely. However, WOTA and AIM have already been tested for content

validity and inter-rater reliability (Tirosh et al., 2008). WOTA consists of two versions: one for

those who are capable of fulfilling instructions, and one for those who are not. The second one

is appropriate for the evaluation of children at the age of approximately 3 years and for older

children with limited cognitive abilities (Tirosh et al., 2008). For practical reasons, it may be

preferable to apply a single version of a test, as it is the case with SWIM, which is useful for all

ages and diagnoses. SWIM is also less time-consuming then WOTA. In our experience, it can

be performed in 15 minutes, while WOTA is reported to be performed in 30 minutes (Tirosh et

al., 2008). SWIM testing is easy to perform and no additional training is needed, while for

WOTA there is a special training incorporated into the curricula of Aquatic Therapy courses in

Israel through a one-day workshop (Tirosh et al., 2008).

4

Based on these considerations, we decided to use SWIM in clinical practice. We could not find

any data on content validity or inter-rater reliability of SWIM, so we wanted to assess both. We

hypothesized that SWIM is a valid and highly reliable measure and could therefore be used in

different settings for evaluation of the relevant functional abilities in water according to the

Halliwick concept.

Methods

Participants

Fifty-five healthy children from a mainstream swimming program were invited for the content

validity study. All of them had been in the program for several months or longer because the

testing was performed in late spring, i.e., at the end of the school-year. Thirty-seven children

were invited to participate in the study on inter-rater reliability. They were all residents at one of

the two school-centres for children with special needs in Slovenia (CIRIUS Kamnik, Vipava)

and engaged in the Halliwick program for several years. A detailed description of this group is

provided in Table 1. Parents of all the children were informed about the protocol and signed the

informed consent. Ethical approval for this study was obtained from the Research Ethics

Committee of the University Rehabilitation Institute of the Republic Slovenia.

Study design

Content validity study: Children were tested by two Halliwick instructors. One was reading

instructions from the SWIM manual and instructing individual child about SWIM items he or

she should perform. When needed, the child was given a practical demonstration by the second

instructor. Physical support was offered when needed. Each child was scored based on the best

performance out of several trials. The instructors then discussed and decided on the assigned

score for each item. Afterwards, all children were tested using the National Evaluation System

of Swimming Abilities – NESSA (Kapus et al., 2002) and assigned to one of the first seven

categories (Table 2).

5

Inter-rater reliability study: The children were tested by two pairs of Halliwick instructors.

Since swimmers with learning and physical disabilities need to be very confident with a person

who is working with them in water, we decided that only one instructor will perform the

practical part of testing. The other instructor was instructing the child about SWIM items he or

she should perform. Physical support was offered when needed. Like in the content validity

study, the child was given a practical demonstration of particular item if needed. Each child was

scored based on best performance out of several trials. The second pair of Halliwick instructors

was sitting at the opposite side of the pool, so they could clearly see and evaluate the

performance of the child but they were not able to hear any possible discussion on SWIM items

or a decision made on scoring a particular item by the first pair.

Swimming with Independent Measure

SWIM comprises 11 items (Table 3) that are evaluated on a 7-point scale (1 to 7). Detailed

information on items is available in the manual (Peacock, 1993). Score 1 means that the

swimmer is unable to perform the activity, it is not safe to test or the item is not measured.

Score 7 is assigned to a swimmer who is able to perform the activity without any support and in

an appropriate way. The maximum possible score is 77 points. There is no need to pass a formal

training to use SWIM, but it is obvious that a person would need knowledge about Halliwick

concept and some practical experience. SWIM was translated into Slovenian and is being used

in clinical practice for the last five years (Groleger Sršen et al., 2008).

Statistical analysis

To eschew the controversies related to measurement level outlined at the beginning of the

Discussion, we conducted all statistical analyses of individual SWIM items, sum of selected

SWIM items and total SWIM scores first using methods assuming interval-level properties and

then using methods assuming only ordinal measurement level. The difference in mean SWIM

score between NESSA categories was first tested using one-way analysis of variance (ANOVA,

6

including a trend analysis with contrasts), and then the ordered trend of SWIM scores with

respect to NESSA categories was tested using the exact Jonckheere-Terpstra test. Mean scores

on two subtotals of SWIM items were compared using paired-samples t-test and then using

exact Wilcoxon matched-pairs signed-rank test (EMP). Likewise, mean scores of all SWIM

items and total SWIM score were compared between the two pairs of raters using paired-

samples t-test and EMP (without adjustment for multiple testing). Agreement between the two

rater pairs regarding SWIM (each item and total score) was assessed with intraclass correlation

(two-way random model for a single measure – ICC (2,1)), and also with weighted Cohen's

kappa coefficient (using quadratic weights). Additionally, for purely exploratory and illustrative

purposes, agreement regarding total score was depicted using the Bland and Altman (1986)

limits-of-agreement approach. All statistical analyses were performed using SPSS 15.0 for

Windows (SPSS Inc., Chicago, IL, 2007), whereby a macro program adapted from published

and verified code was used for kappa calculations (Valiquette et al., 1994; García-Granero,

2007) and Bland-Altman plots (García-Granero, 2009).

Results

Content validity study

Of the 55 healthy children invited to participate, 54 were evaluated because one boy was

reluctant to co-operate; thus, 28 boys and 26 girls were tested. Mean age of the group was 5.9

years (SD 1.9 years, range 3.5-11 years). Nineteen children were not able to swim, 4 were able

to glide through water; the rest were able to swim from eight meters up to 10 minutes without

touching the pool floor (Table 4). As expected, differences in mean SWIM scores among

different NESSA categories (Table 4) were significant (p<0.001). Because of the evident ceiling

effect, we also tested if the means rose with a quadratic trend rather than linearly, which also

proved to be significant (p<0.001). The data did not significantly deviate from the quadratic

trend (p=0.528). The Jonckheere-Terpstra test also showed a significant rise of SWIM scores

with NESSA level (exact p<0.001). Average scores for SWIM items of adjustment to water,

7

breathing and balance were statistically significantly higher than average scores for the items on

rotations (by about 1 point), whether estimated as means and tested parametrically or estimated

as medians and tested nonparametrically (Table 5).

Inter-rater reliability study

All the 37 invited children and youth were willing to co-operate. Mean and median scores for all

SWIM items assessed by both pairs of raters are presented in Table 6. Paired-samples

comparisons for detecting possible bias showed no difference between raters. The largest

difference was observed for item G (sagittal rotation development), but it was still not

statistically significant at the 5% alpha level even though no correction for multiple testing was

applied.

Intraclass correlation and weighted kappa estimates demonstrated that agreement between the

two (pairs of) raters was very high (Table 7). The lowest agreement was found for item G, but it

was still above 0.9. As expected, the agreement was the highest regarding total score (where it

was practically perfect). All the weighted kappa values were virtually identical to the ICC

values (they were identical to two decimals, and equal or lower by up to 0.002 to three

decimals). Given the negligible differences between rater means this should come as no surprise

because in the absence of rater mean differences, ICC(2,1) and kappa with quadratic weights are

identical (Schuster, 2004). On a related note concerning also the content validity study results, it

was also only natural to observe all the medians being very close to the means (because no

distribution was extremely skewed), and all the interquartile ranges being (roughly speaking)

larger by about a half than the standard deviations (because it is known from basic probability

that for a normal distribution the ratio of SD to IQR is 1.35).

The agreement regarding total score is depicted in Figure 1, where the vast majority of points

(i.e., score pairs) lie very close to the main diagonal that represents perfect agreement. In

addition, we visualised agreement regarding total score using the limits-of-agreement approach.

8

The Bland-Altman plot (Figure 2) showed no systematic trend of differences between the two

(pairs of) raters and the limits of agreement comprised zero. The case with the largest

disagreement was a 12 years old girl with CP, GMFCS level IV, for whom the difference of five

points was a result of disagreement regarding items B (by 2 points), C (by 2 points) and J (by 1

point).

Discussion

The aim of the presented study was to explore content validity and inter-rater reliability of

SWIM. The former was addressed through a study involving 55 healthy children, and the later

through a study involving 37 children and youth with special needs assessed by two pairs of

raters.

We found clear association of SWIM scores with the categories of the National Evaluation

System of Swimming Abilities. Based on that, we can conclude that SWIM has good content

validity. Since this part of the study was performed with healthy children, it would be

interesting to perform a similar study with a group of children with physical or learning

disabilities and test content validity against the Halliwick system of four badges.

As expected, we found a ceiling effect with healthy children, since SWIM items are meant to

evaluate pre-swimming and early swimming abilities. This means that SWIM is not useful for

advanced swimmers. We also expected that SWIM scores of children would follow the logical

order of development of pre-swimming and early swimming skill. Again, our expectation

proved to be justified, because the children gained higher scores on the items of adjustment to

water and breathing control, which are among the first skills to be mastered, than on the items of

rotations. Only when a child is able to control breathing while being under water, he or she is

prepared to learn and perform more demanding items of full transversal, longitudinal and

combined rotation (holding face immersed into water and being able to blow out in a controlled

9

manner). It can be added that the same was observed in the group of children and youth with

disabilities.

It was somewhat surprising to see that there were quite some children who were able to swim

(according to the national evaluation system for healthy children) but did still not gain the

maximum total SWIM score of 77 points. We found some of those children not to be fully

adapted to water and not be able to submerge to pool floor and blow bubbles in a controlled

manner. This could lead us to the conclusion that the mainstream swimming program of a

particular school should have spent more time on teaching skills of water adjustment and

breathing control while children are trained in swimming skills. We could not make any

conclusion on other mainstream programs on the national level, but it would be very interesting

to explore this in more detail in the future. The national guidelines of teaching swimming

namely include the points of adjustment to water, breathing control and gliding through water as

early skills to be taught (Kapus et al., 2002).

The results of the second part of the study demonstrated very high inter-rater reliability of

SWIM. At the time of developing the study protocol, we had thought that we should test each

child twice in a row by two pairs of testers. However, we subsequently learned that a child is

able to perform skills at his/her best only when he/she feels comfortable and trusts the instructor

in water. Hence, we adapted the protocol and evaluated children while performing the test only

on one occasion. In this way, we were able to observe only the differences caused by different

decisions of testers, which were in fact just minor. Based on this experience, we can recommend

that the person who is performing the SWIM test should be one who knows the child well and

that the child should be in a confident relationship with that person (concerning water

activities). Such recommendation leads to more reliable evaluation of the child's performance.

10

Based on high inter-rater reliability, we can conclude that no special training is needed for using

the SWIM. Nevertheless, we recommend the tester to be a Halliwick instructor at the level of

group leader or instructor with long-term experience.

The observed inter-rater reliability estimates were all very high. We cannot assign any special

meaning to the fact that the lowest agreement was found regarding item G on sagittal rotation. If

anywhere, we might have expected to find disagreement regarding item D on balance

development. A child can be scored with two points when able to balance in vertical position

with support from helper at trunk, three points when able to balance in vertical position without

support from helper, four points when able to balance in back float position with support from

helper at trunk, and five points when able to balance in back float position without support from

helper. In our experience, there are quite some children (higher levels of GMFCS,

myelomeningocoela and others with poor functional ability of legs) who are not able to stand in

water but are very confident in lying position without support. In those cases we agreed in

advance to score them with the higher score of five points. This is not addressed in the manual

(Peacock, 1993), so we recommend that it is noted and applied in the future.

We also found no systematic trend of differences between raters. In the case with the largest

disagreement (12-year old girl with CP, GMFCS level IV), the two pairs of raters disagreed in

scoring of items B and C by two points. During evaluation, she needed full support of helper

faced towards and was able to blow out with lips at the level of water. The pair of raters who

scored her performance higher (engagement of helper from behind and being able to blow

bubbles in water) was the pair who work with her regularly and know her usual performance.

Hence, disagreement was influenced by previous experience with the girl's performance and

was not a result of a fundamental disagreement on scoring rules.

Before concluding, some methodological issues regarding our statistical analyses must be

addressed. By performing "parametric" and "nonparametric" analyses (to put it in widely used,

11

albeit often misused and technically inappropriate terms) in parallel, we sought to surpass the

controversies regarding the relation between measurement theory (measurement levels) and

statistical theory (statistical methods). Our aim was to avoid both extreme views, namely that of

"permissible" analyses in terms of Stevens' (1951, 1975) taxonomy and the opposing one

epitomised by the saying "the numbers do not know where they came from" (Lord, 1953; Gaito,

1980, Velleman and Wilkinson, 1993). We tried to follow the principle indirectly acknowledged

from both "opposing sides" – from the former primarily through the work in mathematical

psychology ranging from Luce et al. (1990) to Zand Scholten and Borsboom (2009), and the

later in Lord's own sequel (1954) – and championed universally and brilliantly by Tukey (1961),

namely that scientists have to apply mathematics to real-life data with care and understanding.

Furthermore, our results justify our approach and demonstrate that adhering to strictly "ordinal-

level analyses" by discarding the "parametric" part of each of our analyses would have

unnecessarily sacrificed not only familiarity, but also useful information. We can also

confidently speculate that approaching the data within the framework of "ordinal

psychometrics" (Cliff, 1989, 1996, 2003) would have had the same avail. This holds for the

individual SWIM items (which are unquestionably further from interval measurement level) as

well as for the total SWIM score (which might comfortably be assumed to have interval-level

properties in the tradition of classical test theory). It has also long been known and empirically

demonstrated that for the types of analyses we conducted, the decision between ordinal or

interval level of measurement is of no great importance (Baker et al., 1966). Nevertheless,

further research on metric characteristics of SWIM is needed, starting with internal validity

examination that would shed light on the measurement level issues. A much larger sample will

be needed for such analysis, which would entail item-response modelling (e.g., through the

graded response model, or – as widely preferred and advocated in the rehabilitation research

literature – the rating scale extension of the Rasch model).

On a final methodological note, we used the Bland-Altman plot even though the SWIM score is

inherently discrete rather than continuous as assumed by the method, because – as already

12

stressed – we applied it for purely exploratory and explanatory purposes. It served them well by

exposing no systematic trend of differences between the two raters and by clearly identifying

the case with the largest disagreement, which merited additional explanation. The limits of

agreement were therefore calculated and depicted for providing useful visual context rather than

for drawing conclusions. Like all the descriptive and inferential methods, data visualisation was

thus also used in the spirit that should, in our belief, pervade any scientific research and use of

any scientific instruments or methods, i.e., cum grano salis (Vidmar, 2010).

Conclusion

The results showed that the validity of SWIM compared to the National Evaluation System of

Swimming Abilities is high up to the point where a swimmer is well adapted to water and

already able to learn some swimming techniques. Inter-rater reliability of SWIM is very high, so

we believe that SWIM could be used reliably in different practical settings to follow the

progress of swimmers, as well as for research purposes. The findings are also valuable for

planning future studies on efficacy of different programs (impact of different functional abilities

within different pathologies, length of programs, and intensity of programs). However, before

application of SWIM for scientific research purposes, further studies on its sensitivity and

internal validity are recommended.

13

References

1. Baker BO, Hardyck C, Petrinovich LF (1966). Weak measurement vs. strong statistics: an

empirical critique of S.S. Stevens' proscriptions on statistics. Educ Psychol Meas 26, 291-

309.

2. Bland JM, Altman DG (1986). Statistical methods for assessing agreement between two

methods of clinical measurement. Lancet 1(8476), 307-310.

3. Chacham A, Hutzler Y (2001). Reliability and validity of the aquatic adjustment test for

children with disabilities. Movement 6, 160-89.

4. Cliff N (1989). Ordinal consistency and ordinal true scores. Psychometrika 54, 75-91.

5. Cliff N (1996). Ordinal methods for behavioral data analysis. Mahwah, NJ: Lawrence

Erlbaum.

6. Cliff N, Keats JA (2003). Ordinal measurement in the behavioral Sciences. Mahwah, NJ:

Erlbaum.

7. Gaito J (1980). Measurement scales and statistics: resurgence of an old misconception.

Psychol Bull 87(3), 564-567.

8. García-Granero M (2007). KAPPAPLUS. http://www.listserv.uga.edu/cgi-

bin/wa?A2=ind0706&L=spssx-l&D=1&P=53665

9. García-Granero M (2009). Bland & Altman LOA (Limits Of Agreement) analysis (MACRO).

http://gjyp.nl/marta/BALOA.sps

10. Groleger Sršen K, Vrečar I, Korelc S (2008). Swimming program based on Halliwick

concept: evaluation of swimming skill progress in a group of children with motor

disabilities. Neurologia Croatica 57(S3), 4.

11. Groleger Sršen K, Vrečar I, Vidmar G (2010). The Halliwick concept of teaching

swimming and assessment of swimming skills. Rehabilitation (Ljubljana) 9(1), 32-39.

12. Humphries KM (2008). Humphries’ Assessment of Aquatic Readiness. Denton: Texas

Woman’s University, Department Of Kinesiology, Adapted Physical Education And

Activity.

14

13. Kapus V, Štrumbelj B, Kapus J, Jurak G, Šajber Pincolič D, Vute R, Bednarik J, Kapus M,

Čermak V (2002). Plavanje, učenje: slovenska šola plavanja za novo tisočletje. Ljubljana:

Faculty of Sport, Institute of Sport.

14. McMillan P (2002). The Halliwick Story. London: Halliwick Association of Swimming

Therapy. www.halliwick.org.uk/html/history.htm

15. Lord FM (1953). On the statistical treatment of football numbers. Am Psychol 8(12), 750-

751.

16. Lord, FM (1954). Further comment on "Football Numbers". Am Psychol 9(6), 264-265.

17. LuceRD, Krantz DH, Suppes P, Tversky A (1990). Foundations of measurement. (Vol. III:

Representation, axiomatization, and invariance). New York: Academic Press.

18. McMillan J, McMillan P (2006). Halliwick Association of Swimming Therapy: Foundation

Course handbook (14th ed.). London: Halliwick Association of Swimming Therapy.

19. Palisano R, Rosenbaum P, Walter S, Russell D, Wood E, Galuppi B (1997). Gross Motor

Function Classification System, Expanded and Revised. Dev Med Child Neurol 39, 214-

223.

20. Peacock K (1993). Swimming with independent measurement: manual for evaluation.

London: Halliwick Association of Swimming Therapy.

21. Schuster C (2004). A note on the interpretation of weighted kappa and its relations to other

rater agreement statistics for metric scales. Educ Psychol Meas 64, 243-253.

22. Stevens SS (1946). On the theory of scales of measurement. Science 103(2684), 677-680.

23. Stevens SS (1975). Psychophysics. New York: Wiley.

24. Tirosh R, Kats-Leurer M, Gettz M (2008). Halliwick-Based aquatic assessments: reliability

and validity. Int J Aquat Res Educ 2, 224-236.

25. Tukey JW (1961). Data Analysis and Behavioral Science or Learning to Bear the

Quantitative Man's Burden by Shunning Badmandments. In The collected works of John W.

Tukey vol. III. Belmont, CA: Wadsworth (1986); pp. 391-484.

26. Valiquette CAM, Lesage AD, Cyr M, Toupin J (1994). Computing Cohen's kappa

coefficients using SPSS MATRIX. Behav Res Methods Instrum Comput 26, 60-61.

15

27. Vidmar G (2010). Evidence in medicine. Rehabilitation (Ljubljana) 10(S1), 4-11.

28. Velleman P, Wilkinson L (1993). Nominal, ordinal, interval, and ratio typologies are

misleading. Am Stat 47, 65-72.

29. Zand Scholten A, Borsboom D (2009). A reanalysis of Lord's statistical treatment of

football numbers. J Math Psychol 53(2), 69-75.

16

Table 1: Characteristics of the children included in the inter-rater reliability study.

Characteristic Value

Gender Male 15

Female 22

Age Mean 14 years

Range 7-22 years

Diagnosis Autistic spectrum disorder 1

Cerebral vascular insult 1

Chrosomopathy 1

Down syndrome 4

Myelomeningocoela 1

Mental retardation 1

Cerebral palsy 28

GMFCS level I 3

GMFCS level II 5

GMFCS level III 5

GMFCS level IV 7

GMFCS level V 8

GMFCS, Gross Motor Function Classification System level (Palisano et al., 1997)

17

Table 2: Slovenian National Evaluation System of Swimming Abilities (short version).

Level Description of swimming ability

0 Not able to swim

1 Able to glide on the water, arms forward, face in water for 5 seconds

2 Swimming in free style for 8 m

3 Swimming in free style for 25 m, starting in water

4 Swimming in free style for 35 m, starting from the edge of pool

5 Swimming in free style for 50 m, starting from the edge of pool;

Able to change body position from prone lying to horizontal position and back to supine

position;

6 Able to swim for 10 minutes;

Norms for 50 or 100 m of freestyle by age and gender

7 Able to swim breast stroke, back stroke and freestyle, each for 50 m;

Able to jump into water head forward;

Norms by age and gender;

8 Able to swim 200 m in 5 minutes or less;

Able to swim 15 m under the water;

Mastered rescue-from-water techniques

18

Table 3: SWIM items.

Pool-skill Short description

A Water entry development: the extent of support needed for a swimmer to entry the

water at any pool setting

B Water adjustment development: the extent of support needed for a swimmer to be in

the water

C Breath control development: from being able to blow above the water to being able

to submerge and hum safely

D Balance development: being able to control body position in vertical and back float

position

E Backwards transversal rotation development: being able to control movement from

chair position (or curled) position to back float position

F Forwards transversal rotation development: being able to control movement from

back float to chair or prone float position

G Sagittal rotation development: being able to control body while moving sideways by

changing position of head and reaching with arm

H Longitudinal rotation development: being able to do longitudinal roll from back to

back float position

I Combined rotation development: being able to control combination of rotations

J Water stroke development: support needed and distance

K Exit development: the extent of support needed for a swimmer to exit the water

19

Table 4: Descriptive statistics of SWIM score for each level of the National Evaluation System

of Swimming Abilities (NESSA).

SWIM score

NESSA N Mean SD Min Max Me IQR

0 19 34.5 4.1 27 43 34 7

1 4 43.3 7.4 37 54 41 9

2 9 59.8 12.5 45 74 61 26

3 8 68.9 4.2 60 73 70 4

4 5 72.8 1.3 71 74 73 2

5 7 74.0 3.1 69 77 75 4

6 2 76.0 0.0 76 76 76 0

N, number of children; SD, standard deviation; Min, minimum; Max, maximum; Me, median;

IQR, interquartile range

20

Table 5: Difference in average score between the items on adjustment to water and breathing

control compared to the items on rotations (with and without the item on balance).

SWIM items Mean SD p(t) Median IQR p(EMP)

B + C 5.35 1.41 5.75 2.13

vs. D to I 4.50 1.94 <0.001 4.50 3.75 <0.001

vs. E to I 4.60 2.11 <0.001 4.60 4.40 <0.001

B, adjustment to water; C, breath control; D, balance; E, backwards transversal rotation; F,

forwards transversal rotation; G, sagittal rotation; H, longitudinal rotation; I, combined rotation;

SD, standard deviation; t, matched-pairs t-test; IQR, interquartile range; EMP, exact Wilcoxon

matched-pairs signed-rank

21

Table 6: Average scores of SWIM items and mean total score for both pairs of instructors.

SWIM item Mean 1 Mean 2 p(t) Median 1 Median 2 p(EMP)

A 6.00 6.05 0.324 7 7 0.625

B 5.89 5.84 0.600 7 7 0.813

C 5.49 5.57 0.324 6 6 0.531

D 5.19 5.27 0.262 5 5 0.500

E 5.05 5.16 0.103 5 5 0.219

F 5.22 5.19 0.768 6 6 1.000

G 5.00 5.22 0.058 6 6 0.094

H 4.41 4.51 0.160 6 6 0.289

I 4.57 4.54 0.768 5 5 1.000

J 5.00 4.97 0.744 7 7 1.000

K 5.27 5.22 0.571 7 7 1.000

Total 57.08 57.54 0.104 63 64 0.079

t, matched-pairs t-test; IQR, interquartile range; EMP, exact Wilcoxon matched-pairs signed-

rank test; p-values are not adjusted for multiple tests

22

Table 7: Inter-rater reliability estimates for each SWIM item and total SWIM score.

SWIM item ICC κWQ

A 0.975 0.974

B 0.922 0.920

C 0.954 0.953

D 0.953 0.952

E 0.975 0.974

F 0.958 0.957

G 0.905 0.902

H 0.983 0.982

I 0.973 0.972

J 0.979 0.978

K 0.968 0.967

Total 0.996 0.996

ICC, intraclass correlation; κWQ, weighted Cohen's Kappa with quadratic weights

23

Figure captions

Figure 1: Agreement on total SWIM scores between the two (pairs of) raters.

Figure 2: Bland-Altman plot for total SWIM score.