INCONSISTENCY IN THE ANALYSIS OF MORPHOLOGICAL DEFORMITIES IN CHIRONOMIDAE (INSECTA: DIPTERA) LARVAE

29
Ac c e p te d P r e p r i n t This article is protected by copyright. All rights reserved Environmental Toxicology INCONSISTENCY IN THE ANALYSIS OF MORPHOLOGICAL DEFORMITIES IN CHIRONOMIDAE (INSECTA: DIPTERA) LARVAE JOHANNA SALMELIN, KARI-MATTI VUORI, and HEIKKI HÄMÄLÄINEN Environ Toxicol Chem., Accepted Article DOI: 10.1002/etc.3010 Accepted Article "Accepted Articles" are peer-reviewed, accepted manuscripts that have not been edited, formatted, or in any way altered by the authors since acceptance. They are citable by the Digital Object Identifier (DOI). After the manuscript is edited and formatted, it will be removed from the “Accepted Articles” Web site and published as an Early View article. Note that editing may introduce changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. SETAC cannot be held responsible for errors or consequences arising from the use of information contained in these manuscripts.

Transcript of INCONSISTENCY IN THE ANALYSIS OF MORPHOLOGICAL DEFORMITIES IN CHIRONOMIDAE (INSECTA: DIPTERA) LARVAE

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

Environmental Toxicology

INCONSISTENCY IN THE ANALYSIS OF MORPHOLOGICAL DEFORMITIES IN

CHIRONOMIDAE (INSECTA: DIPTERA) LARVAE

JOHANNA SALMELIN, KARI-MATTI VUORI, and HEIKKI HÄMÄLÄINEN

Environ Toxicol Chem., Accepted Article • DOI: 10.1002/etc.3010

Accepted Article "Accepted Articles" are peer-reviewed, accepted manuscripts that have not been edited, formatted, or in any way altered by the authors since acceptance. They are citable by the Digital Object Identifier (DOI). After the manuscript is edited and formatted, it will be removed from the “Accepted Articles” Web site and published as an Early View article. Note that editing may introduce changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. SETAC cannot be held responsible for errors or consequences arising from the use of information contained in these manuscripts.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

Environmental Toxicology Environmental Toxicology and Chemistry DOI 10.1002/etc.3010

INCONSISTENCY IN THE ANALYSIS OF MORPHOLOGICAL DEFORMITIES IN

CHIRONOMIDAE (INSECTA: DIPTERA) LARVAE

Running title: Inconsistency in chironomid deformity assessment

JOHANNA SALMELIN,*† KARI-MATTI VUORI, ‡§ and HEIKKI HÄMÄLÄINEN†

†Department of Biological and Environmental Science, University of Jyvaskyla,

Jyväskylä, Finland

‡Laboratory Centre/Ecotoxicology and Risk Assessment, Finnish Environment Institute,

Jyväskylä, Finland

§South Karelia Institute, Lappeenranta University of Technology, Lappeenranta, Finland

* Address correspondence to [email protected]

This article is protected by copyright. All rights reserved

Submitted 13 October 2014; Returned for Revision 2 February 2015; Accepted 3 April 2015

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

Abstract: The incidence of morphological deformities of chironomid larvae as an indicator of sediment

toxicity has been studied for decades. However, standards for deformity analysis are lacking. We

evaluated whether twenty-five experts diagnosed larval deformities in a similar manner. Based on high-

quality digital images, the experts rated 211 menta of Chironomus spp. larvae as normal or deformed.

The larvae were from a site with polluted sediments or from a reference site. We revealed this to a

random half of the raters, whilst the rest conducted the assessment blind. We quantified the inter-rater

agreement by kappa coefficient, tested if open and blind assessments differed in deformity incidence

(DI) and in differentiation between the sites, and identified those deformity types rated most

consistently/inconsistently. The total DI varied greatly from 10.9 to 66.4% among experts. Kappa

coefficient across rater pairs averaged 0.52 indicating insufficient agreement. The deformity types rated

most consistently were those missing teeth or with extra teeth. The open and blind assessments did not

differ but differentiation between sites was clearest for raters who counted primarily absolute

deformities like missing and extra teeth and excluded apparent mechanical aberrations, or deviations in

tooth size or symmetry. The highly differing criteria in deformity assignment have likely led to

inconsistent results in midge larval deformity studies and indicate an urgent need for standardization of

the analysis. This article is protected by copyright. All rights reserved

Keywords: Benthic macroinvertebrates, Biomarkers, Sediment toxicity, Morphological deformities,

Inter-rater agreement

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

INTRODUCTION

Biological indicators are increasingly used in environmental impact assessment [1]. For

instance, chemicals as environmental stressors can affect organisms by inducing morphological

abnormalities or deformities, which can be used as biomarkers of exposure, effect or both. One sign of

sublethal stress caused by environmental contamination is the increased incidence of structural

deformities in insects, but exposure to toxic agents also increases deformity incidence in gastropods

and fishes [2-3]. Ontogenetic deformities in the head capsule structures of Chironomidae (Insecta:

Diptera) larvae are used as an indicator of sediment toxicity in aquatic ecosystems. Chironomids, or

non-biting midges, have an immature aquatic larval phase with four larval instars. They typically

constitute a significant part of the benthic invertebrate fauna in both lotic and lentic habitats. One of the

most widely distributed and species-rich chironomid genera is Chironomus [4]. Chironomus spp. spend

their larval stage in tubes they construct within soft sediments of lakes and slowly flowing parts of

rivers, where they feed on detritus particles with associated microorganisms and deposited algae [5].

Hence they are readily exposed to the substances bound to the sediment or dissolved in the pore water.

A deformity in chironomid larvae has been defined as “any morphological feature that departs

from the normal configuration” [6,7]. Deformities in chironomid larvae may occur in various

morphological structures including antennae, mandibles, premandibles, epipharyngeal pecten and labral

lamellae [8], but usually and most easily deformities are examined from the mentum. The mentum is a

structure on the ventral side of the fully sclerotized head capsule, posterior to the mouthparts. The

mentum of Chironomus spp. consists of a double-walled labial plate with 13 sclerotized teeth with

dorsal and ventral walls, dorsomentum and ventromentum [9]. The mentum possesses six lateral teeth

on both sides of the trifid median tooth (Figure 1). Lateral teeth can be further subdivided into two

larger inner lateral teeth and four smaller outer lateral teeth [8]. Typical aberrations referred to as

deformities include missing, additional, split and asymmetrical teeth, distinct deep smooth-edged

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

indentations called Köhn gaps or a broad deviation from the normal tooth configuration [8].

Chironomid deformities have received attention as a potential early warning indicator of pollution since

the study of Hamilton and Sæther in 1971 [10], followed by numerous laboratory and field studies.

Xenobiotic substances are suspected to induce morphological deformities during larval ontogeny by

disrupting developmental processes between moults. An elevated incidence of mentum deformities has

been considered a response to metal contamination [11-13], some endocrine-disrupting chemicals [14],

or complex mixtures of xenobiotic substances in sediments [15,16]. However, some studies, mainly

laboratory bioassays, have indicated contradictory results with no effects of metals on Chironomus

mentum deformity rate [17,18].

It is evident from the literature that variable evaluation criteria of deformities are applied by

different researchers. For instance, a split median tooth or mechanical aberrations, such as breakage

and wear, are counted as deformities in some studies, but not in other studies [18-25]. Moreover, some

studies do not report at all the criteria used, or which aberrations are counted for deformities.

Commonly cited published criteria are by Dickman et al. [26], Warwick and Tisdale [8] and Janssens

de Bisthoven et al. [27]. Clearly if different researchers use divergent definitions of deformity, their

results are not comparable. Hence, subjectivity and inconsistency in deformity analysis might also

partially explain the contrasting results regarding deformity response.

Reliability in the context of inter-rater agreement studies refers to the degree to which

different raters agree in their assessments [28]. Reliable assessment or diagnosis results are obtained

when different experts independently make similar evaluation decisions on an identical set of data.

Inter-rater reliability studies are common in medical sciences, including psychiatric diagnostics,

clinical trials and interpretation of radiographs or cell samples, and are also widely used in social

sciences and linguistics [29,30]. In contrast, studies quantifying inter-rater agreement on the deformity

assessment or other morphological biomarkers of aquatic organisms are lacking. An expert as a rater is

an important factor in classifying items, and the criteria used influence the results of an evaluation and

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

a whole study. One goal of inter-rater agreement studies is to reduce subjectivity associated with any

assessment requiring some degree of interpretation.

We therefore studied inter-rater agreement on Chironomus spp. deformity analysis among

experts. We aimed: 1) to quantify inter-rater agreement among experts in deformity assessment (i.e.

whether experts diagnose deformities in a similar manner sharing the same definition of the deformity

and normality); 2) to study whether preconception in the form of background information about site

contamination biases deformity evaluation results; and 3) to improve consistency of deformity analysis

by identifying those types of anomalies with greatest agreement/disagreement among raters. In this

study we focused on Chironomus spp. mentum deformities which are most commonly used in

deformity assessments.

MATERIALS AND METHODS

Test set up

The test material consisted of microscopic EDF (Extended Depth of Focus) images of menta

of 211 fourth instar Chironomus spp. larvae collected from the field. The larvae were fixed on

microscope slides with polyvinyl-lactophenol. Both C. anthracinus- and C. plumosus-type larvae were

included in the samples (Figure 1). Larvae were sampled in 1996 from Lake Saimaa, southeastern

Finland. Most of the larvae (n = 154) were from a reference site upstream from a local pollution source,

a paper and pulp mill. The rest of the larvae (n = 57) were collected from a site immediately

downstream from the factory. The sediment of this impacted site was heavily contaminated with

mercury (Hg) and organic compounds like organohalogens and chlorophenolics (site A1 in Soimasuo

et al. [31]). Thus the test represents a realistic case, and of the original 272 mentum samples only those

that were dried out and damaged were excluded from the test material. The EDF technique provides

pictures taken at different focus levels (Z-series) to be combined into one image with highly improved

clarity. A digital camera (Nikon DS) connected to an optical microscope (Olympus BX41 System

Microscope) was used to take the images. Several (3-13) images from different focus levels of each

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

mentum were combined into one using the laboratory image analysis software NIS-Elements D 4.11.01

64-bit, and were saved as TIF images. The test material was shared with participating experts via Funet

FileSender, which is intended for distributing large data files. However, due to data transfer problems,

for some participants the material was downloaded into a USB flash memory and sent by post.

The participants in the evaluation were invited based on their expertise in Chironomus spp.

deformity analysis, as verified by their publication record on this subject. Four participants were

included based on a suggestion from a colleague or mentor who was not able to participate. Altogether

25 experts (raters hereafter) from 13 different countries (16 raters from Europe, 3 from South America,

2 from North America, 2 from Asia, 1 from Africa and 1 from Australia) agreed to participate and

analyze the test material. Only two invited experts declined or did not answer. The test design was

fully-crossed; hence all experts rated every case. Raters were asked to perform their analysis

independently. Independence was reinforced by anonymity of the raters, who did not know the

identities of the other participants. Raters were required to explicitly mark each mentum (identified by

the number of the image) either as normal or deformed with the number 0 (zero) or 1, respectively, in

the manner they would do in their routine investigation. The raters were also able to fill in some

additional information about deformity type, location or severity, or some other comments. All

assessment results were handled anonymously. After all results were received, we provided all raters

with a summary of the results showing their own personal ratings compared with the others.

In order to evaluate a possible preconception bias, when the test material was delivered

information about the origins of the larvae was revealed to half of the raters selected randomly. This

group of raters hence performed an open assessment, whereas the other half who lacked the

background information made the assessment blind. In addition, to evaluate intra-rater consistency one

duplicate image within the test material was included in both the reference and impacted sets but with

the image rotated so that the rater would not recognize the duplication. This duplicate image was the

only artefact in the data set, and was not counted in the calculation of deformity incidence.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

Data analysis

The frequency of deformed individuals within a chironomid population (deformity incidence,

DI) [32], usually expressed as percentage (%) was calculated as

DI = (d/n) × 100

where d is the number of deformed larvae and n the number of larvae examined. DI% was calculated

for all individual assessment results across all larvae and also separately for larvae from the reference

and impacted sites.

The estimate of raw agreement, the proportion of cases for which two raters agree, was

calculated and averaged among multiple raters. The inter-rater agreement was quantified using chance-

corrected kappa coefficient (κ) for each pair of raters [33]. An overall estimate of kappa was calculated

as the arithmetic mean of pairwise κ-values (n = 300) to define consistency among all raters as

suggested by Light [34]. Possible values for kappa statistics range from −1 to 1, with 1 indicating

perfect agreement, 0 indicating completely random agreement, and −1 indicating complete

disagreement. Landis and Koch [35] provide guidelines for interpreting kappa values, with values from

0.0 to 0.2 indicating slight agreement, 0.21 to 0.40 fair agreement, 0.41 to 0.60 moderate agreement,

0.61 to 0.80 substantial agreement, and 0.81 to 1.0 almost perfect or perfect agreement. An acceptable

kappa indicating a sufficient level of agreement has been suggested to be > 0.60–0.75 [35,36].

We used Mann-Whitney U test to examine differences in DI between blind and open

assessments for both reference and impacted sites. We used Χ2-test separately for all assessments (n =

25) to study whether the distributions of the categorical variable (normal/deformed) differed between

the reference site and the impacted site.

From the initial data analysis it became apparent that the raters had different propensities to

designate deformities. Conservative raters seemed to judge deformities cautiously, include only

obvious abnormalities and exclude any wear or mechanical damage, which were more often counted by

the non-conservative raters. Therefore we decided to evaluate the practical implications of these

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

contrasting strategies for assessments by evaluating their sensitivity to distinguish the reference site

from the impacted site using odds ratio as an effect size measure. We also estimated the total sample

size required to detect the difference in DI between the reference and the impacted site for each

assessment according to Fleiss et al. [37] by interpolation from tabulated values and adjusted for

unequal sample sizes. All statistical analyses were performed using IBM SPSS Statistics 20.

RESULTS

No raters produced completely identical assessments. Occasionally the total number of

deformities, and hence DI, was the same between two raters, but even then the menta rated normal and

deformed were never exactly the same. Of the 211 menta, 48 were rated normal (35 and 13 from the

reference and impacted site, respectively) and 5 deformed (1 and 4) by all 25 raters resulting in

altogether 53 cases with complete agreement (Figure 2). All five menta that were consistently rated as

deformed were with missing or extra teeth, sometimes together with median teeth deformation (Figure

3A–C). However, occasionally even some quite obvious deformities, like a missing tooth, remained

unnoticed. A simple coding error might be suspected in some cases when a mentum having a striking

deformity like a Köhn gap (Figure 3D) was recorded as normal.

The most controversial cases which approximately half of participants (11–14 or 44–56%)

considered normal and the other half deformed, included missing teeth that could represent either

mechanical damage or an actual developmental abnormality (Figure 4A). Other cases rated

inconsistently included teeth with some deviations from normal symmetry (Figure 4B–F). Most

participants did not predominantly consider mechanical damage as a deformity (Figure 5A), whereas

some rated apparently worn teeth (Figure 4C) more or less systematically as deformed.

The number of cases classified as normal and deformed varied from 71–188 and 23–140,

respectively (Figure 6A). Correspondingly, the overall deformity incidence varied from 10.9% to

66.4% depending on the rater. DI calculated separately for the reference site was 3.9–64.7%, and for

the impacted site 26.8–69.6% (Figure 6B). The average raw agreement was 0.80 (±sd 0.14) with a

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

range of 0.43–0.99. Pairwise kappa ranged between 0.09–0.96 indicating large variation in the inter-

rater agreement. Some rater pairs agreed almost perfectly, while some pairs had very divergent

deformity interpretations. The average kappa across rater pairs was 0.52 (±sd 0.23). Average kappa

excluding the results of four raters suggested by peers (numbers 2, 10, 21 and 23 in Figure 6A) was

0.54 indicating that the inclusion of these originally uninvited raters did not affect the observed

agreement. Of pairwise kappa-values, 37.7% were above 0.60 and 25.3% above 0.75. Hence the vast

majority, 62.3% and 74.6%, of pairwise kappa-values fell below these two suggested cut-off values for

sufficient agreement.

The blind and open assessments for the reference site resulted in an average DI of 17.0% and

19.1%, respectively, with no statistically significant difference (Mann-Whitney U = 61.0, p = 0.786).

The same was true for the impact site where the blind (DI = 35.0%) and open (39.7%) assessments did

not differ (Mann-Whitney U = 62.0, p = 0.833).

The duplicate mentum to evaluate inter-rater consistency had a split median tooth (Figure 4B),

and was controversially considered either deformed (15–16 raters) or normal (9–10 raters). However,

most raters (88%) showed intra-rater consistency and evaluated the duplicate mentum similarly. All

three experts who rated the duplicate mentum inconsistently conducted the assessment blind.

In 76% of all assessments the distributions of the dichotomous response (normal/deformed)

differed between the reference and impacted sites according to the Χ2-test. The estimated difference

was clearly negatively related to the propensity of raters to designate deformities (Figure 7); hence

conservative deformity assessments were more likely to distinguish the reference and impacted sites

than were non-conservative assessments. The six assessments which did not to differentiate between

sites had the highest overall DI (Figures 6, 7). This high overall DI resulted mainly from interpreting

worn teeth as deformed. Conservative deformity assessments showed greater effect sizes (Figure 8A),

which was reflected in considerably smaller sample sizes required to detect a difference in DI between

the impacted and the reference site (Figure 8B).

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

DISCUSSION

We studied the inter-rater agreement among experts of chironomid deformities, and found that

experts evaluate deformities rather subjectively and inconsistently. Whereas some rater pairs agreed

closely, sharing the same interpretation of deformities, others had very low agreement. According to

Madden et al. [38], mentum deformities may be categorized as absolute, quantitative deformities, such

as missing or extra teeth, or as relative, qualitative deformities, such as changes in shape or size of

teeth. In our study, the most obvious, ‘absolute’ deformities were generally rated quite consistently,

and the inconsistency was mostly in interpretation of relative, qualitative deformities, as well as in

separation of ontogenetic anomalies from mechanical wear or damage. The cases with the highest

disagreement included teeth slightly deviating from the normal shape. These deviations were

interpreted either as wear or as developmental deformities. When the tooth breaks, the two edges

(dorsal and ventral) with rough and uneven outlines become visible (Figure 5), but sometimes (Figure

4A) this cannot be seen clearly. A minor source of inconsistency might have resulted from careless

errors, for example when a mentum clearly lacking a tooth was rated as normal. Coding errors might

also have occurred. However, there were few apparent examples of these kinds of errors, and most of

the inconsistency was clearly due to different interpretations of deformity and normality.

The low value of the chance-corrected agreement measure, kappa, indicated only moderate

inter-rater agreement according to Landis and Koch [35] and Altman [39], and hence also demonstrated

insufficient consistency in deformity analysis of chironomid larval mentum. Interpretation of kappa

values is controversial, and cut-off values indicating strength of agreement have been considered

arbitrary. This was acknowledged by Landis and Koch [35], but they nevertheless considered boundary

values to provide useful benchmarks. Kappa coefficient and kappa type measures have also been

criticized for positive bias towards marginal homogeneity and negative bias towards trait prevalence

[40], giving higher values when raters use the rating categories unequally and lower values if one of the

categories is more common. In our study some rater pairs used rating categories unequally. Trait

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

prevalence potentially also affected kappa values, as the category ‘normal’ is expected to be much

more common than the category ‘deformed’. Despite the kappa-critique and the ongoing debate about

the most appropriate measure for inter-rater agreement on nominal data [40], our raw data alone

(Figure 6) clearly indicated a severe problem with large variation in deformity assessments among

experts.

The reference site for our study was upstream from a local pollution source, a paper and pulp

mill in southern Lake Saimaa. The sediments of the impacted site immediately downstream from this

point source of pollution were heavily contaminated with Hg and organochlorine compounds. Contrary

to our expectation, providing information about site status did not affect the deformity assessments, and

the results were not biased by any prejudice of the raters. However, this result does not conclusively

rule out the possibility of this type of bias, as in the test situation the participants knew that their

evaluations would be compared with others and therefore they might have been more objective in their

ratings than otherwise. It has been previously documented that raters provide more reliable data when

they know that their performance is monitored [41].

Consistently with the background DI of 2.8–3.1% and < 8%, reported by Dickman et al. [26]

and Vermeulen [7], respectively, a tentative background DI in our study region, based on data from 7

sites in lakes with no significant local pollution, is 5% (H. Hämäläinen, University of Jyväskylä,

Jyväskylä, Finland, unpublished data). In the present study, eleven raters evaluated DI < 8% for the

reference site. All raters, except one, estimated higher DI in the impacted site than in the reference site,

irrespective of their assessment strategies, suggesting that non-conservative assessments could still be

diagnostic. However, the required sample sizes for the most non-conservative assessments to detect any

difference in DI between the sites were huge (> 1000 larvae) and undoubtedly unfeasible in

environmental monitoring or ecotoxicological studies.

Potential biases, or sources of systematic error in animal toxicology studies involve the lack of

randomization, blinding or specification of inclusion/exclusion criteria [42]. All these deficiencies

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

should also be taken into account to reduce subjectivity and inconsistency in chironomid deformity

analysis. Although the deformity assessments in our study did not appear to be subject to any

systematic error due to the influence of information about site status, blinding is recommended in all

studies in which results might be influenced by a researcher’s expectations or preferences. Blind

assessment generally produces more consistent results than open assessment [43], and has long been a

requirement in medical studies and clinical trials.

A stronger and more consistent separation in DI between the reference and impacted sites

resulted from the conservative assessment of deformities compared to more non-conservative

assessment practice. In conservative assessment it was mainly ‘absolute’ abnormalities (see above) that

were judged as deformities, whereas abnormalities which might result from wear or mechanical

damage (worn, broken, different-sized, or skewed teeth) were excluded. This suggests that conservative

assessment, in which only absolute deformities are counted, reflects a potential toxicity response more

consistently and appears to be more sensitive in detecting an effect. This higher sensitivity also means

greater cost-efficiency, as smaller sample sizes are needed to detect an effect.

The inclusive definition of a deformity as a deviation from the normal configuration certainly

invites subjectivity and variability in assessments. Inter-rater agreement might be substantially

increased by compromised and clear guidelines for assessment criteria, which are currently lacking

from chironomid deformity assessments. However, defining such criteria is problematic, as the

deformity status of a larva cannot be confirmed or validated in the absence of an independent ‘truth’.

However, the criteria might be based on a consensus opinion of experts, preferably supported by

empirical studies. As a starting point towards this end the data obtained in our trial can be used to form

an expert consensus. Moreover, the results suggest that conservative assessment considering only

absolute deformities sensu Madden [38] gives more consistent results. Furthermore, to avoid subjective

bias and to give the results greater credibility, we strongly recommend conducting any deformity

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

assessments blind. In environmental toxicity and risk assessment this principle should be adopted more

widely and extended to deformity analysis of other organisms.

CONCLUSIONS

Chironomid larval deformities are used in bioindication of exposure to chemical stressors with

likely effects also on populations and communities. Our study of inter-rater agreement on deformity

assessment indicated that experts evaluate deformities subjectively and inconsistently, which weakens

the indicator value of chironomid deformity analysis. Given that most of these experts have published

scientific research articles on the incidence of deformities, we can conclude that the comparability and

reliability of results in the literature is highly questionable. Hence, the equivocal and partly contrasting

results of studies concerning chironomid deformity incidence as a stress response to contamination

might partly result from inconsistent interpretation of deformities. Even though we focused only on the

mentum of Chironomus, we expect that the results should be broadly generalizable to other

morphological structures and taxa used in chironomid deformity studies. Reliability and consistency

could be increased by developing guidelines to be applied in the chironomid deformity assessment.

Moreover, assessments should be conducted blind. Our results also suggest that more sensitive

detection of effects of sediment toxicity can be obtained by taking into account only absolute

deformities like missing and extra teeth, and Köhn gaps. Inconsistency might also be reduced in the

future by developing automated recognition methods for deformity identification using computer-based

image analysis.

Acknowledgment-—We warmly thank all participated experts for their indispensable efforts making

this study possible, K. Meissner for technical support in microscopic photographing, and R. Jones for

proofreading the manuscript. We also thank two anonymous reviewers for valuable comments that

improved the quality of the manuscript. The study was organized by the University of Jyväskylä,

Finland, in collaboration with the Finnish Environment Institute, and funded by the Finnish Funding

Agency for Innovation.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

Data availability—All data are available upon request from the corresponding author

([email protected]).

REFERENCES

1. Adams SM. 2002. Biological indicators of aquatic ecosystem stress: introduction and overview.

In Adams SM, ed, Biological indicators of aquatic ecosystem stress. American Fisheries

Society, Bethesda, Maryland. pp. 1-11.

2. Conroy PT, Hunt JW, Anderson BS. 1996. Validation of a short-term toxicity test endpoint by

comparison with longer-term effects on larval red abalone Haliotis rufescens. Environmental

Toxicology and Chemistry 15:1245-1250.

3. Carls MG, Rice SD, Hose JE. 1999. Sensitivity of fish embryos to weathered crude oil: Part I.

Low-level exposure during incubations causes malformations, genetic damage, and mortality in

larval Pacific herring (Clupea pallasi). Environmental Toxicology and Chemistry 18: 481–493.

4. Wiederholm T. 1983. Chironomidae of the Holarctic region. Keys and diagnosis. Part I Larvae.

Ent. scand. Suppl. 19: 1-457.

5. Johnson RK. 1987. Seasonal variation in diet of Chironomus plumosus (L.) and C. anthracinus

Zett. (Diptera: Chironomidae) in mesotrophic Lake Erken. Freshwater Biology 17:525-532.

6. Warwick WF. 1988. Morphological deformities in Chironomidae (Diptera) larvae as biological

indicators of toxic stress. In: (ed. Evans M.S.) Toxic contaminants and ecosystem health: A

Great Lakes focus, John Wiley & Sons, New York, pp. 281-320.

7. Vermeulen AC. 1995. Elaborating chironomid deformities as bioindicators of toxic sediment

stress: the potential application of mixture toxicity concepts. Ann. Zool. Fennici 32:265-285.

8. Warwick WF, Tisdale NA. 1988. Morphological deformities in Chironomus,

Cryptochironomus, and Procladius larvae (Diptera: Chironomidae) from two differentially

stressed sites in Tobin Lake, Saskatchewan. Can. J. Fish. Aquat. Sci. 45:1123-1144.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

9. Sæther OA. 1971. Notes on general morphology and terminology of the Chironomidae

(Diptera). The Canadian Entomologist 103:1237-1260.

10. Hamilton AL, Saether OA. 1971. The occurrence of characteristic deformities in the chironomid

larvae of several Canadian lakes. The Canadian Entomologist 103:363-368.

11. Ilyashuk B, Ilyashuk E, Dauvalter V. 2003. Chironomid responses to long-term metal

contamination: a paleolimnological study in two bays of Lake Imandra, Kola Peninsula,

northern Russia. Journal of Paleolimnology 30:217–230.

12. Martinez EA, Wold L, Moore BC, Schaumloffel J, Dasgupta N. 2006. Morphologic and growth

responses in Chironomus tentans to arsenic exposure. Arch. Environ. Contam. Toxicol. 51:529–

536.

13. Di Veroli A, Goretti E, Paumen M L, Kraak MHS, Admiraal W. 2012. Induction of mouthpart

deformities in chironomid larvae exposed to contaminated sediments. Environmental Pollution

166:212-217.

14. Meregalli G, Pluymers L, Ollevier F. 2001. Induction of mouthpart deformities in Chironomus

riparius larvae exposed to 4-n-nonylphenol. Environmental Pollution 111:241-246.

15. Hudson LA, Ciborowski JH. 1996. Spatial and taxonomic variation in incidence of mouthpart

deformities in midge larvae (Diptera: Chironomidae: Chironomini). Can. J. Aquat. Sci. 53:297-

304.

16. Planelló R, Servia MJ, Gómez-Sande P, Herrero O, Cobo F, Morcillo G. 2015. Transcriptional

responses, metabolic activity and mouthpart deformities in natural populations of Chironomus

riparius larvae exposed to environmental pollutants. Environ Toxicol 30: 383-395.

17. Langer-Jaesrich M, Köhler H-R, Gerhardt A. 2010. Can mouth part deformities of Chironomus

riparius serve as indicators for water and sediment pollution? A laboratory approach. J Soils

Sediments 10:414–422.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

18. Arambourou H, Gismondi E, Branchu P, Beisel J-N. 2013. Biochemical and morphological

responses in Chironomus riparius (Diptera: Chironomidae) larvae exposed to lead-spiked

sediment. Environmental Toxicology and Chemistry 32:2558-2564.

19. Nazarova LB, Riss HW, Kahlheber A. 2004. Some observations of buccal deformities in

chironomid larvae (Diptera: Chironomidae) from the Ciénaga Grande de Santa Marta,

Colombia. Caldasia 26:275-290.

20. Dias V, Vasseur C, Bonzom J-M. 2008. Exposure of Chironomus riparius larvae to uranium:

Effects on survival, development time, growth, and mouthpart deformities. Chemosphere

71:574–581.

21. Ochieng H, de Ruyter van Steveninck ED, Wanda FM. 2008. Mouthpart deformities in

Chironomidae (Diptera) as indicators of heavy metal pollution in northern Lake Victoria,

Uganda. African Journal of Aquatic Science 33:135-142.

22. Al-Shami S, Rawi CSM, Nor SAM, Ahmad AH, Ali A. 2010. Morphological deformities in

Chironomus spp. (Diptera: Chironomidae) larvae as a tool for impact assessment of

antropogenic and environmental stresses on three rivers in the Jury River System, Penang,

Malaysia. Environmental Entomology 39:210-222.

23. Morais SS, Molozzi J, Viana AL, Viana TH, Callisto M. 2010. Diversity of larvae of littoral

Chironomidae (Diptera: Insecta) and their role as bioindicators in urban reservoirs of different

trophic levels. Brazilian Journal of Biology 70:995-1004.

24. Gagliardi B, Pettigrove V. 2013. Removal of intensive agriculture from the landscape improves

aquatic ecosystem health. Agriculture, Ecosystems and Environment 176:1-8.

25. Saha D, Mazumdar A. 2013. Deformities of Chironomus sp. larvae (Diptera: Chironomidae) as

indicator of pollution stress in rice fields of Hooghly District, West Bengal. Journal of Today’s

Biological Sciences 2:44-54.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

26. Dickman M, Brindle I, Benson M. 1992. Evidence of teratogens in sediments of the Niagara

River watershed as reflected by chironomid (Diptera, Chironomidae) deformities. Journal of

Great Lakes Research 18:467-480.

27. Janssens de Bisthoven L, Huysmans C, Ollevier F. 1995. The in situ relationships between

sediment concentrations of micropollutants and morphological deformities in Chironomus gr

thummi larvae (Diptera, Chironomidae) from lowland rivers (Belgium): a spatial comparison. In

Cranston P, ed, Chironomids: From Genes to Ecosystems. CSIRO Publisher, East Melbourne,

Australia, pp 63–80.

28. Uebersax JS. 1988. Validity inferences from interobserver agreement. Psychological Bulletin

104:405-416.

29. Krippendorff K. 2004. Reliability in content analysis. Some common misconceptions and

recommendations. Human Communication Research 30:411–433.

30. Parchi P, de Boni L, Saverioni D, Cohen ML, Ferrer I, Gambetti P, Gelpi E, Giaccone G, Hauw

J-J, Höftberger R, Ironside JW, Jansen C, Kovacs GG, Rozemuller A, Seilhean D, Tagliavini F,

Giese A, Kretzschmar HA. 2012. Consensus classification of human prion disease histotypes

allows reliable identification of molecular subtypes: an inter-rater study among surveillance

centres in Europe and USA. Acta neuropathologica 124:517-529.

31. Soimasuo MR, Karels AE, Leppänen H, Santti R, Oikari AO. J. 1988. Biomarker Responses in

Whitefish (Coregonus lavaretus L. s.l.) Experimentally Exposed in a Large Lake Receiving

Effluents from Pulp and Paper Industry. Arch. Environ. Contam. Toxicol. 34:69–80.

32. Hämäläinen H. 1999. Critical appraisal of the indexes of chironomid larval deformities and their

use in bioindication. Ann. Zool. Fennici 36:179-186.

33. Cohen J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological

Measurement. 20:37–46.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

34. Light RJ. 1971. Measures of response agreement for qualitative data: Some generalizations and

alternatives. Psychological Bulletin. 76:365–377.

35. Landis JR, Koch GG. 1977. The measurement of observer agreement for categorical data.

Biometrics. 33:159–174.

36. Fletcher I, Mazzi M, Nuebling M. 2011. When coders are reliable: The application of three

measures to assess inter-rater reliability/agreement with doctor–patient communication data

coded with the VR-CoDES. Patient Education and Counseling 82:341-345.

37. Fleiss JL, Levin B, Paik MC. 2003. Statistical methods for rates and proportions. John Wiley &

Sons, Hoboken, New Jersey, 760 p.

38. Madden CP, Austin AD, Suter PJ. 1995. Pollution monitoring using Chironomid larvae: what is

a deformity? In: Cranston P. (ed.) Chironomids: from genes to ecosystems, CSIRO, Melbourne,

89-101.

39. Altman DG. 1991. Practical Statistics for Medical Research. Chapman & Hall, London, pp.

403-407.

40. Petersen JH, Larsen K, Kreiner S. 2010. Assessing and quantifying inter-rater variation for

dichotomous ratings using a Rasch model. Statistical Methods in Medical Research 21:635–

652.

41. Weinrott MR, Jones RR. 1984. Overt versus Covert Assessment of Observer Reliability. Child

Dev. 55:1125-1137.

42. Krauth D, Woodruff TJ, Bero L. 2013. Instruments for assessing risk of bias and other

methodological criteria of published animal studies: a systematic review. Environ Health

Perspect 121:985–992.

43. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ, McQuay HJ.

1996. Assessing the quality of reports of randomized clinical trials: Is blinding necessary?

Control.Clin.Trials 17:1-12.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

Figure 1. EDF (Extended Focus of Depth) -photographs of a normal Chironomus anthracinus-type

mentum with reduced 2nd

outer lateral teeth (A) and a normal Chironomus plumosus-type mentum

with all outer lateral teeth decreasing steadily towards the lateral margins (B) showing a trifid median

tooth, two inner lateral and four outer lateral teeth.

Figure 2. Case-wise agreement in deformity ratings among the 25 experts. Presentation adopted from

Petersen et al [40]. On the x-axis are the cases (EDF-images of Chironomus spp. menta, n = 211), on

the y-axis the raters. White and black squares indicate menta rated as normal and deformed,

respectively. Vertical bands, black or white, represent rater agreement on a certain case while

horizontal bands indicate disagreement among raters.

Figure 3. Menta consistently classified as deformed: missing median and left lateral teeth (A), an extra

(or split) inner lateral tooth (B), several deformities in the same individual (C) and a Köhn gap on the

right side of the mentum covering part of the median teeth and inner lateral teeth (D).

Figure 4. The most controversial cases (approximately half of the experts considered these menta

deformed and the other half normal) in the test material included apparent mechanical damage (A), a

split median tooth in the duplicate image within the data set used to evaluate intra-rater consistency

(B), skewed teeth (C-E), or small deviations from normal symmetry, e.g. in size of teeth (C,F).

Figure 5. Examples of mechanical damage to a mentum. Broken 1st left outer lateral tooth, and 2

nd right

inner lateral tooth is about to break off (A). Inner and outer left lateral teeth (on the right in the figure)

have been broken off during slide preparation showing clearly the double-walled structure of the

mentum (B). The breakage usually also has rough outlines. The fracture on the dorsal side of the

mentum is usually more difficult to detect and may be mistaken for deformity. This mentum was not

among the ring test data.

Figure 6. The total number of deformities identified independently by 25 raters in the identical

material, separately for the reference site (total number of menta rated = 154) and the impacted site (n =

57) (A). Assessments that did not differentiate the impacted site from the reference site are marked

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

with an asterisk. Deformity incidence (%) calculated from all 25 assessments for the reference and the

impacted site (B).

Figure 7. Chi-square (Χ2)-test statistic against the sum of deformities in raters’ assessments. Dotted line

represents the critical chi-square value with one degree of freedom (df). Black dots and open circles

denote blind and open assessments, respectively.

Figure 8. Effect size (odds ratio, OR, with 95% confidence interval) against the sum of deformities for

each deformity evaluation. The dotted line represents OR of 1 with even probability of deformity in

both the reference and impacted site. When this is within the 95% confidence interval of the estimate,

the difference in deformity incidence is not statistically significant (at the 0.05 significance level) (A).

Total sample size required to detect difference in deformity incidence between the reference and

impacted site (given the estimated proportions), against the sum of deformities in each assessment

(two-tailed test with a significance level 0.05 and power 0.80) (B). Note the logarithmic scale of y-axis.

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved

AcceptedPrepri n

t

This article is protected by copyright. All rights reserved