Post on 30-Apr-2023
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
Environmental Toxicology
INCONSISTENCY IN THE ANALYSIS OF MORPHOLOGICAL DEFORMITIES IN
CHIRONOMIDAE (INSECTA: DIPTERA) LARVAE
JOHANNA SALMELIN, KARI-MATTI VUORI, and HEIKKI HÄMÄLÄINEN
Environ Toxicol Chem., Accepted Article • DOI: 10.1002/etc.3010
Accepted Article "Accepted Articles" are peer-reviewed, accepted manuscripts that have not been edited, formatted, or in any way altered by the authors since acceptance. They are citable by the Digital Object Identifier (DOI). After the manuscript is edited and formatted, it will be removed from the “Accepted Articles” Web site and published as an Early View article. Note that editing may introduce changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. SETAC cannot be held responsible for errors or consequences arising from the use of information contained in these manuscripts.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
Environmental Toxicology Environmental Toxicology and Chemistry DOI 10.1002/etc.3010
INCONSISTENCY IN THE ANALYSIS OF MORPHOLOGICAL DEFORMITIES IN
CHIRONOMIDAE (INSECTA: DIPTERA) LARVAE
Running title: Inconsistency in chironomid deformity assessment
JOHANNA SALMELIN,*† KARI-MATTI VUORI, ‡§ and HEIKKI HÄMÄLÄINEN†
†Department of Biological and Environmental Science, University of Jyvaskyla,
Jyväskylä, Finland
‡Laboratory Centre/Ecotoxicology and Risk Assessment, Finnish Environment Institute,
Jyväskylä, Finland
§South Karelia Institute, Lappeenranta University of Technology, Lappeenranta, Finland
* Address correspondence to johanna.k.salmelin@jyu.fi
This article is protected by copyright. All rights reserved
Submitted 13 October 2014; Returned for Revision 2 February 2015; Accepted 3 April 2015
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
Abstract: The incidence of morphological deformities of chironomid larvae as an indicator of sediment
toxicity has been studied for decades. However, standards for deformity analysis are lacking. We
evaluated whether twenty-five experts diagnosed larval deformities in a similar manner. Based on high-
quality digital images, the experts rated 211 menta of Chironomus spp. larvae as normal or deformed.
The larvae were from a site with polluted sediments or from a reference site. We revealed this to a
random half of the raters, whilst the rest conducted the assessment blind. We quantified the inter-rater
agreement by kappa coefficient, tested if open and blind assessments differed in deformity incidence
(DI) and in differentiation between the sites, and identified those deformity types rated most
consistently/inconsistently. The total DI varied greatly from 10.9 to 66.4% among experts. Kappa
coefficient across rater pairs averaged 0.52 indicating insufficient agreement. The deformity types rated
most consistently were those missing teeth or with extra teeth. The open and blind assessments did not
differ but differentiation between sites was clearest for raters who counted primarily absolute
deformities like missing and extra teeth and excluded apparent mechanical aberrations, or deviations in
tooth size or symmetry. The highly differing criteria in deformity assignment have likely led to
inconsistent results in midge larval deformity studies and indicate an urgent need for standardization of
the analysis. This article is protected by copyright. All rights reserved
Keywords: Benthic macroinvertebrates, Biomarkers, Sediment toxicity, Morphological deformities,
Inter-rater agreement
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
INTRODUCTION
Biological indicators are increasingly used in environmental impact assessment [1]. For
instance, chemicals as environmental stressors can affect organisms by inducing morphological
abnormalities or deformities, which can be used as biomarkers of exposure, effect or both. One sign of
sublethal stress caused by environmental contamination is the increased incidence of structural
deformities in insects, but exposure to toxic agents also increases deformity incidence in gastropods
and fishes [2-3]. Ontogenetic deformities in the head capsule structures of Chironomidae (Insecta:
Diptera) larvae are used as an indicator of sediment toxicity in aquatic ecosystems. Chironomids, or
non-biting midges, have an immature aquatic larval phase with four larval instars. They typically
constitute a significant part of the benthic invertebrate fauna in both lotic and lentic habitats. One of the
most widely distributed and species-rich chironomid genera is Chironomus [4]. Chironomus spp. spend
their larval stage in tubes they construct within soft sediments of lakes and slowly flowing parts of
rivers, where they feed on detritus particles with associated microorganisms and deposited algae [5].
Hence they are readily exposed to the substances bound to the sediment or dissolved in the pore water.
A deformity in chironomid larvae has been defined as “any morphological feature that departs
from the normal configuration” [6,7]. Deformities in chironomid larvae may occur in various
morphological structures including antennae, mandibles, premandibles, epipharyngeal pecten and labral
lamellae [8], but usually and most easily deformities are examined from the mentum. The mentum is a
structure on the ventral side of the fully sclerotized head capsule, posterior to the mouthparts. The
mentum of Chironomus spp. consists of a double-walled labial plate with 13 sclerotized teeth with
dorsal and ventral walls, dorsomentum and ventromentum [9]. The mentum possesses six lateral teeth
on both sides of the trifid median tooth (Figure 1). Lateral teeth can be further subdivided into two
larger inner lateral teeth and four smaller outer lateral teeth [8]. Typical aberrations referred to as
deformities include missing, additional, split and asymmetrical teeth, distinct deep smooth-edged
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
indentations called Köhn gaps or a broad deviation from the normal tooth configuration [8].
Chironomid deformities have received attention as a potential early warning indicator of pollution since
the study of Hamilton and Sæther in 1971 [10], followed by numerous laboratory and field studies.
Xenobiotic substances are suspected to induce morphological deformities during larval ontogeny by
disrupting developmental processes between moults. An elevated incidence of mentum deformities has
been considered a response to metal contamination [11-13], some endocrine-disrupting chemicals [14],
or complex mixtures of xenobiotic substances in sediments [15,16]. However, some studies, mainly
laboratory bioassays, have indicated contradictory results with no effects of metals on Chironomus
mentum deformity rate [17,18].
It is evident from the literature that variable evaluation criteria of deformities are applied by
different researchers. For instance, a split median tooth or mechanical aberrations, such as breakage
and wear, are counted as deformities in some studies, but not in other studies [18-25]. Moreover, some
studies do not report at all the criteria used, or which aberrations are counted for deformities.
Commonly cited published criteria are by Dickman et al. [26], Warwick and Tisdale [8] and Janssens
de Bisthoven et al. [27]. Clearly if different researchers use divergent definitions of deformity, their
results are not comparable. Hence, subjectivity and inconsistency in deformity analysis might also
partially explain the contrasting results regarding deformity response.
Reliability in the context of inter-rater agreement studies refers to the degree to which
different raters agree in their assessments [28]. Reliable assessment or diagnosis results are obtained
when different experts independently make similar evaluation decisions on an identical set of data.
Inter-rater reliability studies are common in medical sciences, including psychiatric diagnostics,
clinical trials and interpretation of radiographs or cell samples, and are also widely used in social
sciences and linguistics [29,30]. In contrast, studies quantifying inter-rater agreement on the deformity
assessment or other morphological biomarkers of aquatic organisms are lacking. An expert as a rater is
an important factor in classifying items, and the criteria used influence the results of an evaluation and
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
a whole study. One goal of inter-rater agreement studies is to reduce subjectivity associated with any
assessment requiring some degree of interpretation.
We therefore studied inter-rater agreement on Chironomus spp. deformity analysis among
experts. We aimed: 1) to quantify inter-rater agreement among experts in deformity assessment (i.e.
whether experts diagnose deformities in a similar manner sharing the same definition of the deformity
and normality); 2) to study whether preconception in the form of background information about site
contamination biases deformity evaluation results; and 3) to improve consistency of deformity analysis
by identifying those types of anomalies with greatest agreement/disagreement among raters. In this
study we focused on Chironomus spp. mentum deformities which are most commonly used in
deformity assessments.
MATERIALS AND METHODS
Test set up
The test material consisted of microscopic EDF (Extended Depth of Focus) images of menta
of 211 fourth instar Chironomus spp. larvae collected from the field. The larvae were fixed on
microscope slides with polyvinyl-lactophenol. Both C. anthracinus- and C. plumosus-type larvae were
included in the samples (Figure 1). Larvae were sampled in 1996 from Lake Saimaa, southeastern
Finland. Most of the larvae (n = 154) were from a reference site upstream from a local pollution source,
a paper and pulp mill. The rest of the larvae (n = 57) were collected from a site immediately
downstream from the factory. The sediment of this impacted site was heavily contaminated with
mercury (Hg) and organic compounds like organohalogens and chlorophenolics (site A1 in Soimasuo
et al. [31]). Thus the test represents a realistic case, and of the original 272 mentum samples only those
that were dried out and damaged were excluded from the test material. The EDF technique provides
pictures taken at different focus levels (Z-series) to be combined into one image with highly improved
clarity. A digital camera (Nikon DS) connected to an optical microscope (Olympus BX41 System
Microscope) was used to take the images. Several (3-13) images from different focus levels of each
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
mentum were combined into one using the laboratory image analysis software NIS-Elements D 4.11.01
64-bit, and were saved as TIF images. The test material was shared with participating experts via Funet
FileSender, which is intended for distributing large data files. However, due to data transfer problems,
for some participants the material was downloaded into a USB flash memory and sent by post.
The participants in the evaluation were invited based on their expertise in Chironomus spp.
deformity analysis, as verified by their publication record on this subject. Four participants were
included based on a suggestion from a colleague or mentor who was not able to participate. Altogether
25 experts (raters hereafter) from 13 different countries (16 raters from Europe, 3 from South America,
2 from North America, 2 from Asia, 1 from Africa and 1 from Australia) agreed to participate and
analyze the test material. Only two invited experts declined or did not answer. The test design was
fully-crossed; hence all experts rated every case. Raters were asked to perform their analysis
independently. Independence was reinforced by anonymity of the raters, who did not know the
identities of the other participants. Raters were required to explicitly mark each mentum (identified by
the number of the image) either as normal or deformed with the number 0 (zero) or 1, respectively, in
the manner they would do in their routine investigation. The raters were also able to fill in some
additional information about deformity type, location or severity, or some other comments. All
assessment results were handled anonymously. After all results were received, we provided all raters
with a summary of the results showing their own personal ratings compared with the others.
In order to evaluate a possible preconception bias, when the test material was delivered
information about the origins of the larvae was revealed to half of the raters selected randomly. This
group of raters hence performed an open assessment, whereas the other half who lacked the
background information made the assessment blind. In addition, to evaluate intra-rater consistency one
duplicate image within the test material was included in both the reference and impacted sets but with
the image rotated so that the rater would not recognize the duplication. This duplicate image was the
only artefact in the data set, and was not counted in the calculation of deformity incidence.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
Data analysis
The frequency of deformed individuals within a chironomid population (deformity incidence,
DI) [32], usually expressed as percentage (%) was calculated as
DI = (d/n) × 100
where d is the number of deformed larvae and n the number of larvae examined. DI% was calculated
for all individual assessment results across all larvae and also separately for larvae from the reference
and impacted sites.
The estimate of raw agreement, the proportion of cases for which two raters agree, was
calculated and averaged among multiple raters. The inter-rater agreement was quantified using chance-
corrected kappa coefficient (κ) for each pair of raters [33]. An overall estimate of kappa was calculated
as the arithmetic mean of pairwise κ-values (n = 300) to define consistency among all raters as
suggested by Light [34]. Possible values for kappa statistics range from −1 to 1, with 1 indicating
perfect agreement, 0 indicating completely random agreement, and −1 indicating complete
disagreement. Landis and Koch [35] provide guidelines for interpreting kappa values, with values from
0.0 to 0.2 indicating slight agreement, 0.21 to 0.40 fair agreement, 0.41 to 0.60 moderate agreement,
0.61 to 0.80 substantial agreement, and 0.81 to 1.0 almost perfect or perfect agreement. An acceptable
kappa indicating a sufficient level of agreement has been suggested to be > 0.60–0.75 [35,36].
We used Mann-Whitney U test to examine differences in DI between blind and open
assessments for both reference and impacted sites. We used Χ2-test separately for all assessments (n =
25) to study whether the distributions of the categorical variable (normal/deformed) differed between
the reference site and the impacted site.
From the initial data analysis it became apparent that the raters had different propensities to
designate deformities. Conservative raters seemed to judge deformities cautiously, include only
obvious abnormalities and exclude any wear or mechanical damage, which were more often counted by
the non-conservative raters. Therefore we decided to evaluate the practical implications of these
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
contrasting strategies for assessments by evaluating their sensitivity to distinguish the reference site
from the impacted site using odds ratio as an effect size measure. We also estimated the total sample
size required to detect the difference in DI between the reference and the impacted site for each
assessment according to Fleiss et al. [37] by interpolation from tabulated values and adjusted for
unequal sample sizes. All statistical analyses were performed using IBM SPSS Statistics 20.
RESULTS
No raters produced completely identical assessments. Occasionally the total number of
deformities, and hence DI, was the same between two raters, but even then the menta rated normal and
deformed were never exactly the same. Of the 211 menta, 48 were rated normal (35 and 13 from the
reference and impacted site, respectively) and 5 deformed (1 and 4) by all 25 raters resulting in
altogether 53 cases with complete agreement (Figure 2). All five menta that were consistently rated as
deformed were with missing or extra teeth, sometimes together with median teeth deformation (Figure
3A–C). However, occasionally even some quite obvious deformities, like a missing tooth, remained
unnoticed. A simple coding error might be suspected in some cases when a mentum having a striking
deformity like a Köhn gap (Figure 3D) was recorded as normal.
The most controversial cases which approximately half of participants (11–14 or 44–56%)
considered normal and the other half deformed, included missing teeth that could represent either
mechanical damage or an actual developmental abnormality (Figure 4A). Other cases rated
inconsistently included teeth with some deviations from normal symmetry (Figure 4B–F). Most
participants did not predominantly consider mechanical damage as a deformity (Figure 5A), whereas
some rated apparently worn teeth (Figure 4C) more or less systematically as deformed.
The number of cases classified as normal and deformed varied from 71–188 and 23–140,
respectively (Figure 6A). Correspondingly, the overall deformity incidence varied from 10.9% to
66.4% depending on the rater. DI calculated separately for the reference site was 3.9–64.7%, and for
the impacted site 26.8–69.6% (Figure 6B). The average raw agreement was 0.80 (±sd 0.14) with a
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
range of 0.43–0.99. Pairwise kappa ranged between 0.09–0.96 indicating large variation in the inter-
rater agreement. Some rater pairs agreed almost perfectly, while some pairs had very divergent
deformity interpretations. The average kappa across rater pairs was 0.52 (±sd 0.23). Average kappa
excluding the results of four raters suggested by peers (numbers 2, 10, 21 and 23 in Figure 6A) was
0.54 indicating that the inclusion of these originally uninvited raters did not affect the observed
agreement. Of pairwise kappa-values, 37.7% were above 0.60 and 25.3% above 0.75. Hence the vast
majority, 62.3% and 74.6%, of pairwise kappa-values fell below these two suggested cut-off values for
sufficient agreement.
The blind and open assessments for the reference site resulted in an average DI of 17.0% and
19.1%, respectively, with no statistically significant difference (Mann-Whitney U = 61.0, p = 0.786).
The same was true for the impact site where the blind (DI = 35.0%) and open (39.7%) assessments did
not differ (Mann-Whitney U = 62.0, p = 0.833).
The duplicate mentum to evaluate inter-rater consistency had a split median tooth (Figure 4B),
and was controversially considered either deformed (15–16 raters) or normal (9–10 raters). However,
most raters (88%) showed intra-rater consistency and evaluated the duplicate mentum similarly. All
three experts who rated the duplicate mentum inconsistently conducted the assessment blind.
In 76% of all assessments the distributions of the dichotomous response (normal/deformed)
differed between the reference and impacted sites according to the Χ2-test. The estimated difference
was clearly negatively related to the propensity of raters to designate deformities (Figure 7); hence
conservative deformity assessments were more likely to distinguish the reference and impacted sites
than were non-conservative assessments. The six assessments which did not to differentiate between
sites had the highest overall DI (Figures 6, 7). This high overall DI resulted mainly from interpreting
worn teeth as deformed. Conservative deformity assessments showed greater effect sizes (Figure 8A),
which was reflected in considerably smaller sample sizes required to detect a difference in DI between
the impacted and the reference site (Figure 8B).
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
DISCUSSION
We studied the inter-rater agreement among experts of chironomid deformities, and found that
experts evaluate deformities rather subjectively and inconsistently. Whereas some rater pairs agreed
closely, sharing the same interpretation of deformities, others had very low agreement. According to
Madden et al. [38], mentum deformities may be categorized as absolute, quantitative deformities, such
as missing or extra teeth, or as relative, qualitative deformities, such as changes in shape or size of
teeth. In our study, the most obvious, ‘absolute’ deformities were generally rated quite consistently,
and the inconsistency was mostly in interpretation of relative, qualitative deformities, as well as in
separation of ontogenetic anomalies from mechanical wear or damage. The cases with the highest
disagreement included teeth slightly deviating from the normal shape. These deviations were
interpreted either as wear or as developmental deformities. When the tooth breaks, the two edges
(dorsal and ventral) with rough and uneven outlines become visible (Figure 5), but sometimes (Figure
4A) this cannot be seen clearly. A minor source of inconsistency might have resulted from careless
errors, for example when a mentum clearly lacking a tooth was rated as normal. Coding errors might
also have occurred. However, there were few apparent examples of these kinds of errors, and most of
the inconsistency was clearly due to different interpretations of deformity and normality.
The low value of the chance-corrected agreement measure, kappa, indicated only moderate
inter-rater agreement according to Landis and Koch [35] and Altman [39], and hence also demonstrated
insufficient consistency in deformity analysis of chironomid larval mentum. Interpretation of kappa
values is controversial, and cut-off values indicating strength of agreement have been considered
arbitrary. This was acknowledged by Landis and Koch [35], but they nevertheless considered boundary
values to provide useful benchmarks. Kappa coefficient and kappa type measures have also been
criticized for positive bias towards marginal homogeneity and negative bias towards trait prevalence
[40], giving higher values when raters use the rating categories unequally and lower values if one of the
categories is more common. In our study some rater pairs used rating categories unequally. Trait
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
prevalence potentially also affected kappa values, as the category ‘normal’ is expected to be much
more common than the category ‘deformed’. Despite the kappa-critique and the ongoing debate about
the most appropriate measure for inter-rater agreement on nominal data [40], our raw data alone
(Figure 6) clearly indicated a severe problem with large variation in deformity assessments among
experts.
The reference site for our study was upstream from a local pollution source, a paper and pulp
mill in southern Lake Saimaa. The sediments of the impacted site immediately downstream from this
point source of pollution were heavily contaminated with Hg and organochlorine compounds. Contrary
to our expectation, providing information about site status did not affect the deformity assessments, and
the results were not biased by any prejudice of the raters. However, this result does not conclusively
rule out the possibility of this type of bias, as in the test situation the participants knew that their
evaluations would be compared with others and therefore they might have been more objective in their
ratings than otherwise. It has been previously documented that raters provide more reliable data when
they know that their performance is monitored [41].
Consistently with the background DI of 2.8–3.1% and < 8%, reported by Dickman et al. [26]
and Vermeulen [7], respectively, a tentative background DI in our study region, based on data from 7
sites in lakes with no significant local pollution, is 5% (H. Hämäläinen, University of Jyväskylä,
Jyväskylä, Finland, unpublished data). In the present study, eleven raters evaluated DI < 8% for the
reference site. All raters, except one, estimated higher DI in the impacted site than in the reference site,
irrespective of their assessment strategies, suggesting that non-conservative assessments could still be
diagnostic. However, the required sample sizes for the most non-conservative assessments to detect any
difference in DI between the sites were huge (> 1000 larvae) and undoubtedly unfeasible in
environmental monitoring or ecotoxicological studies.
Potential biases, or sources of systematic error in animal toxicology studies involve the lack of
randomization, blinding or specification of inclusion/exclusion criteria [42]. All these deficiencies
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
should also be taken into account to reduce subjectivity and inconsistency in chironomid deformity
analysis. Although the deformity assessments in our study did not appear to be subject to any
systematic error due to the influence of information about site status, blinding is recommended in all
studies in which results might be influenced by a researcher’s expectations or preferences. Blind
assessment generally produces more consistent results than open assessment [43], and has long been a
requirement in medical studies and clinical trials.
A stronger and more consistent separation in DI between the reference and impacted sites
resulted from the conservative assessment of deformities compared to more non-conservative
assessment practice. In conservative assessment it was mainly ‘absolute’ abnormalities (see above) that
were judged as deformities, whereas abnormalities which might result from wear or mechanical
damage (worn, broken, different-sized, or skewed teeth) were excluded. This suggests that conservative
assessment, in which only absolute deformities are counted, reflects a potential toxicity response more
consistently and appears to be more sensitive in detecting an effect. This higher sensitivity also means
greater cost-efficiency, as smaller sample sizes are needed to detect an effect.
The inclusive definition of a deformity as a deviation from the normal configuration certainly
invites subjectivity and variability in assessments. Inter-rater agreement might be substantially
increased by compromised and clear guidelines for assessment criteria, which are currently lacking
from chironomid deformity assessments. However, defining such criteria is problematic, as the
deformity status of a larva cannot be confirmed or validated in the absence of an independent ‘truth’.
However, the criteria might be based on a consensus opinion of experts, preferably supported by
empirical studies. As a starting point towards this end the data obtained in our trial can be used to form
an expert consensus. Moreover, the results suggest that conservative assessment considering only
absolute deformities sensu Madden [38] gives more consistent results. Furthermore, to avoid subjective
bias and to give the results greater credibility, we strongly recommend conducting any deformity
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
assessments blind. In environmental toxicity and risk assessment this principle should be adopted more
widely and extended to deformity analysis of other organisms.
CONCLUSIONS
Chironomid larval deformities are used in bioindication of exposure to chemical stressors with
likely effects also on populations and communities. Our study of inter-rater agreement on deformity
assessment indicated that experts evaluate deformities subjectively and inconsistently, which weakens
the indicator value of chironomid deformity analysis. Given that most of these experts have published
scientific research articles on the incidence of deformities, we can conclude that the comparability and
reliability of results in the literature is highly questionable. Hence, the equivocal and partly contrasting
results of studies concerning chironomid deformity incidence as a stress response to contamination
might partly result from inconsistent interpretation of deformities. Even though we focused only on the
mentum of Chironomus, we expect that the results should be broadly generalizable to other
morphological structures and taxa used in chironomid deformity studies. Reliability and consistency
could be increased by developing guidelines to be applied in the chironomid deformity assessment.
Moreover, assessments should be conducted blind. Our results also suggest that more sensitive
detection of effects of sediment toxicity can be obtained by taking into account only absolute
deformities like missing and extra teeth, and Köhn gaps. Inconsistency might also be reduced in the
future by developing automated recognition methods for deformity identification using computer-based
image analysis.
Acknowledgment-—We warmly thank all participated experts for their indispensable efforts making
this study possible, K. Meissner for technical support in microscopic photographing, and R. Jones for
proofreading the manuscript. We also thank two anonymous reviewers for valuable comments that
improved the quality of the manuscript. The study was organized by the University of Jyväskylä,
Finland, in collaboration with the Finnish Environment Institute, and funded by the Finnish Funding
Agency for Innovation.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
Data availability—All data are available upon request from the corresponding author
(johanna.k.salmelin@jyu.fi).
REFERENCES
1. Adams SM. 2002. Biological indicators of aquatic ecosystem stress: introduction and overview.
In Adams SM, ed, Biological indicators of aquatic ecosystem stress. American Fisheries
Society, Bethesda, Maryland. pp. 1-11.
2. Conroy PT, Hunt JW, Anderson BS. 1996. Validation of a short-term toxicity test endpoint by
comparison with longer-term effects on larval red abalone Haliotis rufescens. Environmental
Toxicology and Chemistry 15:1245-1250.
3. Carls MG, Rice SD, Hose JE. 1999. Sensitivity of fish embryos to weathered crude oil: Part I.
Low-level exposure during incubations causes malformations, genetic damage, and mortality in
larval Pacific herring (Clupea pallasi). Environmental Toxicology and Chemistry 18: 481–493.
4. Wiederholm T. 1983. Chironomidae of the Holarctic region. Keys and diagnosis. Part I Larvae.
Ent. scand. Suppl. 19: 1-457.
5. Johnson RK. 1987. Seasonal variation in diet of Chironomus plumosus (L.) and C. anthracinus
Zett. (Diptera: Chironomidae) in mesotrophic Lake Erken. Freshwater Biology 17:525-532.
6. Warwick WF. 1988. Morphological deformities in Chironomidae (Diptera) larvae as biological
indicators of toxic stress. In: (ed. Evans M.S.) Toxic contaminants and ecosystem health: A
Great Lakes focus, John Wiley & Sons, New York, pp. 281-320.
7. Vermeulen AC. 1995. Elaborating chironomid deformities as bioindicators of toxic sediment
stress: the potential application of mixture toxicity concepts. Ann. Zool. Fennici 32:265-285.
8. Warwick WF, Tisdale NA. 1988. Morphological deformities in Chironomus,
Cryptochironomus, and Procladius larvae (Diptera: Chironomidae) from two differentially
stressed sites in Tobin Lake, Saskatchewan. Can. J. Fish. Aquat. Sci. 45:1123-1144.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
9. Sæther OA. 1971. Notes on general morphology and terminology of the Chironomidae
(Diptera). The Canadian Entomologist 103:1237-1260.
10. Hamilton AL, Saether OA. 1971. The occurrence of characteristic deformities in the chironomid
larvae of several Canadian lakes. The Canadian Entomologist 103:363-368.
11. Ilyashuk B, Ilyashuk E, Dauvalter V. 2003. Chironomid responses to long-term metal
contamination: a paleolimnological study in two bays of Lake Imandra, Kola Peninsula,
northern Russia. Journal of Paleolimnology 30:217–230.
12. Martinez EA, Wold L, Moore BC, Schaumloffel J, Dasgupta N. 2006. Morphologic and growth
responses in Chironomus tentans to arsenic exposure. Arch. Environ. Contam. Toxicol. 51:529–
536.
13. Di Veroli A, Goretti E, Paumen M L, Kraak MHS, Admiraal W. 2012. Induction of mouthpart
deformities in chironomid larvae exposed to contaminated sediments. Environmental Pollution
166:212-217.
14. Meregalli G, Pluymers L, Ollevier F. 2001. Induction of mouthpart deformities in Chironomus
riparius larvae exposed to 4-n-nonylphenol. Environmental Pollution 111:241-246.
15. Hudson LA, Ciborowski JH. 1996. Spatial and taxonomic variation in incidence of mouthpart
deformities in midge larvae (Diptera: Chironomidae: Chironomini). Can. J. Aquat. Sci. 53:297-
304.
16. Planelló R, Servia MJ, Gómez-Sande P, Herrero O, Cobo F, Morcillo G. 2015. Transcriptional
responses, metabolic activity and mouthpart deformities in natural populations of Chironomus
riparius larvae exposed to environmental pollutants. Environ Toxicol 30: 383-395.
17. Langer-Jaesrich M, Köhler H-R, Gerhardt A. 2010. Can mouth part deformities of Chironomus
riparius serve as indicators for water and sediment pollution? A laboratory approach. J Soils
Sediments 10:414–422.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
18. Arambourou H, Gismondi E, Branchu P, Beisel J-N. 2013. Biochemical and morphological
responses in Chironomus riparius (Diptera: Chironomidae) larvae exposed to lead-spiked
sediment. Environmental Toxicology and Chemistry 32:2558-2564.
19. Nazarova LB, Riss HW, Kahlheber A. 2004. Some observations of buccal deformities in
chironomid larvae (Diptera: Chironomidae) from the Ciénaga Grande de Santa Marta,
Colombia. Caldasia 26:275-290.
20. Dias V, Vasseur C, Bonzom J-M. 2008. Exposure of Chironomus riparius larvae to uranium:
Effects on survival, development time, growth, and mouthpart deformities. Chemosphere
71:574–581.
21. Ochieng H, de Ruyter van Steveninck ED, Wanda FM. 2008. Mouthpart deformities in
Chironomidae (Diptera) as indicators of heavy metal pollution in northern Lake Victoria,
Uganda. African Journal of Aquatic Science 33:135-142.
22. Al-Shami S, Rawi CSM, Nor SAM, Ahmad AH, Ali A. 2010. Morphological deformities in
Chironomus spp. (Diptera: Chironomidae) larvae as a tool for impact assessment of
antropogenic and environmental stresses on three rivers in the Jury River System, Penang,
Malaysia. Environmental Entomology 39:210-222.
23. Morais SS, Molozzi J, Viana AL, Viana TH, Callisto M. 2010. Diversity of larvae of littoral
Chironomidae (Diptera: Insecta) and their role as bioindicators in urban reservoirs of different
trophic levels. Brazilian Journal of Biology 70:995-1004.
24. Gagliardi B, Pettigrove V. 2013. Removal of intensive agriculture from the landscape improves
aquatic ecosystem health. Agriculture, Ecosystems and Environment 176:1-8.
25. Saha D, Mazumdar A. 2013. Deformities of Chironomus sp. larvae (Diptera: Chironomidae) as
indicator of pollution stress in rice fields of Hooghly District, West Bengal. Journal of Today’s
Biological Sciences 2:44-54.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
26. Dickman M, Brindle I, Benson M. 1992. Evidence of teratogens in sediments of the Niagara
River watershed as reflected by chironomid (Diptera, Chironomidae) deformities. Journal of
Great Lakes Research 18:467-480.
27. Janssens de Bisthoven L, Huysmans C, Ollevier F. 1995. The in situ relationships between
sediment concentrations of micropollutants and morphological deformities in Chironomus gr
thummi larvae (Diptera, Chironomidae) from lowland rivers (Belgium): a spatial comparison. In
Cranston P, ed, Chironomids: From Genes to Ecosystems. CSIRO Publisher, East Melbourne,
Australia, pp 63–80.
28. Uebersax JS. 1988. Validity inferences from interobserver agreement. Psychological Bulletin
104:405-416.
29. Krippendorff K. 2004. Reliability in content analysis. Some common misconceptions and
recommendations. Human Communication Research 30:411–433.
30. Parchi P, de Boni L, Saverioni D, Cohen ML, Ferrer I, Gambetti P, Gelpi E, Giaccone G, Hauw
J-J, Höftberger R, Ironside JW, Jansen C, Kovacs GG, Rozemuller A, Seilhean D, Tagliavini F,
Giese A, Kretzschmar HA. 2012. Consensus classification of human prion disease histotypes
allows reliable identification of molecular subtypes: an inter-rater study among surveillance
centres in Europe and USA. Acta neuropathologica 124:517-529.
31. Soimasuo MR, Karels AE, Leppänen H, Santti R, Oikari AO. J. 1988. Biomarker Responses in
Whitefish (Coregonus lavaretus L. s.l.) Experimentally Exposed in a Large Lake Receiving
Effluents from Pulp and Paper Industry. Arch. Environ. Contam. Toxicol. 34:69–80.
32. Hämäläinen H. 1999. Critical appraisal of the indexes of chironomid larval deformities and their
use in bioindication. Ann. Zool. Fennici 36:179-186.
33. Cohen J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological
Measurement. 20:37–46.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
34. Light RJ. 1971. Measures of response agreement for qualitative data: Some generalizations and
alternatives. Psychological Bulletin. 76:365–377.
35. Landis JR, Koch GG. 1977. The measurement of observer agreement for categorical data.
Biometrics. 33:159–174.
36. Fletcher I, Mazzi M, Nuebling M. 2011. When coders are reliable: The application of three
measures to assess inter-rater reliability/agreement with doctor–patient communication data
coded with the VR-CoDES. Patient Education and Counseling 82:341-345.
37. Fleiss JL, Levin B, Paik MC. 2003. Statistical methods for rates and proportions. John Wiley &
Sons, Hoboken, New Jersey, 760 p.
38. Madden CP, Austin AD, Suter PJ. 1995. Pollution monitoring using Chironomid larvae: what is
a deformity? In: Cranston P. (ed.) Chironomids: from genes to ecosystems, CSIRO, Melbourne,
89-101.
39. Altman DG. 1991. Practical Statistics for Medical Research. Chapman & Hall, London, pp.
403-407.
40. Petersen JH, Larsen K, Kreiner S. 2010. Assessing and quantifying inter-rater variation for
dichotomous ratings using a Rasch model. Statistical Methods in Medical Research 21:635–
652.
41. Weinrott MR, Jones RR. 1984. Overt versus Covert Assessment of Observer Reliability. Child
Dev. 55:1125-1137.
42. Krauth D, Woodruff TJ, Bero L. 2013. Instruments for assessing risk of bias and other
methodological criteria of published animal studies: a systematic review. Environ Health
Perspect 121:985–992.
43. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ, McQuay HJ.
1996. Assessing the quality of reports of randomized clinical trials: Is blinding necessary?
Control.Clin.Trials 17:1-12.
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
Figure 1. EDF (Extended Focus of Depth) -photographs of a normal Chironomus anthracinus-type
mentum with reduced 2nd
outer lateral teeth (A) and a normal Chironomus plumosus-type mentum
with all outer lateral teeth decreasing steadily towards the lateral margins (B) showing a trifid median
tooth, two inner lateral and four outer lateral teeth.
Figure 2. Case-wise agreement in deformity ratings among the 25 experts. Presentation adopted from
Petersen et al [40]. On the x-axis are the cases (EDF-images of Chironomus spp. menta, n = 211), on
the y-axis the raters. White and black squares indicate menta rated as normal and deformed,
respectively. Vertical bands, black or white, represent rater agreement on a certain case while
horizontal bands indicate disagreement among raters.
Figure 3. Menta consistently classified as deformed: missing median and left lateral teeth (A), an extra
(or split) inner lateral tooth (B), several deformities in the same individual (C) and a Köhn gap on the
right side of the mentum covering part of the median teeth and inner lateral teeth (D).
Figure 4. The most controversial cases (approximately half of the experts considered these menta
deformed and the other half normal) in the test material included apparent mechanical damage (A), a
split median tooth in the duplicate image within the data set used to evaluate intra-rater consistency
(B), skewed teeth (C-E), or small deviations from normal symmetry, e.g. in size of teeth (C,F).
Figure 5. Examples of mechanical damage to a mentum. Broken 1st left outer lateral tooth, and 2
nd right
inner lateral tooth is about to break off (A). Inner and outer left lateral teeth (on the right in the figure)
have been broken off during slide preparation showing clearly the double-walled structure of the
mentum (B). The breakage usually also has rough outlines. The fracture on the dorsal side of the
mentum is usually more difficult to detect and may be mistaken for deformity. This mentum was not
among the ring test data.
Figure 6. The total number of deformities identified independently by 25 raters in the identical
material, separately for the reference site (total number of menta rated = 154) and the impacted site (n =
57) (A). Assessments that did not differentiate the impacted site from the reference site are marked
AcceptedPrepri n
t
This article is protected by copyright. All rights reserved
with an asterisk. Deformity incidence (%) calculated from all 25 assessments for the reference and the
impacted site (B).
Figure 7. Chi-square (Χ2)-test statistic against the sum of deformities in raters’ assessments. Dotted line
represents the critical chi-square value with one degree of freedom (df). Black dots and open circles
denote blind and open assessments, respectively.
Figure 8. Effect size (odds ratio, OR, with 95% confidence interval) against the sum of deformities for
each deformity evaluation. The dotted line represents OR of 1 with even probability of deformity in
both the reference and impacted site. When this is within the 95% confidence interval of the estimate,
the difference in deformity incidence is not statistically significant (at the 0.05 significance level) (A).
Total sample size required to detect difference in deformity incidence between the reference and
impacted site (given the estimated proportions), against the sum of deformities in each assessment
(two-tailed test with a significance level 0.05 and power 0.80) (B). Note the logarithmic scale of y-axis.