Post on 30-Apr-2023
A CRITICAL REVIEW OF THE MEAN
MEASURE OF DIVERGENCE AND
MAHALANOBIS DISTANCES USING
ARTIFICIAL DATA AND NEW
APPROACHES TO THE ESTIMATION OF
BIODISTANCES EMPLOYING
NONMETRIC TRAITS
Fitch Laboratory, British School at Athens,
52 Souidias Street, 106 76 Athens, GREECE Dr. E. Nikita
AIM
Review the two most common distance measures for the calculation of biodistances based on nonmetric traits:
- Mean measure of divergence (MMD)
- Tetrachoric Mahalanobis D2 distance (TMD)
Propose two new approaches for the estimation of biodistances from nonmetric traits:
- Direct application of the Mahalanobis distance to ordinally recorded data (OMD)
- Approximation of the covariances of the Mahalanobis distance by Pearson correlation coefficients (RMD)
DENTAL NONMETRIC TRAITS
Upper incisors –
shovel-shaped
Lower molars –
tuberculum sextum
Upper molars –
Carabelli’s trait
Lower premolars –
Tome’s root
BIODISTANCE MEASURES IN SKELETAL
ANALYSIS OF NONMETRIC TRAITS
r
i ii nnrMMD
1 21
2
i2i1 }5.0
1
5.0
1){(
1
)(T)(D -1pooled
TTMD 21212 zzzz
)(R)(D -1pooled
TRMD 21212 zzzz
)(C)( 21
-1
21
2 μμμμ pooled
T
OMDD
LIMITATIONS
Highly correlated and nondiagnostic traits
should not be analysed using the MMD
Serious reduction of the data being
analysed
TMD should be used cautiously when the
amount of missing data is large
Problematic in archaeological samples
ADDITIONAL CONSIDERATIONS
How effective are these measures in relation to
the information being lost from a statistical point
of view by the use of binary dichotomies for the
coding of an underlying continuous phenomenon
– the genetic variation that gives rise to the
expression of nonmetric traits?
MATERIALS
Artificial datasets that follow the multivariate normal
distribution of known variance-covariance. Two variance-
covariance matrices were used, one (CT) derived from
cranial data from North African populations, and the other
(DT) from dental data recorded on Cretan EBA samples.
For the transformation of each continuous dataset to a
binary one, the threshold value was the mean value. For the
transformation to four-rank ordinal data, the threshold
values were the first quartile, the median and the third
quartile.
The effect of missing values was examined by using 20%
missing values randomly affecting all traits.
BIODISTANCE COMPARISONS –
POPULATION DATA
DT matrix, (o) MD, (+) TMD, (x) RMD, () OMD,
and () MMD
• TMD tends to coincide
with the MD
• RMD and OMD differ
slightly from the MD
• MMD tends to exhibit
the greatest
divergence from the
remaining measures
BIODISTANCE COMPARISONS –
SAMPLE DATA
o All distances perform
satisfactorily, although the
OMD appears to have the
best performance
o TMD seems to deteriorate
when the number of traits is
large
Correlation coefficients between MD and
OMD (), RMD (o), TMD (+), and MMD (x) for
different datasets without missing values and
with r = 30 (A) and r = 10 (B) traits.
All examined biodistances
recover correct phenotypic Fst
values when populations are being
compared
When samples are being
analysed, all biodistances except
for RMD and TMD, give accurate
Fst values, provided that a
correction for small sample sizes
has been adopted
• Matrix CT, r = 10 (A) and 30 (B)
• The distances have been calculated in the
absence (+), (x), (Δ) and in the presence (o) of
missing values
• Scaled Fst values are shown by (x) and (Δ)
BIODISTANCE COMPARISONS –
FIXATION INDEX
For the assessment of the statistical significance of
the various biodistances, p-values can be calculated
either using permutations or using test statistics,
whenever these are available in the literature.
P-VALUE COMPARISONS
P-VALUE COMPARISONS Continuous data – CMD/MD
• Sjøvold’s test statistic
gives identical results to
the permutation method
• Rao’s test statistic gives
results divergent from
the permutation method
• The chi-squared test
statistic gives results
convergent to
permutations when the
number of traits is small Matrix CT, r = 30 and 10 (inset). p-values calculated from
permutations (o) and the test statistics: Sjovold (+), Rao
(), and chi-squared (x)
Ordinal data – COMD/OMD
• Sjøvold’s test statistic gives results
identical to the permutation method
provided that there are no missing
values
• In the absence of missing values, Rao’s
test statistic and the chi-squared test
statistic yield reliable results when the
number of traits is small
• The presence of missing values affects
the performance of all test statistics
Matrix DT, r = 30 and 10 (inset) without missing values
(A), and with 20% missing values (B). p-values
calculated from permutations (o) and the test statistics:
Sjovold (+), Rao (), and chi-squared (x)
P-VALUE COMPARISONS
Binary data – MMD
Satisfactory convergence
between the p-values
estimated from the test statistic
proposed for MMD and those
of the permutation methods,
regardless of the number of
traits considered or whether or
not missing values are present
P-values using permutations (o) and the test
statistic proposed for MMD () on matrix CT
with r=10 (A) r=30 (B). Insets correspond to
the same datasets with 20% missing values
P-VALUE COMPARISONS
Binary data – TMD/RMD
The p-values obtained from all
test statistics used for
Mahalanobis-type distances never
converge with those obtained
from permutations, regardless of
the number of traits, sample sizes,
presence or absence of missing
values, or whether the traits are
strongly correlated or not.
P-values for the (A) TMD and (B) RMD using
matrix CP, r = 10. p-values calculated from
permutations (o) and the test statistics: Sjøvold
(+), Rao (), and chi-squared (x)
P-VALUE COMPARISONS
CONCLUSIONS
All biodistances give fairly accurate results when the
number of traits is small.
When the number of traits is large, problems arise with
the TMD and RMD, while MMD and COMD are not
affected
COMD exhibits the best performance under all conditions
The presence of missing values does not affect
significantly OMD and MMD, but it may affect TMD when
the number of traits is large.