Post on 11-May-2023
This is a preprint version of the paper accepted for publication in Language Learning.
Calculating the relative importance of multiple regression predictor variables using dominance
analysis and random forests
Atsushi Mizumoto (Kansai University, Osaka, Japan)
Abstract:
Researchers often make claims regarding the importance of predictor variables in multiple regression
analysis by comparing standardized regression coefficients (standardized beta coefficients). This
practice has been criticized as a misuse of multiple regression analysis. As a remedy, I highlight the
use of dominance analysis and random forest, a machine learning technique, in this method showcase
article to accurately determine predictor importance in multiple regression analysis. To demonstrate
the utility of dominance analysis and random forest, I reproduced the results of an empirical study
and applied these analytical procedures. The results reconfirmed that multiple regression analysis
should always be accompanied by dominance analysis and random forest to identify the unique
contribution of individual predictors while considering correlations among predictors. A web
application to facilitate the use of dominance analysis and random forest among second language
researchers is also introduced.
Keywords: multiple regression analysis; predictor importance; relative importance; dominance
analysis; random forest
Author Note:
I would like to thank Dr. Kristopher Kyle and Dr. Scott Crossley, Associate Editors of Language
Learning, and the four anonymous reviewers for their very helpful suggestions in improving the
quality of this paper. Writing this paper was supported by JSPS KAKENHI Grant Numbers
21H00553 and 20H01284. On a personal note, I would like to dedicate this paper to Dr. Hiroaki
“Keiroh” Maeda, who pointed out the misuse of standardized beta in interpreting the causal
relationship between predictor variables and the criterion variable in the early 2000s (Maeda, 2004).
Regrettably, he passed away on October 10, 2014, at the age of 40. Because I would not be who I am
today without his guidance, let me write this message to him here. “Keiroh-sensei, I’m sorry it took
so long to deliver your message. I hope you like it.”
1
1. The Problem
Multiple regression analysis has been used extensively in second language (L2) studies as an
intermediate-level statistical technique (Khany & Tazik, 2019). This correlation-based technique is
used to examine the relationship between multiple predictor variables (PVs, or the independent
variables) and a single criterion variable (CV, or the dependent variable) (Jeon, 2015). Its primary
purpose is to develop the best model, often with a linear equation, to predict the CV using a set of
PVs. In the process of generating the best fitting model with the highest explanatory power (i.e., the
amount of variance in the CV explained by the set of PVs: R2), it is also possible to assess the
relative importance or contribution of each PV to the overall effect (R2) by comparing the
standardized beta (β) coefficients. L2 researchers often make theoretical and pedagogical claims
based on the magnitude of the standardized beta coefficients, although their use as an index of
predictor importance is now considered a misuse of multiple regression analysis (Karpen, 2017).
This paper addresses this pervasive problem in L2 studies and highlights the use of dominance
analysis, one method of relative importance analysis, along with random forests as alternatives to
standardized beta coefficients to facilitate better interpretation of predictor importance.
The crux of this problem in L2 studies can be portrayed in the following example. Table 1
presents the correlation and standardized beta coefficients from a mock study (adapted from Maeda,
2004). In this hypothetical study (N = 100), we are interested in investigating the contribution of
specific skills to students’ speaking ability. Thus, the CV is “Speaking,” and the PVs are
2
“Vocabulary,” “Grammar,” “Writing,” and “Reading.” The overall variance explained (R2) is .47
(95% CI [.31, .57]). This means that 47% of the variance of the CV can be predicted from the PVs
(the predictive power of the model is 47%), and the remaining variance, or 53%, is unexplained by
the model.
We first examine the simple correlation coefficients, which are also called zero-order or
bivariate correlations, between the CV (i.e., Speaking) and each of the PVs. We can evaluate the
strength of the effect of each PV on the CV using simple correlation/regression because it is a one-
to-one comparison (see the left gray column in Table 1). It is also the starting point of any analysis.
However, this method does not consider the correlations among PVs that are likely present in reality.
Therefore, multiple regression is more appropriate and necessary to fully describe the relationship
between the CV and PVs while accounting for the shared variance among the PVs (Plonsky &
Oswald, 2017).
Table 1
Correlations and Standardized Beta Coefficients in a Mock Study (Adapted from Maeda, 2004)
Variables R
β p Speaking Vocabulary Grammar Writing Reading
Speaking ― ― ―
Vocabulary .62 ― .47 < .001
Grammar .43 .67 ― -.19 .10
Writing .56 .55 .67 ― .31 .02
Reading .61 .72 .65 .79 ― .15 .30 Note. N = 100, R2 = .47 (95% CI [.31, .57])
3
The standardized beta coefficients (β) obtained from a multiple regression analysis (see the
right gray column in Table 1) describe the direct relationship between the CV and each PV while
controlling for the indirect effects of the other PVs. As the standardized beta coefficient of a PV
represents the mean change of the CV given a one-unit (i.e., standard deviation) shift in the PV, it is
often used to compare the relative strength of the CVs in the regression model. In the example above,
Vocabulary has the largest standardized beta coefficient (β = .47). Thus, a standard deviation
increase of one in Vocabulary leads to a standard deviation expected score change of 0.47 in
Speaking (the PV) while the effects of the other PVs are controlled (i.e., all intercorrelations between
PVs are zero).
However, we notice that the magnitude of the standardized beta coefficients of Reading (β
= .15) and Grammar (β = -.19) differs in comparison to their correlation coefficients (r = .61 and .43,
respectively). This phenomenon is counterintuitive, but it is known as a suppression (or suppressor)
effect and often occurs in multiple regression analysis (Hair et al., 2019). A suppression effect arises
when one of the PVs (i.e., Grammar) is more strongly correlated with the other PVs (e.g.,
Vocabulary) than with the CV (Speaking).
Mathematically, the suppression effect is relatively easy to explain. In a situation where the CV
(Y) is explained by the two PVs (X1 and X2), the equation below shows how the standardized beta
coefficient for X1 can be derived from the correlations among them. The denominator of the equation
4
being the same (as there are only two predictors in this specific example), assuming that both the
correlation between Y and X2 and that between X1 and X2 are positive and larger than the correlation
between Y and X1, the standardized beta coefficient (β1) will be negative despite an initial positive
association in a simple correlation with Y.
!! = $"#! − $"#"$#!#"
1 −$#!#"$
The same situation happens with the regression result in Table 1 and the negative standardized
beta coefficient for Grammar (β = -.19), suggesting that Grammar has a negative effect on Speaking
and can be considered a statistical artifact. Put another way, when considering the effect of Grammar
on Speaking, much of the variance shared by these variables can be accounted for by the other PVs
that highly or moderately correlate with Grammar and Speaking. If the indirect effects of these PVs
are removed, what is left (i.e., correlations of the residuals) does not accurately show the variable
importance of Grammar. Without realizing such a statistical artifact, unwary researchers may ignore
PVs that have lower or negative standardized beta coefficients; this may result in a belief that
weighed PVs that remain in the regression model are more important than the other PVs.
Figure 1 further illustrates the relationship between the simple correlation coefficients and the
standardized beta coefficients in a path model (Oswald, 2021). As described in Figure 1, the
suppression effect persists and cannot be solved using standardized parameter estimates in structural
5
equation modeling (SEM), an advanced statistical technique that is well suited to modeling and
handling measurement errors. The same applies to SEM with latent variables (Maassen & Bakker,
2001).
Figure 1
The Relationship Between Correlations and Standardized Beta Coefficients in a Path Model
(Oswald, 2021). CC BY-SA 4.0.
6
Figure 1 also shows that the standardized beta coefficients are based on (a) all PV correlations
and (b) all PV-CV correlations. Take Table 1 data for example, the correlation coefficient between
Speaking (CV) and Vocabulary (PV), which is r = .62, can be calculated using (a) and (b) as follows:
0.47 + (-0.19 * 0.67) + (0.31 * 0.55) + (0.15 * 0.72). If (a) all PV correlations are zero, then we only
have (b) all PV-CV correlations, a case in which standardized beta coefficients are literally
correlations, and when squared, they add up to R2 (Braun et al., 2019). However, PVs are almost
always correlated (i.e., non-zero), and thus the PVs overlap, yielding somewhat counterintuitive
standardized beta coefficients.
As described above, although the standardized beta coefficient can be interpreted in such a way
that it is expressed in a standardized unit, it is calculated based on the unrealistic assumption that the
variance of the other PVs is held constant for each combination (Jaccard & Daniloski, 2012).
Likewise, detecting multicollinearity, a state of very high correlations among the PVs in a multiple
regression model using a diagnostic index, such as the variance inflation factor (VIF), is useless
because it does not nullify the existence of correlations among PVs. As a result, in the process of
estimating the multiple regression equation that best predicts the CV, PVs that contribute less to the
best model (or equation) result in having lower standardized beta coefficient values even if they are
important variables (i.e., those with higher correlations with the CV). That is, one could easily
underestimate or overestimate the importance of variables if the standardized beta coefficient is used
for importance comparison. Note that suppression effects are also problematic for almost any other
7
form of relative importance indices. Partial or semi-partial correlations and changes in R2 also suffer
from the same limitation because they treat PVs as uncorrelated; as such, these indices do not
accurately reflect predictor importance (Karpen, 2017).
In the same line of discussion, standardized beta coefficients obtained from regularized
(penalized) regression, which uses alternative fitting procedures such as ridge regression (Hoerl &
Kennard, 1970), also do not make up an adequate predictor importance index. Regularized regression
is known to be effective for dealing with (a) correlated PVs and (b) a case where the number of
variables exceeds that of observations (cases) in the sample (Hair et al., 2019). In such a situation,
overfitting may occur. Overfitting (or overtraining) is a phenomenon where a model fits the training
data on which a model is built too well and fails to replicate the same performance on the test or
unseen data. In contrast to the ordinary least squares (OLS) estimates, on which a normal regression
model is based and which tend to suffer overfitting, regularized regression (also known as shrinkage
or constrained) methods shrink the estimated coefficients (e.g., standardized beta) toward zero, based
on a tuning parameter λ (lambda), relative to the OLS estimates. As a result, regularized regression
can yield better prediction accuracy and model interpretability than can the OLS estimates (James et
al., 2013). However, standardized beta coefficients obtained by applying regularized regression do
not dramatically differ from those calculated using the OLS estimates. In fact, standardized beta
coefficients applying ridge regression to the data presented in Table 1 are regularized, but they are
still similar to those using the OLS estimates (regularized betas for Vocabulary = .37; Grammar =
8
-.10; Writing = .24; Reading = .19, see online supplimentary material for the result of ridge
regression). It can thus be concluded that using regularized regression such as ridge regression may
increase prediction accuracy, but interpreting the regularized standardized beta coefficients regarding
predictor importance is still not a solution but a problem.
The examples in this section clearly indicate that one cannot rely on standardized beta
coefficients to determine predictor importance. Unfortunately, the misuse of standardized beta
coefficients in multiple regression analysis to determine important predictors has continued to this
day in the L2 research community. Therefore, the purpose of this showcase article is to call for
greater attention to L2 researchers’ misuse of multiple regression analysis and to draw readers’
attention to simple alternatives, dominance analysis and random forest, each of which has been
suggested and shown to be effective in other fields. By using such alternatives, L2 researchers are
expected to be able to better interpret variable importance in multiple regression analysis.
In the following sections, first, the solution is described. I demonstrate the effectiveness of
applying the solution to real-world problems by presenting an empirical example. Next, I introduce
an R-based web application that will enable readers to trace the results and explore the analysis in a
hands-on manner.
R version 4.0.3 (R Core Team, 2021) was used in the analyses. To ensure reproducibility and
transparency in the data analysis (Larson-Hall & Plonsky, 2015), all data and R code used in this
study are available in the IRIS repository (https://www.iris-
9
database.org/iris/app/home/detail?id=york:940279) and the Open Science Framework
(https://osf.io/uxdwh/). Furthermore, the R code can be accessed on Google Colaboratory
(https://bit.ly/3kV45pf), an online cloud-based environment that requires no setup to use. Readers
can run and examine the code and obtain the same results without even installing R and related
packages on their computer.
2. Methods for Determining Predictor Importance
2.1. Relative Importance Metrics
As demonstrated in the previous section, the standardized beta coefficient is not a proper
measure of predictor importance. In the past decade in other research fields, researchers who
recognized this problem have increasingly turned to relative importance analysis as a supplement for
regression analysis (Tonidandel & LeBreton, 2011). Relative importance analysis, also known as
“key driver analysis” in marketing research (Garver & Williams, 2020), is an umbrella term for any
technique to uncover the contributions of correlated predictors in a regression model and estimate
their importance. Relative importance analysis “produces superior estimates of correlated predictors’
importance in both simulation studies and primary studies” (Karpen, 2017, p. 84).
Several researchers have compared a range of relative importance metrics. In addition to
standardized beta coefficients (beta weight) and zero-order correlation coefficients, Table 2
summarizes relative importance measures reported in a guidebook of variable importance (Nathans
et al., 2012). As these metrics represent different perspectives on the data structure in a regression
10
model, no single relative importance metric is sufficient to fully describe the relative importance of
variables. For this reason, researchers need to select and report an appropriate variable importance
metric that suits the purpose of their research.
2.2. Dominance Analysis
Among the relative importance metrics in Table 2, dominance analysis (Budescu, 1993) and
relative weight analysis (Johnson, 2000) are the most commonly used and reported as the
recommended procedures for identifying the relative importance of PVs (Stadler et al., 2017). In the
field of L2 research, Larson-Hall (2016) introduced a method to compute the relative importance
metric (i.e., dominance analysis) using the relaimpo package of R (Grömping, 2006). However,
relative importance analysis has yet to be utilized in L2 research except for a few recent reports (e.g.,
Kyle et al., 2021), primarily because researchers are unaware of it as a viable alternative to
appropriately estimate predictor importance. As discussed earlier, correlations among PVs cause the
problem of inflating or deflating the standardized beta coefficients. That is, in an unrealistic situation
in which PVs are not correlated at all, standardized beta coefficients are a precise measure of
predictor importance. However, for real-world data, dominance analysis or relative weight analysis
could be utilized to effectively address correlations among PVs.
11
Table 2
Metrics of Relative Importance, Their Purpose, and Other Characteristics
Metric Purpose Effect Other Characteristics Zero-order correlation
Identifies magnitude and direction of the relationship between the PV and CV without considering any other PVs Direct See Figure 1
Standardized beta Identifies contribution of each PV to the CV with holding all other PVs constant Total Equivalent to correlation
if PVs uncorrelated Product measure (Pratt measure)
Partitions the regression effect into nonoverlapping partitions based on multiplying the beta of each PV with its respective correlation with the PV
Direct Total Always total R2
Structure coefficient
Identifies how much variance in the predicted scores of the CV can be attributed to each PV Direct Can identify a suppressor
Commonality analysis
Identifies how much variance in the PV is uniquely explained by each possible PV set; yields unique and common effects that, respectively, identify variance unique to one PV and multiple PVs
Total Always total R2; Can identify suppressor or multicollinearity
Dominance analysis
Indicates whether one PV contributes more variance than another PV, either (a) across all possible submodels (i.e., complete dominance) or (b) on average across models of all-possible-subset sizes (i.e., conditional dominance); averaging conditional dominance weights yields general dominance weights
Direct Total Partial
Always total R2; Can identify suppressor
Relative weight Identifies variable importance based on method that addresses multicollinearity by creating variables’ uncorrelated “counterparts” Total Always total R2
Note. Adapted from Hair et al. (2019), Nathans et al. (2012), and Nimon and Oswald (2013). PV = predictor variable (independent variable); CV = criterion variable (dependent variable); Direct (effect) = variable importance when measured in isolation from other PVs; Total (effect) = variable importance when contributions of all other PVs have been accounted for; Partial (effect) = variable importance when contribution to regression models of a specific subset or subsets of other PVs have been accounted for. All-possible-subsets regression is not included because it is not strictly a measure of relative importance; instead, it serves as a base for commonality and dominance analysis (Nimon & Oswald, 2013).
12
Dominance analysis (DA), also known as Shapley value regression, estimates the relative
importance of predictors by examining the change in R2 of the regression model from adding one
predictor to all possible combinations of the other predictors. For example, if the model has three
predictors (A, B, and C), the possible combinations are: (1) No predictors other than the intercept (R2
= 0), (2) A only, (3) B only, (4) C only, (5) A and B, (6) A and C, (7) B and C, and (8) A, B, and C.
By calculating the weighted average of those combinations, general dominance weight can be
obtained. General dominance weight can thus directly reflect the predictor’s importance by itself and
in combination with the other predictors in the model. DA is computationally intensive, especially
when more than 10 predictors are involved, because as the number of predictors increases
(Tonidandel & LeBreton, 2011), the computational demand increases exponentially.
To overcome this computational expense, relative weight analysis (RWA) was developed as an
alternative to DA. RWA and DA have been reported to produce extremely similar results when
applied to the same data (Lebreton et al., 2004). RWA transforms the correlated PVs into new
orthogonal (uncorrelated) variables, that are, at the same time, maximally correlated with the original
PVs using a principal components approach (Johnson, 2000). These transformed counterparts of PVs
are then used for regression to predict the CV. Finally, relative weights are computed by rescaling
the standardized beta coefficients calculated in the final step to the original variables. As the new
orthogonal variables are not correlated with each other, the standardized beta coefficients (i.e.,
relative weights) obtained in this way are directly comparable to assess predictor importance.
13
Of the two metrics, DA and RWA, I will focus on DA as a recommended relative importance
method in this showcase article, as suggested by Braun et al. (2019) and Stadler et al. (2017). This is
because RWA was considered a viable alternative to DA but has now been heavily criticized on
mathematical grounds (Thomas et al., 2014). In addition, considering the fact that RWA was
developed to approximate the results of DA, if there is no problem with computational expense, DA
should be the first choice for researchers. In fact, there is enough computational power today to
perform DA in a typical regression analysis, and DA can also be applied to determine the relative
importance of predictors in hierarchical linear models (HLM), a more advanced form of traditional
linear regression models (Luo & Azen, 2013).
To provide a non-mathematical description of DA, I will explain it using data from the mock
study in Table 1. Table 3 shows the results of applying DA to the data presented in Table 1. The
correlations (r) and standardized beta coefficients (β) are copied, so they are the same as those in
Table 1. The next column shows “Dominance Weight.” Dominance weights can be compared to one
another as R2 to examine which variable contributes more than the others to the entire regression
model. The sum of dominance weights is equal to R2 (R2 = .474). This feature is informative because
it can be interpreted as an estimate of relative effect size (Tonidandel & LeBreton, 2015), which is
not the case for standardized beta coefficients. The dominance weights also have rescaled weights
calculated from the metric of the proportion of explained variance (e.g., Dominance weight of
Vocabulary: 0.176 / 0.474 × 100 = 37.13). As these weights are expressed as percentages, they can
14
be easily interpreted to investigate which variable contributes more/less to the entire regression
model. The next column, “95% CI,” reports the lower and upper bounds of confidence intervals (CI)
for the dominant weight of all PVs. CIs are calculated based on bootstrapping procedures using the
boot package (Canty & Ripley, 2021) and the yhat package (Nimon et al., 2021). The 95% CI values
indicate that if samples are extracted from the population 100 times and, for each time, the
confidence interval is computed, approximately 95 of the 100 confidence intervals would include the
population parameter (in this case, the dominance weight). The next column “Rank” shows the rank
order of PVs in accordance with the magnitude of their dominance weight. It should be noted that,
since dominance weights can never be zero or negative because they are scaled according to the
variance explained (R2), the confidence intervals tend to be positively skewed (Tonidandel &
LeBreton, 2015).
The R code used to obtain the results presented here is based on Nimon and Oswald (2013),
and it enables users to compute and compare a range of relative importance indices to one another
(such as those shown in Table 2) along with their associated bootstrapped confidence intervals. The
online supplimentary material of this article lists all such results.
15
Table 3
Correlations, Standardized Beta Coefficients, and Corresponding Dominance Weights
Variables r β Dominance Weight (%) 95% CI
Rank Lower Upper
Vocabulary .62 .47* .176 (37.13%) .090 .267 1
Grammar .43 -.19 .052 (10.97%) .029 .104 4
Writing .56 .31* .114 (24.05%) .056 .184 3
Reading .61 .15 .132 (27.85%) .063 .214 2
Total .474 (100%)
Note. Criterion variable is Speaking. N = 100, R2 = .47 (95% CI [.31, .57]), *p < .05 (see Table 1 for exact p-values). The dominance weights are shown to three decimal places to accurately calculate total R2.
As for dominance weight rankings, the rank order of weights contains sampling error variance
in the same way that the dominance weights themselves contain those errors. This causes the rank
order to change easily, thereby leading to unstable estimates of the magnitudes and rank orderings of
the dominance weights (Braun et al., 2019). As researchers often rank order PVs by their dominance
weights and treat PVs with higher ranks as more important, the existence of sampling error is a
caveat to remember when presenting and interpreting the rank order of weights. Thus, Braun et al.
recommend reporting the average rank order obtained across bootstrapped replications along with
associated 95% confidence intervals (see Braun et al., 2019 for more details).
By inspecting the dominance weights, it can be seen that the dominance weight of the PV
“Vocabulary” (.176 / 37.13%) shows a balanced importance that partitions the whole variance
explained (R2) while considering the correlations among PVs. The inconsistency in the order of
16
importance when interpreting importance using correlations and standardized beta coefficients is
solved with dominance weight; in this case, Reading (.132 / 27.85%) contributes a little more than
Writing (.114 / 24.05%) to Speaking. Grammar, which was given a negative standardized beta
coefficient (β = -.19), can be safely interpreted as the variable that contributes the least to the model,
given that it has the lowest dominance weight (.052 / 10.97%) among all the PVs. It is worth noting
here that the fact that Grammar has the lowest dominance weight is unrelated to the fact that its
corresponding β is negative.
Figure 2 shows the dominance weights and their corresponding 95% CIs in descending order
from the PV with the largest dominance weight (i.e., Vocabulary) to that with the smallest
dominance weight (i.e., Grammar).
17
Figure 2
Dominance Weights and Corresponding 95% Confidence Intervals
Note. Horizontal error bars show 95% confidence intervals computed from 10,000 bootstrapped
replications. * indicates that the confidence interval does not contain 0 (p < .05).
18
As CIs for all pairs of PVs are comparable to each other, these data can be used to assess statistical
difference (the alpha level was set at 0.05). That is, it is possible to investigate which variable, of the
two PVs, has a statistically larger dominance weight. In this case, Vocabulary and Reading have
larger dominance weights than does Grammar. For other combinations, a statistically significant
difference was not found, indicating that the dominance weights of pairs of Vocabulary-Reading,
Vocabulary-Writing, Reading-Writing, and Writing-Grammar do not differ substantially and that
their CIs overlap to the extent that their weights are similar in magnitude.
As can be observed in this example, researchers can make a balanced judgment of the
importance of predictors by employing DA. Compared to standardized beta coefficients, which are a
“flawed” measure of predictor importance, DA allows for a more intuitive and substantive
interpretation of predictor importance.
Although DA can provide researchers with additional information on predictor importance,
which is otherwise inaccessible in traditional multiple regression analysis, it is not a cure for issues
inherent in multiple regression analysis and, thus, has some limitations (Johnson, 2000). First, DA
cannot account for sampling and measurement errors, as is true of other analyses (Braun et al., 2019;
Tonidandel & LeBreton, 2011). Second, although DA was invented to deal with correlations among
PVs, it cannot rectify multicollinearity when it is caused by two or more variables measuring the
same construct (i.e., construct redundancy) (Stadler et al., 2017). Third, DA is not a replacement for
multiple regression analysis for selecting the best set of PVs for the regression formula. That is, DA
19
is an indispensable supplement to multiple regression analysis for determining predictor importance,
but not the other way around. Finally, DA cannot be used as a tool to provide evidence of causation
(Garver & Williams, 2020). With these limitations in mind, DA should be used as a complement to
the standardized beta coefficients in multiple regression analysis.
2.3. Feature Selection with Machine Learning
Another approach to determining predictor importance is to utilize feature selection methods
from machine learning. Machine learning is a subfield of artificial intelligence (AI) that uses data
and algorithms to imitate the way humans learn with the goal of gradually improving the accuracy of
automated systems. Feature selection is a critical task in machine learning because an explicit model
equation, such as that available in multiple regression analysis (i.e., Y = B0 + B1X1 + B2X2 + B3X3
+ . . . + BnXn + e), is not available in this case. Thus, variable importance is the only means of
investigating what drives the model and what is inside the “black box” (Grömping, 2015).
Variable importance metrics from machine learning methods can be computed using the
function “varImp” of the caret package in R (Kuhn, 2021). The name “caret” is an acronym for
“classification and regression training.” The caret package implements over 200 machine learning
models using other R packages. It streamlines and automates standard processes of machine learning
tasks such as classification and regression. Another popular and newer R package focusing on
machine learning is mlr3 (Lang et al., 2019). This package is known as an “ecosystem” because it is
a unified, object-oriented framework designed to accommodate numerous machine learning tasks,
20
more so than is the caret package, with a variety of learners (i.e., algorithms), feature and model
selection tools, and model assessment capabilities, all of which are supported by advanced
visualization tools.
Of the hundreds of machine learning models available, I describe random forest (Breiman,
2001) in this article. This is because random forest has been suggested as an approach that produces
more accurate estimates of predictor importance than do standardized beta coefficients (Karpen,
2017). Indeed, random forest is the machine learning method for which variable importance is best
researched (Grömping, 2015). Furthermore, random forests have been reported to outperform other
competing data classification models in terms of accuracy (Fernandez-Delgado et al., 2014) and in
selecting critical features (Chen et al., 2020) in large-scale simulation studies.
While the use of dominance analysis alone for the purpose of improving the interpretation of
predictor importance, as demonstrated in the previous section, may be sufficient to obtain valuable
information, additional use of random forest enables researchers to gain further information on
predictor importance. First, by using random forest, researchers can double-check the result of
predictor importance obtained through dominance analysis. If the results (or predictor rankings) of
the dominance analysis and random forest differ, it is plausible that the data being analyzed do not
meet the assumptions of multiple regression analysis. Since random forest is a non-parametric
machine learning model, if assumptions for multiple regression (i.e., linear model) are not met, this
method will provide more accurate results (Liakhovitski et al., 2010). Second, by using random
21
forest, researchers can obtain another piece of information for judging whether a variable is indeed
important or not from a different perspective. Dominance analysis and random forest are derived
from two completely different analytical paradigms, and as such, their algorithms are unrelated in all
aspects (e.g., while the variance is the key determinant of the predictor importance in the regression,
random forest reduces the variance of the variables in regression). That being the case, random forest
can be quite useful as an adjunct to dominance analysis, as the results in most cases agree, and the
two approaches can support each other when closely examining predictor importance. Furthermore,
with the simulation approach adapted in this study (see the later part of this section), it is possible to
determine which predictor is useful and meaningful for predicting the outcome. For these reasons, in
this article, I suggest using random forest in addition to dominance analysis.
Random forest is a non-parametric machine learning method for both classification (i.e.,
categorical outcome variable) and regression (i.e., continuous outcome variable). It is an evolved
version of a decision tree. As its name suggests, random forest creates more than one decision tree by
repeating the random sampling (bootstrapping) of the predictors and then aggregating the result in a
procedure known as bagging (bootstrap aggregating) or an ensemble learning method (see Figure 3).
Because it uses aggregations of data, the random forest method outperforms a single decision tree,
and it is robust to overfitting. Nevertheless, if the data set is too small or contains too much noise,
overfitting can be an issue. Random forest also has methods that are available to address missing
22
data and balancing errors in datasets in which classes are imbalanced. In short, random forest models
tend to be more accurate and generalizable than classic decision trees.
One feature of random forest which makes random forest unique and distinctive as a tool to
detect variable importance is referred to as split-variable randomization. In split-variable
randomization, every time variable importance is evaluated (i.e., a split is to be performed), only a
limited subset of predictors is selected at random, and the best ones are chosen in random forest. In
this way, the importance of each predictor is assessed in all steps (i.e., nodes and splits) in the
process of building trees and forests. Then, estimates of the relative importance of each predictor
variable can be obtained through averaging the values across all trees for each predictor. For this
reason, variable importance calculated from random forest is accurate, and correlations among
predictors do not greatly influence the estimate of variable importance in random forest, in contrast
to multiple regression analysis. However, very highly correlated predictors may still distort the
accuracy and estimates of variable importance. After all, variable importance is conditional on all the
variables in the model and the nature of the model itself.
23
Figure 3
How Random Forest Works
Training Data
Sampling 1
Modeling
Result 1
Voting in classification / Averaging in regression
Predicted Data
tree1
Sampling 2
Modeling
Result 2
tree2
Sampling m
Modeling
Result m
・・・
・・・
・・・
・・・
treem
Random Sampling
24
The most important thing to consider in a machine learning method such as random forest is
how much prediction accuracy will be achieved. To achieve a higher level of prediction accuracy in
machine learning, hyperparameter tuning is vital. A hyperparameter is a value, or a model argument,
to be set before the machine learning process begins. In random forest, hyperparameters include
values such as tree depth, the number of trees, and the number of predictors sampled at each split.
Improper hyperparameter tuning can lead to overfitting. If overfitting occurs, the prediction accuracy
of unknown data will decrease. Therefore, a method called cross-validation is always necessary to
avoid overfitting and to evaluate the prediction accuracy.
Cross-validation is a method of randomly dividing sample data into training data and test data,
creating a model with the training data, and then using the test data to evaluate the prediction
accuracy of the model. The most basic method is called the hold-out (or validation set) method. This
is a simple method that first splits the training and test data in a ratio of 8:2 or 7:3, typically with
more training data than the test data, and then validates the results. The drawback of this method is
that, depending on how the data are split, the results may differ, and the accuracy may decrease if the
sample size is small. Leave-one-out cross-validation (LOOCV) addresses the hold-out method’s
limitation. In the LOOCV method, one sample is used as the test data, and the rest of the sample is
used as the training data to build the model. This process is then repeated as many times as the
number of items in the sample, and the average of all the results is calculated. LOOCV can be used
25
when the sample size is small, but it could be computationally expensive when the sample size is
large (although it is not too much of a problem with modern computer performance). An alternative
to LOOCV is the k-fold cross-validation method. In the k-fold method, the sample is divided into k
groups of approximately equal size, and the first one is used as the test data with the remaining k-1
being used as training data. This method is repeated k times, and the estimates are cross-validated by
averaging all the results. Compared to LOOCV, k-fold requires less computation and is reported to
be as accurate as LOOCV (James et al., 2013; Jarvis & Crossley, 2012).
Random forest evaluates prediction accuracy (or error) using an approach called OOB (out-of-
bag) error estimation, which is different from the abovementioned cross-validation methods. In
random forest, each decision tree is created by resampling all the data with some overlap of the data
being allowed (as shown in Figure 3), and the extracted data are used as training data for creating a
decision tree. The tree-building process is repeated with the aforementioned resampling method,
bootstrapping (LaFlair et al., 2015). Every time bootstrapping is executed, unused data (about 1/3 of
the sample) always remain in the original data. This is called the out-of-bag (OOB) data. The OOB
data can then be used to check the accuracy of the prediction performance. It has been reported that
the OOB estimate produce a similar error estimate to other cross-validation methods with less
computation (Breiman, 2001). Cross-validation is, in a sense, built into random forest, and there is no
need for further cross-validation except for by comparing prediction performance with other machine
learning algorithms.
26
This is why, to ensure the appropriateness of random forest in the next empirical illustration, I
examined MAE (mean absolute error), RMSE (root mean squared error), and R2 for the empirical
example to confirm that the prediction of the random forest model was better than or almost equal to
that of other equally powerful machine learning models that are often compared in the literature (e.g.,
neural networks, support vector machine: SVM, and eXtreme Gradient Boosting: XGBoost) by
performing cross-validation (see online supplementary material for details).
One caveat should be mentioned regarding the use of the estimates of variable importance from
random forest as a companion to the standardized beta coefficients in multiple regression analysis.
Researchers should verify whether the use of random forest is valid. Although random forest is
reported to be one of the best classifiers in machine learning (Fernandez-Delgado et al., 2014), this
was not the case in Jarvis (2011), in which linear discriminant analysis outperformed random forest
in terms of accuracy. Depending on the data used for machine learning, there may be better
algorithms. For this reason, the accuracy of the prediction model should always be cross validated,
and the variable importance metric specific to the selected model should be reported.
Figure 4 displays the variable importance plot of random forest using data from the mock study
(Table 1). The ordinary random forest variable importance does not show which predictor is useful
and meaningful for predicting the outcome. Thus, in this article, I used the Boruta package in R
(Kursa & Rudnicki, 2010) to conduct a further accurate estimation of variable importance. Boruta is
a novel feature ranking and selection algorithm based on random forest. It runs random forest many
27
times (maximum 100 times by default), which is why the results are represented by the boxplots in
Figure 4. In the same way that random forest is much more accurate than a single decision tree, the
Boruta algorithm in general yields more precise estimates of predictor importance than does an
ordinary random forest procedure. Boruta has rarely been used in L2 studies, but it can be utilized to
select important variables to be included in a regression model, as is reported in Crosthwaite et al.
(2020).
Figure 4
Variable Importance Plot Obtained from Random Forest (Using the Boruta Algorithm)
28
By comparing Figure 4 to Table 3 (and Figure 2), the result of random forest (Boruta)
corroborates that of dominance analysis. That is, Vocabulary is the most important predictor,
followed by Speaking and Writing, which are very similar in importance. Grammar is the least
important predictor, with the lower limit of the 95% confidence interval being close to zero in Figure
4. Note that “shadowMax,” “shadowMean,” and “shadowMin” (boxplots in blue) are created by
randomly shuffling the original predictor values; thus, they are “nonsense” variables. If any predictor
variable importance is below that of one of those shadow variables, the predictor is judged as
unimportant. Predictors (“Attributes” in Figure 4) that are significantly better than the shadow
variables by binomial statistical hypothesis test are marked as “Confirmed” and can be regarded as
important predictors. In this case, all variables are “Confirmed” (box plots in light green). Some
predictors may not be subject to a clear decision and are marked as “Tentative.” In this way, the
Boruta algorithm makes the decision regarding predictor importance much easier with reference to
the unique shadow variables.
3. Empirical Illustration
In the previous section, I presented dominance analysis and random forest, and described the
superiority of them over standardized beta coefficients for determining the relative importance or
contribution made by each of the PVs with respect to the CV. Since the example provided was based
on a mock study, how useful dominance analysis and random forests are for real-world data was not
29
verified; therefore, in this section, I have applied dominance analysis and random forest to an already
published study through a reproduction of data analysis. By doing so, I demonstrate how dominance
analysis and random analysis can be utilized in addition to conventional multiple regression analysis
to augment the interpretation of predictor importance.
3.1. Selection of the Primary Study
After many failed attempts to reproduce the results of multiple regression analysis (see the
reason below), the recent study by Goh et al. (2020) was selected. These analyses could be
reproduced because the authors of this primary study properly reported the necessary details to
reproduce the results of multiple regression analysis (Plonsky & Ghanbar, 2018); namely, they
reported the following information:
(1) Correlation matrix (for all correlations between the PVs and CV)
(2) Sample size
(3) Means
(4) Standard deviations
A correlation matrix is always necessary to retrieve the results of the multiple regression
analysis. Some studies have reported only part of the whole correlation matrix (e.g., only the
correlations between the PVs and not those with the CV) or non-parametric correlations (e.g., Hessel,
2015); this partial reporting makes it impossible to retrieve the results reported with parametric
procedures. The sample size has a direct impact on the accuracy of the estimation. For example,
30
confidence intervals become narrower as the sample size increases and wider as the sample size
decreases. This should always be reported to make statistical inferences possible. If the means and
standard deviations are reported, the unstandardized coefficient (Β) can be computed; for this reason,
reporting them is good practice (Norris et al., 2015; Plonsky & Ghanbar, 2018).
In the process of reproducing published study results from reported descriptive statistics, I
realized that the problem of low reproducibility was derived from the fact that too often, raw data
and analysis code are not published in a public repository such as IRIS (Marsden et al., 2016).
Although provision of data and code has been encouraged in the L2 research field in recent years to
promote transparency (Larson-Hall & Plonsky, 2015; Marsden et al., 2019), many L2 researchers
may be reluctant to share their raw data (Plonsky et al., 2015). For instance, a recent study by Nicklin
and Plonsky (2020) once again highlighted the need for and importance of data sharing for the
purpose of reanalysis.
Data sharing is one of the open science practices that more researchers should be aware of.
Open science practices, which increase the transparency and openness of the research process, have
been promoted and increasingly implemented in response to the low reproducibility of research in
the field of psychology (Open Science Collaboration, 2012, 2015). Likewise, open science practices
(e.g., preregistration, registered reports, open materials, raw data and code sharing, and open access
and preprints) have gained attention in the field of L2 research (Marsden & Plonsky, 2018; Plonsky,
in press). Although the sharing of raw data is recommended as one open science practice, as it
31
stands, this choice is up to the individual researcher. As robust, credible, and transparent, high-
quality research is only possible when reproducibility is guaranteed (Gass et al., 2020, p. 252), high
impact journals should make the sharing of raw data mandatory. In fact, Applied Psycholinguistics,
beginning January 2022, requires all authors to make the analysis code and data openly available via
a trusted repository (https://www.cambridge.org/core/journals/applied-
psycholinguistics/information/instructions-contributors). Other journals should follow suit to pursue
research transparency and open science in the L2 field.
Another reason why data sharing is important is due to the nature of multiple regression
analysis. In multiple regression, the variance explained (R2) and the model itself are susceptible to
change depending on the PVs used. This makes it difficult or impossible to compare results obtained
from primary studies using different PVs. Since many L2 studies using multiple regression interpret
the variable importance with standardized beta coefficients, as shown in the following example,
theories and pedagogical implications drawn from the results of those studies may be subject to
reanalysis. With the spread of the open science practices including raw data sharing, such secondary
analyses will be facilitated and conducted to develop more precise theory and practice.
3.2. Selected Primary Study (Goh et al., 2020)
Goh et al. (2020) investigated the extent to which Chinese EFL students’ SAT essay writing
scores could be predicted from eight linguistic features (or micro features) such as basic text features
(e.g., word count), reliability, cohesion, and lexical diversity, which were extracted from students’
32
essays. They divided all 268 essays collected from the participants into a 200-essay training set and a
68-essay test set. They used stepwise multiple regression analysis to search for the best prediction
model for the SAT essay writing score, using eight linguistic features as PVs, and examined the
relative importance of PVs using standardized beta coefficients. Accordingly, it is worth comparing
the results of Goh et al. (2020) with those obtained using dominant analysis (DA).
Two of the eight PVs, both of which were lexical diversity indices (frequency of SAT words
and academic words in the essay), were not included in the final model because they had
insignificant R2 change in stepwise regression analysis (Goh et al., 2020, p. 470). The result of the
stepwise regression model using the 200-essay training set showed that the six PVs were able to
predict 62.6% of the SAT essay score. Using the 68-essay test set for cross-validation, Goh et al.
reported that the six predictors explained 53.5% of the total variance in SAT essay scores and argued
that the model they produced using stepwise regression analysis was valid.
Table 4 provides the correlations and standardized beta coefficients reported by Goh et al.
(2020). Based on these results, Goh et al. argued that three length-related features (i.e., word count,
the Coleman-Liau readability index, and the number of words per sentence) were the top three most
important variables for predicting the SAT writing score. They attributed this result to the tendency
that “human markers favor better readability and a longer essay with a more complex sentence
structure to deliver the argument and opinion” (p. 471).
33
The next three features (see Entry column in Table 4) in order of reported importance were (4)
the number of commas per sentence (Commas), (5) the normalized linking word frequency (Linking
words), and (6) the normalized stop word frequency (stop words). By referring to the standardized
beta coefficients for each PV, Goh et al. offered an explanation. Regarding (4) Commas (β = -.21),
they assumed that comma splices (e.g., combining two independent clauses with a comma instead of
conjunctions), which is a common error in Chinese EFL learners, were the cause of this negative
standardized beta coefficient. As a pedagogical implication, they suggested that teachers should
instruct learners on how to eliminate the unnecessary use of commas, including comma splices.
Concerning (5) linking words (β = .12), Goh et al. reemphasized the importance of using more
linking words appropriately to construct a longer sentence. Being able to write a longer sentence
leads to the first three length-related features: more word counts, possibly higher readability, and a
higher number of words per sentence. The final linguistic feature in the regression model was (6)
stop words (β = -.10). As English stop words are commonly used simple function words (e.g., the, to,
he), they do not carry much information. Therefore, Goh et al. suggested that students should limit
their use of stop words to the minimum level required.
34
Table 4
Correlations and Standardized Beta Coefficients as Reported in Goh et al. (2020)
Variables R
β Entry 1 2 3 4 5 6 7
1 Essay score ― ― ― 2 Word count .67 ― .49 1 3 CLI .41 .18 ― .20 2 4 Commas .02 .05 .21 ― -.21 4 5 Stop words -.35 -.28 -.32 -.03 ― -.10 6 6 Linking words .35 .24 .34 .22 -.13 ― .12 5 7 Words/sentence .45 .31 .43 .47 -.22 .33 ― .27 3
Note. N = 267 for correlations, n = 200 for standardized beta coefficients. Stepwise multiple regression analysis: R2 = .62. Entry is the order of variables entered into the stepwise regression analysis model. CLI = Coleman-Liau readability index, Commas = Number of commas per sentence, Stop words = Stop word frequency normalized to word count, Linking words = Linking word frequency normalized to word count, Words/sentence = Number of words per sentence.
Goh et al. (2020) employed stepwise multiple regression analysis because their primary
purpose was to build the best regression model with the highest predictive power possible. This
method is also known as “statistical” regression analysis. As the name implies, the selection of PVs
in such stepwise regression analysis is purely based on statistical assessment, which often causes
problems and should be used with prudence (see Plonsky & Ghanbar, 2018 for details). Jeon (2015)
contended that when stepwise multiple regression is chosen over other multiple regression
procedures, “the observed relative importance of a PV should be considered with caution” (p. 143).
Considering the several significant problems inherent in the use of stepwise regression analysis,
Nathans et al. (2012) advised against its use for assessing variable importance. Accordingly, standard
35
multiple regression was used to reproduce the analysis of Goh et al. (2020); then, DA was applied to
these data.
Table 5 presents the results of the reproduction and DA conducted using the data from Goh et
al. (2020). The same interpretation of the original results of Goh et al. can be made for the three
length-related features (i.e., word count, CLI, and the number of words per sentence), accounting for
82.10% of the total R2 (.60) of these three PVs (53.67%, 12.21%, and 16.22%, respectively). Of the
other three PVs (i.e., Commas, Stop words, and Linking words), the number of linking words was of
relative importance, with 8.03% rescaled dominant weight. Similarly, the number of stop words was
assigned a value of 7.86%. Along with their correlations (r = .35, -.35) with the CV (SAT Essay
Score), it is reasonable to consider that the number of linking words and stop words were important
predictors contributing to the CV, as suggested by Goh et al.
Meanwhile, the number of commas per sentence (Commas) had a much lower dominance
weight (.018 / 3.01%) than other PVs, although this PV was the fourth important variable (out of six)
in the stepwise multiple regression reported by Goh et al. (2020). Thus, the reason for the
disagreement in predictor importance deserves close inspection. As Goh et al. noted, the number of
commas per sentence was a “suppressor variable in the regression model” (p. 472). As such, its
standardized beta coefficient was inflated and negative (β = -.21). However, Goh et al. carefully
considered this variable (i.e., Commas) because it was the fourth most important variable, after the
three most important length-related features. It was this misinterpretation of standardized beta
36
coefficients that led Goh et al. to overestimate the importance of this specific PV and suggest that
students should use fewer commas in their writing. However, as can be seen in the DA result, it is
clear that the magnitude of this PV (Commas) was simply a statistical artifact resulting from the
process of calculating standardized beta coefficients.
Table 5
Dominance Analysis Applied to the Data from Goh et al. (2020)
Variables r Stepwise
β β p
Dominance Weight (%)
95% CI Rank
Lower Upper Word count .67 .49 .52 < .001 .315 (52.67%) .231 .392 1 CLI .41 .20 .19 < .001 .073 (12.21%) .034 .127 3 Commas .02 -.21 -.19 < .001 .018 (3.01%) .007 .049 6 Stop words -.35 -.10 -.08 .10 .047 (7.86%) .017 .090 5 Linking words .35 .12 .11 .03 .048 (8.03%) .015 .103 4 Words/sentence .45 .27 .24 < .001 .097 (16.22%) .050 .158 2
Total .598 (100) Note. The criterion variable was SAT Essay Score. N = 200, R2 = .60 (95% CI [.50, .65]). β values obtained from stepwise regression were reported in the original study (Goh et al., 2020). R2 and β were recalculated using standard multiple regression.
Figure 5 displays the dominance weights and their corresponding 95% CIs in descending order
from the PV with the largest dominance weight (i.e., Word count) to that with the smallest
dominance weight (i.e., Commas). The comparisons of PVs suggest that the dominance weights of
the three length-related features (i.e., Word count, Words/sentence, and CLI) have statistically larger
weights than Commas. This result corroborates the finding that the predictor importance of Commas
is negligible.
37
Figure 5
Dominance Weights and Corresponding 95% Confidence Intervals
Note. Horizontal error bars show 95% confidence intervals computed from 10,000 bootstrapped
replications. * indicates that the confidence intervals does not contain 0 (p < .05).
38
Figure 6 presents the variable importance plot obtained from applying random forest with the
Boruta algorithm. Again, this supports the DA result; the variable importance of the number of
commas per sentence (Commas) was “tentative” (boxplot in yellow), indicating that the predictor
Commas was not important according to the random forest approach. It is worth mentioning that the
predictors’ relative importance of DA and random forest generally agree, suggesting the robustness
of the amalgamated approach.
Figure 6
Variable Importance Plot for the Data from Goh et al. (2020)
39
The dominance weight of Commas was close to zero in magnitude, and its importance as a
predictor was trivial. The problem of Commas being overrated as an important predictor was simply
because Goh et al. (2020) interpreted its importance using its standardized beta coefficient. As
witnessed in this empirical example, one cannot trust the face value of the standardized beta
coefficients as a measure of predictor importance. Therefore, such coefficients should always be
supplemented by dominance analysis and random forest, which enable more informed judgments and
provide further insight concerning the true contribution of predictors to the multiple regression
model.
Why is the misinterpretation of the importance of the predictor a critical issue? This is because
erroneous theories and pedagogical implications may have evolved as a result of the misuse of
multiple regression analysis. The misinterpretation of predictor importance occurred for only one
variable (i.e., Commas) in Goh et al. (2020). In other cases, the entire multiple regression model may
be called into question if researchers fall into the trap of using only standardized beta coefficients
without utilizing dominance analysis and random forest.
It should be noted that the authors, reviewers, or editors of the original study are not to be
blamed. As long as the use of multiple regression analysis for determining important predictors is
warranted in the L2 research field, there is nothing wrong with this practice. Rather, the authors
should be commended for their good reporting practices, which made the current reproduction of
40
their data analysis possible. As mentioned earlier, most L2 studies fail to report the information
needed to reproduce their results.
The misuse of standardized beta coefficients in multiple regression analysis for identifying
important predictors has penetrated deep into L2 research. Even when researchers obtain
misleadingly small (i.e., attenuated) standardized beta coefficients for PVs that have moderate
correlations with the CV, they may not question this fact. Therefore, the current status is alarming
because theories and pedagogical implications may be presented based on the misuse of multiple
regression analysis. The empirical example in this section has demonstrated that interpreting
standardized beta coefficients to determine predictor importance does more harm than good and that
standardized beta coefficients should always be reported along with dominance and random forest
analyses. To reiterate, it is the dominance analysis that can correctly partition the predicted variance
(R2) to each PV and assist in better understanding how each PV uniquely, and in combination with
other PVs, contributes to the CV, which is impossible to determine using standardized beta
coefficients. For this reason, dominance analysis accompanied by random forest can help in the
development of sound theories and pedagogical implications.
4. An R-based Web Application
To facilitate further the use of dominance analysis (DA) and random forest methods in L2
studies, I developed an R-based web application (Figure 7). It is accessible online for free
41
(https://langtest.jp/shiny/relimp/). In this R-based web application, which has an intuitive interface,
users can experience the procedures described in this paper, which will help develop the user’s initial
understanding of dominance analysis and random forest methods. With DA, users can compute
bootstrapped confidence intervals around the individual dominance weights, and statistical
significance tests comparing predictors are also available. Random forest with the Boruta algorithm
can also be implemented in the web application, although running the bootstrap iterations can be
time consuming.
Users can supply their own datasets by copying and pasting the data from a spreadsheet (see
Mizumoto & Plonsky, 2016 for an introduction to this web application). The web application can
handle both raw data and a correlation matrix as input. When a correlation matrix is used as the
input, the program automatically generates raw data with the same correlation coefficients because
computing confidence intervals or statistical significance tests requires such data. As long as the
original data meet the assumption of a multivariate normal distribution, theoretically, the simulated
and actual datasets should produce the same results when used for statistical analysis. Users can thus
conduct secondary analysis such as that performed in this study.
The R code used to develop this web application is the same as the code used in the analyses
reported in this showcase article. If users are interested in finding out more about the R code used in
the web application, they can refer to the online supplementary material of this paper.
43
5. Conclusion
Researchers have repeatedly noted that standardized beta coefficients should be interpreted
with care. Although there are a few researchers (e.g. Kyle et al., 2021) who understand the problem
and appropriately use relative importance metrics to provide a valid interpretation of predictor
importance, it appears that most researchers in the field are not explicitly aware of the issue.
Consequently, incorrect interpretation of standardized beta coefficients continues to this day. In this
showcase article, I highlighted the use of dominance analysis and random forest as solutions to the
misuse of standardized beta coefficients and illustrated that such methods can enable researchers to
more accurately understand the roles that predictors play in multiple regression models, in contrast to
using standardized beta coefficients.
I demonstrated how dominance analysis and random forest can be combined, instead of
interpreting standardized beta coefficients, to identify predictor importance by applying such
methods to a recently published study in the L2 research field. In addition, to make dominance
analysis and random forest available to more researchers, I developed an R-based web application
that is accessible to anyone.
I hope that this showcase article will help reduce the misuse of multiple regression analysis to
identify important predictors in L2 research. Reviewers and editors can refer to this paper and guide
authors to the appropriate method (i.e., dominance analysis and random forest) when their
44
interpretations may be erroneous. We can now stop ignoring this perpetuating problem in L2
research and keep moving forward in methodological reform (Marsden & Plonsky, 2018) with a
better understanding of predictor importance.
References
Braun, M. T., Converse, P. D., & Oswald, F. L. (2019). The accuracy of dominance analysis as a metric to assess relative importance: The joint impact of sampling error variance and measurement unreliability. Journal of Applied Psychology, 104(4), 593–602. https://doi.org/10.1037/apl0000361
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Budescu, D. V. (1993). Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression. Psychological Bulletin, 114(3), 542–551. https://doi.org/10.1037/0033-2909.114.3.542
Canty, A., & Ripley, B. (2021). boot: Bootstrap R (S-PLUS) functions (R package version 1.3-28) [Computer software]. https://CRAN.R-project.org/package=boot
Chen, R.-C., Dewi, C., Huang, S.-W., & Caraka, R. E. (2020). Selecting critical features for data classification based on machine learning methods. Journal of Big Data, 7(1), 52. https://doi.org/10.1186/s40537-020-00327-4
Crosthwaite, P., Storch, N., & Schweinberger, M. (2020). Less is more? The impact of written corrective feedback on corpus-assisted L2 error resolution. Journal of Second Language Writing, 49, 100729. https://doi.org/10.1016/j.jslw.2020.100729
Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15, 3133–3181. https://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf
Garver, M. S., & Williams, Z. (2020). Utilizing relative weight analysis in customer satisfaction research. International Journal of Market Research, 62(2), 158–175. https://doi.org/10.1177/1470785319859794
Gass, S., Loewen, S., & Plonsky, L. (2020). Coming of age: The past, present, and future of quantitative SLA research. Language Teaching, 54(2), 1–14. https://doi.org/10.1017/S0261444819000430
45
Goh, T.-T., Sun, H., & Yang, B. (2020). Microfeatures influencing writing quality: The case of Chinese students’ SAT essays. Computer Assisted Language Learning, 33(4), 455–481. https://doi.org/10.1080/09588221.2019.1572017
Grömping, U. (2006). Relative importance for linear regression in R: The package relaimpo. Journal of Statistical Software, 17(1). https://doi.org/10.18637/jss.v017.i01
Grömping, U. (2015). Variable importance in regression models. Wiley Interdisciplinary Reviews: Computational Statistics, 7(2), 137–152. https://doi.org/10.1002/wics.1346
Hair, J. F., Babin, B. J., Anderson, R. E., & Black, W. C. (2019). Multivariate data analysis (8th ed.). Cengage.
Hessel, G. (2015). From vision to action: Inquiring into the conditions for the motivational capacity of ideal second language selves. System, 52, 103–114. https://doi.org/10.1016/j.system.2015.05.008
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42(1), 80–86. https://doi.org/10.1080/00401706.2000.10485983
Jaccard, J., & Daniloski, K. (2012). Analysis of variance and the general linear model. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA handbook of research methods in psychology, Vol 3: Data analysis and research publication (pp. 163–190). American Psychological Association. https://doi.org/10.1037/13621-008
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Springer New York. https://doi.org/10.1007/978-1-4614-7138-7
Jarvis, S. (2011). Data mining with learner corpora: Choosing classifiers for L1 detection. In F. Meunier, S. De Cock, G. Gilquin, & M. Paquot (Eds.), Studies in Corpus Linguistics (Vol. 45, pp. 127–154). John Benjamins. https://doi.org/10.1075/scl.45.10jar
Jarvis, S., & Crossley, S. A. (Eds.). (2012). Approaching language transfer through text classification: Explorations in the detection-based approach. Multilingual Matters. https://doi.org/10.21832/9781847696991
Jeon, E. H. (2015). Multiple regression. In L. Plonsky, Advancing quantitative methods in second language research (pp. 131–158). Routledge.
Johnson, J. W. (2000). A heuristic method for estimating the relative weight of predictor variables in multiple regression. Multivariate Behavioral Research, 35(1), 1–19. https://doi.org/10.1207/S15327906MBR3501_1
Karpen, S. C. (2017). Misuses of regression and ANCOVA in educational research. American Journal of Pharmaceutical Education, 81(8), 6501. https://doi.org/10.5688/ajpe6501
Khany, R., & Tazik, K. (2019). Levels of statistical use in applied linguistics research articles: From 1986 to 2015. Journal of Quantitative Linguistics, 26(1), 48–65. https://doi.org/10.1080/09296174.2017.1421498
Kuhn, M. (2021). caret: Classification and regression training (R package version 6.0-88) [Computer software]. https://github.com/topepo/caret/
46
Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11). https://doi.org/10.18637/jss.v036.i11
Kyle, K., Crossley, S. A., & Jarvis, S. (2021). Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly, 18(2), 154–170. https://doi.org/10.1080/15434303.2020.1844205
LaFlair, G. T., Egbert, J., & Plonsky, L. (2015). A practical guide to bootstrapping descriptive statistics, correlations, t tests, and ANOVAs. In L. Plonsky, Advancing quantitative methods in second language research (pp. 46–77). Routledge.
Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., & Bischl, B. (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software, 4(44), 1903. https://doi.org/10.21105/joss.01903
Larson-Hall, J. (2016). A guide to doing statistics in second language research using SPSS and R (2nd ed.). Routledge.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65(S1), 127–159. https://doi.org/10.1111/lang.12115
Lebreton, J. M., Ployhart, R. E., & Ladd, R. T. (2004). A monte carlo comparison of relative importance methodologies. Organizational Research Methods, 7(3), 258–282. https://doi.org/10.1177/1094428104266017
Liakhovitski, D., Bryukhov, Y., & Conklin, M. (2010). Relative importance of predictors: Comparison of random forests with Johnson’s relative weights. Model Assisted Statistics and Applications, 5(4), 235–249. https://doi.org/10.3233/MAS-2010-0172
Luo, W., & Azen, R. (2013). Determining predictor importance in hierarchical linear models using dominance analysis. Journal of Educational and Behavioral Statistics, 38(1), 3–31. https://doi.org/10.3102/1076998612458319
Maassen, G. H., & Bakker, A. B. (2001). Suppressor variables in path models: Definitions and interpretations. Sociological Methods & Research, 30(2), 241–270. https://doi.org/10.1177/0049124101030002004
Maeda, H. (2004). Test kessekisha no mikomiten no yosoku: Kaiki bunseki [Predicting the expected score of absent test takers: Regression analysis]. In H. Maeda & K. Yamamori (Eds.), Eigo kyoshi no tameno kyoiku data bunseki nyumon [Introduction to educational data analysis for English teachers] (pp. 73–81). Taishukan Shoten.
Marsden, E., Crossley, S., Ellis, N., Kormos, J., Morgan‐Short, K., & Thierry, G. (2019). Inclusion of research materials when submitting an article to Language Learning. Language Learning, 69(4), 795–801. https://doi.org/10.1111/lang.12378
47
Marsden, E., Mackey, A., & Plonsky, L. (2016). The IRIS repository: Advancing research practice and methodology. In A. Mackey & E. Marsden (Eds.), Advancing methodology and practice: The IRIS repository of instruments for research into second languages (pp. 1–21). Routledge.
Marsden, E., & Plonsky, L. (2018). Data, open science, and methodological reform in second language acquisition research. In A. Gudmestad & A. Edmonds (Eds.), Critical reflections on data in second language acquisition (pp. 219–228). John Benjamins. https://doi.org/10.1075/lllt.51.10mar
Mizumoto, A., & Plonsky, L. (2016). R as a lingua franca: Advantages of using R for quantitative research in applied linguistics. Applied Linguistics, 37(2), 284–291. https://doi.org/10.1093/applin/amv025
Nathans, L. L., Oswald, F. L., & Nimon, K. F. (2012). Interpreting multiple linear regression: A guidebook of variable importance. Practical Assessment, Research, and Evaluation, 17(9), 1–19. https://doi.org/10.7275/5FEX-B874
Nicklin, C., & Plonsky, L. (2020). Outliers in L2 research in applied linguistics: A synthesis and data re-analysis. Annual Review of Applied Linguistics, 40, 26–55. https://doi.org/10.1017/S0267190520000057
Nimon, K. F., & Oswald, F. L. (2013). Understanding the results of multiple linear regression: Beyond standardized regression coefficients. Organizational Research Methods, 16(4), 650–674. https://doi.org/10.1177/1094428113493929
Nimon, K. F., Oswald, F. L., & Roberts, K. J. (2021). yhat: Interpreting regression effects (R package version 2.0-3) [Computer software]. https://CRAN.R-project.org/package=yhat
Norris, J. M., Plonsky, L., Ross, S. J., & Schoonen, R. (2015). Guidelines for reporting quantitative methods and results in primary research: Guidelines for reporting quantitative methods. Language Learning, 65(2), 470–476. https://doi.org/10.1111/lang.12104
Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7(6), 657–660. https://doi.org/10.1177/1745691612462588
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Oswald, F. L. (2021, June 19). Regression illustrated.jpg. Fred Oswald’s Quick Files. https://osf.io/adnj2
Plonsky, L. (in press). Open science in applied linguistics. John Benjamins. Plonsky, L., Egbert, J., & Laflair, G. T. (2015). Bootstrapping in applied linguistics: Assessing its
potential using shared data. Applied Linguistics, 36(5), 591–610. https://doi.org/10.1093/applin/amu001
Plonsky, L., & Ghanbar, H. (2018). Multiple regression in L2 research: A methodological synthesis and guide to interpreting R2 values. The Modern Language Journal, 102(4), 713–731. https://doi.org/10.1111/modl.12509
48
R Core Team. (2021). R: A language and environment for statistical computing (4.1.2) [Computer software]. https://www.r-project.org/
Stadler, M., Cooper-Thomas, H. D., & Greiff, S. (2017). A primer on relative importance analysis: Illustrations of its utility for psychological research. Psychological Test and Assessment Modeling, 59(4), 381–403.
Thomas, D. R., Zumbo, B. D., Kwan, E., & Schweitzer, L. (2014). On Johnson’s (2000) relative weights method for assessing variable importance: A reanalysis. Multivariate Behavioral Research, 49(4), 329–338. https://doi.org/10.1080/00273171.2014.905766
Tonidandel, S., & LeBreton, J. M. (2011). Relative importance analysis: A useful supplement to regression analysis. Journal of Business and Psychology, 26(1), 1–9. https://doi.org/10.1007/s10869-010-9204-3
Tonidandel, S., & LeBreton, J. M. (2015). RWA Web: A free, comprehensive, web-based, and user-friendly tool for relative weight analyses. Journal of Business and Psychology, 30(2), 207–216. https://doi.org/10.1007/s10869-014-9351-z