Individual Differences in Rule-Based Grammaticality ... - J-Stage

16
Individual Differences in Rule-Based Grammaticality Judgment Behavior: A Bayesian Modeling Approach Kunihiro KUSANAGI Hiroshima University Shusaku KIDA Hiroshima University Abstract The aim of the present study is to examine the source of individual differences in the rule-based grammaticality judgment behavior of foreign language learners. It has been recognized that learners with certain specific traits benefit relatively more from rule-based knowledge representations in their grammaticality judgment performance, while another type of learner tends to rely on their unconscious and non-symbolic type of knowledge, which is often called intuition. Generally, the former type of learner exhibits a relatively higher discriminability (d') in a grammaticality judgment task than the latter type. However, there remain individual differences unexplained in such rule-based grammaticality judgment behavior. The present study focused on grammatical carefulness (GC) as a key factor behind the variance inspected, as GC is considered as one of the latent variables associated with the speed-accuracy tradeoff and meta-linguistic behavior. Using the type II signal detection paradigm and Bayesian modeling, the present study, which targeted Japanese university students (N = 40) who engaged in a grammaticality judgment task (K = 48), found that learners with higher GC scores are more prone to benefit from rule-based knowledge representations than those with lower GC scores, who benefit from their intuition. It indicates that individuals’ psychological, behavioral, and meta-cognitive traits moderate grammaticality judgment behavior in a foreign language. The practical implications concerning testing and assessing grammatical performance in a foreign language were also discussed. 1. Background Analyzing decision, judgment, and choice behavior plays a central role in various academic fields, such as cognitive psychology and behavioral sciences. Foreign language teaching research is no exception; Grammaticality Judgment Tasks (GJTs) are regarded as one of the most commonly used set of tools to examine one’s grammatical performance and to construct latent variables such as grammatical knowledge, albeit they have also given rise to critical methodological debates (e.g., Ellis, 1991; Mandell, 1999). This toolbox usually consists of three 177

Transcript of Individual Differences in Rule-Based Grammaticality ... - J-Stage

Individual Differences in Rule-Based Grammaticality

Judgment Behavior: A Bayesian Modeling Approach

Kunihiro KUSANAGI

Hiroshima University

Shusaku KIDA

Hiroshima University

Abstract

The aim of the present study is to examine the source of individual differences in the

rule-based grammaticality judgment behavior of foreign language learners. It has been recognized

that learners with certain specific traits benefit relatively more from rule-based knowledge

representations in their grammaticality judgment performance, while another type of learner tends

to rely on their unconscious and non-symbolic type of knowledge, which is often called intuition.

Generally, the former type of learner exhibits a relatively higher discriminability (d') in a

grammaticality judgment task than the latter type. However, there remain individual differences

unexplained in such rule-based grammaticality judgment behavior. The present study focused on

grammatical carefulness (GC) as a key factor behind the variance inspected, as GC is considered

as one of the latent variables associated with the speed-accuracy tradeoff and meta-linguistic

behavior. Using the type II signal detection paradigm and Bayesian modeling, the present study,

which targeted Japanese university students (N = 40) who engaged in a grammaticality judgment

task (K = 48), found that learners with higher GC scores are more prone to benefit from rule-based

knowledge representations than those with lower GC scores, who benefit from their intuition. It

indicates that individuals’ psychological, behavioral, and meta-cognitive traits moderate

grammaticality judgment behavior in a foreign language. The practical implications concerning

testing and assessing grammatical performance in a foreign language were also discussed.

1. Background

Analyzing decision, judgment, and choice behavior plays a central role in various academic

fields, such as cognitive psychology and behavioral sciences. Foreign language teaching research

is no exception; Grammaticality Judgment Tasks (GJTs) are regarded as one of the most

commonly used set of tools to examine one’s grammatical performance and to construct latent

variables such as grammatical knowledge, albeit they have also given rise to critical

methodological debates (e.g., Ellis, 1991; Mandell, 1999). This toolbox usually consists of three

177

variables: (a) accuracy; (b) reaction time; and (c) subjective measures, including confidence,

certainty, and other attributes. The latent structure among the three variables itself has been an

important subject of inquiry in mathematical psychology (e.g., Pleskac & Busemeyer, 2010;

Ratcliff & Starns, 2013). However, in foreign language teaching research, very little is known

about the dynamics among the three variables taken from GJTs, especially those with the third one

(Kusanagi, 2018; Tamura, Harada, Kato, Hara, & Kusanagi, 2016).

Generally, a subject’s accuracy of judgment behavior is defined as a correct response

ratio or a probability to answer correctly, and subjective measures are the result of another type of

post-decisional tasks, which are retrospections regarding one’s mental states related to the given

judgment and its processes. One of the typical cases of subjective measures may be binary

confidence rating. After judging the grammaticality of a stimulus, a subject is then asked to

choose two levels of confidence about a given judgment, such as "confident" or "not confident".

This procedure is sometimes called a type II task, while the judgment of the property of a given

stimulus is referred to as a type I task. Type II tasks have various applications in cognitive

psychology, especially research about artificial grammar learning, recognition memory, and

subliminal perception (e.g., Galvin, Podd, Drga, & Whitemore, 2003; Kunimoto, Miller & Pashler,

2001; Maniscalco & Lau, 2012). Other common applications of type II tasks may include: (a)

non-binary confidence ratings, such as the percentile scale (Tunney & Shanks, 2003); (b) the

Remember-Know (R/K) procedure (Yonelinas, 2002); (c) multinomial levels called source

attributions, such as "Rule", "Recollection", "Intuition", "Familiarity", and "Guess" (e.g., Scott &

Dienes, 2008); and (d) binomial source attribution, such as "Rule" vs. "Intuition" (e.g., Kusanagi,

2018; Tamura et al., 2016). We have adopted the last one, which is commonly used in previous

foreign language teaching research (e.g., Ellis, 2005; Tamura et al., 2016).

It is important to note that the responses regarding the subjective measures sometimes

supplied us with additional information to predict the probability that the given judgment is

correct. For instance, assuming that a subject engaged in a GJT (K = 100) with a binary

confidence rating, we can cross-tabulate the observed responses, as presented in Table 1 below.

Table 1.

Hypothetical Cross Tabulation of Type I and Type II Responses

Confident Not Confident Total

Correct 60 10 70

Incorrect 10 20 30

Total 70 30 100

In the column of “confident” responses, 60 out of 70 responses were correct (86 percent),

while in the column of “not confident” responses, 10 out of 30 responses were correct (33 percent).

In this hypothetical case, when a response is confident, the associated judgment is more than twice

178

as likely to be correct. This probabilistic relation can be formally called conditional dependence.

When there is such a conditional dependence of the responses between type I and type II tasks, it

is theoretically considered to give us a type of evidence for conscious knowledge, explicit

knowledge, explicit memory, meta-cognitive sensitivity, or more simply, awareness,

consciousness, and meta-cognition (e.g., Galvin et al., 2003; Kunimoto et al., 2001; Maniscalco &

Lau, 2012; Scott & Dienes, 2008). Conditional dependence can be formulated by various ways;

some previous research has employed a guessing criterion, a zero-correlation criterion (Scott &

Dienes, 2008), and the type II signal detection paradigm (e.g., Galvin et al., 2003; Maniscalco &

Lau, 2012). Second language acquisition researchers also pay attention to conditional dependence

between accuracy and subjective measures, in order to inspect explicit and implicit grammatical

knowledge of second language learners (e.g., Ellis, 2005; Rebuschat, 2013).

In foreign language teaching research in Japan, Kusanagi (2018) examined the dynamics

among accuracy, reaction time, and subjective measures taken from university-level English

learners’ grammaticality judgment data, of which stimuli were assumed to include only

grammatical structures that were already learned. The study reported a general tendency that a

shorter response time predicts a higher accuracy, and a higher probability of “rule” response:

namely, fast, rule-based, and correct responses vs. slow, intuition-based, and incorrect responses.

The results fit well to the common findings in mathematical psychology, rather than to the

distinction of explicit and implicit knowledge in second language acquisition theory (e.g., Ellis,

2005, 2006).

Even if conditional dependence in grammaticality judgment behavior of foreign language

learners is true as a general tendency, a problem arises when we focus on its variance, or

individual differences, which are a very important aspect of foreign language teaching in practice.

Namely, there are some learners who are more likely to rely on the “rule” to identify correct

responses, finding that using their “intuition” will mean that their responses are incorrect, and

there also is the other type of learner who behaves in a contrasting manner, using “intuition” to

find correct responses and dismissing the “rule” for leading them to incorrect responses. The

source of this inconsistency remains quite obscure, and that is what we have examined in the

present study.

2. Theoretical and Methodological Rationale

2.1 Type II Signal Detection Paradigm

To formulize the conditional dependence between accuracy and the subjective measures

described above, the present study employed the type II signal detection paradigm, which is

simply an application of signal detection theory or SDT (e.g., Macmillan & Creelman, 2005;

Stanislaw & Todorov, 1999) to type II tasks. We will first look at a standard SDT-based analysis

179

for GJT data, which starts with categorization of four types of possible responses, as detailed in

Table 2.

Table 2.

Response Categories in Signal Detection Theory

Stimuli = Grammatical Stimuli = Ungrammatical Response = "Grammatical" Hit False Alarm

Response = "Ungrammatical" Miss Correct Rejection

The Hit Ratio (HR1) and False Alarm Ratio (FAR1) were then calculated. HR1 is the probability of

giving a “grammatical” response to grammatical items, and FAR1 is the probability of giving a

“grammatical” response to ungrammatical items.

= Response = "Grammatical" | Stimulus = "Grammatical")

(1)

= Response = "Grammatical" | Stimulus = "Ungrammatical")

(2)

The discriminability index, d1' or type I d' is defined as

= HR) − FAR) (3)

where z denotes z-transformation.

Type I d' signifies both negative and positive values, and positive values represent how

well one discriminates given stimuli without response bias. A typical index for response bias is

criterion, c1, which is defined as

= −

12 [HR) + FAR)] (4)

When the index c1 is 0, there is no response bias. Positive values represent bias toward

grammatical responses; negative toward ungrammatical ones. This equal variance Gaussian signal

detection model is graphically represented in Figure (1).

180

Figure 1. A simplified graphical representation of the equal variance Gaussian signal detection

model. In (a), the Gaussian distribution located on the left side represents the assumed psychological

quantity of a receiver to ungrammatical stimuli, while the right represents grammatical stimuli. When

the psychological quantity to a given stimulus exceeds the criterion or a fixed point k, the receiver

responds “grammatical” to the stimulus. Discriminability d' can be understood as a standardized mean

difference between the two distributions. Thus, when d' takes a greater value, the distributions part from

each other as in (b), in comparison to (a). Graphically, the hit ratio is equal to the area under the

probability density function of grammatical stimuli over k, as in (c). Likewise (d), (e), and (f) represent

the ratios of false alarm, correct rejection, and miss, respectively.

In the same manner, the type II signal detection paradigm cross-tabulates SDT response

categories and the responses on the subjective measure, as was done in the previous section. In

this case, however, we focused on two binominal source attributions, “rule” and “intuition,” as in

Table 3. For convenience, we coined (a) a rule-based benefit that corresponds to type II hits, (b) a

false rule that corresponds to type II false alarms, (c) an intuition-based benefit that corresponds to

a type II miss, and (d) a false intuition that corresponds to a type II correct rejection. Thus, HR2, or

rule-based benefit ratio, and FAR2, or false rule ratio, can be calculated as

HR = Response = "Rule" | Response = Hit ∨ Correct Rejection)

(5)

FAR = Response = "Rule" | Response = Miss ∨ False Alarm)

(6)

181

Table 3.

Response Categories in the Type II Signal Detection Paradigm

Response =

Hit ∨ Correct Rejection

Response =

Miss ∨ False Alarm

Response = "Rule" Type II Hit

(Rule-Based Benefit)

Type II False Alarm

(False Rule)

Response = "Intuition" Type II Miss

(Intuition-Based Benefit)

Type II Correct Rejection

(False Intuition)

Thus, type II d' and type II criteria were written as

= HR) − FAR) (7)

= −

12 [HR) + FAR)] (8)

Type II d' can be an estimator to evaluate how much a subject benefits from rule-based

knowledge rather than intuition in grammaticality judgment behavior without biases, relative to

overall discriminability. When the value is 0, it indicates no benefit from rule-based knowledge

and intuitions. Greater positive values indicate a higher benefit from rule-based knowledge.

Following previous studies, we expect that the population mean, μ, of this index takes a positive

rather than a negative value.

2.2 Grammatical Carefulness

Grammatical Carefulness (GC) is a psychological, behavioral, and meta-cognitive trait

that is expected to moderate foreign language learners’ speed-accuracy tradeoff and

meta-linguistic behavior (Kusanagi et al., 2015). A Foreign Language Grammatical Carefulness

Scale (FLGCS) is a statistically sound inventory written in Japanese (K = 14). FLGCS yields three

subscales and is measured by a three-factor model. Kusanagi et al. reported that the factorial

structure of this inventory is statistically robust, and each of their subscales showed a relatively

high reliability coefficient. Their scale was also reported to have measurement invariance among

the school levels and genders.

Theoretically, GC is expected to be a cause of highly cautious, careful, deliberate, and

intentional language use (Kusanagi et al., 2015, p. 79). Based on this hypothetical property, the

present study assumes that GC is one of the potential traits that explain the individual differences

in rule-based grammaticality judgment behavior, or more precisely, variance in type II d' of

foreign language learners. We expect that learners with relatively higher GC scores will exhibit

higher values in type II d' than learners with lower GC score. The present study further assumes a

182

bi-directional relationship between the two. That is, GC as a psychological, behavioral, and

meta-cognitive trait, is incrementally enhanced by learners’ past subjective successes in cautious,

careful, and deliberate types of language use that include rule-based grammaticality judgment

behavior.

2.3 Models and Research Questions

The present study adopted the framework of Bayesian modeling (see, e.g., Kruschke,

2013, and Lee & Wagenmakers, 2014, for an introduction) to examine the individual differences

described above, because it was recognized that the Null Hypothesis Significant Testing (NHST)

quite often seriously misleads foreign language teaching researchers whose research practice is, by

any standard, not free from (a) uncontrolled research designs with undoubted small sample sizes,

(b) less informative statistical treatments such as t-test and ANOVA, (c) informally described

hypotheses and research questions, and (d) a wealth of misinterpretations of p-values. Unlike

orthodox statistics, the ultimate goal of the present study is to gain a posterior distribution of

parameters in our mathematical models rather than judging statistical significance.

In Bayesian modeling, researchers first construct mathematical models, which should be

formally presented. In the case of the present study, we examine four general regression models

below:

:

= β + ε (9)

:

= β + βGC + ε (10)

:

= β + β

+ ε (11)

:

= β + βGC + β

+ ε (12)

where M denotes models. βj = 0, 1, 2) is the standardized regression coefficient and ε represents the

error of the i-th participant’s value.

is the response variable—that is, type II d' of the i-th

participant. The predicators in the models include GCi, which is the summated scale score of the

i-th participant, and

, which denotes type I d'. Additionally, β0 represents the intercept.

In Model 1, we expect that βwill take positive values. It is equivalent to the statement

that the population mean of type II d' takes a positive value as in (13).

μ

> 0 (13)

183

The inequality in (13) is the first research hypothesis of the present study. Namely, as a general

tendency, learners benefit more from rule-based grammaticality judgment behavior.

We can now inspect the inter-learner variance, which is the main purpose of the present

study. Using Bayesian model comparison, we can select the best model that fit to the observations

from Models 2 to 4. Bayesian model comparison is one of the model selection methods used in

Bayesian analyses. For instance, if we compare Models 1 and 2, the Bayes Factor (BF), which is

sometimes called strength of evidence, can be written as

BF =

|)|) (14)

where D represents the given data. When BF > 1, it suggests that Model 1 is preferred over Model

2. In the study, we expected that Models 2 and 4 would be more strongly supported by the given

data than Model 3. That is to say, GC is a valid predictor of type II d'. Note that type I d' can be

naturally correlated to GC and type II d'. Therefore, it is necessary to examine the indirect

correlation between GC and type II d' via type I d'. Moreover, we predict that β1 is not equal to 0

in Models 2 or 4.

¬β = 0) (15)

The statement in (15) is the second hypothesis of the present study. It means that GC explains the

individual differences of rule-based grammaticality judgment behavior.

In Bayesian statistics, it is assumed that all the parameters are random. Researchers,

therefore, should specify the prior distributions on the parameters in question. The distributions

represent the subjective probability that the researchers have, which must be given prior to

observations. In the present study, the prior distributions are specified below in (16) to (18):

~Normal0, ∞) (16)

~Normal0, ) (17)

~Gamma

12000 ,

12000) (18)

where σ is the hyperparameter for ε. This setting of the prior distributions followed the method

used by Martin, Quinn and Park (2001), and is graphically represented in Figure 2. The setting is

not derived from any type of target-specific information that the authors have.

184

Figure 2. A graphical representation of the priors in the present study. Plot (a) represents the

normal distribution on βs that are virtually uninformative. Plot (b) represents the Gamma

distribution for the hyperparameters of errors. Plot (c) represents the distribution on errors; note that

the distribution is not fixed until the hyperparameter in (b) is given. The plot represents an arbitrary

shape of the prior, where the hyperparameter is equal to 1.00, for instance.

The most important step in Bayesian modeling is to update the prior distribuions on the

parameters of the given model with the observations. The updated distributions are called the

posterior distributions, which are quite often substituted with a sample of random numbers from

the posterior distributions; this is called a Markov Chain Monte Carlo (MCMC) sample. The

posterior probabiliy, which takes the form of the distributions, can be represented as

Posterior ∝ Likelihood × Prior (19)

Therefore, it can be understood that the posterior distribution is propotional to both the prior

distribution that the researchers specified, and the likelihood function, which represents the

probability of observing the data under a certain model. The present study uses standard Gibbs

sampling, which is a common MCMC algorithim, in order to construct the samples from the

posterior distributions. We directly evaluated our hypotheses using BFs, calculating Expected A

Posteriori (EAP), and constructing Bayesian credible intervals by means of highest posterior

density interval (HDI), rather than equal-tailed intervals.

In sum, following the Bayesian analysis process briefly decribed above, the present study

formally examines the following two reseach hypotheses:

(a) The population mean of type II d' takes a positive value.

(b) The standard regression coefficient of GC to type II d' is not equal to zero.

If these hypotheses are judged true, we can conclude that individual differences in rule-based

grammaticality judgment behavior can be explained by psychological, behavioral, and

meta-cognitive traits of learners, such as GC.

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.5

0.0

0.5

1.0

( a)

Parameter

Probability

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

( b)

Parameter

Probability

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

( c)

Parameter

Probability

185

3. Method

3.1 Participants

Forty graduate students who speak Japanese as a first language and are learning English

for academic and occupational purposes participated in a grammaticality judgment task. All of the

participants majored in engineering. Thirty-eight out of 40 were male. Their self-reported TOEIC

scores ranged from 245 to 750, at M = 446.63, SD = 98.41. All the participants signed a consent

form before the experiment, and the ethical issues concerning the present study were thoroughly

explained. No outliers were excluded.

3.2 Instruments

Computer-based grammatical judgment tasks with both types I and II paradigms were

used. In type I task, the participants were asked to judge the grammaticality of a given English

sentence by pressing a response key: "grammatical" or "ungrammatical". The participants were

then asked about their mental states during the judgment, to which they responded by pressing the

“rule” or “intuition” key. A practice session with 10 trials was provided before the experiment.

The computer program was run on Windows OS, and administered on desktop PCs with 21-inch

monitors for all participants. The stimuli set consisted of 48 sentences, including structures such as

(a) adverb placement (k = 6), (b) auxiliary verbs (k = 6), (c) possessives (k = 6), (d) verb

complements (k = 6), (e) embedded questions (k = 6), (f) conditionals (k = 6), (g) infinitives (k =

6), and (h) third person singular phrases (k = 6). Half of the stimuli (k = 24) were grammatical, and

the rest were ungrammatical (k = 24) crossing the structures. The stimuli were presented using

block randomization. This task did not require a time limit or an accelerated pace.

After the GJTs, the participants filled out a paper copy of the questionnaire form. The

questionnaire included items about their demographic information and GC items (k = 14). The GC

items were answered using a seven-point Likert scale, following Kusanagi et al. (2015).

3.3 Analysis

Firstly, we calculated the descriptive statistics of (a) accuracy, (b) the “rule” response

ratio, (c) the GC score, (d) type I d', (e) type II d', (f) the type I criterion, and (g) the type II

criterion. Of these variables, only (c), (d), (e) were our primary interest; the other secondary

variables are for reference. For (a), (b), and (c), the reliability coefficients were calculated.

Although GC has a multi-factor model, the present study did not estimate the factor score of the

subscales. Rather, we used the summated score of all the items, as it is known that the factors had

an extremely high covariance rate (Kusanagi et al., 2015).

The series of Bayesian analyses described in Section 2.3 was conducted using R (R Core

Team, 2016) with psych (Revelle, 2016), MCMCpack (Martin, Quinn,& Park, 2011), GGally

(Schloerke et al., 2017), and HDInterval (Meredith & Kruschke, 2016) packages.

186

4. Results

4.1 Data Description

The descriptive statistics of all the variables are summarized in Table 4. The correlation

matrix among the variables is shown in Table 5. Figure 3 graphically represents the distributions

and correlations of the variables. We found that assuming normality for types I and II d' and the

GC scores is a statistically valid procedure. Accuracy and the "rule" response ratio did not reflect

either a ceiling or floor effect.

The reliability coefficients of the type I responses (k = 48), type II responses (k = 48), and

GC scores were then calculated using Cronbach’s α. The results were α = .61 for type I responses,

α = .91 for type II responses, and α = .90 for the GC score. Type I responses showed a slightly

lower level of reliability than the others. However, we judged the reliability sufficient for

subsequent analyses.

Table 4.

Descriptive Statistics of All the Variables M SD Skewness Kurtosis Minimum Maximum

Accuracy 0.68 0.09 -0.35 -0.04 0.42 0.86

“Rule” Response Ratio 0.55 0.22 -0.35 -0.48 0.04 0.92

GC Score 3.25 0.89 0.27 0.24 1.36 5.71

Type I d' 0.98 0.53 -0.17 -0.13 -0.43 2.12

Type II d' 0.72 0.97 -1.01 2.96 -2.87 2.53

Type I Criterion 0.12 0.30 0.33 0.92 -0.63 0.97

Type II Criterion -0.73 0.25 -1.67 3.70 -1.64 -0.38

Note. N = 40.

Table 5.

Correlation Matrix of All the Variables 1 2 3 4 5 6 7

1. Accuracy 1.00

2. “Rule” Response Ratio -0.13 1.00

3. GC Score 0.22 0.31 1.00

4. Type I d' 0.99 -0.11 0.21 1.00

5. Type II d' 0.26 0.90 0.38 0.27 1.00

6. Type I Criterion -0.31 0.10 0.01 -0.25 -0.05 1.00

7. Type II Criterion 0.09 0.39 0.16 0.07 0.45 -0.32 1.00

187

Figure 3. Plot matrix representing the univariate and bivariate kernel density and scatter plots among all

the variables (N = 40). The numbers represent: 1 = Accuracy, 2 = Rule response ratio, 3 = GC score, 4 =

Type I d', 5 = Type II d', 6 = Type I criterion, and 7 = Type II criterion.

4.2 The General Tendency of Rule-Based Grammaticality Judgment Behavior

In order to verify research hypothesis 1, we examined the sample from the posterior

distribution of β0 in Model 1 using a standard Gibbs sampling, the iterations of which were 10,000

and the number of chains was 1, without a thinning interval. Using Geweke’s diagnosis, we

confirmed that the sampling was successfully converged. The summary of the MCMC sample

from the posterior distribution is summarized below in Table 6: the EAP of β0 was .73, and the

Bayesian credible interval at α = .05 constructed by HDI ranged from 0.40 to 1.04. Figure 4 below

shows the density of the MCMC sample on β0 in Model 1. Obviously, our results supported

research hypothesis 1.

Table 6.

The Summary of the MCMC Sample in Model 1

EAP SD 2.5% 25.0% 50.0% 75.0% 97.5%

β0 0.73 0.16 0.41 0.62 0.73 0.83 1.04

σ2 1.00 0.24 0.65 0.83 0.97 1.13 1.56

Note: Percent here represents the percentile points, rather than the boundaries of HDI.

188

Figure 4. Density plot representing the general tendency of

rule-based grammaticality judgment behavior

4.3 Grammatical Carefulness Explains the Individual Differences

For research hypothesis 2, the MCMC samples for Models 2, 3, and 4 were generated

using Gibbs sampling with the same setting as Model 1. All of the samplings were confirmed to

be converged successfully. The BF matrix of the three models was then were calculated by

marginal likelihood using Laplace approximation. The matrix is shown in Table 7 below. As with

the results of the model selection, Model 2 was found to be the most favorable in terms of BF.

Table 7.

Bayes Factor Matrix

Numerator Log Marginal

Likelihood

Denominator

Model 2 Model 3 Model 4

Model 2 -65.19 1.00 5.01 1.22

Model 3 -66.80 0.20 1.00 0.24

Model 4 -65.38 0.82 4.12 1.00

We therefore examined Model 2 with great precision. The MCMC samples from the

posterior distributions in Model 2 are summarized in Table 8 below. Figure 5 below graphically

represents the posterior distributions and the traces of the terms. The Bayesian credible interval of

β1, the standardized regression coefficient of GC scores, ranged from 0.06 to 0.67, and the EAP

was 0.38. It obviously falls into the category of positive values. Thus, the results clearly suggested

that research hypothesis 2 is plausible.

Table 8.

The Summary of the MCMC Sample in Model 2

EAP SD 2.5% 25.0% 50.0% 75.0% 97.5%

β0 0.00 0.15 -0.29 -0.09 0.00 0.10 0.30

β1 0.38 0.15 0.07 0.27 0.38 0.48 0.68

σ2 0.92 0.23 0.58 0.77 0.80 1.05 1.46

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.52.

02.

5

Population Mean

Densi

ty

189

Figure 5. Trace and density plots representing the MCMC samples from the posterior

distributions in Model 2. Plot (a), (b) and (c) represent the trace of, β0, β1, and σ2 respectively,

and plots (d), (e), and (f) are the kernel density of each.

5. Conclusion and Implications

Our study clearly revealed that: (a) foreign language learners tend to benefit more from

rule-based knowledge representations and less from intuition, and (b) variations in this tendency

can be explained at least partially as GC, one of psychological, behavioral, and meta-cognitive

traits of foreign language learners.

There still remain issues that are unexplained. Firstly, it can be stated that judgment

behavior is a window to examine one’s knowledge and cognitive operations, but it does not fully

explain all real-time language use. Therefore, the results in the present study do not necessarily

mean that all the types of language use basically rely on rule-based knowledge representations

which are moderated by some learners’ traits such as GC. Secondly, we abstracted a very

important aspect of judgment behavior—reaction time. As we mentioned earlier, reaction time

must exhibit a complex relationship with rule-based grammaticality judgment behavior. Future

studies should examine the relationship between the two. Thirdly, the present study addressed

only one of learners’ psychological, behavioral, and meta-cognitive traits. Other more common

traits such as motivation, or epistemological factors such as learner belief and learning attitudes,

may be alternatives to predict individual differences of rule-based grammaticality judgment

behavior. Lastly, we interpret the obtained results carefully in the framework of Bayesian

modeling, but it does not guarantee that the results of the present study are generalizable to any

populations.

0 2000 4000 6000 8000 10000

-0.5

0.0

0.5

( a)

Iteration

Value

0 2000 4000 6000 8000 10000

-0.2

0.2

0.6

1.0

( b)

Iteration

Value

0 2000 4000 6000 8000 10000

0.5

1.0

1.5

2.0

2.5

( c)

Iteration

Value

-0.5 0.0 0.5

0.0

0.5

1.0

1.5

2.0

2.5

( d)

Value

Density

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

( e)

Value

Density

0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

2.0

( f)

Value

Density

190

Although such limitations should be taken seriously, our findings may shed light on the

field of assessment and testing grammatical performance. For instance, type II responses

themselves can be very important additional information for not only understanding cognitive

processes, but also making educational decisions, such as curricula design, students placement and

material development.

Assuming that there are two learners who perform very similarly in type I task, but one

benefits more from rules while the other benefits more from intuition, rationally, the pedagogical

treatments and materials that the two learners need may differ. For instance, teaching materials

that are needed for the learner who tends to rely on intuition may be a textbook that give precise

information about grammatical rules, rather than that of the pattern practice type. However, such

optimization of treatments and materials cannot be realized without understanding the individual

differences underlining in a latent factor behind the performance. Needless to say, in order to

understand such hidden individual differences and successfully apply it to everyday teaching

practices, formal approaches such as Bayesian modeling, which we have implemented in the

present study, will be very helpful and promising.

References

Ellis, R. (1991). Grammaticality judgments and second language acquisition. Studies in Second

Language Acquisition, 13, 161–186.

Ellis, R. (2005). Measuring implicit and explicit knowledge of a second language: A psychometric

study. Studies in Second Language Acquisition, 27, 141–172.

Ellis, R. (2006). Modeling learning difficulty and second language proficiency: The differential

contributions of implicit and explicit knowledge. Applied linguistics, 27, 431–463.

Galvin, S. J., Podd, J. V., Drga, V., & Whitmore, J. (2003). Type 2 tasks in the theory of signal

detectability: Discrimination between correct and incorrect decisions. Psychonomic Bulletin

and Review, 10, 843–876.

Kruschke, J. K. (2010). Doing Bayesian data analysis: A tutorial with R and BUGS. New York,

NY: Academic Press.

Kunimoto, C., Miller, J., & Pashler, H. (2001). Confidence and accuracy of near-threshold

discrimination responses. Consciousness and Cognition, 10, 294–340.

Kusanagi, K., Fukuta, J., Kawaguchi, Y., Tamura, Y., Goto, A., Kurita, A., & Murota, D. (2015).

Foreign Language Grammatical Carefulness Scale: Scale development and its initial

validation. Annual Review of English Language Education in Japan, 26, 77–92.

Kusanagi, K. (2018). Gaikokugo no bunpouchishiki ni okeru ichigensei no kenshou: Bunpousei

handan no seitouritsu, hannouzikan, shukantekihensuu o taishou ni (doctoral dissertation) [A

unitary view of grammatical knowledge in a foreign language.]. Nagoya University, Nagoya,

Japan.

191

Lee M. D., & Wagenmakers E. J. (2014). Bayesian cognitive modeling: A practical course.

Cambridge University Press.

Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A userʼs guide. Mahwah, NJ:

Lawrence Earlbaum Associates.

Mandell, P. B. (1999). On the reliability of grammaticality judgment tests in second language

acquisition research. Second Language Research, 15, 73–99.

Maniscalco, B., & Lau, H. (2012). A signal detection theoretic approach for estimating

metacognitive sensitivity from confidence ratings. Consciousness and Cognition, 21,

422–430.

Martin, A. D., Quinn, K. M., & Park, J. (2011). MCMCpack: Markov chain monte carlo in R.

Journal of Statistical Software. 42 (9), 1–21.

Meredith, M., & Kruschke, J. (2016). HDInterval: Highest (Posterior) Density Intervals. R

package version 0.1.3. https://CRAN.R-project.org/package=HDInterval

Pleskac, T. J., & Busemeyer, J. R. (2010). Two-stage dynamic signal detection: A theory of

choice, decision time, and confidence. Psychological Review, 117, 864–901.

Ratcliff, R., & Starns, J. J. (2013). Modeling confidence judgments, response times, and multiple

choices in decision making: Recognition memory and motion discrimination. Psychological

Review, 120, 697–719.

Rebuschat, P. (2013). Measuring implicit and explicit knowledge in second language research.

Language Learning, 63, 595–626.

Revelle, W. (2016). psych: Procedures for personality and psychological research. R package

version 1.6. http://CRAN.R-project.org/package=psych

Schloerke, R., Crowley, J., Cook, D., Briatte, F., Marbach, M., Thoen, E., Elberg, A., &

Larmarange, J. (2017). GGally: Extension to 'ggplot2'. R package version 1.3.2.

https://CRAN.R-project.org/package=GGally

Scott, R. B., & Dienes, Z. (2008). The conscious, the unconscious, and familiarity. Journal of

Experimental Psychology: Learning, Memory, and Cognition, 34, 1264–1288.

Stanislaw, H., & Todorov, N. (1999). Calculation of signal detection theory measures. Behavior

Research Methods, Instruments, and Computers, 31, 137–149.

Tamura, Y., Harada, Y., Kato, D., Hara, K., & Kusanagi, K. (2016). Unconscious but slowly

activated grammatical knowledge of Japanese EFL learners: A case of tough movement.

Annual Review of English Language Education in Japan, 27, 169–184.

Tunney, R. J., & Shanks, D. R. (2003). Subjective measures of awareness and implicit cognition.

Memory and Cognition, 31, 1060–1071.

Yonelinas, A. P. (2002). The nature of recollection and familiarity: A review of 30 years of

research. Journal of Memory and Language, 46, 441–517.

192