Testing the limits of orthogonality: A study of the interaction between lexical frequency and...
Transcript of Testing the limits of orthogonality: A study of the interaction between lexical frequency and...
Testing the limits of orthogonality: A study of the interaction between lexical frequency and independent variables
Anna E. Wilson Indiana University
Abstract
Spanish is a “pro-drop” language, meaning subject personal pronoun (SPP)
expression may be phonetically null. Due in large part to the well-understood
dialectal differences and universal constraints operating on SPP expression,
Bayley, Cárdenas, Schouten, and Vélez Salas (2012) have highlighted it as a
“classic” or “showcase” variationist variable.
The present study follows the current trend in using SPP expression as a
starting point in order to investigate the effects of lexical frequency on
morphosyntactic independent variables and furthers discussion by examining its
behavior in a comparative study (Bayley, Holland & Ware, 2013; Erker & Guy,
2012). This study compares SPP expression of 6 Mexican immigrants living in
the US with that of 6 speakers living in Mexico City (Butragueño & Lastra,
2000; Wilson, 2013). The data are coded for previous realization (perseverance),
continuity of reference, tense-mood-aspect, genre, lexical frequency (based on
1% of non-lemmatized verbs from Bayley et al. (2013) and Erker and Guy
(2012)), morphological regularity, and verbal semantic content. Lexical
frequency is proven to behave non-orthogonally, supporting the conclusions of
Erker and Guy (2012).
Keywords: Spanish, sociolinguistics, pronoun expression, lexical frequency,
language contact
2 IULC Working Papers
1. Introduction
In Spanish, subject personal pronouns (SPP) may be expressed (i.e. explicit) or
phonetically null (i.e. not expressed), such as in example (1):
(1) Yo canto ‘I sing’
Ø canto ‘I sing’
In variationist sociolinguistics, SPP expression in Spanish is one of the most
widely investigated variables (Bayley & Pease-Álvarez, 1997; Cameron, 1993,
1995; Lapidus & Otheguy 2005a, 2005b; Silva-Corvalán 1997a, 1997b; Torres-
Cacoullos & Travis 2010; among many others). Due in large part to the dialectal
differences and universal constraints operating on SPP expression, Bayley,
Cárdenas, Schouten, and Vélez Salas (2012, p. 49-50) have highlighted it as a
“classic” or “showcase” variationist variable. The fact that the processes
influencing SPP expression are well documented is precisely why researchers
continue to conduct studies incorporating it. The past several years have witnessed
a transition away from defining the constraints acting upon SPP expression to
using SPPs as a “jumping-off” point for addressing more complex
(morpho)syntactic issues, such as the role of lexical frequency in conditioning
variation (Bayley, Holland & Ware, 2013; Erker & Guy, 2012), how the strength
of constraints affects child first language (L1) acquisition (Shin & Erker,
forthcoming), and how typologically diverse languages share constraints with
regard to subject expression (Torres-Cacoullos & Travis, 2013).
The present study follows the current trend in using SPP expression as a
starting point in order to further investigate the effects of lexical frequency on
morphosyntactic dependent and independent variables (Bayley et al., 2013; Erker
& Guy, 2012). The questions motivating this study are as follows: (1) Does lexical
frequency behave nonorthogonally, as in Erker and Guy (2012), or as an
independent variable, as in Bayley et al. (2013)? (2) Do Mexican speakers who
have embraced a community within the US pattern differently in lexical frequency
than speakers living in Mexico City? In order to answer these questions, the
effects on internal independent variables are explored in two corpora of speakers:
six speakers from a corpus of Mexican immigrants living in Roswell, Georgia,
United States (Wilson, 2013) are compared with six speakers from the PRESEEA
corpus of speakers from Mexico City, Mexico (Butragueño & Lastra, 2000).1
This study is organized as follows: I first discuss the theoretical
standpoint of previous and current variationist sociolinguistic work regarding
subject personal pronoun (SPP) expression in §2 before elaborating on the specific
approach and theoretical underpinnings of this study in §3, including the envelope
of variation and the dependent and independent variables. Afterwards follows a
presentation of results in §4 and a discussion of their implications in §5.
1 An earlier version of this article included an investigation into use of self-referential
‘uno’. This element has been removed from the current version due to inconclusive
findings from low frequency.
3 IULC Working Papers
2. Theoretical Background
The review of previous and current perspectives in SPP research is divided into
two principal sections: (1) constraints operating on SPP expression and (2) trends
in recent investigations regarding variable SPP expression.
2.1 Main Constraints of SPP Expression
A key component of variationist work which is seen in SPP studies is the goal of
describing highly generalizable patterns which adequately describe trends in
multiple dialects of a given language. Of the main constraints that variationist
researchers agree upon tense, mood, aspect (abbreviated TMA); switch reference;
perseverance/priming; semantic class of verb; genre; and place of origin are the
most salient (Barrenechea & Alonso, 1977; Bentivoglio, 1987; Cameron, 1993,
1995; Cameron & Flores-Ferrán, 2004; Enríquez, 1984; Flores-Ferrán, 2002,
2007; Miyajima, 2000; Morales, 1997; Otheguy et al., 2007; Otheguy & Zentella,
2012; Silva-Corvalán, 2003; Travis, 2005, 2007; Toribio, 2000, among others). In
Spanish, verbs are typically morphologically distinct for subject. Verb endings are
encoded with subject information, rendering SPP expression unnecessary for
distinguishing the subject; however, in certain cases, first, third, and formal
second person are indistinguishable. For example, conditional, past imperfect, as
well as subjunctive cases fall into this category (e.g. comería ‘He/she/Usted/I
would eat’, cantaba ‘He/she/Usted/I used to sing(s)’, venga ‘He/she/Usted/I
subjunctive come(s)’). The expectation is that speakers will express more SPPs in
context of morphological ambiguity, a perspective which began with the work of
Gili Gaya (1970) and Hochberg (1986). The more contemporary works of
Cameron (1993, 1994) discuss this constraint in-depth, finding that present tense
ser ‘to be,’ an unambiguous verb, actually has the highest rate of expressed SPPs,
followed by two- and three-way ambiguous verbs2. Silva-Corvalán (2003)
proposes a “semantic-pragmatic” hypothesis to account for the variable expression
of SPPs under the TMA constraint. She argues that the more ambiguous imperfect
tense serves to present “background” information in discourse, while the more
unique preterit forms present “foreground” information, giving us the saliency we
see in the TMA constraint (Silva-Corvalán, 2003).
Also according to Silva-Corvalán (2003, p. 2), another constraint, that of
co-referentiality (commonly called switch reference), is perhaps the “most
statistically significant factor in all the completed studies so far” (translation
mine). Several studies which solidified early on this position for switch reference
are Silva-Corvalán (1982), Bentivoglio (1987), and Cameron (1993, 1995).
According to Cameron (1993), same reference is when the “target” NP, expressed
SPP, or null subject refers to the same subject as the “trigger” NP, expressed, or
2 Three-way ambiguous verbs are when first, third, and formal second person are
indistinguishable. For example, conditional, past imperfect, as well as subjunctive cases fall
into this category (e.g. comería ‘He/she/Usted/I would eat’, cantaba ‘He/she/Usted/I used to
sing’, venga ‘He/she/Usted/I subjunctive come(s)’). A two-way ambiguous verb is one such
as canta ‘He/she/Usted sings’ (present indicative), since the ambiguity exists only between
second person (formal) and third person.
4 IULC Working Papers
null subject. Across the board, these researchers, among many others since these
publications, have found that SPPs are disfavored in environments of same
reference, while the converse is also true—SPPs are favored in contexts of switch
reference. Perhaps one of the most well-known works is Cameron (1995), which
provides a thorough exploration of the process of switch reference. He defines a
“reference chain” as a sequence of two or three NPs which maintain the same
referent in ongoing discourse (Cameron, 1995).
Recently, there has been more work detailing how perseverance affects
SPP expression. While switch reference deals with the referent of the trigger,
perseverance addresses the form of the trigger. When interpreting studies, it is
helpful to consider switch reference as to whom is being referred, while
perseverance is how the speaker is referring to that subject. Cameron’s (1995)
reference chains also pertain to perseverance, as the author uses them to argue that
expressed yo will lead to more expressed yo’s, while null SPPs will lead to more
null expressions, a position also supported by Flores-Ferrán (2005). Travis (2005)
calls this process the “yo-yo effect” and elaborates that its effects are only felt at
low degrees of distance.
Another constraint that researchers agree impacts SPP expression is
semantic class of the verb. Since sociolinguistic interviews are centered on the
perspective of the speaker, higher SPP expression is generally observed with verbs
which index the speaker’s opinion or mental activity (Bentivoglio, 1987;
Enríquez, 1984; Miyajima, 2000; Morales, 1997; among others). In terms of genre
effects on SPP expression, both Travis (2007) and Flores-Ferrán (2005) reveal that
narrative environments are less likely to have expressed SPPs. Related to aspects
of Travis (2007), place of origin has been an observed difference between the SPP
expression of “Mainlander” and Caribbean varieties of Spanish, to use the
terminology of Otheguy, Zentella and Livert (2007) and Otheguy and Zentella
(2012). As a whole, speakers of Caribbean Spanish varieties have more expressed
SPPs, while speakers of Mainland Spanish varieties have more null SPPs
(Cameron, 1993; Otheguy et al., 2007; Otheguy & Zentella, 2012; Toribio, 2000;
among others).
In terms of purely social factors, a high level of variability has been
identified in different Spanish varieties. Nearly every study that includes social
variables attests to a different social constraint ranking and conditioning effect on
SPP expression (Ávila-Jiménez, 1995; Otheguy & Zentella, 2012; Silva-Corvalán,
2001; among others). An interesting complement to this component of SPP
expression is the lack of agreement of the effects of language contact, with
English as well as among Spanish varieties, on SPP expression. There is a general
trend that Spanish in contact with English fails to display more expressed SPPs,
but that view is not corroborated by everyone (Ladipus & Otheguy, 2005a, 2005b;
Flores-Ferrán, 2004).
2.2 Trends in Recent Work
In recent studies of SPP expression, scholars have been moving beyond defining
the constraints acting on SPP use. For example, Erker and Guy (2012) describe
the effects of lexical frequency on a traditional SPP expression study. Instead of
5 IULC Working Papers
operating like a typical independent variable which either favors or disfavors the
expression of SPPs (or is not significant), they find that lexical frequency is a
“nonorthogonal” variable that enhances the qualities of the independent variables,
but does not directly affect the dependent variable. Instead of directly constraining
the expression of SPPs (the dependent variable), a “nonorthogonal” variable (in
the case of Erker and Guy (2012), lexical frequency) affects the independent
variables (such as TMA, switch reference, etc.), which in turn influence the
dependent variable. The whole process is akin to a cascade of influences, starting
with the nonorthogonal variable and ending with the dependent variable.
Following this train of thought, a disfavoring constraint will disfavor more
strongly in frequent verbs, defined in their study as a verb which comprises 1% or
more of non-lemmatized3 verbs in their corpus. Conversely, a favoring constraint
will favor SPP expression more strongly in frequent verbs. In infrequent verbs,
which is to say one which makes up under 1% of non-lemmatized verbs in their
corpus, no effect on constraints is observed (Erker & Guy, 2012). In response,
Bayley et al. (2013) conducted a similar study where the authors attempt to
replicate the methodology used in Erker and Guy (2012). Bayley et al. (2013)
reveal that lexical frequency does not behave in a nonorthogonal manner; rather,
they hold that the effect lexical frequency has on SPP expression is that of a
traditional independent variable. They find that frequent verbs disfavor SPP
expression, and that infrequent verbs favor expression (Bayley et al., 2013).
Outside of Bybee (2001, 2002), Erker and Guy (2012) and Bayley et al. (2013) are
the only two studies which address lexical frequency from a variationist
perspective. The present study furthers the discussion of frequency by
investigating its behavior in a comparative study. The well-understood base of
SPP expression has been shown through praxis as a viable starting point for
distinguishing deeper connections in (morpho)syntactic behavior (Bayley et al.,
2013; Erker & Guy, 2012; Shin & Erker, forthcoming; Torres-Cacoullos & Travis,
2013), and it is this mentality which motivates the present study.
3. Methodology
3.1 Speakers
The data used in this study come from transcripts of sociolinguistic interviews
with 12 Mexican speakers. The twelve speakers are pulled from two different
corpora, which are utilized in order to compare differences in patterning of lexical
frequency across speakers from the same cultural heritage living in the US and
Mexico. Half of the speakers included in the present study are first generation
immigrants living in the exurb Roswell, Georgia, United States at the time of the
interview. These data were collected by the author for a quantitative study of
narrative structure as part of the civic-academic partnership the Roswell Voices
3 A “lemma” is a verb stem that has restricted semantic content. Also called a “lexeme.”
Lemmatization refers to the grouping of lemmas together according to semantic content
(e.g. tengo ‘I have,’ tuve ‘I had,’ and tenía ‘I used to have’ with tener).
6 IULC Working Papers
Project (Wilson, 2013). The interviews follow a guided conversational protocol
covering aspects of local community life and intracommunity relationships as well
as personal history. The average words per interview is 5,805. The other half (6)
of the speakers come from the PRESEEA corpus of Mexican speakers from
Mexico City (Butragueño & Lastra, 2000). The topic of discussion in the
PRESEEA interviews ranges from personal history, to education, employment,
and community life. The average length of the interviews by words transcribed is
5,657 words. Having similar topics in the interviews from the two corpora is
essential for being able to compare them, as speech dealing with similar topics
helps to control, to an extent, the types of structures employed by the speakers.
3.2 Independent Variables
For this study, seven internal factors and five external factors are coded, providing
2,063 tokens (354 verb types) within the envelope of variation. Independent
variables were chosen based on significance in previous studies on SPP
expression, and include the following internal factors:
1) Perseverance (null, yo, uno)
2) Switch reference (maintained, switched)
3) TMA (ambiguous, unambiguous)
4) Genre (narrative, opinion, other)
5) Verbal semantic content (mental activity, stative, external activity)
6) Morphological regularity (regular, irregular)
7) Lexical frequency (frequent, infrequent).
The external factors are biological sex (male, female), socioeconomic class
(middle, working), age at time of interview (under/over 40), education (primary,
secondary, university or trade school), and for the Roswell speakers, years in the
US (below/above 15 years). The definition of independent variables for this study
is based primarily on Erker & Guy (2012) and Travis (2007).
3.3 Dependent Variable
The dependent variable is a binomial one based on the variation among null
subject expression (Ø) and overt first person yo. In defining the envelope of
variation, exclusions include the following: Experiencer-subject constructions
such as me gusta or me hace (due to verbal agreement with object rather than
subject), false starts, and elided verbs, such as no, yo tampoco [voy] ‘No, I’m not
[going] either’ (due to invariability). Only finite verbs with first person reference
were coded, according to the precedent established by Silva-Corvalán (1982). In
contrast to Silva-Corvalán (1994), but in line with Amaral and Schwenter (2005),
contrastive situations were regarded as within the envelope of variation because of
the acceptability of adverbial phrases carrying the contrastive emphasis.
7 IULC Working Papers
Additionally, quoted speech is also included in the envelope of variation, as there
is no theoretical reason it would not exhibit effects of perseverance or other
independent variables.
3.4 Analysis
The anticipated findings in terms of the dependent variable are as follows: Subject
expression is expected to be predominantly null due to the historically low SPP
expression among Mexicans (Bayley & Pease-Álvarez, 1997; Otheguy &
Zentella, 2012; Silva-Corvalán, 1994; among others). The data are analyzed using
Goldvarb X, a statistical package designed for logistic regression analysis
targeting language data. In the present study, results either favor null SPPs with a
factor weight (fw) of 0.5 or above or disfavor null SPPs with a factor weight
below 0.5. Three separate runs were conducted on the data. Like Bayley et al.
(2013), the frequent and infrequent verbs were each run separately and a
combined run with all verbs was conducted after. In the runs with verbs separated
by frequency, only the internal factors are analyzed for a more fine-grained
comparison with the results of Bayley et al. (2013) and Erker and Guy (2012).
4. Results
4.1 Internal Factors
Out of 2,063 total tokens, frequent verbs account for 1,460, or roughly 70% of the
observations. The remaining 603 tokens are designated as infrequent. Based on
the typically Zipfian distribution of language data, a division of
frequent/infrequent verbs along the lines of 70%-30% is within the normal range
(Zipf, 1935, 1945). Forty-three (43) verbs are coded as frequent, and 311 verbs are
coded as infrequent. Figure A below shows percent SPP expression for each of the
43 frequent verbs.
Figure A. Percent SPP expression by frequent verbs.
The percent of expressed SPPs ranges from 10% with llevo ‘I carry/wear’ to 90%
with sabía ‘I used to know.’ Of these verbs, 30 are external activity, eight are
0%
20%
40%
60%
80%
100%
llevo vi
po
ngo vo
y
dije
con
ozc
o
tuve
acu
erd
o
emp
ecé
esto
y
qu
iero
he
ido
ven
go
llegu
é
hab
lo
dec
ía
soy
vin
e
qu
edo
pie
nso
trab
ajoP
erc
ent
SPP
Pre
sen
t
Frequent verbs
8 IULC Working Papers
stative, and five are mental activity. Overall, frequent verbs have an equivalent
rate of expressed SPPs with infrequent verbs (see Figure B).
Figure B. Percent SPP expression by infrequent and frequent verbs.
Unlike the results for Erker and Guy (2012), the present study finds SPP
expression is not significantly different between infrequent and frequent verbs via
a t-test with unequal variances (p=0.981). At first glance, Figure B seems to
confirm Erker and Guy’s (2012, p. 550) statement that a lower threshold for
defining frequency is a number “below which pronoun rates are not well
differentiated.” However, the log distribution of the data does not show a clear
“inflection point,” as does the data for Erker and Guy (2012) (see Figure C).
Figure C. Verbs by percent SPP expression and log frequency.
Figure C shows percent SPP expression by log frequency. Darker circles represent
points of overlapping measurement. As in Figure B, more verbs above 1.0 log
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Verb frequency
Per
cen
t SP
P P
rese
nt
Infrequent
Frequent
9 IULC Working Papers
(frequent)4 have an approximately equal rate of SPP expression than verbs below
1.0 log (infrequent), indicated by the non-significant regression line. By plotting
log frequency against SPP expression, we can visualize how lexical frequency
does not directly affect rates of pronoun expression. If lexical frequency did alter
SPP expression, Figure C would show a change in pattern across the frequency
cutoff point (for this study 1.0 log), as in Figure D (below) from Erker and Guy
(2012).
Figure D. From Erker & Guy (2012), p. 538: “Log frequency and percent SPPs
present”
In Figure D, we see a difference in distribution of percent pronouns present before
and after the approximate location of the cutoff (1.5-1.72 log) (Erker & Guy,
2012). Table 1 below shows Goldvarb results for two separate runs of frequent
and infrequent verbs. The objective of Table 1 is to identify differences between
the models for frequent and infrequent verbs. Of the six independent variables
(perseverance, switch reference, TMA, genre, morphological regularity, and
semantic content), switch reference, genre, and semantic content are excluded
from the run for frequent verbs due to interaction. Based on observations of the
data’s statistical relationships, I suspect genre and semantic content interact with
each other. The relationship is one where a verb with determined semantic
content, say external activity, occurs predominantly in one classification of genre,
like narrative. Due to the notoriously poor distribution of natural language data, a
much larger corpus than that of the present study would be needed to overcome
this interaction. Additionally, switch reference interacts with perseverance. For
example, as is anticipated from the literature on switch reference and
perseverance, contexts of null perseverance are often instances of maintained
reference. For the run of infrequent verbs, switch reference, genre, and
4 Rounded from 0.9 log
10 IULC Working Papers
morphological regularity are excluded because they interact with other factors.
Switch reference and morphological regularity interact with each other, and
switch reference interacts with perseverance and TMA. In illustration, the two-
way interaction of morphological regularity and switch reference means that
irregular verbs co-occur primarily with same reference, or the inverse. Like in
frequent verbs, genre interacts with semantic content. By removing switch
reference, morphological regularity, and genre, the remaining factors are
significant in the run of infrequent verbs.
Table 1. Internal factors by frequent and infrequent verbs in probability SPP
expressed as Ø.
Corrected Mean
(Frequent/ infreq.)
.61 / .53
Log likelihood
(Frequent/ infreq.)
-916.576/
-376.393
Frequent Infrequent
Factor group Factor N % fw N % fw
Morph. Reg. Irregular 869 59.5 .60 -- -- --
Regular 591 40.5 .36 -- -- --
range 24 --
TMA Unambiguous 1336 91.5 .52 425 72.4 .56
Ambiguous 124 8.5 .34 162 27.6 .34
range 18 22
Perseverance Null 764 55.8 .56 331 58.0 .58
Yo 604 44.2 .42 240 42.0 .39
range 14 19
Genre Other -- -- -- -- -- --
Narrative -- -- -- -- -- --
Opinion -- -- -- -- -- --
range -- --
Semantic content External -- -- -- 452 77.0 .53
Stative -- -- -- 81 13.8 .44
Mental -- -- -- 54 9.2 .31
range -- 22
Switch reference Maintained -- -- -- -- -- --
Switched -- -- -- -- -- --
range -- --
Total N 1460 587
Note: p<0.000 (frequent), p=0.007 (infrequent)
In the following section, each the relationship of lexical frequency and the
independent variables is discussed. Afterwards, Goldvarb output of a run with
lexical frequency treated as an independent variable is presented, and finally the
results of the external factors are detailed. In both frequencies of verbs, null forms
are most likely to be followed by a null form, supporting Travis’ (2005) “yo-yo
effect” for perseverance, holding true to expectations. Switch reference was not
11 IULC Working Papers
included in either run of frequent/infrequent verbs due to its interaction with other
factors. In frequent and infrequent verbs, unambiguous verb forms (e.g. preterit,
simple present) favor null pronoun expression. Given the weight of ambiguous
TMA does not change and TMA patterns similarly to perseverance in magnitude
shift, it is possible there may be frequency effects on the independent variables
similar to those observed in Erker and Guy (2012).
In order to test the effect of frequency on the separate factors, graphs like
those in Erker and Guy (2012) are made for four types of independent variables:
1) those which are significant for both frequent and infrequent verbs (TMA—
Figure E), 2) those which are excluded from both runs (Genre—Figure G), 3)
those which are included only in the run for frequent verbs (Morphological
Regularity—Figure I), and 4) those which are included only in the run for
infrequent verbs (Semantic Content—Figure K). On each graph, Pearson chi
square tests are run in order to compare the change in significance between
infrequent and frequent verbs for each factor in the factor group.
Figure E. TMA by verb frequency and percent SPP present
Here, we can see that the maintenance of the factor ranking for TMA between the
infrequent and frequent verbs. The difference between infrequent and frequent
verbs is not significant for either ambiguous (p=0.575) or unambiguous verbs
(p=0.112). In Erker and Guy (2012), there is a similar effect of lexical frequency
on TMA (Figure F).
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Infrequent Frequent
Per
cen
t SP
P P
rese
nt
Unambiguous
Ambiguous
12 IULC Working Papers
Figure F. From Erker & Guy (2012), p. 544: “TMA: frequent vs. infrequent
forms”
Figure F shows the closest equivalent factor group in Erker and Guy (2012) to the
TMA factor group of the present study. While the differences between infrequent
and frequent verbs for the factors in Erker and Guy (2012) are significant
(p<0.001), the parallel behavior of present and imperfect in their graph is the same
relationship we see for ambiguous and unambiguous in Figure E. Based on this
similarity, it is possible that lexical frequency will prove to be a non-orthogonal
variable. If lexical frequency were not non-orthogonal there would not be any
patterning in SPP use for factors. Additional independent variables are visualized
and tested in subsequent sections in order to confirm the behavior of lexical
frequency. Genre is excluded from both frequent and infrequent runs due to
interaction. Since it is not possible to describe what is happening with this factor
via statistics, it visualized below in Figure G.
Figure G. Genre by verb frequency and percent SPP present
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Infrequent Frequent
Per
cen
t SP
P P
rese
nt
Other
Opinion
Narrative
13 IULC Working Papers
While the difference across infrequent and frequent verbs for opinion is not
significant (p=0.737), for narrative (p=0.002) and other (p<0.000), it is significant.
As can be seen from Figure G, genre falls under the pattern “activation via
interaction” predicted by Erker and Guy (2012), which Figure H illustrates below
for semantic activity.
Figure H. From Erker & Guy (2012), p. 543: “Activation via interaction: effects
only appear among frequent forms”
As can be seen, the pattern of dispersion in frequent forms is the same, albeit
without as drastic of a range in percent SPPs present. The important comparison
between Figures G and H is in infrequent verbs, there is close to no difference in
SPP expression rate across the factors, but in frequent verbs, the percent SPP
expressed disperses drastically across factors. In Figure H, there are no factors
which are not significant. The lack of significance in the factor “opinion” in the
present study are attributed to having a smaller data set than that of Erker and Guy
(2012) (2,063 tokens versus 4,916 tokens), in addition to dealing with different
populations than those included in Erker and Guy (2012). The effect of social
factors is discussed further in §4.2. In genre, the differences across frequency
become significant, confirming that lexical frequency does in fact behave non-
orthogonally, as established in Erker and Guy (2012). Morphological regularity is
a factor included in the model for frequent verbs, but not infrequent ones, due to
interaction with switch reference. For this reason Figure I demonstrates the
behavior of morphological regularity.
14 IULC Working Papers
Figure I. Morphological regularity by verb frequency and percent SPP present.
In Figure I, the difference across verb frequency for the factor “irregular” is
significant (p=0.001), while for regular it is near significant (p= 0.072). As can be
seen from Figure J below showing morphological regularity in the data of Erker
and Guy (2012), the data for morphological regularity in the present study are near
identical in terms of patterning, strengthening the case for non-orthogonality.
Figure J. From Erker & Guy (2012), p. 542: “Morphological regularity: frequent
vs. infrequent forms”
Semantic content of verb is selected as significant in infrequent verbs only.
Semantic content is excluded in frequent verbs because of interaction with genre.
Figure K shows frequency effects for semantic content.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Infrequent Frequent
Per
cen
t SP
P P
rese
nt
Regular
Irregular
15 IULC Working Papers
Figure K. Semantic content by verb frequency and percent SPP present.
Interestingly, semantic content resembles the frequency patterning for TMA
(significant across both infrequent and frequent verbs) more so than for
morphological regularity (significant only in frequent verbs). The main difference
in Figure K is that both declines in SPP rate from infrequent to frequent verbs for
external (p<0.000) and mental (p=0.002) are significant, providing additional
evidence for lexical frequency as a nonorthogonal variable. Returning to Figure H,
the graph of semantic content in Erker and Guy (2012), we can see the pattern is
different. In their article, Erker and Guy (2012) do not find certain factors fail to
be affected by frequency. In Figure K, stative semantic content has no significant
change across frequency (p=0.453). According to the Oxford concise dictionary of
linguistics, “stative” describes “a persisting state or situation” (Matthews, 2007).
The very nature of stative verbs is resistant to change. I hypothesize the lack of
difference of SPP expression for stative verbs across infrequent and frequent verbs
is tied in part to the opposition to dynamicity encoded in stativity.
Furthermore, the corpus of the present study is comprised of only first
person singular tokens, whereas the corpus of Erker and Guy (2012) incorporates
all grammatical persons. The combination of using a corpus based on speaker-
centered sociolinguistic interviews, examining only first person singular tokens,
and the nature of stative verbs is hypothesized to account for the unique behavior
of stative verbs in the present study.
The results from the run including lexical frequency as a binomial factor
group are presented below (Table 2). When including frequency as an independent
variable in the model, switch reference, genre, morphological regularity, and
semantic content are removed due to interactions. Based on observations of data
patterning in statistical models not reported, switch reference and genre interact.
According to Travis (2007), these results are expected. She finds that narrative
contexts have significantly higher instances of null pronouns due to continuity of
referent across longer spans compared to other genres of speech. Additionally,
morphological regularity and semantic content interact. Much like the behavior of
stative verbs in Figure K, the fact that the corpus of the present study is comprised
solely of first person singular tokens may spark the interaction of morphological
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Infrequent Frequent
Per
cen
t SP
P P
rese
nt
External
Stative
Mental
16 IULC Working Papers
regularity and semantic content in the combined run. This would mean, for
example, that irregular verbs like digo ‘I say’ are always coded as external
activity. Without these four interacting factors, TMA and perseverance are
significant in the combined run (Table 2).
Table 2. Combined run of lexical frequency & internal factors in probability SPP
expressed as Ø.
Corrected Mean .58
Log likelihood -1336.959
Total N 2047
Factor group Factor N % fw
TMA Unambiguous 1761 86.0 .54
Ambiguous 286 14.0 .29
range 25
Perseverance Null 1095 56.5 .57
Yo 844 43.5 .41
range 16
Lexical frequency Frequent 1460 71.3 [.51]
Infrequent 587 28.7 [.47]
range --
Note: Factors not selected as significant in [brackets]. p<0.000
In Table 2, frequency behaves as expected from the perspective of Erker and Guy
(2012), but against predictions from the viewpoint of Bayley et al. (2013). Lexical
frequency is not selected as significant in the logistic regression model, the reason
for which Erker and Guy (2012) hold is the interaction it has with other
independent variables, which is seen most clearly in in the graphs of each factor
group in the current study. In this sense, lexical frequency in the present study can
be confirmed as non-orthogonal, as it does not directly condition the dependent
variable.
4.2 External factors
The primary motivation for conducting a comparative study with speakers living
in the US and Mexico was to compare the behavior of lexical frequency across
populations. Speakers are incorporated into the combined run of all internal
factors and lexical frequency to test whether they have the same factors
conditioning SPP expression. A separate run for US speakers and speakers living
in Mexico was conducted due to interaction with each other. A non-contact
hypothesis holds that speakers living in the US will pattern identically to speakers
living in Mexico, given their common cultural heritage. Goldvarb results for these
runs are presented below in Tables 3 and 4.
As in the combined run of all internal factors with lexical frequency
(Table 2), certain factors are excluded due to interactions. These exclusions are
17 IULC Working Papers
switch reference, genre, morphological regularity, and semantic content, for the
same interactions mentioned in Table 1.
Table 3. US speakers and internal factors in probability SPP expressed as Ø.
Corrected Mean .51
Log likelihood -793.150
Total N 1147
Factor group Factor N % fw
TMA Unambiguous 1008 87.9 .51
Ambiguous 139 12.1 .40
range 11
Lexical frequency Frequent 856 74.6 [.51]
Infrequent 291 25.4 [.47]
range --
Note: Factors not selected as significant in [brackets]. p=0.012
Table 4. Speakers living in Mexico and internal factors in probability SPP
expressed as Ø.
Corrected Mean .68
Log likelihood -512.692
Total N 900
Factor group Factor N % fw
TMA Unambiguous 753 83.7 .57
Ambiguous 147 16.3 .20
range 37
Perseverance Null 546 65.5 .58
Yo 287 34.5 .36
range 22
Lexical frequency Frequent 604 67.1 .53
Infrequent 296 32.9 .45
range 8
Note: p=0.048
As is seen in Table 3, perseverance is excluded for US speakers. This is due to
interaction, although the factor with which it is interacting remains unclear. For
US speakers, then, only TMA conditions null SPPs, as lexical frequency was not
selected as significant. This finding upholds the non-orthogonal behavior we have
seen with lexical frequency up to this point. Examining Table 4, we see that all
non-interacting factors were selected as significant, including perseverance and
lexical frequency. For speakers living in Mexico, lexical frequency behaves as an
independent variable, directly conditioning null SPPs, and there is no interaction
affecting perseverance. Based on these results, we must reject the non-contact
hypothesis. The data show that different factors condition the SPP expression of
Mexican speakers living in the US and Mexico. This relationship is supported by
18 IULC Working Papers
Figure L, which depicts the similar, but separate, patterning of SPP expression
across verb frequencies for speaker country.
Figure L. Mexico vs. US speakers by verb frequency and percent SPP present.
The difference between infrequent and frequent verbs is significant for both US
and Mexico speakers (p=0.021 and p<0.000, respectively). These results confirm
Erker and Guy’s (2012) claim that “The systematic nature of the potentiation
effect of frequency is further confirmed” by the “parallel behavior” the two
nationalities in their corpus demonstrate. However, the authors continue to specify
that “when speakers are separated according to their time of residence in New
York City […] frequency affects pronoun use in a nearly identical way for
subgroups in the sample” (Erker & Guy, 2012, p. 546). As is shown in Figure M
(below), the subgroups of above and below 15 years in the US do not follow the
same pattern as that of Figure L. The difference for speakers who have spent more
than 15 years in the US is significant across verb frequencies (p=0.014), while for
the under 15 years group it is not (p=0.199). For the subgroups, no real tendency
emerges, but the external factor data from the present study support, at least in
part, the generalizations of Erker and Guy (2012). It is interesting to note that out
of the two groups, frequency changes the SPP expression rate most of speakers
who have spent more than 15 years in the US, even though lexical frequency was
selected as a significant independent variable only for those speakers living in
Mexico.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Infrequent Frequent
Per
cen
t SP
P P
rese
nt
US
Mexico
19 IULC Working Papers
Figure M. Above 15 years in the US vs. below 15 years in the US by verb
frequency and percent SPP present.
5. Discussion & Conclusions
The results of this study present support for the theory that lexical frequency
behaves non-orthogonally. For lexical frequency itself, it has been shown that
infrequent and frequent verbs have an equivalent amount of expressed SPPs,
contrary to the findings of Erker and Guy (2012) (Figures C and D). Where to set
the frequency threshold remains one of the unanswered questions of studies
dealing with lexical frequency. Since each data set has a unique distribution of
percent SPP expression by log frequency (as in Figures C and D), it appears that
the cutoff for distinguishing frequent from infrequent verbs is relative and
arbitrary.
Unlike the divergent pattern when moving from infrequent to frequent
verbs of morphological regularity and genre, the pattern of TMA is a parallel one,
re-visiting an interesting phenomenon observed by Erker and Guy (2012) for the
same constraint. They discuss, “To the extent that the tense and mood properties
of individual verb forms vary as a function of the temporal aspects of a discourse,
it may be worth considering switch reference and TMA as qualitatively different
kinds of linguistic factors” (Erker & Guy, 2012, p. 552). The reason they make
such a claim is due to the fact that “switch reference and TMA variables are
predictive of SPP use for both frequent and infrequent verbs,” while
simultaneously exhibiting “weaker” interaction (i.e. a parallel rather than
divergent pattern) with frequency than the other variables included in their study
(Erker & Guy, 2012, p. 552). Although the difference in SPP expression across
verb frequencies for the TMA factor group in the present study did not reach
significance, switch reference, shown below in Figure N, does.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Infrequent Frequent
Per
cen
t SP
P P
rese
nt
Under 15 years
Above 15 years
20 IULC Working Papers
Figure N. Switch reference by verb frequency and percent SPP present.
Both factors are significant in switch reference (maintained: p=0.007, switch:
p=0.002).
Figure O. From Erker & Guy (2012), p.544: “Switch reference: frequent vs.
infrequent forms”
As described in Erker and Guy (2012), switch reference patterns with TMA.
Semantic content, although exhibiting similar parallel patterns, for purposes of
this discussion is not grouped with switch reference and TMA due to the non-
significance of some factors in the factor group (stative). TMA is included in this
generalization due to consistency in terms of significance across the factor group.
In the present study, then, TMA and switch reference display “weaker” effects of
frequency because they hold a parallel pattern rather than diverging with
frequency.
Thus, Erker and Guy’s (2012) proposal that TMA and switch reference
fulfill higher discourse-level functions is seen out in the systematic fluctuations of
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Infrequent Frequent
Per
cen
t SP
P P
rese
nt
Maintained
Switch
21 IULC Working Papers
these constraints in the present study. This perspective is supported by Torres-
Cacoullos and Travis (2013), who find that “subject continuity,” or switch
reference, is a constraint that conditions SPP expression in both English and
Spanish, despite these two languages being typologically diverse. The
combination of these findings reveal that TMA and switch reference are
functionally distinct from other constraints operating on SPP expression and merit
more profound investigation.
As far as the external factors of this study are concerned, it has been
shown that speakers living in the US and Mexico have parallel usage of SPP
expression with regard to frequency. Even though speakers living in the US share
a common national and cultural heritage with those living in Mexico, the former
group has a higher rate of expressed SPPs overall. Despite the differences between
the two groups, no distinct transition is apparent in the SPP rate across frequency
of those speakers who have lived in the US above 15 years compared to those who
have lived in the US below 15 years. The fact that Erker and Guy’s (2012)
prediction is not seen out in the same way in the present study with Mexican
speakers living in Georgia demonstrates a need for future studies pursuing the
specifics of what length of time it takes for group of speakers to be differentiated
from one another. Also, Mexican speakers living in the US and Mexico have been
proven to have different systems for SPP expression. For US speakers, TMA
constrains null SPPs. In the system of these speakers, lexical frequency is a non-
orthogonal variable (i.e. interacts with independent variables instead of directly
conditioning the dependent variable). In the SPP system of speakers living in
Mexico, perseverance, TMA, and lexical frequency constrain null SPP use. The
only time lexical frequency is found to behave more like what is found in Bayley
et al. (2013), that is to say as an independent variable, is with speakers living in
Mexico. Based on these findings, a non-contact hypothesis is rejected.
Importantly, more focused inquiry taking into account lexical frequency and
different variations of Spanish is needed. Despite both groups included in the
present study being Mexican, their linguistic systems are distinct, leading us to
revisit the notion of language contact. In the future, the effect of individual
speaker variation on this finding should be investigated.
Lexical frequency has been proven as a viable course of sociolinguistic
research, as it affects traditionally defined factor groups conditioning variation.
Despite this fact, frequency still merits exploration, as little is understood as to
why it affects certain constraints and not others. Specifically, increased attention
must be given to the “threshold” or “inflection point,” and what consequences
manipulating this mark has upon the significance of constraints.
Abbreviations
SPP: Subject Personal Pronoun
TMA: Tense, Mood, Aspect
22 IULC Working Papers
Acknowledgements
I would like to thank Manuel Díaz-Campos, the staff of the Indiana University
Linguistics Club Working Papers, Sean McKinnon, and two anonymous reviewers
for their honest feedback on earlier versions of this article. I am also grateful to
Stephanie Dickinson, Huizi Xu, and Zifei Hu at the Indiana Statistical Consulting
Center for their invaluable help in deciding on the optimal analysis. All errors
remain my own.
References
Ávila-Jiménez, B. (1995). A sociolinguistic analysis of a change in progress:
Pronominal overtness in Puerto Rican Spanish. Cornell Working Papers
in Linguistics, 13, 25-47.
Amaral, P. & Schwenter, S. (2005). Contrast and the (non) occurrence of subject
pronouns. In D. Eddington (Ed.), Selected Proceedings of the 7th
Hispanic Linguistics Symposium (pp. 116-127). Somerville, MA:
Cascadilla Proceedings Project.
Bayley, R. (2013). The quantitative paradigm. In J. K. Chambers & N. Schilling-
Estes (Eds.), The handbook of language variation and change (2nd ed.)
(pp. 85-107). West Sussex, UK: Wiley-Blackwell Publishing.
Bayley, R., Cárdenas, N.L., Treviño Schouten, B., & Vélez Salas, C.M. (2012). In
K. Geeslin & M. Díaz-Campos (Eds.), Selected Proceedings of the 14th
Hispanic Linguistics Symposium, (pp. 48-60). Somerville, MA:
Cascadilla Proceedings Project.
Bayley, R., Holland, C., & Ware, K. (2013). Lexical frequency and syntactic
variation: A test of a linguistic hypothesis. University of Pennsylvania
Working Papers in Linguistics, 19(2), 21-30.
Bayley, R., & Pease-Álvarez, L. (1997). Null pronoun variation in Mexican-
descent children’s narrative discourse. Language Variation and Change,
9, 349-371.
Bentivoglio, P. (1987). Los sujetos pronominales de primera persona en el habla
de Caracas. Caracas: Universidad Central de Venezuela.
Butragueño, P. M. & Lastra, Y. (2000). Corpus sociolingüístico de la ciudad de
México (CSCM).
http://lef.colmex.mx/Sociolinguistica/CSCM/Corpus.htm
Bybee, J. (2001). Phonological evidence for exemplar storage of multiword
sequences. Studies in Second Language Acquisition , 24, 215-221.
Bybee, J. (2002). Word frequency and context of use in the lexical diffusion of
phonetically-conditioned sound change. Language Variation and
Change, 14, 261-290.
23 IULC Working Papers
Cameron, R. (1993). Ambiguous agreement, functional compensation, and non-
specific tú in the Spanish of San Juan, Puerto Rico, and Madrid, Spain.
Language Variation and Change, 5, 305–334.
Cameron, R. (1994). Switch reference, verb class, and priming in a variable
syntax. In K. Beals, J. Denton, R. Knippen, L. Melnar, H. Suzuki, & E.
Zeinfeld (Eds.), Papers from the 30th Regional Meeting of the Chicago
Linguistic Society: Volume 2: The Parasession on Variation in Linguistic
Theory (pp. 27-45). Chicago: Chicago Linguistic Society.
Cameron, R. (1995). The scope and limits of switch reference as a constraint on
pronominal subject expression. Hispanic Linguistics, 607, 1-27.
Enríquez, E.V. (1984). El pronombre personal sujeto en la lengua española
hablada en Madrid. Madrid: Consejo Superior de Investigaciones
Científicas.
Erker, D. & Guy, G. (2012). The role of lexical frequency in syntactic variability:
Variable subject personal pronoun expression in Spanish. Language,
88(3), 526-557.
Flores-Ferrán, N. (2005). La expresión del pronombre personal sujeto en
narrativas orales de puertorriqueños en Nueva York. In L. A. Ortiz & M.
L. López (Eds.), Contactos y contextos lingüísticos: El español en los
Estados Unidos y en contacto con otras lenguas (pp. 119-129). Madrid:
Iberoamericana.
Gili Gaya, S. (1970). VOX curso superior de sintaxis española (9th ed.).
Barcelona: Bibliograf.
Hochberg, J. (1986). Functional compensation for /s/ deletion in Puerto Rican
Spanish. Language, 62(3), 609-621.
Lapidus, N. & Otheguy, R. (2005a). Contact induced change? Overt nonspecific
ellos in Spanish in New York. In L. Sayahi & M. Westmoreland (Eds.),
Selected proceedings of the 2nd Workshop on Spanish Sociolinguistics
(pp. 67-75). Somerville, MA: Cascadilla Press.
Lapidus, N. & Otheguy, R. (2005b). Overt nonspecific ellos in Spanish in New
York. Spanish in Context, 2, 157-74.
Miyajima, A. (2000). Spanish subject pronoun expression and verb semantics.
Sophia Lingüística, 46-47, 73-88.
Morales, A. (1997). La hipótesis funcional y la aparición de sujeto no nominal: El
español de Puerto Rico. Hispania, 80, 153-165.
Otheguy, R. & Zentella, A.C. (2012). Spanish in New York: Language contact,
dialectal leveling, and structural continuity. New York: Oxford
University Press.
24 IULC Working Papers
Otheguy, R., Zentella, A.C., & Livert, D. (2007). Language and dialect contact in
Spanish in New York: Towards the formation of a speech community.
Language, 83, 770-802.
Sankoff, D., Tagliamonte, S.A., & Smith, E. (2012). Goldvarb Lion: A
multivariate analysis application. Department of Linguistics, University
of Toronto & Department of Mathematics, University of Ottawa.
Silva-Corvalán, C. (1982). Subject expression and placement in Mexican-
American Spanish. In J. Amastae & E. Olivares (Eds.), Spanish in the
United States: Sociolinguistic aspects (pp. 93-120). New York:
Cambridge University Press.
Silva-Corvalán, C. (1994). Language contact and change: Spanish in Los Angeles.
New York: Oxford University Press.
Silva-Corvalán, C. (1997a). Avances en el estudio de la variación sintáctica: La
expresión del sujeto. Cuadernos del Sur, 27, 35-49.
Silva-Corvalán, C. (1997b). Referent tracking in oral Spanish. In J. H. Hill, P. J.
Mistry, & L. Campbell (Eds.), The life of language: Papers in linguistics
in honor of William Bright (pp. 341-354). Berlin: Walter de Gruyter.
Silva-Corvalán, C. (2001). Sociolingüística y pragmática del español.
Washington, DC: Georgetown University Press.
Silva-Corvalán, C. (2003). Otra mirada a la expresión del sujeto como variable
sintáctica. In F. Moreno Fernández, F. Gimeno, J. A. Samper, M. L.
Gutiérrez, M. Vaquero & C. Hernández (Eds.), Lengua, variación y
contexto, Volumen en honor a Humberto López Morales (pp. 849-860).
Madrid: Arco/Libros.
Shin, N. L. & Erker, D. (forthcoming). The emergence of structured variability in
morphosyntax: Childhood acquisition of Spanish subject pronouns. In A.
M. Carvalho, R. Orozco, & N. L. Shin (Eds.), Subject pronoun
expression in Spanish: A cross-dialectal perspective.
Torres-Cacoullos, R. & Travis, C. (2010). Variable yo expression in New
Mexico: English influence? In S. Rivera-Mills and D. Villa Crésap
(Eds.), Spanish of the US southwest: A language in transition (pp. 189-
210). Madrid: Iberoamericana/Vervuert.
Torres-Cacoullos, R. & Travis, C. (2013). Subject pronouns in Spanish and
English: Measuring (dis)similarity. Paper presented at the meeting of
New Ways of Analyzing Variation 42, Carnegie Mellon University,
Pittsburgh, PA.
Toribio, A.J. (2000). Setting parametric limits on dialectal variation in Spanish.
Lingua, 10, 315-341.
25 IULC Working Papers
Travis, C. (2005). The yo-yo effect: Priming in subject expression in Colombian
Spanish. In R. Gess & E. J. Rubin (Eds.), Selected papers from the 34th
Linguistic Symposium on Romance Languages (LSRL) Salt Lake City,
2004 (pp. 329-349). Amsterdam: Benjamins.
Travis, C. (2007). Genre effects on subject expression in Spanish: Priming in
narrative and conversation. Language Variation and Change, 19, 101-
135.
Wilson, A. (2013). Stories of Roswell, Georgia: A sociolinguistic study of
narrative structure. Athens: University of Georgia Press.
Zipf, G. K. (1935). The psycho-biology of language. Boston: Houghton Mifflin.
Zipf, G. K. (1945). Human behavior and the principle of the least effort. An
introduction to human ecology. New York: Hafner