Testing the limits of orthogonality: A study of the interaction between lexical frequency and...

25
Testing the limits of orthogonality: A study of the interaction between lexical frequency and independent variables Anna E. Wilson Indiana University Abstract Spanish is a “pro-drop” language, meaning subject personal pronoun (SPP) expression may be phonetically null. Due in large part to the well-understood dialectal differences and universal constraints operating on SPP expression, Bayley, Cárdenas, Schouten, and Vélez Salas (2012) have highlighted it as a “classic” or “showcase” variationist variable. The present study follows the current trend in using SPP expression as a starting point in order to investigate the effects of lexical frequency on morphosyntactic independent variables and furthers discussion by examining its behavior in a comparative study (Bayley, Holland & Ware, 2013; Erker & Guy, 2012). This study compares SPP expression of 6 Mexican immigrants living in the US with that of 6 speakers living in Mexico City (Butragueño & Lastra, 2000; Wilson, 2013). The data are coded for previous realization (perseverance), continuity of reference, tense-mood-aspect, genre, lexical frequency (based on 1% of non-lemmatized verbs from Bayley et al. (2013) and Erker and Guy (2012)), morphological regularity, and verbal semantic content. Lexical frequency is proven to behave non-orthogonally, supporting the conclusions of Erker and Guy (2012). Keywords: Spanish, sociolinguistics, pronoun expression, lexical frequency, language contact

Transcript of Testing the limits of orthogonality: A study of the interaction between lexical frequency and...

Testing the limits of orthogonality: A study of the interaction between lexical frequency and independent variables

Anna E. Wilson Indiana University

Abstract

Spanish is a “pro-drop” language, meaning subject personal pronoun (SPP)

expression may be phonetically null. Due in large part to the well-understood

dialectal differences and universal constraints operating on SPP expression,

Bayley, Cárdenas, Schouten, and Vélez Salas (2012) have highlighted it as a

“classic” or “showcase” variationist variable.

The present study follows the current trend in using SPP expression as a

starting point in order to investigate the effects of lexical frequency on

morphosyntactic independent variables and furthers discussion by examining its

behavior in a comparative study (Bayley, Holland & Ware, 2013; Erker & Guy,

2012). This study compares SPP expression of 6 Mexican immigrants living in

the US with that of 6 speakers living in Mexico City (Butragueño & Lastra,

2000; Wilson, 2013). The data are coded for previous realization (perseverance),

continuity of reference, tense-mood-aspect, genre, lexical frequency (based on

1% of non-lemmatized verbs from Bayley et al. (2013) and Erker and Guy

(2012)), morphological regularity, and verbal semantic content. Lexical

frequency is proven to behave non-orthogonally, supporting the conclusions of

Erker and Guy (2012).

Keywords: Spanish, sociolinguistics, pronoun expression, lexical frequency,

language contact

2 IULC Working Papers

1. Introduction

In Spanish, subject personal pronouns (SPP) may be expressed (i.e. explicit) or

phonetically null (i.e. not expressed), such as in example (1):

(1) Yo canto ‘I sing’

Ø canto ‘I sing’

In variationist sociolinguistics, SPP expression in Spanish is one of the most

widely investigated variables (Bayley & Pease-Álvarez, 1997; Cameron, 1993,

1995; Lapidus & Otheguy 2005a, 2005b; Silva-Corvalán 1997a, 1997b; Torres-

Cacoullos & Travis 2010; among many others). Due in large part to the dialectal

differences and universal constraints operating on SPP expression, Bayley,

Cárdenas, Schouten, and Vélez Salas (2012, p. 49-50) have highlighted it as a

“classic” or “showcase” variationist variable. The fact that the processes

influencing SPP expression are well documented is precisely why researchers

continue to conduct studies incorporating it. The past several years have witnessed

a transition away from defining the constraints acting upon SPP expression to

using SPPs as a “jumping-off” point for addressing more complex

(morpho)syntactic issues, such as the role of lexical frequency in conditioning

variation (Bayley, Holland & Ware, 2013; Erker & Guy, 2012), how the strength

of constraints affects child first language (L1) acquisition (Shin & Erker,

forthcoming), and how typologically diverse languages share constraints with

regard to subject expression (Torres-Cacoullos & Travis, 2013).

The present study follows the current trend in using SPP expression as a

starting point in order to further investigate the effects of lexical frequency on

morphosyntactic dependent and independent variables (Bayley et al., 2013; Erker

& Guy, 2012). The questions motivating this study are as follows: (1) Does lexical

frequency behave nonorthogonally, as in Erker and Guy (2012), or as an

independent variable, as in Bayley et al. (2013)? (2) Do Mexican speakers who

have embraced a community within the US pattern differently in lexical frequency

than speakers living in Mexico City? In order to answer these questions, the

effects on internal independent variables are explored in two corpora of speakers:

six speakers from a corpus of Mexican immigrants living in Roswell, Georgia,

United States (Wilson, 2013) are compared with six speakers from the PRESEEA

corpus of speakers from Mexico City, Mexico (Butragueño & Lastra, 2000).1

This study is organized as follows: I first discuss the theoretical

standpoint of previous and current variationist sociolinguistic work regarding

subject personal pronoun (SPP) expression in §2 before elaborating on the specific

approach and theoretical underpinnings of this study in §3, including the envelope

of variation and the dependent and independent variables. Afterwards follows a

presentation of results in §4 and a discussion of their implications in §5.

1 An earlier version of this article included an investigation into use of self-referential

‘uno’. This element has been removed from the current version due to inconclusive

findings from low frequency.

3 IULC Working Papers

2. Theoretical Background

The review of previous and current perspectives in SPP research is divided into

two principal sections: (1) constraints operating on SPP expression and (2) trends

in recent investigations regarding variable SPP expression.

2.1 Main Constraints of SPP Expression

A key component of variationist work which is seen in SPP studies is the goal of

describing highly generalizable patterns which adequately describe trends in

multiple dialects of a given language. Of the main constraints that variationist

researchers agree upon tense, mood, aspect (abbreviated TMA); switch reference;

perseverance/priming; semantic class of verb; genre; and place of origin are the

most salient (Barrenechea & Alonso, 1977; Bentivoglio, 1987; Cameron, 1993,

1995; Cameron & Flores-Ferrán, 2004; Enríquez, 1984; Flores-Ferrán, 2002,

2007; Miyajima, 2000; Morales, 1997; Otheguy et al., 2007; Otheguy & Zentella,

2012; Silva-Corvalán, 2003; Travis, 2005, 2007; Toribio, 2000, among others). In

Spanish, verbs are typically morphologically distinct for subject. Verb endings are

encoded with subject information, rendering SPP expression unnecessary for

distinguishing the subject; however, in certain cases, first, third, and formal

second person are indistinguishable. For example, conditional, past imperfect, as

well as subjunctive cases fall into this category (e.g. comería ‘He/she/Usted/I

would eat’, cantaba ‘He/she/Usted/I used to sing(s)’, venga ‘He/she/Usted/I

subjunctive come(s)’). The expectation is that speakers will express more SPPs in

context of morphological ambiguity, a perspective which began with the work of

Gili Gaya (1970) and Hochberg (1986). The more contemporary works of

Cameron (1993, 1994) discuss this constraint in-depth, finding that present tense

ser ‘to be,’ an unambiguous verb, actually has the highest rate of expressed SPPs,

followed by two- and three-way ambiguous verbs2. Silva-Corvalán (2003)

proposes a “semantic-pragmatic” hypothesis to account for the variable expression

of SPPs under the TMA constraint. She argues that the more ambiguous imperfect

tense serves to present “background” information in discourse, while the more

unique preterit forms present “foreground” information, giving us the saliency we

see in the TMA constraint (Silva-Corvalán, 2003).

Also according to Silva-Corvalán (2003, p. 2), another constraint, that of

co-referentiality (commonly called switch reference), is perhaps the “most

statistically significant factor in all the completed studies so far” (translation

mine). Several studies which solidified early on this position for switch reference

are Silva-Corvalán (1982), Bentivoglio (1987), and Cameron (1993, 1995).

According to Cameron (1993), same reference is when the “target” NP, expressed

SPP, or null subject refers to the same subject as the “trigger” NP, expressed, or

2 Three-way ambiguous verbs are when first, third, and formal second person are

indistinguishable. For example, conditional, past imperfect, as well as subjunctive cases fall

into this category (e.g. comería ‘He/she/Usted/I would eat’, cantaba ‘He/she/Usted/I used to

sing’, venga ‘He/she/Usted/I subjunctive come(s)’). A two-way ambiguous verb is one such

as canta ‘He/she/Usted sings’ (present indicative), since the ambiguity exists only between

second person (formal) and third person.

4 IULC Working Papers

null subject. Across the board, these researchers, among many others since these

publications, have found that SPPs are disfavored in environments of same

reference, while the converse is also true—SPPs are favored in contexts of switch

reference. Perhaps one of the most well-known works is Cameron (1995), which

provides a thorough exploration of the process of switch reference. He defines a

“reference chain” as a sequence of two or three NPs which maintain the same

referent in ongoing discourse (Cameron, 1995).

Recently, there has been more work detailing how perseverance affects

SPP expression. While switch reference deals with the referent of the trigger,

perseverance addresses the form of the trigger. When interpreting studies, it is

helpful to consider switch reference as to whom is being referred, while

perseverance is how the speaker is referring to that subject. Cameron’s (1995)

reference chains also pertain to perseverance, as the author uses them to argue that

expressed yo will lead to more expressed yo’s, while null SPPs will lead to more

null expressions, a position also supported by Flores-Ferrán (2005). Travis (2005)

calls this process the “yo-yo effect” and elaborates that its effects are only felt at

low degrees of distance.

Another constraint that researchers agree impacts SPP expression is

semantic class of the verb. Since sociolinguistic interviews are centered on the

perspective of the speaker, higher SPP expression is generally observed with verbs

which index the speaker’s opinion or mental activity (Bentivoglio, 1987;

Enríquez, 1984; Miyajima, 2000; Morales, 1997; among others). In terms of genre

effects on SPP expression, both Travis (2007) and Flores-Ferrán (2005) reveal that

narrative environments are less likely to have expressed SPPs. Related to aspects

of Travis (2007), place of origin has been an observed difference between the SPP

expression of “Mainlander” and Caribbean varieties of Spanish, to use the

terminology of Otheguy, Zentella and Livert (2007) and Otheguy and Zentella

(2012). As a whole, speakers of Caribbean Spanish varieties have more expressed

SPPs, while speakers of Mainland Spanish varieties have more null SPPs

(Cameron, 1993; Otheguy et al., 2007; Otheguy & Zentella, 2012; Toribio, 2000;

among others).

In terms of purely social factors, a high level of variability has been

identified in different Spanish varieties. Nearly every study that includes social

variables attests to a different social constraint ranking and conditioning effect on

SPP expression (Ávila-Jiménez, 1995; Otheguy & Zentella, 2012; Silva-Corvalán,

2001; among others). An interesting complement to this component of SPP

expression is the lack of agreement of the effects of language contact, with

English as well as among Spanish varieties, on SPP expression. There is a general

trend that Spanish in contact with English fails to display more expressed SPPs,

but that view is not corroborated by everyone (Ladipus & Otheguy, 2005a, 2005b;

Flores-Ferrán, 2004).

2.2 Trends in Recent Work

In recent studies of SPP expression, scholars have been moving beyond defining

the constraints acting on SPP use. For example, Erker and Guy (2012) describe

the effects of lexical frequency on a traditional SPP expression study. Instead of

5 IULC Working Papers

operating like a typical independent variable which either favors or disfavors the

expression of SPPs (or is not significant), they find that lexical frequency is a

“nonorthogonal” variable that enhances the qualities of the independent variables,

but does not directly affect the dependent variable. Instead of directly constraining

the expression of SPPs (the dependent variable), a “nonorthogonal” variable (in

the case of Erker and Guy (2012), lexical frequency) affects the independent

variables (such as TMA, switch reference, etc.), which in turn influence the

dependent variable. The whole process is akin to a cascade of influences, starting

with the nonorthogonal variable and ending with the dependent variable.

Following this train of thought, a disfavoring constraint will disfavor more

strongly in frequent verbs, defined in their study as a verb which comprises 1% or

more of non-lemmatized3 verbs in their corpus. Conversely, a favoring constraint

will favor SPP expression more strongly in frequent verbs. In infrequent verbs,

which is to say one which makes up under 1% of non-lemmatized verbs in their

corpus, no effect on constraints is observed (Erker & Guy, 2012). In response,

Bayley et al. (2013) conducted a similar study where the authors attempt to

replicate the methodology used in Erker and Guy (2012). Bayley et al. (2013)

reveal that lexical frequency does not behave in a nonorthogonal manner; rather,

they hold that the effect lexical frequency has on SPP expression is that of a

traditional independent variable. They find that frequent verbs disfavor SPP

expression, and that infrequent verbs favor expression (Bayley et al., 2013).

Outside of Bybee (2001, 2002), Erker and Guy (2012) and Bayley et al. (2013) are

the only two studies which address lexical frequency from a variationist

perspective. The present study furthers the discussion of frequency by

investigating its behavior in a comparative study. The well-understood base of

SPP expression has been shown through praxis as a viable starting point for

distinguishing deeper connections in (morpho)syntactic behavior (Bayley et al.,

2013; Erker & Guy, 2012; Shin & Erker, forthcoming; Torres-Cacoullos & Travis,

2013), and it is this mentality which motivates the present study.

3. Methodology

3.1 Speakers

The data used in this study come from transcripts of sociolinguistic interviews

with 12 Mexican speakers. The twelve speakers are pulled from two different

corpora, which are utilized in order to compare differences in patterning of lexical

frequency across speakers from the same cultural heritage living in the US and

Mexico. Half of the speakers included in the present study are first generation

immigrants living in the exurb Roswell, Georgia, United States at the time of the

interview. These data were collected by the author for a quantitative study of

narrative structure as part of the civic-academic partnership the Roswell Voices

3 A “lemma” is a verb stem that has restricted semantic content. Also called a “lexeme.”

Lemmatization refers to the grouping of lemmas together according to semantic content

(e.g. tengo ‘I have,’ tuve ‘I had,’ and tenía ‘I used to have’ with tener).

6 IULC Working Papers

Project (Wilson, 2013). The interviews follow a guided conversational protocol

covering aspects of local community life and intracommunity relationships as well

as personal history. The average words per interview is 5,805. The other half (6)

of the speakers come from the PRESEEA corpus of Mexican speakers from

Mexico City (Butragueño & Lastra, 2000). The topic of discussion in the

PRESEEA interviews ranges from personal history, to education, employment,

and community life. The average length of the interviews by words transcribed is

5,657 words. Having similar topics in the interviews from the two corpora is

essential for being able to compare them, as speech dealing with similar topics

helps to control, to an extent, the types of structures employed by the speakers.

3.2 Independent Variables

For this study, seven internal factors and five external factors are coded, providing

2,063 tokens (354 verb types) within the envelope of variation. Independent

variables were chosen based on significance in previous studies on SPP

expression, and include the following internal factors:

1) Perseverance (null, yo, uno)

2) Switch reference (maintained, switched)

3) TMA (ambiguous, unambiguous)

4) Genre (narrative, opinion, other)

5) Verbal semantic content (mental activity, stative, external activity)

6) Morphological regularity (regular, irregular)

7) Lexical frequency (frequent, infrequent).

The external factors are biological sex (male, female), socioeconomic class

(middle, working), age at time of interview (under/over 40), education (primary,

secondary, university or trade school), and for the Roswell speakers, years in the

US (below/above 15 years). The definition of independent variables for this study

is based primarily on Erker & Guy (2012) and Travis (2007).

3.3 Dependent Variable

The dependent variable is a binomial one based on the variation among null

subject expression (Ø) and overt first person yo. In defining the envelope of

variation, exclusions include the following: Experiencer-subject constructions

such as me gusta or me hace (due to verbal agreement with object rather than

subject), false starts, and elided verbs, such as no, yo tampoco [voy] ‘No, I’m not

[going] either’ (due to invariability). Only finite verbs with first person reference

were coded, according to the precedent established by Silva-Corvalán (1982). In

contrast to Silva-Corvalán (1994), but in line with Amaral and Schwenter (2005),

contrastive situations were regarded as within the envelope of variation because of

the acceptability of adverbial phrases carrying the contrastive emphasis.

7 IULC Working Papers

Additionally, quoted speech is also included in the envelope of variation, as there

is no theoretical reason it would not exhibit effects of perseverance or other

independent variables.

3.4 Analysis

The anticipated findings in terms of the dependent variable are as follows: Subject

expression is expected to be predominantly null due to the historically low SPP

expression among Mexicans (Bayley & Pease-Álvarez, 1997; Otheguy &

Zentella, 2012; Silva-Corvalán, 1994; among others). The data are analyzed using

Goldvarb X, a statistical package designed for logistic regression analysis

targeting language data. In the present study, results either favor null SPPs with a

factor weight (fw) of 0.5 or above or disfavor null SPPs with a factor weight

below 0.5. Three separate runs were conducted on the data. Like Bayley et al.

(2013), the frequent and infrequent verbs were each run separately and a

combined run with all verbs was conducted after. In the runs with verbs separated

by frequency, only the internal factors are analyzed for a more fine-grained

comparison with the results of Bayley et al. (2013) and Erker and Guy (2012).

4. Results

4.1 Internal Factors

Out of 2,063 total tokens, frequent verbs account for 1,460, or roughly 70% of the

observations. The remaining 603 tokens are designated as infrequent. Based on

the typically Zipfian distribution of language data, a division of

frequent/infrequent verbs along the lines of 70%-30% is within the normal range

(Zipf, 1935, 1945). Forty-three (43) verbs are coded as frequent, and 311 verbs are

coded as infrequent. Figure A below shows percent SPP expression for each of the

43 frequent verbs.

Figure A. Percent SPP expression by frequent verbs.

The percent of expressed SPPs ranges from 10% with llevo ‘I carry/wear’ to 90%

with sabía ‘I used to know.’ Of these verbs, 30 are external activity, eight are

0%

20%

40%

60%

80%

100%

llevo vi

po

ngo vo

y

dije

con

ozc

o

tuve

acu

erd

o

emp

ecé

esto

y

qu

iero

he

ido

ven

go

llegu

é

hab

lo

dec

ía

soy

vin

e

qu

edo

pie

nso

trab

ajoP

erc

ent

SPP

Pre

sen

t

Frequent verbs

8 IULC Working Papers

stative, and five are mental activity. Overall, frequent verbs have an equivalent

rate of expressed SPPs with infrequent verbs (see Figure B).

Figure B. Percent SPP expression by infrequent and frequent verbs.

Unlike the results for Erker and Guy (2012), the present study finds SPP

expression is not significantly different between infrequent and frequent verbs via

a t-test with unequal variances (p=0.981). At first glance, Figure B seems to

confirm Erker and Guy’s (2012, p. 550) statement that a lower threshold for

defining frequency is a number “below which pronoun rates are not well

differentiated.” However, the log distribution of the data does not show a clear

“inflection point,” as does the data for Erker and Guy (2012) (see Figure C).

Figure C. Verbs by percent SPP expression and log frequency.

Figure C shows percent SPP expression by log frequency. Darker circles represent

points of overlapping measurement. As in Figure B, more verbs above 1.0 log

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Verb frequency

Per

cen

t SP

P P

rese

nt

Infrequent

Frequent

9 IULC Working Papers

(frequent)4 have an approximately equal rate of SPP expression than verbs below

1.0 log (infrequent), indicated by the non-significant regression line. By plotting

log frequency against SPP expression, we can visualize how lexical frequency

does not directly affect rates of pronoun expression. If lexical frequency did alter

SPP expression, Figure C would show a change in pattern across the frequency

cutoff point (for this study 1.0 log), as in Figure D (below) from Erker and Guy

(2012).

Figure D. From Erker & Guy (2012), p. 538: “Log frequency and percent SPPs

present”

In Figure D, we see a difference in distribution of percent pronouns present before

and after the approximate location of the cutoff (1.5-1.72 log) (Erker & Guy,

2012). Table 1 below shows Goldvarb results for two separate runs of frequent

and infrequent verbs. The objective of Table 1 is to identify differences between

the models for frequent and infrequent verbs. Of the six independent variables

(perseverance, switch reference, TMA, genre, morphological regularity, and

semantic content), switch reference, genre, and semantic content are excluded

from the run for frequent verbs due to interaction. Based on observations of the

data’s statistical relationships, I suspect genre and semantic content interact with

each other. The relationship is one where a verb with determined semantic

content, say external activity, occurs predominantly in one classification of genre,

like narrative. Due to the notoriously poor distribution of natural language data, a

much larger corpus than that of the present study would be needed to overcome

this interaction. Additionally, switch reference interacts with perseverance. For

example, as is anticipated from the literature on switch reference and

perseverance, contexts of null perseverance are often instances of maintained

reference. For the run of infrequent verbs, switch reference, genre, and

4 Rounded from 0.9 log

10 IULC Working Papers

morphological regularity are excluded because they interact with other factors.

Switch reference and morphological regularity interact with each other, and

switch reference interacts with perseverance and TMA. In illustration, the two-

way interaction of morphological regularity and switch reference means that

irregular verbs co-occur primarily with same reference, or the inverse. Like in

frequent verbs, genre interacts with semantic content. By removing switch

reference, morphological regularity, and genre, the remaining factors are

significant in the run of infrequent verbs.

Table 1. Internal factors by frequent and infrequent verbs in probability SPP

expressed as Ø.

Corrected Mean

(Frequent/ infreq.)

.61 / .53

Log likelihood

(Frequent/ infreq.)

-916.576/

-376.393

Frequent Infrequent

Factor group Factor N % fw N % fw

Morph. Reg. Irregular 869 59.5 .60 -- -- --

Regular 591 40.5 .36 -- -- --

range 24 --

TMA Unambiguous 1336 91.5 .52 425 72.4 .56

Ambiguous 124 8.5 .34 162 27.6 .34

range 18 22

Perseverance Null 764 55.8 .56 331 58.0 .58

Yo 604 44.2 .42 240 42.0 .39

range 14 19

Genre Other -- -- -- -- -- --

Narrative -- -- -- -- -- --

Opinion -- -- -- -- -- --

range -- --

Semantic content External -- -- -- 452 77.0 .53

Stative -- -- -- 81 13.8 .44

Mental -- -- -- 54 9.2 .31

range -- 22

Switch reference Maintained -- -- -- -- -- --

Switched -- -- -- -- -- --

range -- --

Total N 1460 587

Note: p<0.000 (frequent), p=0.007 (infrequent)

In the following section, each the relationship of lexical frequency and the

independent variables is discussed. Afterwards, Goldvarb output of a run with

lexical frequency treated as an independent variable is presented, and finally the

results of the external factors are detailed. In both frequencies of verbs, null forms

are most likely to be followed by a null form, supporting Travis’ (2005) “yo-yo

effect” for perseverance, holding true to expectations. Switch reference was not

11 IULC Working Papers

included in either run of frequent/infrequent verbs due to its interaction with other

factors. In frequent and infrequent verbs, unambiguous verb forms (e.g. preterit,

simple present) favor null pronoun expression. Given the weight of ambiguous

TMA does not change and TMA patterns similarly to perseverance in magnitude

shift, it is possible there may be frequency effects on the independent variables

similar to those observed in Erker and Guy (2012).

In order to test the effect of frequency on the separate factors, graphs like

those in Erker and Guy (2012) are made for four types of independent variables:

1) those which are significant for both frequent and infrequent verbs (TMA—

Figure E), 2) those which are excluded from both runs (Genre—Figure G), 3)

those which are included only in the run for frequent verbs (Morphological

Regularity—Figure I), and 4) those which are included only in the run for

infrequent verbs (Semantic Content—Figure K). On each graph, Pearson chi

square tests are run in order to compare the change in significance between

infrequent and frequent verbs for each factor in the factor group.

Figure E. TMA by verb frequency and percent SPP present

Here, we can see that the maintenance of the factor ranking for TMA between the

infrequent and frequent verbs. The difference between infrequent and frequent

verbs is not significant for either ambiguous (p=0.575) or unambiguous verbs

(p=0.112). In Erker and Guy (2012), there is a similar effect of lexical frequency

on TMA (Figure F).

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Infrequent Frequent

Per

cen

t SP

P P

rese

nt

Unambiguous

Ambiguous

12 IULC Working Papers

Figure F. From Erker & Guy (2012), p. 544: “TMA: frequent vs. infrequent

forms”

Figure F shows the closest equivalent factor group in Erker and Guy (2012) to the

TMA factor group of the present study. While the differences between infrequent

and frequent verbs for the factors in Erker and Guy (2012) are significant

(p<0.001), the parallel behavior of present and imperfect in their graph is the same

relationship we see for ambiguous and unambiguous in Figure E. Based on this

similarity, it is possible that lexical frequency will prove to be a non-orthogonal

variable. If lexical frequency were not non-orthogonal there would not be any

patterning in SPP use for factors. Additional independent variables are visualized

and tested in subsequent sections in order to confirm the behavior of lexical

frequency. Genre is excluded from both frequent and infrequent runs due to

interaction. Since it is not possible to describe what is happening with this factor

via statistics, it visualized below in Figure G.

Figure G. Genre by verb frequency and percent SPP present

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Infrequent Frequent

Per

cen

t SP

P P

rese

nt

Other

Opinion

Narrative

13 IULC Working Papers

While the difference across infrequent and frequent verbs for opinion is not

significant (p=0.737), for narrative (p=0.002) and other (p<0.000), it is significant.

As can be seen from Figure G, genre falls under the pattern “activation via

interaction” predicted by Erker and Guy (2012), which Figure H illustrates below

for semantic activity.

Figure H. From Erker & Guy (2012), p. 543: “Activation via interaction: effects

only appear among frequent forms”

As can be seen, the pattern of dispersion in frequent forms is the same, albeit

without as drastic of a range in percent SPPs present. The important comparison

between Figures G and H is in infrequent verbs, there is close to no difference in

SPP expression rate across the factors, but in frequent verbs, the percent SPP

expressed disperses drastically across factors. In Figure H, there are no factors

which are not significant. The lack of significance in the factor “opinion” in the

present study are attributed to having a smaller data set than that of Erker and Guy

(2012) (2,063 tokens versus 4,916 tokens), in addition to dealing with different

populations than those included in Erker and Guy (2012). The effect of social

factors is discussed further in §4.2. In genre, the differences across frequency

become significant, confirming that lexical frequency does in fact behave non-

orthogonally, as established in Erker and Guy (2012). Morphological regularity is

a factor included in the model for frequent verbs, but not infrequent ones, due to

interaction with switch reference. For this reason Figure I demonstrates the

behavior of morphological regularity.

14 IULC Working Papers

Figure I. Morphological regularity by verb frequency and percent SPP present.

In Figure I, the difference across verb frequency for the factor “irregular” is

significant (p=0.001), while for regular it is near significant (p= 0.072). As can be

seen from Figure J below showing morphological regularity in the data of Erker

and Guy (2012), the data for morphological regularity in the present study are near

identical in terms of patterning, strengthening the case for non-orthogonality.

Figure J. From Erker & Guy (2012), p. 542: “Morphological regularity: frequent

vs. infrequent forms”

Semantic content of verb is selected as significant in infrequent verbs only.

Semantic content is excluded in frequent verbs because of interaction with genre.

Figure K shows frequency effects for semantic content.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Infrequent Frequent

Per

cen

t SP

P P

rese

nt

Regular

Irregular

15 IULC Working Papers

Figure K. Semantic content by verb frequency and percent SPP present.

Interestingly, semantic content resembles the frequency patterning for TMA

(significant across both infrequent and frequent verbs) more so than for

morphological regularity (significant only in frequent verbs). The main difference

in Figure K is that both declines in SPP rate from infrequent to frequent verbs for

external (p<0.000) and mental (p=0.002) are significant, providing additional

evidence for lexical frequency as a nonorthogonal variable. Returning to Figure H,

the graph of semantic content in Erker and Guy (2012), we can see the pattern is

different. In their article, Erker and Guy (2012) do not find certain factors fail to

be affected by frequency. In Figure K, stative semantic content has no significant

change across frequency (p=0.453). According to the Oxford concise dictionary of

linguistics, “stative” describes “a persisting state or situation” (Matthews, 2007).

The very nature of stative verbs is resistant to change. I hypothesize the lack of

difference of SPP expression for stative verbs across infrequent and frequent verbs

is tied in part to the opposition to dynamicity encoded in stativity.

Furthermore, the corpus of the present study is comprised of only first

person singular tokens, whereas the corpus of Erker and Guy (2012) incorporates

all grammatical persons. The combination of using a corpus based on speaker-

centered sociolinguistic interviews, examining only first person singular tokens,

and the nature of stative verbs is hypothesized to account for the unique behavior

of stative verbs in the present study.

The results from the run including lexical frequency as a binomial factor

group are presented below (Table 2). When including frequency as an independent

variable in the model, switch reference, genre, morphological regularity, and

semantic content are removed due to interactions. Based on observations of data

patterning in statistical models not reported, switch reference and genre interact.

According to Travis (2007), these results are expected. She finds that narrative

contexts have significantly higher instances of null pronouns due to continuity of

referent across longer spans compared to other genres of speech. Additionally,

morphological regularity and semantic content interact. Much like the behavior of

stative verbs in Figure K, the fact that the corpus of the present study is comprised

solely of first person singular tokens may spark the interaction of morphological

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Infrequent Frequent

Per

cen

t SP

P P

rese

nt

External

Stative

Mental

16 IULC Working Papers

regularity and semantic content in the combined run. This would mean, for

example, that irregular verbs like digo ‘I say’ are always coded as external

activity. Without these four interacting factors, TMA and perseverance are

significant in the combined run (Table 2).

Table 2. Combined run of lexical frequency & internal factors in probability SPP

expressed as Ø.

Corrected Mean .58

Log likelihood -1336.959

Total N 2047

Factor group Factor N % fw

TMA Unambiguous 1761 86.0 .54

Ambiguous 286 14.0 .29

range 25

Perseverance Null 1095 56.5 .57

Yo 844 43.5 .41

range 16

Lexical frequency Frequent 1460 71.3 [.51]

Infrequent 587 28.7 [.47]

range --

Note: Factors not selected as significant in [brackets]. p<0.000

In Table 2, frequency behaves as expected from the perspective of Erker and Guy

(2012), but against predictions from the viewpoint of Bayley et al. (2013). Lexical

frequency is not selected as significant in the logistic regression model, the reason

for which Erker and Guy (2012) hold is the interaction it has with other

independent variables, which is seen most clearly in in the graphs of each factor

group in the current study. In this sense, lexical frequency in the present study can

be confirmed as non-orthogonal, as it does not directly condition the dependent

variable.

4.2 External factors

The primary motivation for conducting a comparative study with speakers living

in the US and Mexico was to compare the behavior of lexical frequency across

populations. Speakers are incorporated into the combined run of all internal

factors and lexical frequency to test whether they have the same factors

conditioning SPP expression. A separate run for US speakers and speakers living

in Mexico was conducted due to interaction with each other. A non-contact

hypothesis holds that speakers living in the US will pattern identically to speakers

living in Mexico, given their common cultural heritage. Goldvarb results for these

runs are presented below in Tables 3 and 4.

As in the combined run of all internal factors with lexical frequency

(Table 2), certain factors are excluded due to interactions. These exclusions are

17 IULC Working Papers

switch reference, genre, morphological regularity, and semantic content, for the

same interactions mentioned in Table 1.

Table 3. US speakers and internal factors in probability SPP expressed as Ø.

Corrected Mean .51

Log likelihood -793.150

Total N 1147

Factor group Factor N % fw

TMA Unambiguous 1008 87.9 .51

Ambiguous 139 12.1 .40

range 11

Lexical frequency Frequent 856 74.6 [.51]

Infrequent 291 25.4 [.47]

range --

Note: Factors not selected as significant in [brackets]. p=0.012

Table 4. Speakers living in Mexico and internal factors in probability SPP

expressed as Ø.

Corrected Mean .68

Log likelihood -512.692

Total N 900

Factor group Factor N % fw

TMA Unambiguous 753 83.7 .57

Ambiguous 147 16.3 .20

range 37

Perseverance Null 546 65.5 .58

Yo 287 34.5 .36

range 22

Lexical frequency Frequent 604 67.1 .53

Infrequent 296 32.9 .45

range 8

Note: p=0.048

As is seen in Table 3, perseverance is excluded for US speakers. This is due to

interaction, although the factor with which it is interacting remains unclear. For

US speakers, then, only TMA conditions null SPPs, as lexical frequency was not

selected as significant. This finding upholds the non-orthogonal behavior we have

seen with lexical frequency up to this point. Examining Table 4, we see that all

non-interacting factors were selected as significant, including perseverance and

lexical frequency. For speakers living in Mexico, lexical frequency behaves as an

independent variable, directly conditioning null SPPs, and there is no interaction

affecting perseverance. Based on these results, we must reject the non-contact

hypothesis. The data show that different factors condition the SPP expression of

Mexican speakers living in the US and Mexico. This relationship is supported by

18 IULC Working Papers

Figure L, which depicts the similar, but separate, patterning of SPP expression

across verb frequencies for speaker country.

Figure L. Mexico vs. US speakers by verb frequency and percent SPP present.

The difference between infrequent and frequent verbs is significant for both US

and Mexico speakers (p=0.021 and p<0.000, respectively). These results confirm

Erker and Guy’s (2012) claim that “The systematic nature of the potentiation

effect of frequency is further confirmed” by the “parallel behavior” the two

nationalities in their corpus demonstrate. However, the authors continue to specify

that “when speakers are separated according to their time of residence in New

York City […] frequency affects pronoun use in a nearly identical way for

subgroups in the sample” (Erker & Guy, 2012, p. 546). As is shown in Figure M

(below), the subgroups of above and below 15 years in the US do not follow the

same pattern as that of Figure L. The difference for speakers who have spent more

than 15 years in the US is significant across verb frequencies (p=0.014), while for

the under 15 years group it is not (p=0.199). For the subgroups, no real tendency

emerges, but the external factor data from the present study support, at least in

part, the generalizations of Erker and Guy (2012). It is interesting to note that out

of the two groups, frequency changes the SPP expression rate most of speakers

who have spent more than 15 years in the US, even though lexical frequency was

selected as a significant independent variable only for those speakers living in

Mexico.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Infrequent Frequent

Per

cen

t SP

P P

rese

nt

US

Mexico

19 IULC Working Papers

Figure M. Above 15 years in the US vs. below 15 years in the US by verb

frequency and percent SPP present.

5. Discussion & Conclusions

The results of this study present support for the theory that lexical frequency

behaves non-orthogonally. For lexical frequency itself, it has been shown that

infrequent and frequent verbs have an equivalent amount of expressed SPPs,

contrary to the findings of Erker and Guy (2012) (Figures C and D). Where to set

the frequency threshold remains one of the unanswered questions of studies

dealing with lexical frequency. Since each data set has a unique distribution of

percent SPP expression by log frequency (as in Figures C and D), it appears that

the cutoff for distinguishing frequent from infrequent verbs is relative and

arbitrary.

Unlike the divergent pattern when moving from infrequent to frequent

verbs of morphological regularity and genre, the pattern of TMA is a parallel one,

re-visiting an interesting phenomenon observed by Erker and Guy (2012) for the

same constraint. They discuss, “To the extent that the tense and mood properties

of individual verb forms vary as a function of the temporal aspects of a discourse,

it may be worth considering switch reference and TMA as qualitatively different

kinds of linguistic factors” (Erker & Guy, 2012, p. 552). The reason they make

such a claim is due to the fact that “switch reference and TMA variables are

predictive of SPP use for both frequent and infrequent verbs,” while

simultaneously exhibiting “weaker” interaction (i.e. a parallel rather than

divergent pattern) with frequency than the other variables included in their study

(Erker & Guy, 2012, p. 552). Although the difference in SPP expression across

verb frequencies for the TMA factor group in the present study did not reach

significance, switch reference, shown below in Figure N, does.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Infrequent Frequent

Per

cen

t SP

P P

rese

nt

Under 15 years

Above 15 years

20 IULC Working Papers

Figure N. Switch reference by verb frequency and percent SPP present.

Both factors are significant in switch reference (maintained: p=0.007, switch:

p=0.002).

Figure O. From Erker & Guy (2012), p.544: “Switch reference: frequent vs.

infrequent forms”

As described in Erker and Guy (2012), switch reference patterns with TMA.

Semantic content, although exhibiting similar parallel patterns, for purposes of

this discussion is not grouped with switch reference and TMA due to the non-

significance of some factors in the factor group (stative). TMA is included in this

generalization due to consistency in terms of significance across the factor group.

In the present study, then, TMA and switch reference display “weaker” effects of

frequency because they hold a parallel pattern rather than diverging with

frequency.

Thus, Erker and Guy’s (2012) proposal that TMA and switch reference

fulfill higher discourse-level functions is seen out in the systematic fluctuations of

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Infrequent Frequent

Per

cen

t SP

P P

rese

nt

Maintained

Switch

21 IULC Working Papers

these constraints in the present study. This perspective is supported by Torres-

Cacoullos and Travis (2013), who find that “subject continuity,” or switch

reference, is a constraint that conditions SPP expression in both English and

Spanish, despite these two languages being typologically diverse. The

combination of these findings reveal that TMA and switch reference are

functionally distinct from other constraints operating on SPP expression and merit

more profound investigation.

As far as the external factors of this study are concerned, it has been

shown that speakers living in the US and Mexico have parallel usage of SPP

expression with regard to frequency. Even though speakers living in the US share

a common national and cultural heritage with those living in Mexico, the former

group has a higher rate of expressed SPPs overall. Despite the differences between

the two groups, no distinct transition is apparent in the SPP rate across frequency

of those speakers who have lived in the US above 15 years compared to those who

have lived in the US below 15 years. The fact that Erker and Guy’s (2012)

prediction is not seen out in the same way in the present study with Mexican

speakers living in Georgia demonstrates a need for future studies pursuing the

specifics of what length of time it takes for group of speakers to be differentiated

from one another. Also, Mexican speakers living in the US and Mexico have been

proven to have different systems for SPP expression. For US speakers, TMA

constrains null SPPs. In the system of these speakers, lexical frequency is a non-

orthogonal variable (i.e. interacts with independent variables instead of directly

conditioning the dependent variable). In the SPP system of speakers living in

Mexico, perseverance, TMA, and lexical frequency constrain null SPP use. The

only time lexical frequency is found to behave more like what is found in Bayley

et al. (2013), that is to say as an independent variable, is with speakers living in

Mexico. Based on these findings, a non-contact hypothesis is rejected.

Importantly, more focused inquiry taking into account lexical frequency and

different variations of Spanish is needed. Despite both groups included in the

present study being Mexican, their linguistic systems are distinct, leading us to

revisit the notion of language contact. In the future, the effect of individual

speaker variation on this finding should be investigated.

Lexical frequency has been proven as a viable course of sociolinguistic

research, as it affects traditionally defined factor groups conditioning variation.

Despite this fact, frequency still merits exploration, as little is understood as to

why it affects certain constraints and not others. Specifically, increased attention

must be given to the “threshold” or “inflection point,” and what consequences

manipulating this mark has upon the significance of constraints.

Abbreviations

SPP: Subject Personal Pronoun

TMA: Tense, Mood, Aspect

22 IULC Working Papers

Acknowledgements

I would like to thank Manuel Díaz-Campos, the staff of the Indiana University

Linguistics Club Working Papers, Sean McKinnon, and two anonymous reviewers

for their honest feedback on earlier versions of this article. I am also grateful to

Stephanie Dickinson, Huizi Xu, and Zifei Hu at the Indiana Statistical Consulting

Center for their invaluable help in deciding on the optimal analysis. All errors

remain my own.

References

Ávila-Jiménez, B. (1995). A sociolinguistic analysis of a change in progress:

Pronominal overtness in Puerto Rican Spanish. Cornell Working Papers

in Linguistics, 13, 25-47.

Amaral, P. & Schwenter, S. (2005). Contrast and the (non) occurrence of subject

pronouns. In D. Eddington (Ed.), Selected Proceedings of the 7th

Hispanic Linguistics Symposium (pp. 116-127). Somerville, MA:

Cascadilla Proceedings Project.

Bayley, R. (2013). The quantitative paradigm. In J. K. Chambers & N. Schilling-

Estes (Eds.), The handbook of language variation and change (2nd ed.)

(pp. 85-107). West Sussex, UK: Wiley-Blackwell Publishing.

Bayley, R., Cárdenas, N.L., Treviño Schouten, B., & Vélez Salas, C.M. (2012). In

K. Geeslin & M. Díaz-Campos (Eds.), Selected Proceedings of the 14th

Hispanic Linguistics Symposium, (pp. 48-60). Somerville, MA:

Cascadilla Proceedings Project.

Bayley, R., Holland, C., & Ware, K. (2013). Lexical frequency and syntactic

variation: A test of a linguistic hypothesis. University of Pennsylvania

Working Papers in Linguistics, 19(2), 21-30.

Bayley, R., & Pease-Álvarez, L. (1997). Null pronoun variation in Mexican-

descent children’s narrative discourse. Language Variation and Change,

9, 349-371.

Bentivoglio, P. (1987). Los sujetos pronominales de primera persona en el habla

de Caracas. Caracas: Universidad Central de Venezuela.

Butragueño, P. M. & Lastra, Y. (2000). Corpus sociolingüístico de la ciudad de

México (CSCM).

http://lef.colmex.mx/Sociolinguistica/CSCM/Corpus.htm

Bybee, J. (2001). Phonological evidence for exemplar storage of multiword

sequences. Studies in Second Language Acquisition , 24, 215-221.

Bybee, J. (2002). Word frequency and context of use in the lexical diffusion of

phonetically-conditioned sound change. Language Variation and

Change, 14, 261-290.

23 IULC Working Papers

Cameron, R. (1993). Ambiguous agreement, functional compensation, and non-

specific tú in the Spanish of San Juan, Puerto Rico, and Madrid, Spain.

Language Variation and Change, 5, 305–334.

Cameron, R. (1994). Switch reference, verb class, and priming in a variable

syntax. In K. Beals, J. Denton, R. Knippen, L. Melnar, H. Suzuki, & E.

Zeinfeld (Eds.), Papers from the 30th Regional Meeting of the Chicago

Linguistic Society: Volume 2: The Parasession on Variation in Linguistic

Theory (pp. 27-45). Chicago: Chicago Linguistic Society.

Cameron, R. (1995). The scope and limits of switch reference as a constraint on

pronominal subject expression. Hispanic Linguistics, 607, 1-27.

Enríquez, E.V. (1984). El pronombre personal sujeto en la lengua española

hablada en Madrid. Madrid: Consejo Superior de Investigaciones

Científicas.

Erker, D. & Guy, G. (2012). The role of lexical frequency in syntactic variability:

Variable subject personal pronoun expression in Spanish. Language,

88(3), 526-557.

Flores-Ferrán, N. (2005). La expresión del pronombre personal sujeto en

narrativas orales de puertorriqueños en Nueva York. In L. A. Ortiz & M.

L. López (Eds.), Contactos y contextos lingüísticos: El español en los

Estados Unidos y en contacto con otras lenguas (pp. 119-129). Madrid:

Iberoamericana.

Gili Gaya, S. (1970). VOX curso superior de sintaxis española (9th ed.).

Barcelona: Bibliograf.

Hochberg, J. (1986). Functional compensation for /s/ deletion in Puerto Rican

Spanish. Language, 62(3), 609-621.

Lapidus, N. & Otheguy, R. (2005a). Contact induced change? Overt nonspecific

ellos in Spanish in New York. In L. Sayahi & M. Westmoreland (Eds.),

Selected proceedings of the 2nd Workshop on Spanish Sociolinguistics

(pp. 67-75). Somerville, MA: Cascadilla Press.

Lapidus, N. & Otheguy, R. (2005b). Overt nonspecific ellos in Spanish in New

York. Spanish in Context, 2, 157-74.

Miyajima, A. (2000). Spanish subject pronoun expression and verb semantics.

Sophia Lingüística, 46-47, 73-88.

Morales, A. (1997). La hipótesis funcional y la aparición de sujeto no nominal: El

español de Puerto Rico. Hispania, 80, 153-165.

Otheguy, R. & Zentella, A.C. (2012). Spanish in New York: Language contact,

dialectal leveling, and structural continuity. New York: Oxford

University Press.

24 IULC Working Papers

Otheguy, R., Zentella, A.C., & Livert, D. (2007). Language and dialect contact in

Spanish in New York: Towards the formation of a speech community.

Language, 83, 770-802.

Sankoff, D., Tagliamonte, S.A., & Smith, E. (2012). Goldvarb Lion: A

multivariate analysis application. Department of Linguistics, University

of Toronto & Department of Mathematics, University of Ottawa.

Silva-Corvalán, C. (1982). Subject expression and placement in Mexican-

American Spanish. In J. Amastae & E. Olivares (Eds.), Spanish in the

United States: Sociolinguistic aspects (pp. 93-120). New York:

Cambridge University Press.

Silva-Corvalán, C. (1994). Language contact and change: Spanish in Los Angeles.

New York: Oxford University Press.

Silva-Corvalán, C. (1997a). Avances en el estudio de la variación sintáctica: La

expresión del sujeto. Cuadernos del Sur, 27, 35-49.

Silva-Corvalán, C. (1997b). Referent tracking in oral Spanish. In J. H. Hill, P. J.

Mistry, & L. Campbell (Eds.), The life of language: Papers in linguistics

in honor of William Bright (pp. 341-354). Berlin: Walter de Gruyter.

Silva-Corvalán, C. (2001). Sociolingüística y pragmática del español.

Washington, DC: Georgetown University Press.

Silva-Corvalán, C. (2003). Otra mirada a la expresión del sujeto como variable

sintáctica. In F. Moreno Fernández, F. Gimeno, J. A. Samper, M. L.

Gutiérrez, M. Vaquero & C. Hernández (Eds.), Lengua, variación y

contexto, Volumen en honor a Humberto López Morales (pp. 849-860).

Madrid: Arco/Libros.

Shin, N. L. & Erker, D. (forthcoming). The emergence of structured variability in

morphosyntax: Childhood acquisition of Spanish subject pronouns. In A.

M. Carvalho, R. Orozco, & N. L. Shin (Eds.), Subject pronoun

expression in Spanish: A cross-dialectal perspective.

Torres-Cacoullos, R. & Travis, C. (2010). Variable yo expression in New

Mexico: English influence? In S. Rivera-Mills and D. Villa Crésap

(Eds.), Spanish of the US southwest: A language in transition (pp. 189-

210). Madrid: Iberoamericana/Vervuert.

Torres-Cacoullos, R. & Travis, C. (2013). Subject pronouns in Spanish and

English: Measuring (dis)similarity. Paper presented at the meeting of

New Ways of Analyzing Variation 42, Carnegie Mellon University,

Pittsburgh, PA.

Toribio, A.J. (2000). Setting parametric limits on dialectal variation in Spanish.

Lingua, 10, 315-341.

25 IULC Working Papers

Travis, C. (2005). The yo-yo effect: Priming in subject expression in Colombian

Spanish. In R. Gess & E. J. Rubin (Eds.), Selected papers from the 34th

Linguistic Symposium on Romance Languages (LSRL) Salt Lake City,

2004 (pp. 329-349). Amsterdam: Benjamins.

Travis, C. (2007). Genre effects on subject expression in Spanish: Priming in

narrative and conversation. Language Variation and Change, 19, 101-

135.

Wilson, A. (2013). Stories of Roswell, Georgia: A sociolinguistic study of

narrative structure. Athens: University of Georgia Press.

Zipf, G. K. (1935). The psycho-biology of language. Boston: Houghton Mifflin.

Zipf, G. K. (1945). Human behavior and the principle of the least effort. An

introduction to human ecology. New York: Hafner