Describing corpora, comparing corpora - Roland Schäfer

Describing corpora, comparing corpora

Felix Bildhauer* and Roland Schäfer**

* IDS Mannheim, **Freie Universität Berlin

CL tutorial @ DGfS 41st annual meeting5 March 2019, Bremen

Felix Bildhauer & Roland Schäfer Describing corpora, comparing corpora DGfS-CL 1 / 133

Schedule

11:00 – 12:30 Session 1: Describing corpora14:00 – 15:30 Session 2: Comparing corpora16:00 – 17:30 Session 3: Modelling

Data packages: https://www.webcorpora.org/dgfs19/

R-Studio: https://www.rstudio.com


https://www.webcorpora.org/dgfs19/

https://www.rstudio.com

Why describe / compare corpora?

Choose an appropriate resource for a particular purpose

Is corpus A suitable from a technical viewpoint(quality of post-processing and annotations)

Using a different corpus, would linguistics findings differ significantly?

Does the performance of tool X vary with different corpora?

How does a tool trained on corpus A perform on data from corpus B?

How broad a claim can be made based on my findings?

Can corpus A be used as a substitue for corpus B(especially if corpus B is unavailable / unaffordable and corpus A is free)

…


Different corpora, different findings

Proportion of genitive complements after selected prepositions, by corpus(Bildhauer & Schäfer, in prep.)

auß

erge

genü

ber

nebs

tsa

mt

gem

äßen

tgeg

enm

itsam

tm

ange

lsw

egen

dank

zuzü

glic

htr

otz

eins

chlie

ßlic

hm

ittel

sbe

zügl

ich

abzü

glic

hw

ähre

ndvo

rbeh

altli

chhi

nsic

htlic

han

gesi

chts

seite

nsan

läss

lich

betr

effs

0.0

0.2

0.4

0.6

0.8

1.0decowdereko


Domain adaptation

CMC WEBTagger STTS_IBK STTS 1.0 STTS_IBK STTS 1.0

Prange et al. (2016) 87.33 90.28 93.55 94.62COW 77.89 81.51 91.82 92.96TreeTagger 73.21 76.81 91.75 92.89Stanford 70.60 75.83 89.42 92.52

Figure: EmpiriST shared task results (Beißwenger et al., 2016)


Corpora used in this tutorial

DeReKo / KoGra

≈ 7bn tokens

subset of Deutsches Referenzkorpus (DeReKo), Kupietz et al., 2010

defined and used in IDS project ”‘Korpusgrammatik”’

stratification: Bubenhofer, Konopka & Schneider, 2014

mostly newspaper texts

rich linguistic annotation

not (yet) available to the public

This tutorial’s color code: red


Corpora used in this tutorial (II)

DECOW16B (COW initiative, Schäfer & Bildhauer, 2012)

≈ 20.5bn tokens

web corpus

created with 2016 technology of the COW initiative

breadth-first web crawl

rich linguistic annotation

publicly available at https://www.webcorpora.org/

This tutorial’s color code: green


https://www.webcorpora.org/

Corpora used in this tutorial (III)

RanDECOW-1m (COW initiative, Schäfer, 2016)

≈ 1m tokens

web corpus

data collected through random walks

corrected for host biasrich linguistic annotation

publicly available at https://www.webcorpora.org/

This tutorial’s color code: blue


https://www.webcorpora.org/

The structure of the web

SCCIN OUT

TUBE

TENDRIL

Manning, Raghavan & Schütze, 2009, p. 427

Broder et al. (2000): IN, OUT, SCC, and TENDRIL components are notextremely different in size.

A more detailed report on the sizes: Ángeles Serrano et al. (2007).


Part I: Describing corpora


Kinds of meta data


Dimensions of corpus description

Typical corpus-level meta data:

size (measured in documents, tokens, words, utterances, turns …)sampling scheme: composition in terms of criterion X, balance…

Typical document-level meta data:

creation datesocio-demographic info about speaker/writer (gender, age, education,dialect, …)medium / mode (written, spoken)text topic / content areaaddressee, purposetext type / register / genre…

These are external criteria which describe the communicative setting(Atkins, Clear & Ostler, 1992; Biber, 1993).


Text type / register / genre

intuitively, register/ text is an important dimension of variation, but …

long tradition of investigation into registers, text types, genres

large body of research in within different research traditions

no coherent use / widely accepted definitions of these terms(could be used interchangeably or encode important theoreticaldistinctions)

with any given taxonomy: operationalization often problematic(more or less prototypical cases)

different taxonomies often not campatible with each other(or mapping of cetegories is unclear)


Common kinds of mata data in popular corpora (I)

DWDS Kernkorpus (Geyken, 2007), ≈ 100m words, balanced

Stratified by decade and text type:

Novels Newspaper Scientific Other(“Belletristik”) (“Zeitung”) (“Wissenschaft”) (“Gebrauchsliteratur”)

28.42% 27.36% 23.15% 21.05%

1990s “Wissenschaft”: mostly encyclopediae (Islam, Buddhism, idioms,nazi, opera, pedagogy)

1990s “Gebrauchsliteratur”: 1953 of 2319 texts from aktuelles Lexikon(Süddeutsche Zeitung), some political communication


Common kinds of mata data in popular corpora (II)

Deutsches Referenzkorpus (DeReKo, Kupietz et al., 2010), ≈ 42bn words

Selection of attributes; not all documents are annotated:

Category example valueauthor eigene Bearbeitung; Antonia LangsdorfdocTitle Hamburger Morgenpost, Januar 2006pubDate 2006-01-04pubPlace Hamburgpublisher Morgenpost Verlagreference MOPO, 04.01.2006, S. 19; Jahreshoroskop 2006textClass staat-gesellschaft familie-geschlechttextColumn SerietextType Zeitung: Tageszeitung, BoulevardzeitungtextTypeArt Serie


Common kinds of mata data in popular corpora (III)

DeReKo, some text types:

Aphorismus, Autobiografie, Bericht, Biografie, Brief, Denkschrift,Erlass, Erzählung Essay, Fußnote, Forschungsbericht, Gebet,Gebrauchsanweisung, Gedicht, Hörspiel, Interview, Klappentext,Kommentar, Leserbrief, Leitartikel, Märchen, Nachruf, Nachwort,Parteiprogramm, Petition, Presseerklärung, Produktbeschreibung,Protokoll, Ratgeber, Rede, Reportage, Rezension, Roman, Schauspiel,Tagebuch, Werbung

But also:Nachrichten, Abhandlung, Aufsatz, Flugblatt, Handzettel, Vorspann,Bericht, Feuilleton, Tipps Service, Lokales, Essen und Trinken,Beilage, Serie, Bericht/Reportage, Porträt:Stadtporträt,Porträt:Länderporträt, Bericht:Wetterbericht, Bericht:Sportbericht,Bericht:Schicksalsbericht, Bericht:Erfahrungsbericht, Fall:KurioserFall, Fall:Spektakulärer Fall


Common kinds of mata data in popular corpora (IV)

Typical Web corpora

mostly (breadth-first) crawled

HTML documents don’t provide much (linguistically relevant) metadata

usually no elaborate detailed sampling scheme possible

such meta data must usually be generated post hoc

Example: manual annotation of (samples of) COW corpora.

Usually mediocre inter-rater agreement for “high-level” genre/registercategories

Alternative: specify relevant dimensions of genre/register and classifydocuments along several axes.

e. g., Sinclair, 1996; Sharoff, 2006


The COWCat taxonomy

based on Sinclair, 1996; Sharoff, 2006

multiple dimensions, no genres

only categories with a potential influenceon grammatical features

Aim, Audience, Authorship, Domain, Mode


COWCat Aim and Audience

Aim has 5 distinct categories:

1 Recommendation (Re)2 Instruction (Is)3 Information (If)4 Discussion (Di)5 Fiction (Fi)

Audience has 3 distinct categories:

1 General (Ge)2 Informed or Restricted (In)3 Professional (Pr)


Experiment for German

4 raters

800 documents

training phase: 100 documents, 2 meetings


Results

Agreement and 𝜅 (Fleiss) for 4 raters:

Aim (5) Aud (3) Auth (5) Mode (4)

Agree 0.67 0.59 0.53 0.82𝜅 0.50 0.42 0.63 0.78

Agreement and 𝜅 (Cohen) for for best pairwise raters:

Aim (5) Aud (3) Auth (5) Mode (4)

Agree 0.84 0.86 0.78 0.91𝜅 0.49 0.53 0.71 0.82


Aim by top-level domain


Audience by top-level domain

Ge If Pr

DeEsUk

Comparison of corpus composition: Audience0

2040

6080

100


Text-internal criteria

Corpus level: summary statistics, e. g. word or sentence lengths

Document level: counts of linguistic features

Provide useful information for describing / comparing corpora

Can high-level categories (e. g., register, genre) be inferred from suchdata? (to be addressed later)

Seminal work: Biber, 1988:

Document-level counts of dozens of linguistic features

Purpose: not as meta data to be included in the corpus,but for studying register variation


COReX

Feature extractorOver 60 normalised feature counts at the document level

morphologicallexicalsyntacticstylistic markerssome non-linguistic features TTR, number of sentences etc.

Requires pre-processed text, uses information from POS tags,morphological analyses, NE recognition, topological parse, customword lists

Implemented in Python, open source, extendible to cover more features


COReX features (selection)

Feature Explanationcn common nouns per 1,000 wordsadj adjectives per 1,000 wordscmpnd compounds per 1,000 common nounspper_2nd 2nd person pronouns per 1,000 wordsgen genitives per 1,000 nounsclitindef clitic indefinite articles per 1,000 indef. articlesimp imperatives per 1,000 wordsneper person names per 1,000 wordsclausevf clausal Vf per 1,000 Vfpass passive constructions per clauseperf perfect constructions per clausevpast number of past verbs per 1,000 wordscnloan loan nouns with recognizable suffix (‘-ik’, ‘-um’) per 1,000 nounsqsvoc short/contracted forms (’nich’, ’schomma’) per 1,000 wordsshort non-standard contracted forms (’gehts’, ’aufm’) per 1,000 words


Exercise: read COReX meta data into an R data frame

1 Download the archive XYZ fromhttps://www.webcorpora.org/dfgs19/data.tar.gz and unpack it.

2 It contains 3 .tsv files with meta data froma sample from DECOW16B (70,000 docs)a sample from DeReKo/KoGra (70,000 docs)a sample from RanDECOW (70,000 docs)

3 Read the .tsv files into separate data frames, e. g.:

decow.corex <- read.table("/path/to/random_decow_70k.tsv", sep="t", header = TRUE)


https://www.webcorpora.org/dfgs19/data.tar.gz

Aggregated data


Data aggregation

Two examples:

1 Factor analysis2 Register classification


Factor analysis

Purpose: uncover a number underlying/ unobserved variables on thebasis of (a larger number of) observed variables

Assumption: underlying/unobserved variables cause the variability inobserved variables

FA groups variables together that vary together.

Observed: COReX data

Reduces dimensionality: originally each document described by61-dimensional vector (61 COReX features)

After FA, each document described by an n-dimensional vector (nfactors extracted from the data).

Number of factors is determined beforehand by the researcher.


Biber, 1988

In linguistics, pioneering work by Biber (1988 and subsequent).

7 factors / dimensions of variation in a varied corpus of English.Factors interpreted linguistically / functionally,

1 by examining feature loadings in factors2 by examining documents with particularly high or low scores on a factor


Factor analysis: hands-on excercise

library{psych} # provides fa function

# make one large data frame:random_combined_210k <- rbind.data.frame(random_dereko_70k

, random_decow_70k, random_randy_70k)

# scale numerical values (column 1 is ID , columns 63 and64 are CORPUS and TEXTSIGLE):

random_combined_210k[2:62] <- lapply(random_combined_210k[,2:62], scale)

# run factor analysis , extract 7 factors using principalfactor method and promax rotation:

fa.n7 <- fa(random_combined_210k[,2:62], nfactors=7,rotate = "promax", fm="pa")

# print results; plot results:print(fa.n7 , cut = 0.3)fa.diagram(fa.n7 , cut = 0.3)


Factor analysis: results

vfin vv

card

vaux

v2 neorg

indef cn

neloc subjs neg

dem

esvf

short

clitindef itj

imp

unkn qsvoc

emo

adv pper_2nd

cmpnd

prep

ttrat

answ

nonwrd sapos

parta

slen

vlast rsimpx

simpx

vflen psimpx

wlen conj

adj

dq

neper

vpressubj

perf

clausevf

cnloan

vvieren poss

pper_3rd

def pass

gen

pper_1st

vpast

vpres

plu

wh

inf

zuinf

mod subji

vvpastsubj

wpastsubj

F2

0.90.7−0.60.6

0.6−0.50.3

−0.3

F1

0.70.7

0.60.50.50.50.50.4

0.4

F5

0.90.70.60.50.4 F4

−0.6−0.6−0.60.3

0.30.3 F7

−0.6−0.6

0.60.50.4

−0.4

F6

1−0.90.5

F3

0.90.60.50.4


Factor scores

For each document, calculate a “score” on each of the 7 factors.

Biber’s method:

For a given factor, add up a document’s value of each variable that isprominently “loaded” for that factor.

Example:

Features w/ high loadings on factor 5: slen, vlast, rsimpx, simpx, vflen

Document XY:slen= 1.48, vlast= −0.16, rsimpx= 1.24, simpx= 0.27, vflen= −0.23Then document XY’s factor score on factor 5 is:1.48 − 0.16 + 1.24 + 0.27 − 0.23 = 2.6


Distribution of factor scores

0.0

0.1

0.2

0.3

0 10 20 30

Factor score on Factor 1

dens

ity

corpus

dereko

decow

decow_ran

Factor 1: Distribution of scores by corpus


Distribution of factor scores (II)

0.0

0.1

0.2

0.3

0 10 20 30 40 50

Score on Factor 1 (short, clitindef, itj, emo, qsvoc, pper2nd, ...)

dens

ity

forum

0

1

Factor 1: Distribution of scores by document type


Interpretation of factors

Some documents (DECOW16B) with high factor scores on factor 1:http://www.sto-center.de/forum/archive/index.php/t-4555.html

http://www.qdsl-support.de/archive/index.php/t-4165.html

Some dimensions are readily interpretable, others unclear.


http://www.sto-center.de/forum/archive/index.php/t-4555.html

http://www.qdsl-support.de/archive/index.php/t-4165.html

Linguistic features for register / text type / genreclassification

Factor analysis describes each document along along a number ofdimensions.

These are not register / text type categories.

What about automatic document classication for high-level categories(e. g., register / text type)?


Automatic document classification

Starting in the early 1960s. Applications:

“newswire filtering”

patent classification

“web page classifiation”

spam filtering

“authorship attribution”

“author gender detection”

affective rating / sentiment analysis

genre classification

(Sebastiani, 2005)

Especially topic / thematic classification ist well establishhed.


Automatic Classification

The document classification problem:

Given a set of classes: determine which class(es) a given objectbelongs to.

one-of problems (single-label task) vs. any-of problem (multi-label task)

Supervised classification: requires manually annotated training set.

Fixed set of classes: C = {𝑐1, 𝑐2, 𝑐2, … , 𝑐𝑗}Document space: XDescription 𝑑 ∈ X of a documentTraining set D of labeled documents ⟨𝑑, 𝑐⟩, where ⟨𝑑, 𝑐⟩ ∈ X × C

Classification function 𝛾 maps documents to classes: 𝛾 ∶ X ↦ C

Learning method: Γ(D) = 𝛾(i. e., Γ takes training set as input, returns classification function 𝛾)

(Manning, Raghavan & Schütze, 2009, ch. 13)







Fixed set of classes: C = {𝑐1, 𝑐2, 𝑐2, … , 𝑐𝑗}

Document space: XDescription 𝑑 ∈ X of a documentTraining set D of labeled documents ⟨𝑑, 𝑐⟩, where ⟨𝑑, 𝑐⟩ ∈ X × C



(Manning, Raghavan & Schütze, 2009, ch. 13)Felix Bildhauer & Roland Schäfer Describing corpora, comparing corpora DGfS-CL 39 / 133






Fixed set of classes: C = {𝑐1, 𝑐2, 𝑐2, … , 𝑐𝑗}Document space: X

Description 𝑑 ∈ X of a documentTraining set D of labeled documents ⟨𝑑, 𝑐⟩, where ⟨𝑑, 𝑐⟩ ∈ X × C









Fixed set of classes: C = {𝑐1, 𝑐2, 𝑐2, … , 𝑐𝑗}Document space: XDescription 𝑑 ∈ X of a document

Training set D of labeled documents ⟨𝑑, 𝑐⟩, where ⟨𝑑, 𝑐⟩ ∈ X × C









Fixed set of classes: C = {𝑐1, 𝑐2, 𝑐2, … , 𝑐𝑗}Document space: XDescription 𝑑 ∈ X of a documentTraining set D of labeled documents ⟨𝑑, 𝑐⟩, where ⟨𝑑, 𝑐⟩ ∈ X × C




Figure: Supervised classification: Training and prediction (from Bird, Klein & Loper,2009)


Features

Numerous attempts at automatic genre indentification:Karlgren & Cutting, 1994; Kessler, Nunberg & Schütze, 1997; Lee &Myaeng, 2002; Freund, Clarke & Toms, 2006; Kanaris & Stamatatos,2009; Mehler, Sharoff & Santini, 2010; Biber & Egbert, 2015

Using linguistic and / or non-linguistic information (markup etc.)


Automatic register identification using grammaticalfeatures

State-of-the-art classification results (Biber & Egbert, 2015):

web documents

32 register categories

42.1% accuracy

.27 precision, .29 recall


More problems

Problems with automatic annotation for large “general purpose” corpora:

1 Conceptual: Typically use many grammatical features: risk ofcircularity if the resulting categories are used for controlling registervariation

2 Technical: clustering / classification reduces the dimensions of theoriginal input features: loss of information.

We will explore the consequences of such data aggregation later in amodelling excercise.


Thematic description


The purpose(s) of thematic description / classification

large, unstructured document collections(any kind of data base, also web corpora)

knowing what the documents are about

highly relevant in IR: find documents about something(genes, diseases, companies, products, …)

(corpus) linguistics: themes of documents can becorrelated with other categories (genres, etc.)

distribution of linguistic features varyingwith topical structure


What a document is about I

often: things denoted by words contained in the document

primary subject referenced in title/headline (if any)

but direct mentioning of subjects not necessary

depends on granularity of “subject” or “topic”

“broad” topics, e. g.:

foreign policy national affairs sports

more fine-grained topics, e. g.:

Chinese foreign policy Middle East conflict

the U.S.’s relationship with Russia


Approaches to thematic description of documents /corpora

Keyword analysis

Supervised approaches (document classification)

Unsupervised approaches (e. g. topic modelling)


Keyword analysis with tf.idf

Term frequency by inverse document frequency

Goal: Find terms that are characteristic of a document.

A term is characteristic if it is frequent in that document(“term frequency”, 𝑡𝑓)

A term is characteristic if it does not occur in many other documents(“document frequency”, 𝑑𝑓 )

Several normalisations/weightings improve results(see Manning, Raghavan & Schütze, 2009, Ch. 6.2)


Keyword analysis with tf.idf (II)

tf(t,d)frequency of term 𝑡 in document 𝑑

df(t)document frequency of term 𝑡(number of documents that contain term 𝑡)

idf(t) = Ndf(t)

inverse document frequency: total number of documents divided bynumber of documents that contain 𝑡

tf.idf(t,d) = tf(t,d) ⋅ log Ndf(t)


Logarithm: affects large 𝑖𝑑𝑓 values more than small ones.

0 200 400 600 800 1000

020

040

060

080

010

00

(N = 1000 docs)DF

IDF

0 200 400 600 800 1000

01

23

45

67

(N = 1000 docs)DF

log

IDF


Thematic document classification

Manually classify (several hundred) documents according to someannotation schema.

e. g., the COWCat schema(version with 13 categories for topic domains)

870 docs from DECOW14

886 docs from DeReKo 2014-II

HistoryFineArts

TechnologyPublicLifeAndInfrastructure

LifeAndLeisure

Philosophy

BusinessIndividual

LawMedical Science

PoliticsSociety

BeliefsLaw

Business

LifeAndLeisureBeliefsFineArts

PublicLifeAndInfrastructure

PoliticsSocietyMedical

History

Individual

Technology







HistoryFineArts


LifeAndLeisure

Philosophy

BusinessIndividual

LawMedical Science

PoliticsSociety

Beliefs

LawBusiness




History

Individual

Technology







HistoryFineArts


LifeAndLeisure

Philosophy

BusinessIndividual

LawMedical Science

PoliticsSociety

BeliefsLaw

Business




History

Individual

Technology


Use as training data for classifier

Figure: Supervised classification: Training and prediction (Bird, Klein & Loper, 2009)

Classifiers: Naive Bayes, Support Vector Machines, Artficial NeuralNetworks and many, many moreGood starting point: WEKA (Frank, Hall & Witten, 2016),https://www.cs.waikato.ac.nz/ml/weka/


https://www.cs.waikato.ac.nz/ml/weka/

Topic modeling

What is probabilistic topic modeling?

Statistical methods for discovering and annotating large archives ofdocuments with thematic informationAnalyze the words of the original texts to discover:

the themes that run through the textshow those themes are connected to each otherhow they change over time

Can be applied to massive amounts of data.

Can be adapted to many kinds of data(text documents, genetic data, images, social networks, …)

Does not require any prior annotations or labeling of the documents.

(from Blei, 2012)


Latent Dirichlet Allocation (LDA; Blei, Ng & Jordan, 2003)

Intuition: Documents exhibit multiple topics.

Documents may blend topics in different proportions.

e. g., a document may be primarily about sports, plus business pluswhite-collar crime

knowing how a documens blends various topics helps situate it in acollection of documents

LDA: statistical model of document collections designed to capture thisintuition

LDA: all the documents in the collection share the same set of topics,but each document exhibits those topics in different proportion


LDA: History

LSA / LSI Latent Semantic Analysis / Indexing (Deerwester et al., 1990)

↓pLSI probabilistic LSI (Hofmann, 1999)

↓LDA Latent Dirichlet Allocation (Blei, Ng & Jordan, 2003)


LDA (II)

LDA assumes that documents arise from a generative process.

Topic: a distribution over a fixed vocabulary

e. g., a sports topic has words about sports with high probability

e. g., a business topic has words about business with high probability

Generating a document is two-stage process:

1 Randomly choose a distribution T over topics.2 For each word in the document:

1 Randomly choose a topic from T.2 Randomly choose a word from the corresponding distribution over the

vocabulary.


LDA (III)

1 Randomly choose a distribution T over topics.2 For each word in the document:

1 Randomly choose a topic from T.2 Randomly choose a word from the corresponding distribution over the

vocabulary.

Each document exhibits the topics in different proportion (1).

Each word in each document is drawn from one of the topics (2b),

the selected topic is chosen from the per-document distribution overtopics (2a)


LDA IV

Topic structure: the topics, per-document topic distributions, and theper-document per-word topic assignments

documents themselves are observed

topic structure is hidden structure

computational problem:use the observed documents to infer the hidden topic structure

or: What is the hidden structure that likely generated the observedcollection?


Generative probabilistic modeling

Generative probabilistic modeling in general:

Assumption: data arises from a generative process that includeshidden variablesGenerative process: defines joint probability distribution over observedand hidden random variables

Data analysis: use joint distribution to compute conditionaldistribution of the hidden variables given the observed variables

Conditional distribution: posterior distribution

LDA:

observed variables: the words of the documents

the hidden variables: topic structure

computational problem: computing the posterior dsitribution(the conditional distribution of the hidden variables given thedocuments)


The posterior distribution

The true posterior distribution is intractable to compute.

involves exponentially large number of every possible instantiation ofthe hidden topic structureinstead, topic modeling algorithms approximate the true posteriordistribution

1 sampling-based algorithms (usually Gibbs sampling)2 variational algorithms (deterministic alternative)

Which approach is better is a matter of debate.


Example: LDA topic, weighted terms


Part II: Comparing corpora


Corpus comparison

Questions researchers might ask:

Are corpus X and corpus Y similar to each other?

How similar are they to each other?

Are they significantly different from each other?

Given corpus X and corpus Y, where is corpus Z located between these?

…

Big question: Similarity with respect to which criterion?


Corpus similarity

What does it mean for two corpora to be “similar” to each other?

no such thing as a single measure of corpus similarity

no simple, single answer to this question

Corpora can be similar by one criterion and different by another, e. g.:

corpus A: docs about sports and politics, from forum discussions

corpus B: docs about sports and politics, from newspaper articles

Many different approaches, e. g.:

compare corpus composition (text types, topics, authorship etc.)

compare distribution of linguistic entities(words, other linguistic features)

compare corpora with respect to a specific task(collocation extraction, word similarity tasks)

(See Kilgarriff, 2001 for discussion.)


Corpus similarity

What does it mean for two corpora to be “similar” to each other?

no such thing as a single measure of corpus similarity

no simple, single answer to this question

Corpora can be similar by one criterion and different by another, e. g.:

corpus A: docs about sports and politics, from forum discussions

corpus B: docs about sports and politics, from newspaper articles

Many different approaches, e. g.:

compare corpus composition (text types, topics, authorship etc.)

compare distribution of linguistic entities(words, other linguistic features)

compare corpora with respect to a specific task(collocation extraction, word similarity tasks)

(See Kilgarriff, 2001 for discussion.)Felix Bildhauer & Roland Schäfer Describing corpora, comparing corpora DGfS-CL 83 / 133

Overview slide here?


Comparing keywords

Idea: compare lists of keywords extracted from two corpora

Keywords: words that are characteristic of a corpus

Key-ness: relative notion

Compute a statistic from the frequency of each in corpus A, and thefrequency of the same word in corpus B

Sometimes, calculate p-values and dispersion


Keyword extraction statistics

Ratio of relative frequencies (e. g., Edmundson & Wyllys, 1961;Damerau, 1993; Kilgarriff, 2012)Yule, 1944 difference coefficient (e. g., Hofland & Johansson, 1982)𝜒2 (e. g., Scott, 1997)-2 Log-Likelihood (e. g., Scott, 2001)Mann-Whitney U (Kilgarriff, 2001)tf.idf (Spärck Jones, 1972)

Different statistics may result in quite different keyword lists.


Keyword extraction with 𝜒2

Compute a 𝜒2 value from a 2×2 table for each word type in the joinedcorpus (Corpus A ∪ Corpus B)

Multipy value by the sign of the first table cell(positive if a word is over-represented in Corpus A, negative otherwise)

𝜒2 × 𝑠𝑔𝑛(𝑂1,1 − 𝐸1,1)Sort the list on the signed 𝜒2 value

Illustration:

Keyword extraction with 𝜒2 from two samples of DECOW12(1.2 G tokens each).


Example: keyword extraction with 𝜒2

Score Word Translation f Corpus A f Corpus A213857 Selbstbewusstsein ‘self-consciousness’ 230086 7103180761 der ‘the’ 25823958 23332624112942 Stärken ‘strengths’ 147041 13782110867 des ‘of.the’ 5467541 4511218109470 stärken ‘strengthen’ 150609 1680781735 in ‘in’ 14531108 1329186180314 Niederösterreich ‘Lower Austria 81732 99769230 und ‘and’ 27615823 2620999967723 Steiermark ‘Styria’ 69845 116561955 Schwächen ‘weaknesses’ 101680 1703458046 Sie ‘you’ 2389178 192935448998 Selbstbewusstseinstraining ‘self-consciousness training’ 48100 345075 Die ‘the’ 4725965 417691741216 Oberösterreich ‘Upper Austria’ 43930 121738921 werden ‘become’ 3889092 342477233635 , , 61275629 6044099833447 die ‘the’ 26896810 2607551333069 wir ‘we’ 2751969 238820332967 Gott ‘God’ 277819 16158727182 » » 499957 35550026135 durch ‘through’ 2028104 174948025822 uns ‘us’ 1432263 119646125367 von ‘from’ 8958078 845865524745 Beispiele ‘examples’ 95118 3880724657 « « 437106 30851523838 Menschen ‘people’ 804843 63320523655 zur ‘to.the’ 1887285 163189223466 In ‘in 1267897 105621823359 sich ‘oneself’ 7073007 663692423056 Coaching ‘coaching’ 32678 3958


Example: keyword extraction with 𝜒2

Score Word Translation f Corpus A f Corpus B-49228 es ‘it’ 7191558 8203365-52339 ! ! 3539669 4250013-55500 jetzt ‘now’ 1093904 1497137-58057 so ‘so’ 4078118 4881829-58090 meine ‘my’ 743321 1086311-61785 was ‘what’ 2368986 2994141-62641 wenn ‘if’ 2525537 3175758-62847 :zustimm: ‘agree 4 64046-65120 bin ‘am’ 898077 1296452-68224 das ‘the’ 11114277 12601969-68699 : : 7361865 8552945-69360 Du ‘you’ 856126 1258723-70448 dann ‘then’ 2705021 3418243-73501 :-D :-D 4080 86533-74531 schon ‘already’ 2299966 2975671-79816 mich ‘me’ 1776035 2391232-80099 auch ‘too’ 7618001 8919734-99309 da ‘there’ 2125894 2876793

-102364 aber ‘but’ 3900764 4932552-115949 du ‘you’ 1501636 2189889-117379 hab ‘have’ 606783 1065271-138531 nicht ‘not’ 9679787 11587724-144946 habe ‘have’ 1725867 2552673-166247 ja ‘yes’ 1806276 2714682-176159 mir ‘me’ 2190317 3215638-176735 Ich ‘I’ 2664716 3791898-186177 ? ? 3889053 5278796-186838 mal ‘once’ 2000220 3014011-333374 … … 3418151 5189644-709423 ich ‘I’ 8606043 12673625


Keyword extraction with the ratio of relative frequencies

Ratio of relative frequencies (RRF): early research on automaticdocument summarization and indexing (Edmundson & Wyllys, 1961)

For every word ocurring in Corpus A or Corpus B:divide the word’s relative frequency in Corpus A by its relativeferquency in Corpus B(after adding a smoothing constant)

Sort on the resulting score (between 0 and ∞).

Kilgarriff, 2012: obtain keywords from different frequency bands byvarying the smoothing constant.

Illustration:

Keyword extraction with RRF from the same two samples of DECOW12(1.2 G tokens each).


Example: keyword extraction with RRF

Ratio Word Translation f Corpus A f Corpus B14193 IntSel®-Selbstbewusstseinstraining ‘IntSel® self-consciousness training’ 13929 012252 Selbstbewusstseinstraining ‘self-consciousness training’ 48100 311414 IntSel®-Selbstbewusstseinstrainings ‘IntSel® self-consciousness trainings’ 11201 07253 moviac moviac (NE) 7118 05022 Kinder-Selbstbewusstseins-Coach ‘children’s self-consciousness coach’ 4928 03876 www.theaterstuebchen.de www.theaterstuebchen.de 3803 02876 Selbstbewusstseinstrainer ‘self-consciousness coach’ 2822 02854 ’schmökern ‘to browse’ 2800 02853 IntSel®-Wertekonzept ‘IntSel® scheme of values’ 2799 02853 IntSel®-Stärkenleiter ‘IntSel® scale of strengths’ 2799 02853 Angst-Vermeidungsstrategien ‘fear avoidance strategies’ 2799 02853 ’Austherapierte ‘healed persons’ 2799 02693 HR-Lieblingsschiff ‘HR favorite ship’ 2642 02613 Schwehm Schwehm (NE) 17949 62158 www.bauemotion.de www.bauemotion.de 2117 02068 :futsch: emoticon 2029 01961 Litaraturmarkt ‘literature market’ 1924 01721 Thor’al Thor’al (NE) 1688 01631 Grujicic Grujicic (NE) 1600 01591 :fletch: emoticon 1561 01554 Terror-Die terror-the 1524 01550 Architektur-Meldungen ‘architecture news’ 1520 01496 Beamt-er/ ‘state employee’ 1467 01469 91785 91785 2883 11464 Systemcoach ‘system coach’ 2872 11438 Party-Highlight ‘party highlight’ 1410 01435 Tiergefahren ‘danger from animals’ 1407 01434 Tybrang Tybrang (NE) 1406 01433 Titelvertei-digung ‘title defense’ 1405 01433 Riesen-Herausforderer ‘big contender’ 1405 0


Example: keyword extraction with RRF

Ratio Word Translation f Corpus 1 f Corpus 24.268E-04 :five: emoticon 1 47744.265E-04 @vandeStonehill @vandeStonehill 0 23884.265E-04 #WeLove #WeLove 0 23884.265E-04 #HighSchoolMusical3 #HighSchoolMusical3 0 23884.208E-04 :aargh: emoticon 0 24204.195E-04 :meinemeinung emoticon 0 24284.189E-04 |supergri emoticon 0 24314.125E-04 :zickig: emoticon 0 24694.107E-04 *seh *seh 0 24804.098E-04 |kopfkrat emoticon 0 24854.096E-04 kostenlose-urteile.de kostenlose-urteile.de 0 24864.072E-04 Migrantenrat ‘immigrants’ board’ 1 50034.072E-04 MV-Politiker ‘MV-politician’ 0 25013.971E-04 berlin.business-on.de berlin.business-on.de 0 25653.731E-04 :fürcht: emoticon 0 27303.655E-04 :leiderja: emoticon 0 27873.473E-04 Bollywoodsbest Bollywoodsbest 0 29333.329E-04 Fotoserver ‘photo server’ 0 30603.192E-04 :urgs: emoticon 0 31913.174E-04 :habenmuss: emoticon 0 32093.168E-04 Stachelhausen Stachelhausen (NE) 1 64312.906E-04 :rothlol: emoticon 0 35052.637E-04 Juusuf Juusuf (NE) 0 38632.489E-04 :zumgluecknein: emoticon 0 40922.136E-04 :zufrieden: emoticon 0 47701.953E-04 :menno: emoticon 0 52171.818E-04 @ProSieben @ProSieben 0 56021.682E-04 :dollschaem: emoticon 0 60561.331E-04 :traeum: emoticon 0 76561.281E-04 :dollfreu: emoticon 0 7955


Keyword extraction: properties of 𝜒2 versus RRF

Corpus 1 Corpus 2f absolute f relative f absolute f relative Ratio of rel. freqs. 𝜒2

50 .05 100 .1 0.5 17.3100 .1 200 .2 0.5 38.4200 .2 400 .4 0.5 94.3

Figure: 𝜒2 vs. ratio of relative frequencies in keyword extraction, illustrated by twocorpora of 1,000 tokens each.


Compare association of documents with induced topics

−1.0 −0.5 0.0 0.5

DeReKo DECOW

Figure: Log ratio of relative frequencies: proportion of documents with topic Xamong their 3 most strongly associated topics


Measuring thematic corpus balance

Ciamarita & Baroni (2006): ensure that no topic is heavily over-representedin a corpus.Method:

Create a number of corpora that are deliberately biased towards sometopic.

Calculate mean distance of each one of these corpora to all othercorpora.

Distance: based on word frequencies, measured as relative entropy(or Kullback-Leibler distance, KullbackLeibler1951).Measure distance of a target corpus to all other corpora.

Expectation: if the target corpus is unbiased, it is “in between” all thebiased corporamean distance of target corpus should be smaller than mean distanceof biased corpora.


Comparing word frequency lists

Top 20 common nouns in FRCOW2011 and frWaC (Baroni et al., 2009):

Rank FRCOW 2011 (few seed URLs) FRWAC (many seed URLs)1 année ‘year’ site2 travail ‘work’ an3 temps ‘time’ travail4 an ‘year’ jour5 jour ‘day’ année6 pays ‘country’ service7 monde ‘world’ temps8 vie ‘life’ article9 personne ‘person’ personne

10 homme ‘man’ projet11 service information12 cas ‘case’ entreprise ‘company’13 droit ‘right’ recherche ‘(re-)search’14 effet ‘effect’ vie15 projet ‘project’ droit16 question page17 enfant ‘child’ formation (‘education’)18 fois ‘time (occasion)’ commentaire ‘comment’19 place cas20 site fois

Fairly good overlapBut method is impressionistic, no “measure” of the differenceAre these corpora “significantly” (dis-)similar to each other?


Testing for differences: 𝜒2-test

𝐻0: The corpora are samples from the same population(the frequency of a word is not correlated with variable corpus are notcorrelated)

𝜒2-test test compares frequencies of a word type in the two corpora

Corpus 1 Corpus 2word X freq(X) freq(X)

¬ word X freq(¬X) freq(¬X)


Corpus comparison using the 𝜒2-test

Frequency 𝜒2 pWord Corpus 1 Corpus 2de 6781719 6802262 32.99 <.001, 5627749 5633555 3.12 .077la 3613946 3614049 0.001 .975. 3574395 3579032 3.08 .079que 2963992 2956662 9.36 <.010y 2642241 2653365 23.88 <.001en 2562028 2564809 1.53 .217el 2450353 2446328 3.40 .065a 1885112 1882813 1.44 .230los 1597103 1603537 13.09 <.001del 1173860 1172623 0.67 .415se 1139311 1143202 6.68 <.010las 1054729 1054924 0.02 .896un 1001556 1000106 1.07 .302

Figure: 𝜒2 14 most frequent word types in two Spanish Web corpora (110m tokens)

Overall score?𝜒2 statistic for the 14 × 2 table (13 df): 𝜒2 = 76.87, 𝑝 < .001But: these are in fact random subcorpora from ESCOW2012.





Overall score?

𝜒2 statistic for the 14 × 2 table (13 df): 𝜒2 = 76.87, 𝑝 < .001But: these are in fact random subcorpora from ESCOW2012.





Overall score?𝜒2 statistic for the 14 × 2 table (13 df): 𝜒2 = 76.87, 𝑝 < .001

But: these are in fact random subcorpora from ESCOW2012.





Overall score?𝜒2 statistic for the 14 × 2 table (13 df): 𝜒2 = 76.87, 𝑝 < .001But: these are in fact random subcorpora from ESCOW2012.


Why the 𝜒2-test is not suitable

𝜒2 statistic grows as a function of sample sizeHuge sample size: minor differences lead to large 𝜒2 values

Form Corpus 3 Corpus 4 𝜒2 pde 13579307 13563617 9.65 <.010, 11264561 11274225 4.38 .036la 7227386 7236002 5.31 .021. 7157262 7150075 3.72 .054que 5924861 5932089 4.53 .033y 5303604 5292011 12.98 <.001en 5117964 5123526 3.10 .078el 4885947 4900224 21.31 <.001a 3766747 3773854 6.82 <.010los 3203514 3193313 16.49 <.001del 2340110 2338707 0.42 .515se 2277149 2284887 13.27 <.001las 2105694 2109117 2.81 .094un 1998736 2002337 3.27 .070

Figure: Two random subcorpora from ESCOW2012, 220m tokens each


Using test statistics but no p-values / hypothesis tests

Hypothesis testing: 𝜒2 (and other tests) fail to capture intuitions aboutsimilarity

But: Test statistics can be used for ranking a set of candidates, withoutany hypothesis testing.

Kilgarriff, 2001: do away with hypothesis testing in comparing corpora

Use test statistics as a measure of relative similarity between corpora

No measure of similarity in absolute terms

Which test statistic should be used?


“Known similarity corpora” (Kilgarriff, 2001)

Which test statistic best captures a relevant notion of “similarity”?

Find out experimentally.Requires a number of corpora with known similarity properties.Kilgarriff, 2001: set of different corpora, each one with differentcomposition in terms of text types, e. g.:

20% sports, 80% politics40% sports, 60% politics

Gold standard for similarity ranking”40_sports_60_politics is more similar to 100_sports than20_sports_80_politics” etc.

Compute test statistic for 𝑛 most frequent tokens, sum up, evaluatehow many gold standard rankings are predicted correctlyResult: 𝜒2 outperforms all others(including Spearman rank correlation coefficient & variants ofcross-entropy measures)




Find out experimentally.

Requires a number of corpora with known similarity properties.Kilgarriff, 2001: set of different corpora, each one with differentcomposition in terms of text types, e. g.:







Find out experimentally.Requires a number of corpora with known similarity properties.

Kilgarriff, 2001: set of different corpora, each one with differentcomposition in terms of text types, e. g.:










Compute test statistic for 𝑛 most frequent tokens, sum up, evaluatehow many gold standard rankings are predicted correctly

Result: 𝜒2 outperforms all others(including Spearman rank correlation coefficient & variants ofcross-entropy measures)


Other features, other statistics

Teich & Fankhauser, 2009: comparing F-LOB (Freiburg–LOB Corpus ofBritish English) with corpus of scientific writing (≈ 1M tokens each)

Features:

Standardized type-token ratio (STTR) as a potential indicator oftechnical language

Relative number of nouns, lexical verbs, and adverbs potentialindicators of abstract language

Lexical density (avg. number of lexical words per clause) as a measurefor the informational density

Statistic: information gain (measures how well a feature distinguishesbetween classes)

STTR as best performing feature

“Shallow features such as a low type-token ratio clearly characterize themeta-register of scientific writing.”


Other features, other statistics (II)

Denoual (2006): Locate target corpus X between two reference corpora Yand Z.Features:

Character 𝑁 -grams (1 < 𝑛 < 17)Does not require tokenization or lemmatization.

Statistic: cross-entropy

Construct several N-gram character models from reference corpora

Use these language models to estimate the cross-entropy,in terms of bits per character needed to encode a test corpus

Two corpora are similar when one of them can be completely predictedgiven the knowledge of the other one

Evaluation: on Japanese, with known similarity copora

cross entropy & N-grams performs equally well as words & 𝜒2


Extrinsic evaluation: collocation extraction Biemann et al.,2013

Stefan Evert’s work reported in (Biemann et al., 2013)

compares different corpora with a gold standard

size POS basicname (tokens) corpus type tagged lemmatized unitBNC 0.1 G reference corpus + + textWP500 0.2 G Wikipedia + + fragmentWackypedia 1.0 G Wikipedia + + articleukWaC 2.1 G web corpus + + web pageWebBase 3.3 G web corpus + + paragraphUKCOW 4.0 G web corpus + + sentenceLCC 0.9 G web corpus + − sentenceLCC (𝑓 ≥ 𝑘) 0.9 G web n-grams + − n-gramWeb1T5 (𝑓 ≥ 40) 1000.0 G web n-grams − − n-gram


English verb-particle combinations (I)

gold standard: 3,078 verb-partikel combinations, non-compositional(carry on, knock out) oder kompositional bring together, peer out)

extracted: co-occurrences of word pairs(3-words window left / right right to the verb)


English verb-particle combinations (II)

POS size average precisioncorpus filter (tokens) 𝐺2 𝑡 MI Dice MI2 𝑋2

BNC + 0.1 G 31.06 29.15 22.58 30.13 30.97 32.12WP500 + 0.2 G 28.01 25.73 27.81 30.29 29.98 31.56Wackypedia + 1.0 G 28.03 25.70 27.39 30.35 30.10 31.58ukWaC + 2.2 G 30.01 27.82 25.76 30.54 30.98 32.66WebBase + 3.3 G 30.34 27.80 27.95 31.74 32.02 33.95UKCOW + 4.0 G 32.31 30.00 26.43 32.00 32.96 34.71LCC + 0.9 G 25.61 24.83 22.14 26.82 25.09 26.38LCC (𝑓 ≥ 5) + 0.9 G 26.95 26.45 25.54 27.78 25.96 27.66LCC (𝑓 ≥ 10) + 0.9 G 27.34 26.81 27.13 27.85 25.95 28.09LCC − 0.9 G 24.67 23.63 21.41 25.36 23.88 25.63LCC (𝑓 ≥ 5) − 0.9 G 25.45 24.79 23.54 26.30 24.55 26.21LCC (𝑓 ≥ 10) − 0.9 G 25.84 25.16 25.28 26.49 24.71 26.63Web1T5 (𝑓 ≥ 40) − 1000.0 G 26.61 26.12 21.67 27.82 25.72 27.14


English verb-particle combinations (III)

Clean, annotated web corpora can replace more traditional corpora likethe BNC.

Diversity might play a role: web corpora need to be 1 GT in size to rivalthe BNC.

Web1T5: size alone is not sufficient, good filtering and linguisticannotation are necessary.


Homogeneity

How similar to each other are subparts of a corpus?

Subparts are usually sets of texts / documents

Common method:

Choose a feature(in a broad sense: could also be a lexical item syntactic constructionconstruction; see also Stefan Gries, 2006)

Assess occurrence in subparts

Compute summary statistic

Homogeneity is an intra-corpus measure, but it makes sense to interpret itwith respect to other corpora.


Homogeneity: Illustration

Assess sampling distribution of the mean of some feature X.

Draw 𝑚 samples of 𝑛 documents from the corpus.

For each sample, get the mean value of feature X.

Calculate the standard error of the mean.

Plot density of distribution of sample means.

Repeat this 𝑡 times.


Plotting exercise

Run the R-Script tutorial-resampling.R and reproduce the plot forpaste tense verbs.

source(tutorial -resampling.R)

Open the script file and modify it: select one or two other COReXfeatures and plot the sampling distribution of their mean.

# which other COReX features are there?colnames(random_dereko_70k)

# edit this line (select new feature):feature <-"vpast"

# edit this line (remove ylim and xlim):plot(density(m), ylim=c(0,1), xlim=(c(5,35)), ... )

# edit this line (adapt position of the legend):legend (15,1, ... )


Part III: Modelling


Example GLM… or rather logistic regression

Alternation of genitive and case identityin German measure NPs; much more complex real model cf.Schäfer (accept with revisions September 2017)http://rolandschaefer.net/?p=1290

Wir trinken eine Flasche guten Wein. (Agree=1)

Wir trinken eine Flasche guten Weines. (Agree=0)

which factors incluence Agreesome of the possible regressors

casedefinitenesscardinal determiner


http://rolandschaefer.net/?p=1290

Why not a Linear Model?

●● ●●

● ●● ●● ●

0 5 10 15 20

−0.

50.

00.

51.

01.

5

x

y

LM predicts continuous variable

not plausible for dichotomous outcome

also not usable as probability (outside [0, 1])normality assumption for errors violated


GLM: Logits

For a GLM, we pretend a linear combination is possible:

predicting probabilities of Agree=1

…from a linear combination of regressors

the linear term gives us logits:

𝑧 = 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑛𝑥𝑛 + 𝛽0


GLM: Link functions

Logits transformed into probabilities via some exponential CDF(cumulative density function) like the logistic or normal (probit) CDF

𝑝(𝑦 = 1) = 11+𝑒−𝑧

For binary predictions:

𝑦 = {0 if 𝑝(𝑦 = 1) ≤ 0.51 if 𝑝(𝑦 = 1) > 0.5


Logit transform

Transformed logits as 𝑝(𝑦 = 1):

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

Logits

p(y=

1)


Interpretation of the coefficients

only indirect interpretation of coefficients

𝛽𝑖 positive ⇒ positive influence on 𝑝(𝑦 = 1)𝛽𝑖 negative ⇒ negative influence on 𝑝(𝑦 = 1)

non-linear influence on probabilities!

linear influence on logits only

divide-by-4 (Gelman & Hill, 2006, p. 82):coefficients around 0.5 can be divided by 4to give an upper bound for effect on the probability scale




𝛽𝑖 positive ⇒ positive influence on 𝑝(𝑦 = 1)

𝛽𝑖 negative ⇒ negative influence on 𝑝(𝑦 = 1)







𝛽𝑖 positive ⇒ positive influence on 𝑝(𝑦 = 1)𝛽𝑖 negative ⇒ negative influence on 𝑝(𝑦 = 1)





Skipping

Maximum Likelihood Estimation

model selection (LLR test, bootstrap)

AIC (also model selection)

prediction accuracy and PRE, cross-validation

overidspersion in count GLMs

collinearity diagnostics

homogeneity of variance/graphical model validation

other links and error distributions

variance structures (gls)

power calculations for GLM(M)shighly important depending on your statistical philosophy


Contrast coding

Categorical regressors (factors) have to be coded(automatically by R) as a series of dichotomous variables.

Example of three-level variable A and dummy regressors 𝑥1..3

A = 1 A = 2 A = 3x1 = 1 0 0x2 = 0 1 0x3 = 0 0 1

But! De facto there are only 𝑘 − 1 dummies for a factor with 𝑘 levels.The other one is used as a reference for the others and is encodedby the intercept.


Dummy-coded factors in model equations

Example of a dummy-coded regressor 𝐴 as dummies 𝑥1..3plus one numerical regressor 𝑥4:

𝑝(𝑦 = 1) = 11+𝑒−𝑧

with 𝑧 = 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝛽4𝑥4 + 𝛽0

With the specific values:

𝑥1..3: 0 oder 1If 𝑥1 = 1 then 𝑥2 = 0 und 𝑥3 = 0 etc.

𝑥4 ∈ IR




𝑝(𝑦 = 1) = 11+𝑒−𝑧

with 𝑧 = 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝛽4𝑥4 + 𝛽0


𝑥1..3: 0 oder 1

If 𝑥1 = 1 then 𝑥2 = 0 und 𝑥3 = 0 etc.

𝑥4 ∈ IR




𝑝(𝑦 = 1) = 11+𝑒−𝑧

with 𝑧 = 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝛽4𝑥4 + 𝛽0


𝑥1..3: 0 oder 1If 𝑥1 = 1 then 𝑥2 = 0 und 𝑥3 = 0 etc.

𝑥4 ∈ IR


Intercepts

the intercept (𝛽0) in GLMs cannot bethe point of intersection with the ordinate axis

for numerical regressors:simplest possible binomial GLM: ��(𝑦 = 1) = 𝛽1𝑥1 + 𝛽0If 𝑥1 = 0 it predicts 𝛽0.

for dummy-coded variable one level is chosen as the referenceto which the others are compared

GLM with three dummies:��(𝑦 = 1) = 𝛽𝐴𝑘𝑘 ⋅ 𝑥𝐴𝑘𝑘 + 𝛽𝐷𝑎𝑡 ⋅ 𝑥𝐷𝑎𝑡 + 𝛽𝑁𝑜𝑚If all regressors are 0, the observation shows Nom case.The dummies thus model the between the reference category (Nom)and the other cases.


𝜆 and pseudo-𝑅2

𝜆: proportional reduction of the prediction error𝜆 = prediction error without model−prediction error with model

prediction error without model

𝑅2: proportion of variance explained by the regressorsa measure that can (with limitations) be used to compare models

Cox & Snell: 𝑅2𝐶 = 1 − ( 𝐿0

𝐿𝑓) 2

𝑛

Problem: Never reaches 1!

Nagelkerke: 𝑅2𝑁 = 𝑅2

𝐶𝑅2𝑚𝑎𝑥

mit 𝑅2𝑚𝑎𝑥 = 1 − (𝐿0) 2

𝑛


Mixed models: GLMMs

“always add speaker and genre as a random effect”

idea: account for between-group variancepossible grouping factors in lingusitics:

participantslexemes in corpus studiestext types/genres

alternative: one model per group (no pooling)

alternative: ignore (complete pooling)

alternative: contrast-coded groups as fixed effects

mixed modeling as a cult in corpus linguisticsignoring basic facts about the (non-)sense of random effects


Situation for nested random effect

Exemplar Speaker Region

1 Daryl Tyneside2 Daryl Tyneside3 Riley Tyneside4 Riley Tyneside5 Dale Greater London6 Dale Greater London7 Reed Greater London8 Reed Greater London

Table: Illustration of nested factors


Situation for crossed random effects

Exemplar Speaker Mode

1 Daryl Spoken2 Daryl Written3 Riley Spoken4 Riley Spoken5 Dale Written6 Dale Written7 Reed Spoken8 Reed Written

Table: Illustration of crossed factors


Model specification: fixed vs. random

Adding a grouping factor as a fixed effect = dummy coding:

𝑃 𝑟(𝑦𝑖 = 1) = 𝑙𝑜𝑔𝑖𝑡−1(𝛼0 +𝛽𝑑 ⋅𝑥𝑖𝑑 +𝛽𝑙1

⋅𝑥𝑖𝑙1

+𝛽𝑙2⋅𝑥𝑖

𝑙2+⋯+𝛽𝑙𝑚−1

⋅𝑥𝑖𝑙𝑚−1

)(1)

Adding a modeled random effect:

𝑃 (𝑦𝑖 = 1) = 𝑙𝑜𝑔𝑖𝑡−1(𝛼𝑗[𝑖]𝑙 + 𝛽𝑑 ⋅ 𝑥𝑖

𝑑) (2)

𝛼𝑗𝑙 ∼ 𝑁(𝜇𝑙, 𝜎2

𝑙 ) (3)


What are modeled effects, conceptually

nothing very specific

see Gelman & Hill (2006, pp. 245–247)rather a technical question

many levels: modeled effect, few levels: non-modeledfew data points in some levels: modeled effect(shrinkage = partial pooling)testing needed (really?): non-modeled fixed parameter estimatemodeled: conditional means/modes (BLUPs) are predictionstests on predictions: never

An illustrative simulation of:

𝑃(𝑦𝑖 = 1) = 𝑙𝑜𝑔𝑖𝑡−1(𝛼𝑗[𝑖] + 𝛽1 ⋅ 𝑥𝑖1 + 𝛽𝑖

2 ⋅ 𝑥𝑖2) (4)

𝛼𝑗 ∼ 𝑁(𝜇, 𝜎) (5)


Fixed parameter estimates vs. conditional modes I

−5 −4 −3 −2 −1 0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

group0002

mean = −1.69

median = −1.632.5% = −3.2997.25% = −0.4

true = −1.88

−1 0 1 2 3 40.

00.

10.

20.

30.

40.

50.

6

group0003

mean = 1.18

median = 1.132.5% = −0.0297.25% = 2.73

true = 1.27

−4 −3 −2 −1 0 1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

group0004

mean = −1.15

median = −1.092.5% = −2.6197.25% = −0.01

true = −1.25

−1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

group0005

mean = 1.51

median = 1.452.5% = 0.2597.25% = 3.11

true = 1.69

−10 −8 −6 −4 −2 0

0.0

0.1

0.2

0.3

0.4

0.5

group0002

mean = −2.16

median = −2.022.5% = −3.9797.25% = −0.69

true = −1.88

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

group0003

mean = 1.54

median = 1.362.5% = 0.0597.25% = 3.43

true = 1.27

−1.5 −0.5 0.0 0.5 1.0 1.5

0.0

0.1

0.2

0.3

0.4

0.5

group0004

mean = −1.4

median = −1.322.5% = −3.0597.25% = −0.09

true = −1.25

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

group0005

mean = 2.28

median = 1.82.5% = 0.4497.25% = 5.81

true = 1.69

Figure: LEFT: Group levels in sample GLMM based on predicted random effect (conditionalmode); 5 groups; 20 observations per group; 1,000 simulations; the horizontal line marks thetrue value; RIGHT: Estimated fixed effects for the grouping factor in sample GLM; 5 groups;20 observations per level of the grouping factor; 1,000 simulations; the horizontal line marksthe true value


Pseudo-𝑅2 for these simulations I

0.0 0.2 0.4 0.6 0.8

01

23

45

Nakagawa & Schielzeth's R−squared (GLMM)

Estimates

Den

sity

marginal R−squaredconditional R−squared

0.0 0.2 0.4 0.6 0.8

01

23

4

Nagelkerke's R−squared (GLM)

Estimates

Den

sity

ignore randomrandom as fixed

Figure: LEFT: Distribution of Nakagawa & Schielzeth’s 𝑅2 in the simulations;RIGHT: Nagelkerke’s 𝑅2 in the simulations for a GLM that ignores the groupingfactor and a model that includes it as a fixed effect


Gries

What I call Gries’ Modelling Everything Approach (MEA)is an extremists’ version of multifactorial modelling

Stefan Th. Gries (2015): by default, thou shalt add random effects for:text types,speakers,lemmas,etc.

Stefan Th. Gries (2017): by default, thou shalt model:lengths of words and constituents,information-structural predictors,NP types,animacy,priming/persistency predictors (encoding numerous effectsof words and constructions which occurred before the target item),etc.


Schäfer (2018)

MEA leads to highly complex models

huge amounts of data required

massive annotation work (Stefan Th. Gries, 2017, p. 24)

overparametrisation (Schäfer, 2018)

…means the data don’t support the complex model structure

not a problem of a suboptimal optimiser but a problemof the data and the demands of the model(Bates et al., 2015; Matuschek et al., 2017)

Data dredging leads to spurious “significances”!


Why, oh why?

An exmple (p. c.) from cancer drug development…

target: extend the life expectancy or quality of life

hundreds of factors are known to influence those targets

individual studies look at the effect of one drug

…for which the working mechanism is (ideally) known

other factors?potentially control for age, sex, preconditions…if they have a known effect on the working mechanism of the drug

all other effects: randomisation (Fisher, 1935)


Informed minimal models

Severe modelling (Schäfer, n.d., based on Mayo, 2018)

model only effects relevant to your proposed causal mechanism

use statistical tests with care(Fisherian interpretation or Severe Testing)

don’t do model selection(= tweaking of potentially spurious “significances”)

model or control known relevant nuisance factors

apart from that: randomisation

better: find a significant influential causal factorcontrolling speakers’ bevaiour

worse: try to model everything and work with overparametrisedmodels and muddy inferences (many of them will just be wrong)


References

Ángeles Serrano, M., Ana Maguitman, Marián Boguñá, Santo Fortunato & Alessandro Vespignani. 2007.Decoding the structure of the WWW: a comparative analysis of web crawls. ACM Transactions onthe Web 1(2). Article 10.

Atkins, Sue, Jeremy Clear & Nicholas Ostler. 1992. Corpus design criteria. Literary and LinguisticComputing 7(1). 1–16.

Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta. 2009. The WaCky Wide Web: Acollection of very large linguistically processed web-crawled corpora. Language Resources andEvaluation 43(3). 209–226.

Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2015. Fitting linear mixed-effects modelsusing lme4. Journal of Statistical Software 67(1). 1–48.

Beißwenger, Michael, Sabine Bartsch, Stefan Evert & Kay-Michael Würzner. 2016. EmpiriST 2015: AShared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and WebCorpora. Berlin. http://aclanthology.coli.uni-saarland.de/pdf/W/W16/W16-2606.pdf.

Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.http://www.loc.gov/catdir/toc/cam023/87038213.html.

Biber, Douglas. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8(4).243–257.

Biber, Douglas & Jesse Egbert. 2015. Using grammatical features for automatic register identification inan unrestricted corpus of documents from the open web. Journal of Research Design and Statistics inLinguistics and Communication Science 2(1). 3–36.


http://aclanthology.coli.uni-saarland.de/pdf/W/W16/W16-2606.pdf

http://www.loc.gov/catdir/toc/cam023/87038213.html

Biemann, Chris, Felix Bildhauer, Stefan Evert, Dirk Goldhahn, Uwe Quasthoff, Roland Schäfer,Johannes Simon, Leonard Swiezinski & Torsten Zesch. 2013. Scalable construction of high-qualityweb corpora. Journal for Language Technology and Computational Linguistics 28(2). 23–60.

Bird, Steven, Ewan Klein & Edward Loper. 2009. Natural language processing with Python. Analyzing textwith the Natural Language Toolkit. O’Reilly.

Blei, David M. 2012. Probabilistic topic models. Communications of the ACM 55(4). 77–84.http://doi.acm.org/10.1145/2133806.2133826.

Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of MachineLearning Research 3. 993–1022.

Broder, Andrei, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata,Andrew Tomkins & Janet Wiener. 2000. Graph structure in the web. Computer Networks 33. 309–320.

Bubenhofer, Noah, Marek Konopka & Roman Schneider. 2014. Präliminarien einer korpusgrammatik.Vol. 4 (Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache). Unter Mitwirkung vonCaren Brinckmann. Tübingen: Narr.

Ciamarita, Massimiliano & Marco Baroni. 2006. Measuring web-corpus randomness: A progress report.In Marco Baroni & Silvia Bernardini (eds.), Wacky! Working papers on the web as corpus. Bologna:GEDIT.

Damerau, Fred J. 1993. Generating and evaluating domain-oriented multi-word terms from texts.Information Processing & Management 29(4). 433–447.


http://doi.acm.org/10.1145/2133806.2133826

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer & Richard Harshman. 1990.Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6).391–407.http://www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf.

Denoual, Etienne. 2006. A method to quantify corpus similarity and its application to quantifying thedegree of literality in a document. International Journal of Technology and Human Interaction 2(1).51–66.

Edmundson, H. P. & R. E. Wyllys. 1961. Automatic abstracting and indexing. survey andrecommendations. Communications of the ACM 4(5). 226–234.

Fisher, Ronald A. 1935. The design of experiments. London: Macmillan.

Frank, Eibe, Mark A. Hall & Ian H. Witten. 2016. The WEKA Workbench. Online Appendix for ”DataMining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, Fourth Edition.

Freund, Luanne, Charles L. A. Clarke & Elaine G. Toms. 2006. Towards genre classification for ir in theworkplace. In Proceedings of the 1st international conference on information interaction in context(IIiX), 30–36. Copenhagen, Denmark: ACM. http://doi.acm.org/10.1145/1164820.1164829.

Gelman, Andrew & Jennifer Hill. 2006. Data analysis using regression and multilevel/hierarchical models.Cambridge: Cambridge University Press.

Geyken, Alexander. 2007. The DWDS corpus: a reference corpus for the German language of the 20thcentury. In Christiane Fellbaum (ed.), Collocations and idioms: Linguistic, lexicographic, andcomputational aspects, 23–41. London: Continuum Press.

Gries, Stefan. 2006. Exploring variability within and between corpora: some methodologicalconsiderations. Corpora 1. 109–151.


http://www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf

http://doi.acm.org/10.1145/1164820.1164829

Gries, Stefan Th. 2015. The most underused statistical method in corpus linguistics: multi-level (andmixed-effects) models. Corpora 10(1). 95–126.

Gries, Stefan Th. 2017. Syntactic alternation research. taking stock and some suggestions for the future.In Ludovic De Cuypere, Clara Vanderschueren & Gert De Sutter (eds.), Current trends in analyzingsyntactic variation, vol. 31 (Belgian Journal of Lingustics), 7–27. Amsterdam: Benjamins.

Hofland, Knut & Stig Johansson. 1982. Word frequencies in British and American English. Bergen: TheNorwegian Computiing Cente for the Humanities.

Hofmann, Thomas. 1999. Probabilistic latent semantic analysis. In Proceedings of the fifteenth conferenceon uncertainty in artificial intelligence (UAI’99), 289–296. Stockholm, Sweden: Morgan KaufmannPublishers Inc. http://dl.acm.org/citation.cfm?id=2073796.2073829.

Kanaris, Ioannis & Efstathios Stamatatos. 2009. Learning to recognize webpage genres. InformationProcessing and Management 45(5). 499–512.

Karlgren, Jussi & Douglass Cutting. 1994. Recognizing text genres with simple metrics usingdiscriminant analysis. In Proceedings of coling 94, 1071–1075.

Kessler, Brett, Geoffrey Nunberg & Hinrich Schütze. 1997. Automatic detection of text genre. InProceedings of the 35th annual meeting of the association for computational linguistics and eighthconference of the european chapter of the association for computational linguistics (ACL ’98), 32–38.Madrid, Spain: Association for Computational Linguistics.http://dx.doi.org/10.3115/976909.979622.

Kilgarriff, Adam. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1). 97–133.


http://dl.acm.org/citation.cfm?id=2073796.2073829

http://dx.doi.org/10.3115/976909.979622

Kilgarriff, Adam. 2012. Getting to know your corpus. In Petr Sojka, Aleŝ Horák, Ivan Kopeĉek &Karel Pala (eds.), Text, speech and dialogue - 15th international conference, tsd 2012, brno, czechrepublic, september 3-7, 2012. proceedings, 3–15. Heidelberg: Springer.

Kupietz, Marc, Cyril Belica, Holger Keibel & Andreas Witt. 2010. The German reference corpus DeReKo:a primordial sample for linguistic research. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard,Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds.), Proceedings of theseventh international conference on language resources and evaluation (LREC ’10), 1848–1854. Valletta,Malta: European Language Resources Association (ELRA).

Lee, Yong-Bae & Sung Hyon Myaeng. 2002. Text genre classification with genre-revealing andsubject-revealing features. In Proceedings of the 25th annual international acm sigir conference onresearch and development in information retrieval (SIGIR ’02), 145–150. Tampere, Finland: ACM.http://doi.acm.org/10.1145/564376.564403.

Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. 2009. An introduction to informationretrieval. Cambridge: cup.

Matuschek, Hannes, Reinhold Kliegl, Shravan Vasishth, Harald Baayen & Douglas M. Bates. 2017.Balancing type I error and power in linear mixed models. Journal of Memory and Language 94.305–315.

Mayo, Deborah G. 2018. Statistical inference as severe testing: how to get beyond the statistics wars.Cambridge: Cambridge University Press.

Mehler, Alexander, Serge Sharoff & Marina Santini (eds.). 2010. Genres on the web: computational modelsand empirical studies. Vol. 42 (Text, speech and language technology). New York: Springer.


http://doi.acm.org/10.1145/564376.564403

Schäfer, Roland. 2016. On bias-free crawling and representative web corpora. In Proceedings of the 10thweb as corpus workshop, 99–105. Berlin: Association for Computational Linguistics.http://aclweb.org/anthology/W16-2612.

Schäfer, Roland. 2018. Probabilistic German morphosyntax. Humboldt Universität zu Berlin Habilitationthesis. http://rolandschaefer.net/?p=1410.

Schäfer, Roland. accept with revisions September 2017. Competing constructions for German measurenoun phrases: from usage data to experiment. Cognitive Linguistics.

Schäfer, Roland. N.d. Statistische Inferenz in der Linguistik. in preparation.

Schäfer, Roland & Felix Bildhauer. 2012. Building large corpora from the web using a new efficient toolchain. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan,Bente Maegaard, Joseph Mariani, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the EighthInternational Conference on Language Resources and Evaluation (LREC’12), 486–493. Istanbul: ELRA.

Scott, Mike. 1997. PC analysis of key words - and key key words. System 25(2). 233–245.

Scott, Mike. 2001. Comparing corpora and identifying keywords, collocations, frequency distributionsthrough the WordSmith Tools suite of computer programs. In Mohsen Ghadessy, Alex Henry &Robert L. Roseberry (eds.), Small corpus studies and ELT, 47–67. Amsterdam & Philadelphia:Benjamins.

Sebastiani, Fabrizio. 2005. Text categorization. In Jorge H. Doorn Laura C. Rivero & VivianaE. Ferraggine (eds.), The encyclopedia of database technologies and applications, 683–687. Hershey, US:dea Group Publishing.

Sharoff, Serge. 2006. Creating general-purpose corpora using automated search engine queries. InMarco Baroni & Silvia Bernardini (eds.), Wacky! working papers on the web as corpus. Bologna: GEDIT.


http://aclweb.org/anthology/W16-2612

http://rolandschaefer.net/?p=1410

Sinclair, John. 1996. Preliminary recommendations on corpus typology. Tech. rep. EAG–TCWG–CTYP/P.Expert Advisory Group on Language Engineering Standards document.

Spärck Jones, Karen. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of Documentatio 28(1). 11–21.

Teich, Elke & Peter Fankhauser. 2009. Exploring a corpus of scientific texts using data mining. Languageand Computers 71. 233–247.

Yule, G. Udny. 1944. The statistical study of literary vocabulary. Cambridge: Cambridge University Press.


Describing corpora, comparing corpora - Roland Schäfer

Documents

Transcript of Describing corpora, comparing corpora - Roland Schäfer