CORPUS LINGUISTICS

17
Corpus Linguistics

Transcript of CORPUS LINGUISTICS

Corpus Linguistics

What is corpus linguistics ?

• Definition

• Description

• Aims

Advantages of

technology

1 The collection of ever large

language samples

2 the ability for much faster and more

efficient processing & access

3 the availability of easy to learn

computer resources for linguistic

analysis.

Characteristics of corups-

based analyses of language

• It is empirical

• It utilizes corpus

• It makes extensive use of

computers

• It depends on both quantitative

& qualitative analytical

techniques.

Corpus Design and Compilation

- Def. of Corpus

- Features of Corpus:

-1. no minimum size for a text collection

2. made up available to other researchers

a- encourages a higher degree of account-

ability in data analysis

b- permits collaborative work and follow-

up studies by different researchers

Types of Corpora

general specialized

I. General Corpora

Aims to:

a. represent lang. in its broadest

sense

b. serve as a widely available re-

source for baseline or

comparative studies

Designed to be

a. quite large b. balanced

e.g.

the Brown, the LOB, the BNC and

The American National Corpus

II. Specialized Corpora

-Register-specific description and

investigations of language

- Spoken & written

e.g.

the ICE = national varieties of Eng.

the TOEFL 2000 Spoken & Written

Academic Language Corpus

- historic (the Helsinki , the Archer)

- academic speech (MICASE)

- teenage lang. (COLT)

- child lang. (CHILDES)

- Learners’ corpus (ICLE)

Issues in Corpus Linguistics

The design of the corpus should:

1. reflect the anticipated goals

2. representative of the type of lang.

a. registers, discourse modes and

topics

b. the demographic of speakers or

writers

c. based on production or reception

3. The intended distribution of

the corpus

* Consider funding & time

Corpus Compilation

Data collection involves:

1. Creating electronic versions

2. Storing

3. Organizing

How to collect written vs. spoken

Corpora ?

Written corpora involves:

- Using a scanner and optical

character recognition (OCR)

software

- Some corpora are keyboarded

manually

- Some corpora are produced in

both print & electronic versions

Spoken corpora involves:

1. Decide on a transcription system

a. capture prosodic details or

phonetic variation

b. represent interactional

characteristics of speech

2. obtain permission to use the

data for the corpus

Markup and Annotation

What is markup?

What’s the use of it ?

Two types of markup:

1. Structural markup

2. Headers markup

What is annotation ?

Linguistic tagging such as

- Part of speech

- Levels of specificity, e.g.

functional information or case

- Prosodic and phonetic

annotation

Advantages of tagged corpus:

1. Allows researcher to explore and

answer different types of

questions

2. Allows researcher to see what

grammatical structures co-occur

3. Addresses the problem of words

that have multiple meanings or

functions