Advantages of
technology
1 The collection of ever large
language samples
2 the ability for much faster and more
efficient processing & access
3 the availability of easy to learn
computer resources for linguistic
analysis.
Characteristics of corups-
based analyses of language
• It is empirical
• It utilizes corpus
• It makes extensive use of
computers
• It depends on both quantitative
& qualitative analytical
techniques.
Corpus Design and Compilation
- Def. of Corpus
- Features of Corpus:
-1. no minimum size for a text collection
2. made up available to other researchers
a- encourages a higher degree of account-
ability in data analysis
b- permits collaborative work and follow-
up studies by different researchers
Types of Corpora
general specialized
I. General Corpora
Aims to:
a. represent lang. in its broadest
sense
b. serve as a widely available re-
source for baseline or
comparative studies
Designed to be
a. quite large b. balanced
e.g.
the Brown, the LOB, the BNC and
The American National Corpus
II. Specialized Corpora
-Register-specific description and
investigations of language
- Spoken & written
e.g.
the ICE = national varieties of Eng.
the TOEFL 2000 Spoken & Written
Academic Language Corpus
- historic (the Helsinki , the Archer)
Issues in Corpus Linguistics
The design of the corpus should:
1. reflect the anticipated goals
2. representative of the type of lang.
a. registers, discourse modes and
topics
b. the demographic of speakers or
writers
c. based on production or reception
Corpus Compilation
Data collection involves:
1. Creating electronic versions
2. Storing
3. Organizing
How to collect written vs. spoken
Corpora ?
Written corpora involves:
- Using a scanner and optical
character recognition (OCR)
software
- Some corpora are keyboarded
manually
- Some corpora are produced in
both print & electronic versions
Spoken corpora involves:
1. Decide on a transcription system
a. capture prosodic details or
phonetic variation
b. represent interactional
characteristics of speech
2. obtain permission to use the
data for the corpus
Markup and Annotation
What is markup?
What’s the use of it ?
Two types of markup:
1. Structural markup
2. Headers markup
What is annotation ?
Linguistic tagging such as
- Part of speech
- Levels of specificity, e.g.
functional information or case
- Prosodic and phonetic
annotation
Top Related