History, Features, and Typology of Language Corpora

25
History, Features, and Typology of Language Corpora

Transcript of History, Features, and Typology of Language Corpora

History, Features, and Typology of LanguageCorpora

Niladri Sekhar Dash • S. Arulmozi

History, Features,and Typology of LanguageCorpora

123

Niladri Sekhar DashLinguistic Research UnitIndian Statistical InstituteKolkata, West BengalIndia

S. ArulmoziCentre for Applied Linguistics andTranslation Studies

University of HyderabadHyderabad, TelanganaIndia

ISBN 978-981-10-7457-8 ISBN 978-981-10-7458-5 (eBook)https://doi.org/10.1007/978-981-10-7458-5

Library of Congress Control Number: 2017962060

© Springer Nature Singapore Pte Ltd. 2018This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionor information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, express or implied, with respect to the material contained herein orfor any errors or omissions that may have been made. The publisher remains neutral with regard tojurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer NatureThe registered company is Springer Nature Singapore Pte Ltd.The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Dedicated to the people of Shabra andSathyamangalam

Preface

The purpose of this introductory book is to confirm the importance of speech andtext corpora in the modern age of linguistic studies. We consider corpus linguistics tobe one of the fundamental domains of applied linguistics within the main researchand development activities of man–machine interaction in language understanding.Keeping this observation in mind, we have tried to convey some of the general ideasand issues related to corpus linguistics and corpus-based studies of languages. Forthe works of speech corpora development and utilization in speech and languagetechnology, during the last few decades, corpora have created unprecedentedexpectations among scholars. Since we want to keep this expectation alive, we havetried to bring in an extra shade to the field of corpus application so that corpora canmeet the great challenges we have been facing in understanding natural languages inall their intricacies.

The present book is the result of our intensive research in the area of corpuslinguistics for more than 25 years. In this book, we have tried to address some of thebasic issues of corpus linguistics with reference to corpora of English and otherlanguages. We have focussed on the revival and rejuvenation of the empiricalapproach to language study to show how language corpora of various types aredeveloped and used in various works of mainstream linguistics, applied linguisticsand language technology. We have shown how new findings obtained from lan-guage corpora are becoming useful to refute or substantiate previous observationabout languages. We have provided working definitions of the corpus, identified thegeneral features of the corpus, and focussed on the application potentials of thecorpus. We have drawn lines of distinction between different types of corpora;discussed the form and content of parallel translation corpora; addressed issuesinvolved in the generation of web text corpora; presented a short history ofpre-digital corpora, described some digital text and speech corpora; and finally,have highlighted some limitations of language corpora. In this course-cum-reference book, we have given emphasis to English and Indian languages since no

vii

book previously existed in this area that has adequately highlighted the issueslinked with Indian languages.

The topics discussed in this book have a strong theoretical as well as practicalsignificance. Over the years, corpus-based language study has remarkably changedthe trends of language research and application across the globe. However, it hasfailed to create an impact on Indian and South Asian languages, in spite of the factthat language corpora have contributed on a large scale to new growth and to theadvancement of linguistics in most of the advanced countries. This initial apathy isgradually ebbing away and, in fact, some Indian universities, as well as the uni-versities of some neighboring countries like Bangladesh, Bhutan, Nepal, Maldives,Pakistan and Srilanka, are planning to introduce a fully fledged course on corpuslinguistics at the university level. This book will be highly useful in this context,since it possesses the information necessary to address the requirements of studentsenrolled in such university-level courses.

The present book contains short but highly valuable and relevant discussions onthe forgotten past of corpus-based linguistic research and applications that havebeen carried out over a few centuries across the languages of the world. Thehistorical narrative results from our intensive investigation into the terrains oflanguage corpus use in earlier centuries. This is perhaps the first book of its kindthat aims to encompass the history of language description and application withclose reference to corpora developed manually by the masters of the craft.

Over the decades, the basic methods of corpus making have undergone changeswith the advent of new tools and techniques for text collection and access. In thisbook, we have made an attempt to show how, in the earlier centuries, the process oflanguage corpora generation was practised long before the introduction of thecomputer and how earlier scholars designed, developed and used handmade corporain their language-based activities relating to dictionary making, the study of dia-lects, language teaching, understanding word meanings, defining usages of wordsand terms, exploring the nature and manners of language acquisition, writinggrammar books, preparing text materials, exploring specific stylistic traits of someliterary masters and so on.

In all such works, the earlier scholars utilized handmade corpora of selected textsamples to gather and extract relevant linguistic information and examples toenhance the quality and reliability of their works. With full reference to this history,this book is expected to create awareness among the scholars about this area inorder to encourage interest in using corpora in research, development and appli-cation in linguistics, as well as in sister disciplines. The information presented inthis book categorically underlines that analysis of corpora of actual language usecan yield new information and insights to describe a language in a more faithfulmanner, as well as to deal with the problems of linguistics with certifiedauthenticity.

Our experience in dealing with language corpora, along with the experience ofsome other scholars in India and abroad, has helped us to realize that a book of thiskind is long overdue for those interested to know the utility of language corpora forlinguistic research and applications. This inspired us to assemble relevant

viii Preface

information from various fields of linguistics and sister disciplines to write a bookthat would provide the necessary philosophical perspectives about this new field oflanguage research and application. This book will provide scholars with apanoramic exposure to this new area of language study, as well as inspire them toexplore this area with enthusiasm.

This book also presents primary information about corpora and their typologies.It presents a colorful picture of the present state of corpus-based language studywith a clear focus on the future course of activities relating to corpus generation andusage. The book intends to emphasize the compilation, analysis and investigation ofactual language data from both qualitative and functional perspectives in order toaddress some theoretical and methodological issues and principles relating todescriptive linguistics, applied linguistics and language technology. The topicsdiscussed and referred to in the book have strong referential and academic relevancein the global context. We have come across many queries made by scholars acrossthe world about the history and the present state of corpus linguistics in general, andsince there no such book has ever previously been written in this area, this book ishighly suitable for addressing these queries.

Deeper investigations into languages have shown many unique aspects of lan-guages that are not only interesting but also quite useful. We have observed thatwithin a natural setting, a language—in speech as well as in writing—is used as aversatile tool of communication. In this context, the goal of a language investigatoris to understand the language in minute detail so that (s)he can develop computersystems that can perform like normal human beings in terms of exerting the regularfunctions of hearing and understanding a language. With regard to the present stateof research in corpus linguistics across the world, there is a need for more effortfocused towards developing natural, spontaneous and unconstrained languagecorpora for better man–machine interaction. In addition, there is an urgent need forutilization of information obtained from the analysis of language data of varioustext types collected empirically and compiled in corpora for developingdomain-free and workable commercial systems for speech and language technol-ogy. Only then can we think of weaving a realistic linguistic fabric for the benefitof the common people. This enterprise, however, requires more basic and intensiveresearch on a large amount of empirical language data that are compiled as corporaand processed.

In this book, we have attempted to trace the trends and perspectives of languageresearch with a focus on development and use of corpora for the activities relatingto linguistics and allied disciplines. We have noted that, in most of these works, theprimary importance of spoken text, compared with written text, is apparent from itsexclusive use by speech communities (Sasaki 2003). The spoken form of a lan-guage has certain characteristic features that differ to the written form. These fea-tures of the spoken form contribute greatly to shaping the thought processes andthinking capabilities of speech communities (Tannen 1982). Despite so manycomplexities, speech provides the highest amount of information among all theoutput modes available to human beings. Therefore, although there are many dif-ferences between speech and writing, it is absolutely necessary to understand the

Preface ix

inherent cognitive interface between the two in terms of realizing the interdepen-dence of the processes used for generating speech and language corpora.

Within the wider spectrum of speech and language, the functional and referentialvalue of language corpora is well understood. The success of any kind in eachof these domains requires a huge amount of language data for experimentation,analysis, implementation and verification. Naturally, work in these areas will be farmore reliable if databases are directly obtained, in a form of a corpus, from theactual contexts of the language used by people in their regular linguistic interac-tions. This signifies that proper, as well as faithful, representation of real-life lan-guage data may bring about reliability, dependency and authenticity to systems anddevices intended to be developed for speech and language technology. This mayinspire the younger generation to work in the area of corpus linguistics, for thebetterment of linguistics and languages at large.

The basic objective of the book is to show how corpus-based language study, ashas been noted in English and other languages, has opened up many new areas oflinguistic research and applications. To bring home this argument, we have dealtwith the typology of corpora and argued for using typology-based corpora in lin-guistic studies. In different chapters of the book, we have shown how systematicanalysis of corpora can produce new data, information, and insights that are usefulfor all kinds of understanding of our languages.

Although researchers have felt the need to use corpora in linguistic andextralinguistic studies during last few decades, they did not have adequate exposureand knowledge about the generation, processing and utilization of corpora in anorganized manner. To address this problem, in this book, we have presented thenecessary guidance to target readers with regard to the processes of designing andusing language corpora for specific needs. Moreover, we have provided categorizedinformation about the classification of corpora, as well as the types of corpora thatmay help general corpora users to determine how they can select and use a par-ticular corpus for their specific research, education and application problems. Thepresent book can contribute to ‘corpus linguistics’ in the following five importantways to address the requirement of readers:

[1] It can make people aware of a moderately new method of language researchand application with reference to data and information of actual languageuse;

[2] It can exhibit how linguistic data and information of various types arepossible to generate from corpora for works relating to every domain ofhuman knowledge.

[3] It can show how findings from corpora can help people attest or challenge therelevance and validity of earlier observations relating to a language or itsproperties.

[4] It can open up many new avenues and areas of language studies for thebetterment of languages and their users.

x Preface

The issues discussed in this book have both academic importance and functionalrelevance in the general domain of corpus linguistics and language technology.Over the last 70 years, the trend of corpus-based language study has tried to findsuitable answers to the questions relating to form, content and function of languagecorpora in the advancement of human knowledge. We have tried to find answers tosome of the questions within the mainframe of language description and languageuse. Thus, this book confirms its academic relevance and intellectual significance inthe studies and application of language data in linguistics and allied disciplines.

The book has been written with some specific goals in mind. One of them is todeal with the issues of using empirical language data in the different domains oflinguistic research and development. Keeping this challenge in mind, the bookcovers some of the major issues of corpus linguistics from descriptive and appliedperspectives in order to enrich linguistics and related disciplines with new findings.It also proposes using the empirical linguistic information to verify some earlierclaims and observations made in linguistics, sociolinguistics, demography, psy-chology, anthropology and cognitive science. Furthermore, this book emphasizesresearch into naturally occurring language data complemented with qualitativeresults and functional interpretation of new findings so that theoretical and intro-spective issues are addressed with due reference to actual usage. We believe that thediscussions presented in this book will be useful for the new generations of lan-guage scientists for devising new methods for empirical language research that arepragmatic and sensible

The content and quality of this book stands out as a relevant and importantcontribution to the field of corpus linguistics. As an output of our long and intensiveresearch in this area, this book has the potential to open up new directions forresearch and application into mainstream linguistics and language technology. Therelevance of the book may be measured in terms of the theoretical and practicalvalue of corpora in language research, which it categorically highlights. The bookdraws our attention towards the potential future course of activities, both in generaland applied linguistics. In essence, it is a course-cum-reference book that willbenefit students and teachers of the corpus and computational linguistics, both at theundergraduate and postgraduate level.

The book is referential in its approach and empirical in its analysis. It has enoughdata and information to be considered as a course-cum-reference book for uni-versity students, teachers and researchers working in this area. Although the book iswritten primarily for postgraduate students and researchers; people working inempirical linguistics, language technology, computational linguistics, languageprocessing, descriptive linguistics, historical linguistics, sociolinguistics, languageteaching, dialectology, lexicography, lexicology, semantics, discourse, and stylis-tics and so on can find this book equally relevant and useful for new informationand insights.

Kolkata, India Niladri Sekhar DashHyderabad, India S. ArulmoziAugust 2017

Preface xi

Acknowledgements

We humbly thank our seniors, peers and juniors who have helped us in differentcapacities to shape our observations into the form of this book. We alsoacknowledge those known and unknown scholars from whom we have tried toassimilate insights and information to formulate our ideas and the concepts fur-nished in this book. We humbly thank those unknown reviewers who suggested thenecessary corrections and modifications for the improvement of content and qualityof the book. We sincerely appreciate their wise and insightful comments for thebetterment of the work.

We sincerely thank Prof. Probal Dasgupta, Prof. Panchanan Mohanty, Prof.Udaya Narayana Singh, Prof. Anvita Abbi, Prof. Pramod Pandey, Prof. Girish NathJha, Prof. Mazhar Mehdi Hussain, Prof. Rizwanur Rahman, Prof. R.C. Sharma, Prof.Tista Bagchi, Dr. Tanmoy Bhattacharya, Prof. Pradip Kumar Das, Dr. Parteek KumarBhatia, Dr. Suman Preet Virk, Prof. Aadil Amin Kak, Prof. Pushpak Bhattacharyya,Dr. Ramdas Karmali, Dr. Jyoti D. Pawar, Prof. S. Rajendran, Prof. K. P. Soman,Dr. M.C. Kesava Murty, Prof. Malhar Kulkarni, Prof. Imtiaz Hasnain, Prof. A.R.Fatihi, Prof. Vijay Kaul, Prof. Omkar Koul, Prof. Raj Nath Bhat, Dr. Abhinav KumarMishra, Dr. Anil Thakur, Dr. Sanjukta Ghosh, Prof. Rajeev Sangal, Dr. Manoj Jain,Dr. Swarn Lata, Prof. Niladri Chatterjee, Prof. Yashawanta Singh, Dr. SurmangolSharma, Prof. Madhumita Barbora, Prof. Gautam Borah, Dr. Arup Kumar Saha,Dr. Amalesh Gope, Ms. Bipasa Patgiri, Dr. Priyankoo Sharmah, Prof. MahidasBhattacharya, Dr. Samir Karmakar, Dr. Atanu Saha, Dr. Indranil Acharya, Prof.Tirthankar Purakayastha, Prof. Debashis Bandyopadhyay, Prof. Mina Dan, Dr. AditiGhosh, Dr. Sunandan Kumar Sen, Prof. Manton Kumar Singh, Dr. Rizwan Ahmed,Dr. Sudip Naskar, Prof. Anupam Basu, Prof. Usha Devi, Prof. Renuga Devi, Prof,Gautam Sengupta, Prof. K. Rajyarama, Prof. G. Uma Maheshwar Rao, Dr. SriparnaDas, Prof. Bhubaneswar Chilikuri, Dr. L. Ramamoorthy, Prof. UmaraniPappuswami, Prof. G. Balasubramanian, Prof. Perumalsamy, Dr. TariqKhan, Dr. Kakali Mukherjee, Dr. Sibasis Mukhopadhyay, Prof. N. Deivasundaram,Dr. Lalitha Raja, Dr. S. Shanavas, Dr. S. Kunjamma, Dr. Rose Mary, Dr. L. Darwin,Dr. S. Prema and many others for their constructive critical comments on our works

xiii

presented at various seminars, workshops and conferences. Their views and opinionshave helped us to revise and upgrade the content of the book.

We acknowledge the support and encouragement we have received from ourparents, teachers, colleagues, friends and students for writing this book.Particularly, we would like to mention the name of Ms. Shinjini Chatterjee, who—for years—has been persistently encouraging us to collect our thoughts in order toproduce this book. This book would not have been possible without her continuousencouragement. Ms. Priya Vyas also deserves our thanks for her elegant manage-ment of the manuscript at its formative stage.

We happily express our sincere thanks to Soma, Shrotriya, Somaditya,Visalakshi, Aravindh, and Anirudh for their perennial emotional support andencouragement extended during the course of writing this book. They have alwaysbeen with us to boost up our morale during odd circumstances and adversesituations.

We shall consider our efforts are amply rewarded if people who are interested incorpus linguistics find this book useful for their academic and non-academicendeavors.

August 2017 Niladri Sekhar DashS. Arulmozi

xiv Acknowledgements

Contents

1 Definition of ‘Corpus’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Some Popular Definitions of ‘Corpus’ . . . . . . . . . . . . . . . . . . 21.3 What Is a Corpus? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 The Acronym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Corpus, Dataset and Database . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Formational Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.7 The Benefits of a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.8 Advantages of a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Features of a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Quantity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6 Equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.7 Retrievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.8 Verifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.9 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.10 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.11 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

xv

3 Genre of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Why Classify Corpora? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 Genre of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4 Text Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.5 Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 Spoken Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Nature of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 General Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3 Special Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4 Sample Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5 Literary Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.6 Monitor Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.7 Multimodal Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.8 Sublanguage Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.9 Controlled Language Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 624.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Type and Purpose of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Type of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2.1 Monolingual Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 685.2.2 Bilingual Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.3 Multilingual Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Purpose of Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.1 Unannotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.2 Annotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Maxims of Corpus Annotation . . . . . . . . . . . . . . . . . . . . . . . . 775.5 Issues Involved in Annotation . . . . . . . . . . . . . . . . . . . . . . . . 795.6 The Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.7 The State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 Nature of Text Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.2 Parallel Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.3 Translation Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.4 Aligned Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.5 Comparable Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

xvi Contents

6.6 Reference Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.7 Learner Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.8 Opportunistic Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7 Parallel Translation Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017.2 Definition of a Parallel Translation Corpus (PTC) . . . . . . . . . . 1027.3 Construction of a PTC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.4 Features of a PTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.4.1 Large Quantity of Data . . . . . . . . . . . . . . . . . . . . . . . 1067.4.2 Quality of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.4.3 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . 1077.4.4 Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.4.5 Equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.4.6 Retrievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.4.7 Verifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.4.8 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.4.9 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.5 Alignment of Texts in PTC . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.6 Analysis of Text in PTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.7 Restructuring Translation Units in PTC . . . . . . . . . . . . . . . . . 1157.8 Extraction of Translational Equivalent Units . . . . . . . . . . . . . . 1177.9 Bilingual Lexical Database . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.10 Bilingual Terminology Databank . . . . . . . . . . . . . . . . . . . . . . 1197.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8 Web Text Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.2 Defining a Web Text Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 1268.3 Theoretical Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1278.4 Purpose Behind a Web Text Corpus . . . . . . . . . . . . . . . . . . . . 1298.5 Early Attempts for Web Text Corpus Generation . . . . . . . . . . 1318.6 Methodologies Applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.6.1 Overall Design of the Web Text Corpus . . . . . . . . . . 1338.6.2 Domains and Sub-domains of Texts . . . . . . . . . . . . . . 1338.6.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8.7 Metadata Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1368.7.1 Computerizing the Data . . . . . . . . . . . . . . . . . . . . . . 1378.7.2 Validation of Web Corpus . . . . . . . . . . . . . . . . . . . . 140

8.8 Problems in Generation of Web Text Corpus . . . . . . . . . . . . . 1408.8.1 Technical Problems . . . . . . . . . . . . . . . . . . . . . . . . . 141

Contents xvii

8.8.2 Linguistic Problems . . . . . . . . . . . . . . . . . . . . . . . . . 1418.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9 Pre-digital Corpora (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.2 The Questions of Relevance . . . . . . . . . . . . . . . . . . . . . . . . . 1489.3 Word Collection from Corpora for Dictionary Compilation . . . 150

9.3.1 Johnson’s Dictionary (1755) . . . . . . . . . . . . . . . . . . . 1519.3.2 The Oxford English Dictionary (1882) . . . . . . . . . . . . 1539.3.3 Supplementary Volumes of the Oxford English

Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.3.4 Dictionary of American English . . . . . . . . . . . . . . . . 157

9.4 Collecting Quotations for Dictionary . . . . . . . . . . . . . . . . . . . 1589.5 Corpora in Lexical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1609.6 Corpora for Writing Grammars . . . . . . . . . . . . . . . . . . . . . . . 1629.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

10 Pre-digital Corpora (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16710.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16710.2 Corpora in Dialect Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16810.3 Corpora in Speech Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17610.4 Corpora in Language Pedagogy . . . . . . . . . . . . . . . . . . . . . . . 17910.5 Corpora in Language Acquisition . . . . . . . . . . . . . . . . . . . . . . 18110.6 Corpora in Stylistic Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 18210.7 Corpora in Other Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18310.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

11 Digital Text Corpora (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18711.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18711.2 The Brown Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18811.3 The LOB Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19111.4 The Australian Corpus of English . . . . . . . . . . . . . . . . . . . . . 19411.5 The Corpus of New Zealand English . . . . . . . . . . . . . . . . . . . 19511.6 The Freiburg–LOB Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 19711.7 The International Corpus of English . . . . . . . . . . . . . . . . . . . . 19811.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

12 Digital Text Corpora (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20312.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20312.2 British National Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20412.3 BNC-Baby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

xviii Contents

12.4 American National Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 20612.5 Bank of English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20812.6 Croatian National Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 20912.7 English–Norwegian Parallel Corpus . . . . . . . . . . . . . . . . . . . . 21012.8 Some Small-Sized Text Corpora . . . . . . . . . . . . . . . . . . . . . . 21212.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

13 Digital Speech Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22113.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22113.2 The Hurdles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22213.3 Relevance of the Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22313.4 Speech Part of Survey of English Usage . . . . . . . . . . . . . . . . . 22413.5 London–Lund Corpus of Spoken English . . . . . . . . . . . . . . . . 22613.6 Machine-Readable Corpus of Spoken English . . . . . . . . . . . . . 22713.7 Corpus of Spoken New Zealand English . . . . . . . . . . . . . . . . 22813.8 Michigan Corpus of Academic Speech . . . . . . . . . . . . . . . . . . 23013.9 Corpus of London Teenage Language . . . . . . . . . . . . . . . . . . 23313.10 Some Small-Sized Speech Corpora . . . . . . . . . . . . . . . . . . . . . 23513.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

14 Utilization of Language Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 24114.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24114.2 Utility of a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24214.3 The Revival Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24414.4 Use of a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24514.5 Corpus Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

14.5.1 Language Specialists . . . . . . . . . . . . . . . . . . . . . . . . . 24814.5.2 Content Specialists . . . . . . . . . . . . . . . . . . . . . . . . . . 24914.5.3 Media Specialists . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

14.6 Corpora in Language Technology . . . . . . . . . . . . . . . . . . . . . 25014.7 Mutual Dependency Interface . . . . . . . . . . . . . . . . . . . . . . . . 25614.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

15 Limitations of Language Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 25915.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25915.2 Criticism from Generative Linguistics . . . . . . . . . . . . . . . . . . 26115.3 Paucity in Balanced Text Representation . . . . . . . . . . . . . . . . 26215.4 Limitations in Technical Efficiency . . . . . . . . . . . . . . . . . . . . 26315.5 Supremacy of Text Over Speech . . . . . . . . . . . . . . . . . . . . . . 26515.6 Scarcity of Dialogic Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . 26715.7 Lack of Pictorial Elements in Corpus . . . . . . . . . . . . . . . . . . . 26815.8 Lack of Poetic Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

Contents xix

15.9 Other Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27015.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

Subject Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

xx Contents

About the Authors

Niladri Sekhar Dash is Associate Professor in the Linguistic Research Unit of theIndian Statistical Institute, Kolkata. He has been working on Corpus Linguistics,Language Technology, Natural Language Processing, Language Documentationand Digitization, Computational Lexicography, Computer Assisted LanguageTeaching, and Manual and Machine Translation for over two decades. He iscredited with 15 research monographs and 225 research papers in peer-reviewedinternational and national journals, anthologies and conference proceedings. He hasdelivered lectures and taught courses as an invited scholar at more than 30 uni-versities and institutes in India and abroad. He has acted as a consultant for severalorganizations working on Language Technology and Natural Language Processing.Dr. Dash is the Principal Investigator for five language technology projects fundedby the Government of India and the Indian Statistical Institute, Kolkata. He is theEditor-in-Chief of the Journal of Advanced Linguistic Studies—a peer-reviewedinternational journal of linguistics; and Editorial Board Member of five internationaljournals. He is a member of several linguistics associations across the world and aregular Ph.D. thesis adjudicator for several Indian universities. At present, Dr. Dashis working on a Digital Pronunciation Dictionary for Bangla, Hindi–Bangla ParallelTranslation Corpus Generation, Endangered Language Documentation andDigitization, POS Tagging and Chunking, Word Sense Disambiguation, Manualand Machine Translation, and Computer Assisted Language Teaching, as well asother projects. Details of Dr. Dash are available at: https://sites.google.com/site/nsdashisi/home/.

S. Arulmozi is Assistant Professor at the Centre for Applied Linguistics andTranslation Studies (CALTS), University of Hyderabad, India. He has previouslytaught at the Dravidian University, Kuppam; acted as Guest Faculty at CALTS,University of Hyderabad; worked as Research Staff at the Anna University,Chennai; as Project Fellow at the Tamil University, Thanjavur; and as LanguageAssistant-Tamil at the Central Institute of Indian Languages, Mysore. Dr. Arulmozihas been working on Corpus Linguistics for some years and has been trainedprofessionally in WordNet. He has successfully carried out projects on Corpus

xxi

Linguistics and WordNet funded by the Government of India and has also con-ducted a workshop on language technology at the University of Malaya, KualaLumpur, Malaysia. He is credited with one research monograph and 15 researchpapers in peer-reviewed international and national journals, edited volumes andconference proceedings.

xxii About the Authors

Abbreviations

ACE Australian Corpus of EnglishANC American National CorpusASCII American Standard Code for Information InterchangeASE Actual Sense ExtractionBADIP Bancadati Dell Italiano ParlatoBCET Birmingham Collection of English TextBLD Bilingual Lexical DatabaseBNC British National CorpusBoE Bank of EnglishBOK Body of KnowledgeBoS Bank of SwedishBRP British Representative PronunciationCFE Caterpillar Fundamental EnglishCHILDES Child Language Data Exchange SystemCLAWS Constituent Likelihood Automatic Word Tagging SystemCLC Controlled Language CorpusCNC Croatian National CorpusCoCA Corpus of Contemporary American EnglishCOLT Corpus of London Teenage LanguageCSE Corpus Spoken EnglishCSPA Corpus of Spoken and Professional American EnglishDOS Disk Operating SystemEAP English for Academic PurposeELT English Language TeachingENPC English-Norwegian Parallel CorpusESL English as a Second LanguageEUSTACE Edinburgh University Speech Timing Archive and Corpus of EnglishFLOB Freiburg-LOB Corpus of British EnglishHTML Hypertext Markup LanguageICAME International Computer Archive of Modern and Medieval English

xxiii

ICE International Corpus of EnglishIDS Institut fur Deutsche SpracheILCI Indian Languages Corpora InitiativeISCII Indian Standard Code for Information InterchangeISI Indian Statistical InstituteKBCS Knowledge-based Computer SystemKCIE Kolhapur Corpus of Indian EnglishLCEMET Lampeter Corpus of Early Modern English TractsLDC Linguistic Data ConsortiumLLC Lancaster-Lund CorpusLLSC London-Lund Speech CorpusLOB Lancaster-Oslo/BergenLSI Linguistic Survey of IndiaLT Language TechnologyMIC MEANING Italian CorpusMICASE Michigan Corpus of Academic Spoken EnglishNIST National Institute of Standards and TechnologyNLP Natural Language ProcessingOCR Optical Character RecognitionOED Oxford English DictionaryPOS Part-of-SpeechPPCME Penn-Helsinki Parsed Corpus of Middle EnglishPTC Parallel Translation CorpusSCB Standard Colloquial BanglaSEC Simplified English Checker/CorrectorSEU Survey of English UsageSGML Standard Generalised Markup LanguageSSE Survey of Spoken EnglishSTT Scientific and Technical TermTASA Texas Association of School AdministratorsTDIL Technology Development for Indian LanguagesTEI Text Encoding InitiativeTEU Translation Equivalent UnitWCNZE Wellington Corpus New Zealand EnglishWSD Word Sense DisambiguationWTC Web Text CorpusWWW World Wide Web

xxiv Abbreviations

List of Figures

Fig. 1.1 The definition of ‘corpus’ embedded within CORPUS . . . . . . . . 7Fig. 1.2 Example of a dataset where columns and rows carry different

variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Fig. 1.3 Picture of database produced from large sets of data and

information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Fig. 1.4 Utilization of a corpus by man and machine . . . . . . . . . . . . . . . . 12Fig. 2.1 Growth of language corpora over the years . . . . . . . . . . . . . . . . . 19Fig. 2.2 Quality of a corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Fig. 2.3 Text representation in a corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 23Fig. 2.4 Text in corpus in simple plain format . . . . . . . . . . . . . . . . . . . . . 25Fig. 2.5 Equality in data from all text types in corpus . . . . . . . . . . . . . . . 26Fig. 2.6 Retrievability of language data from corpus. . . . . . . . . . . . . . . . . 27Fig. 2.7 Verifiability of corpus by man and machine . . . . . . . . . . . . . . . . 28Fig. 2.8 Augmentation of corpus data over the years . . . . . . . . . . . . . . . . 30Fig. 2.9 Documentation of information included in a corpus . . . . . . . . . . 31Fig. 3.1 Classification of language corpora based on different

criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Fig. 3.2 Classification of corpora based on genre of text . . . . . . . . . . . . . 39Fig. 3.3 Example of a text from Kolhapur Corpus of Indian

English (KCIE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Fig. 3.4 Composition and content of a speech corpus . . . . . . . . . . . . . . . . 44Fig. 3.5 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Fig. 3.6 Example of a spoken corpus (LLC) . . . . . . . . . . . . . . . . . . . . . . . 47Fig. 3.7 Lancaster/IBM spoken tagged English corpus . . . . . . . . . . . . . . . 48Fig. 4.1 Classification of corpus based on the nature of data . . . . . . . . . . 52Fig. 4.2 Birth of special corpus from a general corpus . . . . . . . . . . . . . . . 54Fig. 4.3 Composition of a special corpus . . . . . . . . . . . . . . . . . . . . . . . . . 55Fig. 4.4 Bank of English: a screen shot . . . . . . . . . . . . . . . . . . . . . . . . . . 58Fig. 4.5 Corpus of contemporary American English (COCA) . . . . . . . . . . 58Fig. 4.6 Structure and composition of a multimodal corpus . . . . . . . . . . . 59Fig. 5.1 Classification of corpus based on type of text . . . . . . . . . . . . . . . 68

xxv

Fig. 5.2 Sample of the ISI Bangla text corpus . . . . . . . . . . . . . . . . . . . . . 70Fig. 5.3 Structure of a bilingual corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Fig. 5.4 Structure of a multilingual corpus . . . . . . . . . . . . . . . . . . . . . . . . 71Fig. 5.5 Classification of corpus based on purpose of design . . . . . . . . . . 73Fig. 5.6 Annotated London–Lund Speech Corpus. . . . . . . . . . . . . . . . . . . 75Fig. 6.1 Classification of Corpus-based on nature of application. . . . . . . . 86Fig. 6.2 Conceptual frame of a parallel corpus . . . . . . . . . . . . . . . . . . . . . 87Fig. 6.3 Model of an ideal translation corpus . . . . . . . . . . . . . . . . . . . . . . 89Fig. 6.4 Sentences aligned in the Hindi–Bangla translation corpus . . . . . . 91Fig. 6.5 Comparable corpus with different texts from a single

language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Fig. 6.6 Comparable corpus with the same texts from different

languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Fig. 7.1 Hindi as a source language and other Indian languages as target

languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Fig. 7.2 Schematic representation of a Parallel Translation Corpus . . . . . . 103Fig. 7.3 Construction and composition of a PTC . . . . . . . . . . . . . . . . . . . 104Fig. 7.4 Sample of Hindi–Bangla parallel translation corpus. . . . . . . . . . . 105Fig. 7.5 Layers of translation unit alignment in a PTC . . . . . . . . . . . . . . . 111Fig. 7.6 Sentences aligned in a Hindi–Bangla PTC. . . . . . . . . . . . . . . . . . 112Fig. 7.7 Lexical mapping between Hindi and Bangla . . . . . . . . . . . . . . . . 116Fig. 7.8 Extraction of TEUs from a parallel translation corpus . . . . . . . . . 117Fig. 7.9 Verification of TEUs with a monolingual corpus . . . . . . . . . . . . . 118Fig. 8.1 Major domains of text samples of the Bangla web text corpus

(WTC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Fig. 8.2 Stages involved in web text corpus (WTC) compilation . . . . . . . 136Fig. 8.3 Metadata information for the texts taken from magazines . . . . . . 137Fig. 8.4 Metadata information for the texts taken from books. . . . . . . . . . 138Fig. 8.5 Metadata information for the texts taken from newspapers . . . . . 138Fig. 8.6 Metadata information for the texts taken from websites. . . . . . . . 139Fig. 9.1 Utilization of handmade language corpora in various areas . . . . . 150Fig. 9.2 Picture of the Plan of a Dictionary of the English

Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152Fig. 9.3 Picture of the A Dictionary of the English

Language (1755) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154Fig. 9.4 The first edition of the Oxford English Dictionary. . . . . . . . . . . . 156Fig. 9.5 Cover page of The Teacher’s Word Book (1921) . . . . . . . . . . . . 160Fig. 9.6 Cover page of The Teacher’s Word Book of 30,000

Words (1944) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161Fig. 10.1 Utilization of handmade language corpora in applied

linguistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168Fig. 10.2 The introductory page of Linguistic Survey of India . . . . . . . . . . 175Fig. 10.3 First alphabetical page of The Century Dictionary

and Cyclopedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

xxvi List of Figures

Fig. 14.1 Major domains of use of language corpora . . . . . . . . . . . . . . . . . 243Fig. 14.2 Growth of corpora after the introduction of the Brown corpus in

1961 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245Fig. 14.3 Mutual dependency between corpus linguistics and language

technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256Fig. 15.1 Different types of the limitations of a corpus . . . . . . . . . . . . . . . . 260

List of Figures xxvii

List of Tables

Table 3.1 Method used for developing a speech corpus . . . . . . . . . . . . . . 45Table 5.1 Present state of corpus annotation in English and Indian

languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Table 7.1 Restructuring Hindi and Bengali sentences . . . . . . . . . . . . . . . . 116Table 7.2 Similar vocabulary of Bengali and Odia . . . . . . . . . . . . . . . . . . 119Table 7.3 English–Bangla parallel translation corpus . . . . . . . . . . . . . . . . 121Table 8.1 Domains and sub-domains of the Bangla

Web Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Table 11.1 Text samples in the Brown Corpus (1961) . . . . . . . . . . . . . . . . 190Table 11.2 Text samples in the LOB Corpus (1978) . . . . . . . . . . . . . . . . . 191Table 11.3 Composition of the Brown Corpus and the

LOB Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192Table 11.4 Categories of spoken text samples in ICE. . . . . . . . . . . . . . . . . 200Table 11.5 Categories of written text samples in ICE. . . . . . . . . . . . . . . . . 200Table 12.1 Components and total words of the first part

of the ANC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207Table 13.1 Speech part of the survey of English usage . . . . . . . . . . . . . . . 224Table 13.2 Composition of the Corpus of Spoken English . . . . . . . . . . . . . 228Table 13.3 Words in Wellington Corpusof Spoken

New Zealand English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230Table 13.4 Speaker and word counts in the MICASE . . . . . . . . . . . . . . . . 232Table 14.1 A tentative scale on corpus generation over

the years in languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244Table 14.2 People and the type of corpus they require . . . . . . . . . . . . . . . . 251

xxix