Cambridge Business English Corpus (v. 1.0)

21
Cambridge Business English Corpus (v. 1.0) The Cambridge English Corpus contains a variety of British and American, spoken and written business English. This language, however, is not currently stored in one business English subcorpus (this feature is currently under development). When you begin working with the business texts in the Cambridge Sketch Engine, you will need to create a business subcorpus (or several business subcorpora) containing the texts that you wish to search. Please note that the spoken Business English (CANBEC) is found: In the Cambridge International Corpus under “British Spoken” (not “Br Spoken”) In the Cambridge Spoken Corpus under “Br Canbec” Contents Getting started 1) What Business English is in the CEC and where can it be found? p3 2) Creating your own business subcorpus p4 3) Using the business subject codes p7 Looking at Business English in different contexts 4) Using the text types p9 Business English in the Learner Corpus 5) Business English exams in the CLC p11 6) Learner Corpus exam question papers p13 Creating, uploading and sharing new Business English corpora 7) Using Web BootCaT p15 8) Uploading your own text files p16 9) Sharing your corpora with others p18 Finding keywords in Business English 10) Extracting keywords from your uploaded corpora p19 11) Using Word List p20 12) Using Keyword List p21 1

Transcript of Cambridge Business English Corpus (v. 1.0)

Cambridge Business English Corpus (v. 1.0) The Cambridge English Corpus contains a variety of British and American, spoken and written business English. This language, however, is not currently stored in one business English subcorpus (this feature is currently under development). When you begin working with the business texts in the Cambridge Sketch Engine, you will need to create a business subcorpus (or several business subcorpora) containing the texts that you wish to search. Please note that the spoken Business English (CANBEC) is found:

• In the Cambridge International Corpus under “British Spoken” (not “Br Spoken”) • In the Cambridge Spoken Corpus under “Br Canbec”

Contents Getting started 1) What Business English is in the CEC and where can it be found? p3 2) Creating your own business subcorpus p4 3) Using the business subject codes p7 Looking at Business English in different contexts 4) Using the text types p9 Business English in the Learner Corpus 5) Business English exams in the CLC p11 6) Learner Corpus exam question papers p13 Creating, uploading and sharing new Business English corpora 7) Using Web BootCaT p15 8) Uploading your own text files p16 9) Sharing your corpora with others p18 Finding keywords in Business English 10) Extracting keywords from your uploaded corpora p19 11) Using Word List p20 12) Using Keyword List p21

1

1. What Business English is in the Cambridge English Corpus and where can it be found?

There are a range of specific business subcorpora in the Cambridge English Corpus. These are highlighted in blue in the table below. They are recommended for creating your business subcorpus.

There are also other subcorpora which contain some Business or Financial English, but which are not solely Business English subcorpora. The documents in these subcorpora have been subject coded according to topic. This means that by using the business subject codes it is possible to pull out the business-related English in these subcorpora, (although it is wise to sample the data before analysing it, as some documents have multiple subject codes applied to them and may not primarily be concerned with business).

The main business subject codes are: • F (Finance and business) • BZ (Business and commerce) • EC (Economics and finance)

Key: BSC = Business subject codes For more on subject codes, please see “Using the business subject codes”

British American International / unspecified / other

Wri

tten

• BNC academic + BSC • BNC written + BSC • Economist newspapers • Financial Times newspapers • Guardian newspapers + BSC • Independent newspapers + BSC • Mail newspapers + BSC • Mirror newspapers + BSC • Scottish newspapers + BSC • Telegraph + BSC • Times + BSC • Wolverhampton Business Corpus • CUP Business Books

• Am business magazines • Am newspapers + BSC • Am written + BSC

• CUP Journals + BSC

Spok

en

• CANBEC (found in the CIC and the CSC)

• BNC Spoken + BSC

• Am News + BSC • Am interview (TV) + BSC • Am analysis (radio) + BSC

Lear

ner

wri

tten

• BEC Preliminary, Vantage, Higher, (1, 2, 3)

• ICFE (Financial)

Web

dat

a

• Web corpus Finance • Web corpus Employment • Web corpus Commerce

2

2. Creating your own business subcorpus Subcorpora can only be created inside one corpus (that is, they can’t be formed from data from more than one of the different main corpora on the homepage, such as the CIC and the CSC, or the CLC and the CIC, although this is under development). For details of how to compare across corpora, see Using Keyword List . To create your subcorpus:

1) Select the corpus that you wish to work in from the box on the homepage 2) On the concordance query page, select Text Types from the left menu, (under Expert Options) 3) Next to the subcorpus box, click “Create new”

• I want to look at the same Business English sources every time I log into Sketch – can I save my selected sources?

• I want to make sure I use the same corpora as my co-authors / editors / researchers – how do I set up my

subcorpus? • I only want to look at written Business English – can I create a subcorpus of just written Business

English?

This brings up a list of all the different text types and subcorpora that you can choose to include in your new subcorpus. This page may take a little while to load. (See the page opposite for further explanations of the following stages.)

4) Enter the name of your new subcorpus in the box at the top, e.g. Business written 5) Select the texts you wish to include in your subcorpus

3

Creating a subcorpus

E.g. Business written

You can choose to see the number of tokens (words, punctuation marks, etc) or the number of documents in each category

Leaving these boxes blank will include both written and spoken English in your subcorpus, unless all the sources you have chosen only include one type of English Select the variety of English

e.g. American, British. If you want to include all types of English, leave all boxes blank

If using non-business specific subcorpora (e.g. Telegraph newspapers), you can filter by using the business subject codes

Select the sources that you want to include (see table on p.<3> for details)

4

6) Once you have selected what you want to include, click “Create subcorpus”. This will then bring up the details of your new subcorpus, including how many tokens and/or words it contains

7) Now when you go to the concordance query page, Word List, or Word Sketch (for the latter

selecting advanced options), you can choose to search in your new subcorpus by highlighting it in the drop-down list

Your subcorpus will be saved, so every time you log into Sketch you can access your subcorpus. Tips:

• You can create multiple subcorpora (up to 6 subcorpora per main corpus) • If you have any questions about what to include in your subcorpus, check with your editor and/or

co-authors to make sure that you use the same data

• For further information on what Business English is available where, and what might be most suited to your needs, please contact a member of the Corpus team

• If you need to know how many words or tokens your subcorpus contains, click on “info”

• For a full breakdown of the number of words per different text types in your subcorpus, go to

Word List (see instructions on page 9)

5

3. Using the business subject codes A lot of the documents included in the CEC have been labelled with one or more subject codes. These are a hierarchical system of 1, 2, 3 or 4 letter codes which indicate that the document deals with a certain topic. The Text Types option on the concordance query page allows you to include documents associated with a specific subject area in your search results. While most of the documents in the CEC are subject coded, please note, however, CANBEC is not subject coded. Because the filters are set up to exclude rather than add, if you search for a subject code and tick the box for CANBEC (or British Spoken in the CIC), your results will not include any CANBEC files. This means that it is not possible to create a subcorpus of all business written and spoken English in the CIC using the subject codes, (i.e. you would have to exclude CANBEC). Likewise, in the CSC, you can either choose CANBEC or the other sources plus the business subject codes. The hierarchical nature of the subject codes means that if an article is given a Level 4 code (it has four letters, e.g. MERG: “Mergers, monopolies, takeovers and joint ventures”, see diagram below), every article coded with MERG will be automatically included in searches for its parent category F (Finance and Business). Therefore if an article is about television and also about media and publishing, it would only be coded MERG; there is no need to include the code F as coding an article with MERG will automatically code it with F in addition. E.g.

Level 1

Level 2

Level 3

Level 4

MAR

BZ

FINANCE AND BUSINESS

BUSINESS AND COMMERCE ACCOUNTING AND BOOKKEEPING

MARKETING AND MERCHANDISING

MERGERS, MONOPOLIES, TAKEOVERS, JOINT VENTURES

ACC

MERG

F

• How do I search for a word / phrase in texts relating to a specific area of business, e.g. labour relations?

• Which subject codes should I use, and what do they mean?

• How can I search for results in all texts relating to business?

Coding for MERG therefore also automatically includes F – BZ - MAR For a full list of the business subject codes in hierarchy, see Appendix 1. The codes (without their full descriptions) can be found in the subject area field of the text types. These can be expanded to view the child codes under each parent code. In the picture below, the all business codes (found under the parent code F) are shown. The hierarchy path is shown in the right hand column:

6

F Finance and business F BZ Business and commerce F::BZ ACC Accounting and book-keeping F::BZ::ACC LAB Staff and the workforce (incl. labour relations) F::BZ::LAB MAR Marketing and merchandising F::BZ::MAR MERG Mergers, monopolies, takeovers, joint ventures F::BZ::MAR::MERG PUBL Public relations F::BZ::MAR::PUBL RETA Retailing F::BZ::MAR::RETA SPON Sponsorship F::BZ::MAR::SPON OFF Office practices and equipment F::BZ::OFF EC Economics and finance F::EC CUR Currencies, modern coins and monetary units F::EC::CUR DEV Economic development and growth F::EC::DEV ECY Economic conditions and forecasts F::EC::ECY FIN Company finance (e.g. borrowing, shares issues, bankruptcies) F::EC::FIN INF Inflation, prices, costs and earnings F::EC::INF PEN Pensions F::EC::PEN PFE Banking and personal finance F::EC::PFE TRA International trade (imports and exports) F::EC::TRA IN Insurance F::IN IV Investment and stock exchange F::IV MOR Mortgages and real estate F::IV::MOR MF Manufacturing F::MF BAR Barrel-making F::MF::BAR DYE Paints, dyes and pigments F::MF::DYE FOU Foundry (casting) F::MF::FOU GLA Glass F::MF::GLA HOR Watches and clocks F::MF::HOR PLC Plastics F::MF::PLC PLT Plating F::MF::PLT RUB Rubber F::MF::RUB SME Smelting F::MF::SME TAN Tanning and leather F::MF::TAN TEX Textiles F::MF::TEX TA Taxation F::TA ITX Income tax F::TA::ITX PTA Local government taxation (e.g. Poll tax, council tax) F::TA::PTA VAT Value added tax, purchase taxes, excise duties F::TA::VAT

Tips: • Only use the subject codes (alone) if you are happy to exclude CANBEC from your results

7

• The more specific you can be about the topic you wish to focus on, the more relevant your results are likely to be

• To check a document source in your concordance, click on the document id in the left-hand

column

Looking at Business English in different contexts 4. Using the Text Types

• How I do find out whether “facilitate” is more common in British or American Business English? • How can I search for instances of “synergy“in only business magazines?

• How do I find out whether the phrase “to call a meeting” in Business English occurs more in newspapers or in non-fiction business books?

When searching in your business subcorpus for a word or structure, you might want to find out what types of document it occurs most frequently in, and to refine your results by looking only at such documents. To do this:

1) After you have made a concordance, go to Frequency – Text Types on the lower half of the left hand menu.

This gives you a list of the number of hits for your search term in each of the different document types.

2) You can then view examples from one category only, e.g. newspapers, by clicking on “p” of “p”/”n” in the left hand column next to that text type.

This takes you to a concordance of only these results. You can then continue to filter and analyse the data based on just these results.

For example, (depending on the make-up of your subcorpus) you can find out whether a word occurs more commonly in British or American Business English, or in newspapers compared to other documents. For this, you will also need to calculate the normalised frequency for your search item in the different text types in your subcorpus. To do this, you require the word count for the different text types in the subcorpus:

8

3) Go to Word List and scroll down to the Text Types box 4) Select your subcorpus, then “Variety of English” and click “Show Header Fields”.

This will return a breakdown of the types of English in your subcorpus. Remember to make sure that the list shows tokens (or words) rather than document count.

5) Now you can calculate the normalised frequency per million by dividing the number of hits in that text type by the total token count for that type of English, and multiplying the result by 1,000,000

6) You can then directly compare the frequency of your search term in the different text types

Tips:

• Normalised frequency per million: No. of hits x 1,000,000 Total word count

9

Business English and the Learner Corpus

• How can I search for words, phrases or errors in Business English exams only? • I want to see what words Business English students make mistakes with, and at which levels

• How do I search for mistakes made by Business English students from a particular L1 / market? • How can I find examples of candidate’s business reports, formal emails etc?

5. Business English exams in the CLC The Learner Corpus contains:

• BEC Preliminary • BEC Vantage • BEC Higher • The older BEC 1, 2 and 3 • ICFE scripts (financial English)

To search only in these exams, you can use the text types to either

a) Create a subcorpus for these exams b) Select the exams you want each time you perform a search

You can also choose to narrow your searches to look at candidates of a particular first language, nationality, age, or any of the other parameters found in the text types. For example, here a subcorpus of all BEC passes has been created:

10

From the text types you can also choose to look only at candidate answers which are of a certain:

• Style, e.g. business • Format, e.g. where the question paper asks the candidate to write a report

• Register, e.g. where the question is designed to elicit a formal response

Tips:

• To search in more than one style, use the vertical bar to separate the styles. e.g. Business| Informative/news (This is also the case with searches involving multiple first languages and nationalities e.g. French|Spanish , China|Korea|Japan )

• By combining the style and format options, you can search for business reports, business emails, business proposals, business letters etc.

• For more information on using the coded and uncoded Learner Corpus, please see the

separate guide Using the Learner Corpus, available on the Cambridge Help pages on Sketch Engine

11

6. Learner Corpus Exam Question Papers

• How do I search for the types of questions business students have to answer? • How can I look at exam papers which ask candidates to produce a report, formal business letter, etc.?

The Cambridge English Corpus also contains a corpus of exam questions from almost the full range of exams in the CLC. You can search for business exam question papers by entering the name(s) of the exam(s) in the Exam box in the Text Types (NB. to search for multiple exams, use the vertical bar | to separate the names e.g. BECH|BECP) You can also choose from the Text Types to look at questions of a particular format, style or register: e.g. Report, Business, Formal:

To view all question papers that fit into your selected categories:

1. Choose a CQL query and enter [] into the query box 2. Press Frequency - Doc ID once you’ve made your concordance to see a list of exam papers.

You can then view these individually by clicking on the “p” of “p”/”n” in the left column.

Alternatively, you can search for a specific word, structure or phrase in these question papers.

12

Creating, uploading and sharing new Business English corpora You can also use Sketch Engine’s analysis tools for other data: you can create and analyse your own specialised Business English corpora by uploading text files or using Sketch’s Web BootCaT function to pull in samples of Business English from the Internet. Permissions If you are using Web BootCaT, the permissions status for the corpus you create is the same as other corpora in the Cambridge English Corpus. (I.e. unidentifiable quotes of no more than 20 words can be used; for longer quotes, please contact the Permissions Unit). When uploading your own text files, please ensure that you have the consent of speakers / contributors. Please make sure any quotes used are unidentifiable. Token limits There are some limits on the size of corpora you can create:

• You can create / upload corpora of up to 1,000,000 tokens in total

• If you are given access rights to another user’s corpus, (i.e. it appears under “Other users’ corpora” on the homepage) then this does not count towards you token limit. Sharing corpora is therefore a good way of getting access to more data without having to delete corpora you have already created

• If you need to create a corpus of more than 1,000,000 tokens, please contact a member of

the Corpus team

13

7. Using Web BootCaT

• Can I create a corpus of web Business English and analyse it in Sketch? • How can I create a corpus that focuses on a particular business domain?

• There’s a lot of Business English available on the Internet, but how can I collect it and analyse it?

You can create your own corpus of Business English by capturing data from the Internet using the Web BootCaT function. It can be used to compile a corpus on:

• A particular topic or subject not currently found within the Corpus • A specific area of business that you would like to investigate further

You can either choose to input a list of seed words, or of URLs that relate to your topic. Seed words are terms that are expected to be typical of the domain of interest. For example, to generate a business meeting corpus, you might use seed words such as meeting, agenda, minutes, conference call, etc. BootCaT then generates a corpus based on searches for these seed words. Using Web BootCaT, you can look for examples of agendas, minutes, reports, presentations and other business document types not included in the current Text Types. By clicking on a document’s id in your concordance, you can find the link to the URL from which the document comes, so that you can see the text in its original context. Tips:

• In the CIC, there are a number of pre-existing corpora available (see the sources field of the Text Types) which have been built from the web. These include separate corpora for Commerce, Finance and Employment. The seed words for these have been taken from the DANTE project (http://www.webdante.net/index.html )

• For more information and full instructions on using Web BootCaT to compile and share web

corpora, please see Cambridge Sketch Engine – Advanced Help.

14

8. Uploading your own text files • Can I analyse my own collection of business texts using Sketch?

If you have your own corpus of text files, (for example if you have permission to use a set of business emails that you have collected) then you can upload these into Sketch to create your own personal subcorpus. To do this:

1) Save your documents as plain text files (.txt) 2) From the homepage, go to Create Corpus on the left hand menu (see diagram below)

3) Give your corpus a suitable name (containing no spaces) as its ID and name e.g. Business_emails

4) Enter a short description into the Info box (this text will be displayed alongside the corpus name on the homepage)

5) Select English from the drop-down list of languages. Then click “Next”

6) Choose Tree Tagger for English from the list of Sketch grammars

7) Click “Finish”

Now you can begin to upload your files. Click “Add new file”. You can upload files from your computer, use the FTP site, or download from the web. To upload files from your computer:

8) Select “Upload from disk” and browse to find the file you wish to upload 9) Open the file and press “Next”. The text file will then appear in a preview pane. If it displays

without problems, click “Finish”

15

To upload files via the FTP site:

10) Launch Filezilla on your computer 11) In Sketch, on the “Add new file” page, click on the link to the Sketch FTP

12) Enter the host name that appears on the Sketch FTP into Filezilla, along with your Sketch user

name and password

Filezilla

13) You can now transfer files across from your computer to the Sketch FTP directories. For more

information on using Filezilla, please see the separate guide Using the FTP site 14) On the “Add new file” page, your files should now appear in the drop down box. You can select

and add these now to your corpus

Compiling your corpus: 15) Once you’ve finished adding all your files, click “Compile corpus”, then “Compile” 16) Press OK. Then press “Open in Sketch” to search in your corpus

Your new corpus will now appear on the homepage under My Corpora. You can open it in Sketch from the home page by clicking on the magnifying glass icon next to it:

16

9. Sharing your corpora with others • How do I share my own corpus of text files with my co-authors / other Sketch users?

If you want to share the files you’ve uploaded with another person on Sketch so that you can both access them:

1) Go to My Corpora on the homepage and then click on the pen symbol next to your corpus (labelled “Edit corpus configuration” when you hover the cursor over it)

2) Select “Access privileges” from the left hand menu

3) You can then choose to nominate other sketch users to have read-only, file uploading, or full access rights to your corpus

Your corpus will then appear under “Other users’ corpora” on these users’ homepages.

17

Finding keywords in Business English 10. Extracting keywords from your uploaded corpora

• Which words are more frequent in my new corpus compared to general English? • How does my uploaded corpus differ from another corpus?

To see which words are most common in your new, uploaded corpus compared to general English, you can get Sketch to extract the keywords from the corpus you’ve compiled. For example, this can be used to compare the most common words in business emails with general English.

1) Click on “Edit corpus configuration” in the table beside your corpus 2) Choose “Extract keywords” from the left hand menu

3) Choose a reference corpus to compare your corpus to from the drop-down list (e.g. the CIC for

general English, see diagram below)

4) You can choose to search for key words, lemmas, or lower case words (lc)

5) Depending on the size of your corpus, you may need to alter the minimum frequency of keywords. (The default is 50, but if your corpus is comparatively small then you may need to lower this slightly in order to include enough results.)

Choose the corpus you wish to compare your subcorpus with

Select what you wish to search for

The default is to exclude very common / short function words. Click on the link to see the list of excluded words

6) Click OK. Sketch will then return a list of the words which are significantly frequent in your

corpus compared to the reference corpus NB. If you are using text files with headers, you will need to discount the words in the header fields from the list of keywords e.g. DATE, TITLE, February

18

11. Using Word List

• How can I find which words are more frequent in spoken English? • How can I find all the most frequent words ending in e.g. “-ing”, or beginning with “pre-“, in my subcorpus?

• Can I compare an existing word list I have with a word list for my [written] Business English subcorpus?

• How do I find out what the most common words in written Business English are?

Word List is a quick way of finding the most common words in your subcorpus.

1) Selecting the corpus your subcorpus is located in from the homepage, go to Word List on the top left hand menu

2) Choose your subcorpus from the drop-down list on the Word List entry form 3) You can search for a word, a tag, a lemma etc

4) Click “Make Word List” to see a list of the most frequent words / tags / lemmas etc. The raw

frequency is given alongside the frequency per million for each word Tips:

• If you want to search for a particular set of words, e.g. all words ending in “ing”, in your business subcorpus, use the Pattern field and search for .*ing

• To find all words in your business subcorpus with the prefix “pre”, search for pre.*

• For example, you could search for the most frequent words beginning with “bus” (e.g. business,

businessman, business-focused) by searching for bus.* If you already have a word list from another source and you wish to compare the order of frequency of the words on the list with your subcorpus, you can do this using the “List from file” option:

1. Save your word list as a plain text file (.txt) or paste the list into Notepad and save it 2. Load it by clicking on “Browse” next to “List from file”, finding your file and opening it 3. Click “Make word list” 4. Compare the new word list with your original. (Click on the heading “Frequency” to see the

results in order of frequency)

19

12. Using Keyword List

• What are key words for Business English, compared to general English? • How can I compare results across corpora?

• How can I find out which words are particularly associated with Business English?

The Keyword List is a useful way of viewing how English in business contexts differs from general English (e.g. English in the Cambridge International Corpus). It can be used to see which words are used significantly more in Business English compared to general English. You can then explore how these keywords are used in Business English. You can also use the Keyword List to compare spoken and written Business English, or American and British English, (or any other subcorpora that you have created). The Keyword List is currently the only way to compare across corpora. It can compare subcorpus to subcorpus, or subcorpus to corpus (e.g. Business English to the whole of the Cambridge International Corpus).

1) Select the corpus from which you wish to create a subcorpus (e.g. Cambridge Spoken Corpus) 2) Select Word List from the left menu bar

3) Go to the Keywords box

4) Choose the subcorpus you want to find keywords in from the drop-down menu, or create a new

subcorpus by clicking on “create new subcorpus”

5) Select the corpus you want to compare keywords with 6) Choose either to compare keywords with the whole of this corpus, or with a subcorpus (select

from the drop-down menu)

7) Choose which attribute you want to search for (e.g. words, tags, lemmas, lower case words)

20

21

8) Click “Find Keywords”. Sketch will now process the data. This may take a minute or two. Sketch will return a list of words which are significantly more common in your chosen subcorpus than in the corpus to which you are comparing it.

9) You can then investigate the uses of these words in business contexts by clicking on the

frequency score. This brings up a concordance of uses of this word from your business subcorpus. From this stage you can look at frequency, collocations etc as you would a normal concordance.

For further help in using the Business corpora available in Sketch, contact: [email protected]