Institutional academic English in the European context: a web-as-corpus approach to comparing native...

SILVIA BERNARDINI, ADRIANO FERRARESI, FEDERICO GASPARI

Institutional academic English in the

European context: a web-as-corpus approach

to comparing native and non-native language

1. Introduction and overview

In this contribution we present a corpus-based analysis of institu-

tional English as used in the Italian academic context. In order to

make this multi-faceted object of study more manageable and fo-

cused, the investigation is limited to academic websites. These are

viewed as particularly relevant inasmuch as they provide a powerful

means of making contents available to a vast audience, including,

crucially, international students. Producing appropriate and effective

web texts of an institutional nature in English is a must for institu-

tions in non-English speaking countries in order to favour EU-wide

student mobility and to attract prospective students from outside the

EHEA. From a descriptive / theoretical point of view, studies of aca-

demic discourse conducted so far have mainly focused on discipli-

nary academic English, and especially on scientific writing. Institu-

tional English produced within academia has received much less

attention, with the exception of a few landmark publications (notably

Fairclough 1993 and Biber 2006). Hence the relevance of the present

investigation.

The contribution has a double focus. In the first part the poten-

tial and limitations of the “web-as-corpus” methodology for special-

ised comparable corpus construction are illustrated. We describe the

semi-automatic process through which we collected English-language

texts published on the web by Italian universities; a similar approach

was adopted to build a matching sub-corpus of UK and Irish websites,

which afford examples of native English standards within the EU.

28 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari

The corpus-building procedure as well as the corpus structure and

contents are described in section 2.2. Since the procedure is semi-

automatic and allows limited control over the corpus contents, we

carried out a preliminary analysis to ensure that the resulting data

sets are appropriate for the purposes of our study. In line with current

work on web-as-corpus methods (Sharoff 2006), we present two ways

of assessing to what extent the two corpora may be regarded as com-

parable in terms of topics covered and (broadly speaking) text types

included. This methodologically-oriented exploration is reported in

section 2.3.

The second part of the contribution has a more descriptive fo-

cus, seeking to shed light on the characteristics of institutional aca-

demic English in Italian websites. As part of the efforts to achieve the

demanding strategic objectives of the “Bologna Process” universities

need to disseminate information on the web in English. On the whole,

Italian universities have implemented this requirement to different

degrees, and preliminary investigations reveal a rather disappointing

situation. Interventions aimed at supporting multilingual communi-

cative strategies are therefore needed to strengthen internationalisation

policies. To date, however, no in-depth studies have been devoted to

the discursive features of institutional English as it is used on the

websites of Italian universities, nor has this “lingua franca” variety

of English been compared to native varieties within the EU context.

The crucial importance of English as a lingua franca, especially in

scientific and academic international settings, is nowadays widely

recognised and has stimulated a number of comparative studies (some

corpus-based) analysing non-native varieties against the background

of standard “native” varieties (Seidlhofer 2001, Mauranen 2003,

Jenkins 2007).

In section 3 we analyse institutional academic language in its

native and lingua-franca varieties, taking as a starting point the analysis

of institutional university registers offered in Biber (2006). A com-

parative analysis of characteristic lexical bundles and of ways of ex-

pressing stance (more direct / indirect forms of obligation and neces-

sity) is carried out. Our findings suggest that the Italian institutions

make lighter use of set phrases assisting navigation and positively

evaluating themselves, and show a dispreference for personal style

29Institutional academic English in the European context

and for the more indirect stance expressions. As a result, these texts

come across as more directive, and the institutions who published

them as arguably more remote than is the case in the native corpus.

The study is part of a larger project which in the longer-term

seeks to provide resources for Italian professional writers and trans-

lators working with institutional academic English, and which aims

to set a standard for an EU-wide pool of corpora representing non-

native varieties of English used by universities in a number of Euro-

pean countries.

2. Web-as-corpus methods for specialised

corpus construction

2.1 Previous work

In the last decade the use of web data has become common practice

in corpus linguistics research. Witness to this is a large and growing

literature on web-as-corpus approaches to corpus building and us-

age, which are adopted for purposes as diverse as terminology ex-

traction (Castagnoli 2006) and register analysis (Biber / Kurjian 2007).

The term web-as-corpus is normally used for two main methodologi-

cal paradigms. The first consists in considering the web per se as a

corpus. Within this paradigm, web data are accessed either through

existing commercial search engines or through post-processors of

search engines’ output, such as WebCorp,1 which are intended to

present data in a “linguist-friendly” format. This approach is particu-

larly apt for investigations on pre-defined linguistic phenomena, e. g.

re-lexicalisation of specific terms (Brekke 2000) or infrequent bigram

identification (Keller/Lapata 2003), but is worse suited when consid-

erations regarding text types or domains are central to the research

questions. For this reason, this approach will not be considered here;

1 <http://www.webcorp.org.uk/>.


for a discussion of its advantages and limitations see, e. g., Lüdeling

et al. (2007) and Kilgarriff (2007).

The second paradigm within web-as-corpus research consists

in drawing on the enormous amounts of textual material available on

the web to compile off-line corpora, usually relying on (semi-)auto-

matic text selection and download procedures. These involve either

performing a customised crawl of specific websites, which are con-

sidered as representative of the specific text type / topic of interest,

e. g., the CNN transcripts as instances of broadcast language (Hoffman

2007), or adopting ad hoc strategies relying on search engines to

retrieve potentially relevant web pages, which are then downloaded

for corpus compilation (Baroni / Bernardini 2004, Leturia et al. 2008).

Of course, this approach is not devoid of problems. First, web data

tend to be very noisy, i. e. they contain duplicate pages, pages pro-

duced automatically by servers, and “boilerplate” – a term first intro-

duced by Fletcher (2004) to indicate portions of text which are re-

peated across the pages of a site, e. g. navigation bars, copyright

notices, etc., which tend to distort statistics about corpus composition.

Secondly, and perhaps more crucially, automatic procedures of cor-

pus construction, while making it possible to build relatively large

corpora in little time, allow for limited control over corpus contents.

As Baroni / Ueyama (2006: 2) point out, however, these are typical

problems of “quick and dirty” corpora, rather than web corpora per

se, and it is ultimately the trade-off between costs (in terms of time,

funding, etc.) and quality / usefulness of the resource that should be

considered when deciding what methods are to be adopted for its

construction.

In fact, several papers have been devoted to highlighting the

advantages of using web data (see, e. g., Kilgarriff / Grefenstette 2003,

Fletcher 2004). Web texts are in machine readable format, thus fa-

cilitating their retrieval and processing for inclusion in a corpus; the

web is constantly updated, and has been demonstrated to be a valid

resource for investigating contemporary language usage (Lüdeling et

al. 2007); and it makes available linguistic materials which can hardly

be found elsewhere, i. e., samples of specialised languages and web-

based genres. Often, corpus resources simply do not exist or do not

provide enough samples of specific language varieties, such as, e. g.,

31

medical English (Gatto 2009: 101 ff.). In these circumstances, the

web, with its extreme heterogeneity of text types and topics and its

up-to-dateness, seems to be the obvious place to turn to. As for web-

based genres (e. g. blogs, chat rooms), it has been suggested that they

display peculiar communicative structures which set them apart from

“traditional” written texts (Santini 2007), and thus deserve dedicated

analyses.

An example of research following the second paradigm, and

focusing on institutional academic language is Thelwall (2005). Us-

ing a customised crawler, the author builds a corpus of pages from

university websites of three English-speaking countries (Australia,

New Zealand and the UK), and then carries out a preliminary evalu-

ation of its contents based on a frequency analysis of the most com-

mon words in its three components and in the written component of

the BNC. This is intended to pinpoint differences both across the

national provenances sampled in the corpus, and of academic web

English when compared to “general” English.

2.2 Building acWaC

For the purposes of this study, we needed a relatively large and up-

to-date corpus which would represent contemporary English in the

websites of Italian universities, focusing in particular on “institutional”

communication (Biber 2006). In order to highlight features of this

variety, we also needed a benchmark for comparison, and we settled

on texts published on the websites of British and Irish universities,

taken as a native standard within the European Union. Following

Thelwall (2005) and adopting web-as-corpus semi-automatic con-

struction procedures seemed the most obvious choice to make (cf.

section 2.1). Unlike Thelwall, who performed a customised crawl of

university websites, we used the BootCaT toolkit (Baroni and Bernar-

dini 2004), a set of Perl scripts for rapidly building specialised, ad

hoc corpora. Using BootCaT is less labour intensive and technically

demanding than setting up a personal crawler. The resulting mono-

lingual comparable corpus is called acWaC, i. e., academic Web-As-

Corpus.

Institutional academic English in the European context


The first step of the BootCaT procedure consists in manually

identifying relevant “seeds”, i. e. words or word combinations that

are assumed to be characteristic of the language variety of interest.

For research focusing on domain-specific varieties of a language,

seeds are usually key terms of that domain. In our case, however, the

guiding criterion in corpus construction was not gathering texts ac-

cording to their domain. Rather, they had to share particular text prin-

cipals (to borrow Goffman’s (1981) term), i. e. the texts had to be

made available over the web by universities, considered as the enti-

ties responsible for their production (regardless of the actual draft-

ers). A preliminary browsing of the websites of Italian universities

was therefore carried out, in order to assess which of them feature

substantial amounts of English language contents (based on our

checks, slightly more than half do). For the sake of sub-corpus com-

parability, the number of UK institutions had to be narrowed down

with respect to the totality of the available websites. We decided to

include the 20 universities of the Russell Group for the UK,2 and all

Irish universities. The seeds for our search consisted therefore in a

set of URLs corresponding to (English language) websites of the

universities which were identified during this preliminary phase, plus

a few common English words like “the”, “of” and “and”.

In the second step of the procedure, the URLs (used as argu-

ments of the site: operator) and the seed words are submitted as query

terms to a search engine (either Google or Yahoo!). Further heuristics

can be adopted to maximise the precision of the results: we employed

the language filter provided by Google, and excluded pdf files (through

the filetype: operator) so as to limit the impact of disciplinary writing

(e. g. research papers), often published in this format on the web. The

search engine produced a list of URLs for each pre-determined

website, and a maximum of 300 documents per site were downloaded.

Notice that since the procedure relies on search engines’ ranking al-

gorithms, and since we download the first 300 pages, results depend

to a large extent on the way a particular website is indexed by the

search engine itself. It is likely that the pages that will end up in the

2 The Russell Group is an association of top-level universities in the United

Kingdom (<http://www.russellgroup.ac.uk/>).

33

corpus are skewed towards the more “popular” ones (Gatto 2009:

51–52), but this is not considered a problem here, since a) we are

interested in analysing the documents that a student is likely to en-

counter on a university website, and b) the same retrieving procedure

is used for the Italian and UK / Irish sub-corpora (henceforth, respec-

tively, IT-UNI and EN-UNI).

In the final phase, documents were POS-tagged and lemmatised

using the TreeTagger3 and indexed for corpus consultation with the

CorpusWorkBench.4 Table 1 provides size information about the

acWaC corpus.

acWaC

IT-UNI EN-UNI

Tokens 4,228,841 5,435,855

Types 165,037 125,089

Documents 6,745 7,721

University websites sampled 55 28

Table 1. Size information of the acWaC corpus.

Notice that, despite the semi-automatic procedure that was employed,

the size of the two sub-corpora is roughly comparable. A point that

should also be stressed is that one of the major strengths of this pro-

cedure is that it is easily replicable, and can be taken as a model to

rapidly build similar corpora for universities based in other coun-

tries, as well as making it possible to track diachronic changes within

the same websites.

There remains one open issue: how certain are we that the pro-

cedure worked, i. e. it retrieved documents matching our expectations?

When building a corpus from the web using automated queries to a

search engine, one has to always bear in mind that the retrieved docu-

ments may simply not match the targeted contents. While this has

never been a problem in traditional corpus construction, where texts

for the corpus are manually selected, it is a crucial step in web-as-

3 <http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/>.

4 <http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/>.



corpus projects. Ideally, one would want to quickly read through all

the documents in the corpus, or at least a large enough number of

them. However, this is hardly feasible for a corpus of about 14,000

documents, as well as being at odds with the rationale behind the

web-as-corpus methodology, which is meant to speed up and ease the

procedure of corpus construction. The next section looks at the method

we employed to evaluate contents and corpus comparability.

2.3 Evaluating acWaC

In this project we adopted two methods of corpus evaluation / com-

parison previously used in the literature. First of all, as an initial quali-

tative step, we randomly selected 200 documents from each sub-cor-

pus, read through them and classified them in terms of (broad) topic /

genre classes. This was a cost-effective procedure that partly followed

Sharoff (2006), yet making no attempt to bring in the burgeoning

literature on genre classification (see e. g. Lee 2001; Santini 2007).

Secondly, we retrieved lists of lemmas and of part of speech sequences

from the two sub-corpora, cleaned them, and compared them using a

statistical measure that reveals the words and lexico-grammatical

structures that are more characteristic of one sub-corpus compared to

the other (Sharoff 2006, Ferraresi et al. 2008).

These exploratory analyses confirmed the overall comparabil-

ity of the two sub-corpora, as well as providing initial evidence of

some differences that were further investigated in the analysis proper

(section 3). In turn, the latter confirmed that the corpus construction

procedure had been successful in retrieving comparable texts repre-

senting the language varieties under investigation, following the con-

struction-evaluation-use virtuous circle advocated by Atkins et al.

(1992) in their classic work on corpus design.

2.3.1 Comparing samples of texts from the sub-corpora

Two random samples of 200 texts were extracted from EN-UNI and

IT-UNI and classified in broadly functional terms. The categories

were developed bottom-up and refined through several rounds of

35

analysis until all the texts under scrutiny had been accounted for. The

results are reported in table 2.

Text category IT-UNI EN-UNI

Description of research centres, departments, committees 40 62

Description of courses, degrees, modules 48 34

News, events, life 26 47

Regulations 17 11

Personal pages 23 14

Disciplinary writing 26 21

Web navigation 03 03

Mixed or unclassifiable 17 8

Table 2. Random samples (200 texts each) from IT-UNI and EN-UNI compared.

The distribution of texts in the different categories is not identical in

the two sub-corpora. The British / Irish part contains more descrip-

tions of bodies within a given institution (faculties, research centres,

committees, etc.) and more pages referring to current events and stu-

dents’ life, while the Italian counterpart contains more pages describ-

ing the courses on offer, regulatory texts (e. g. agreements), personal

pages and disciplinary writing (e. g. academic papers).

This distribution suggests that the English language contents

provided by Italian websites focus on the institution’s educational

offer and on exchange projects (the regulatory texts); that there is an

attempt on the part of individuals to make their research / teaching

activity known internationally, and possibly that more disciplinary

writing gets published on the web in html format than in pdf (remem-

ber that the pdf files were excluded from the crawls). The under-

represented categories also tell us that institutions in Italy seem more

concerned with the “formal” aspects of students’ exchange projects

than with their daily and social life, and, more interestingly perhaps,

that they do not feel as compelled as the UK / Irish institutions to

describe and ultimately advertise themselves (their research and teach-

ing credentials, their facilities, etc.), as opposed to the courses they

offer. Lastly, the higher number of mixed or unclassifiable texts in

IT-UNI testifies to the greater difficulty of harvesting texts in one



language from websites that are written mainly in another, as op-

posed to single-language websites. This is an unavoidable consequence

of the methodology and research design, and possibly of the object

of study itself, but would not seem to impact on the comparison to an

extent that could distort the general picture.

Though based on relatively few texts, this comparison has high-

lighted possible differences between the two language varieties rep-

resented in the corpus, to be further investigated in the analysis proper.

In terms of evaluation of the corpus construction methodology, there

is no indication that the automatic procedure has gone awry, yielding

non-comparable sub-corpora.

2.3.2 Comparing word- and ngram-lists

As a second step in the evaluation of corpus contents, we adopted a

methodology that is widely used to assess the composition and char-

acteristic linguistic features of corpora, both web-derived (Sharoff

2006) and “traditional” (Rayson / Garside 2000). This consists in com-

paring a frequency list obtained from the corpus of interest with a

benchmark, using log-likelihood as a statistical association measure,

which, unlike Mutual Information or t-score, has been proved to be

independent of corpus size (Dunning 1993).

We extracted lists of lemmas and of sequences of 3 parts of

speech (henceforth 3-grams) from both sub-corpora, and compared

them using each list in turn as a benchmark for the other. Taken to-

gether, these lists can give us an idea of lexico-grammatical regulari-

ties in the two sub-corpora, and, crucially, reveal salient differences

between them. As a pre-processing step, which was aimed at reduc-

ing the amount of noise in the lists, lemmas unknown to the tagger

were filtered out, along with proper nouns and words containing non-

alphabetical characters. For each (ranked) list pertaining to IT-UNI

and EN-UNI, we generated and analysed concordances for the top 50

entries in the lemma lists and for the top 5 ones in the POS 3-gram

lists, taking them as clues to salient corpus differences in terms of

(broad) topic categories and functional linguistic features.

37

2.3.2.1 Comparing lemma lists

The 50 lemmas more typical of IT-UNI when compared to EN-UNI

can be grouped into three broad topic / function categories (see table

3 for examples). Notice that in this analysis we do not take into account

words appearing in boilerplate sections of the web pages, which ac-

count for nearly 50% of the key lemmas found both for IT-UNI (e. g.

“fax”) and EN-UNI (e. g. “accessibility”). These words typically appear

in portions of text which are repeated identically across different pages

of the same website. While boilerplate text might be an interesting

object of study (for a discussion, cf. section 3.1), the very fact that it

is “repeated text” is most likely to distort frequency data, and hence

blur the analysis of its “typicality” in one corpus (Fletcher 2004).

IT-UNI

1. Institutional activities 2. Relations with other institutions 3. Academic / diciplinary

Credit agreement scientific

Exam cooperation model

Professor company analysis

Cycle field

Table 3. The most typical lemmas of IT-UNI when compared to EN-UNI, split by

topic / function.

The first category of non-boilerplate words in the IT-UNI wordlist is

that of words that are related to what we might call “institutional

activities”, and include, e. g., “credit”, “exam” and “cycle”. These

are mainly found in pages describing the educational offer of univer-

sities, such as degree or module descriptions, i. e. informative / regula-

tory texts aimed at providing (foreign and / or exchange) students with

information about the available programmes. Also notice the presence

within this category of “context-bound” terms, such as “professor” (a

term referring to a more common position in Italy than in the UK /

Ireland), and “cycle” (referring to each stage of an educational path,

leading to e. g. a Bachelor’s or Master’s-level degree). The second

category includes words which are mainly used to refer to relations

with other universities, institutions and private companies. The rela-

tively high frequency of these words in IT-UNI can be accounted for



on the basis of a larger presence of texts that regulate, e. g., student

exchanges or internship contracts (cf. section 2.3.1). As an example of

the use of these words, randomly selected concordance lines for the

lemma “agreement” are provided in figure 1. The third category of

typical lemmas in IT-UNI groups words which are characteristic of

academic disciplinary texts, e. g., “scientific”, “model” and “analy-

sis”. These words mainly appear in research articles, suggesting that

the exclusion of pdf files is not a good enough heuristics for guarding

against disciplinary writing, at least for the non-native component (cf.

section 2.2).

in organisations which have an <agreement> with the University

ersity of Cagliari * Bilateral <agreements> with european

erican and african countries * <Agreements> with eastern count

rd Lyon I\x{201D } , France.An <agreement> for co-operation an

peration * Signing Cooperation <Agreements> * Community and

countries and eastern europe * <Agreements> with mediterranean

s Programme , or * an exchange <agreement> . If your home

ersity has signed one of these <agreements> , just take part ,

vities * Health Care * Service <Agreements> * Video and Movie

( Erasmus , Tempus , bilateral <agreement> , etc. ) , the leng

Figure 1. Ten random concordance lines for the lemma “agreement” in IT-UNI.

Moving on to the analysis of key lemmas in EN-UNI, these seem to

be more varied than the ones which were found for IT-UNI, and can

be grouped into four categories (see table 4). At a first glance, the

first category would seem to coincide with that of “institutional ac-

tivities” also found in the Italian sub-corpus. However, a closer analy-

sis reveals that these key lemmas in EN-UNI refer mainly to research

activities (e. g. “postgraduate”, “research”), rather than to the uni-

versities’ educational offer, and are mainly found in informative texts

describing the institution and promoting, e.g., its research achieve-

ments. The “Services / support” category includes words related to

ways in which universities assist students. An example is “funding”,

which is found in pages in which institutions specify the funding

opportunities for both undergraduate and postgraduate study. This is

in line with the results presented in section 2.3.1, according to which

UK / Irish universities seem to be more concerned with students’ life

and welfare than their Italian counterparts.

39

EN-UNI

1. Institutional activities 2. Services / support 3. Evaluative language 4. Function words

Postgraduate Support range your

Research Funding include our

MA Disability

Table 4. The most typical lemmas of EN-UNI when compared to IT-UNI, split by

topic / function

While the categories just described seem to point at differences be-

tween the two corpora in terms of domain and topic representation

(in turn probably reflecting different communicative priorities of Ital-

ian and UK / Irish institutions), categories three and four seem to point

at more strictly linguistic differences. The presence of words like

“range” and “include” among the most typical of EN-UNI requires

an analysis of concordance lines. As can be seen in figure 2, these

words typically occur as part of expressions which depict the oppor-

tunities offered by universities as particularly vast, and thus convey

positive evaluation. Notice that the presence of self-promotional lan-

guage in university websites was also observed by Thelwall (2005:

537; cf. also section 3.1).

and work . It supports a whole <range> of nationalities and cultu

----- The School offers a wide <range> of subjects at undergradua

ost two million books , a wide <range> of periodicals and IT

duate students an unparalleled <range> of expertise and the

oduction Drawing upon the wide <range> of skills and knowledge of

AA ) has an exceptionally wide <range> of chronological interests

taff cover between them a wide <range> of fields and expertise in

area centred on the city . The <range> of community participation

oject staff are constructing a <range> of computer-based , multi-

e period of the course , and a <range> of work placements and pro

Figure 2. Ten random concordance lines for the lemma “range” in EN-UNI.

Lastly, closed-class words such as “your” and “our” resulted among

the most typical of EN-UNI. This was somewhat surprising, insofar

as the proportion of function words tends to remain more stable than

that of lexical words across corpora (Manning / Schütze 1999: 20–

21). The presence of these possessives, as the analysis presented in



section 3.3 seems to confirm, can be taken as a clue to the use of a

more “personal” style on the part of UK / Irish universities, which

identify themselves as “we”, and address students as “you”, thus

trying to establish a more “involved” relationship with them.

2.3.2.2 Comparing lists of part-of-speech 3-grams

The comparison of 3-gram lists was carried out as a further means of

evaluating corpus contents. Given the limited space, in this contribution

the method is presented mainly for illustrative purposes, to highlight

the possible types of linguistic information that this analysis can yield

when applied to corpus evaluation. Results are presented in table 5.

POS 3-gram Examples

IT-UNI

To have DET to have the (boilerplate)

Have DET ADJ have the right (boilerplate)

PREP DET ADJ on the other; at the same

PREP ADJ NOUN for foreign students; of foreign languages

DET NOUN NOUN the degree course; the research group

EN-UNI

PERS_PRON be ADV you are here (boilerplate)

PREP POSS_ADJ NOUN of our research, of its kind

’s ADJ NOUN ’s inaugural lecture; ’s leading research

ADV to V back to top,5 here to get (boilerplate)

NOUN PREP POSS_ADJ participation in our; springboard for your

Table 5. The 5 most typical POS 3-grams of IT-UNI compared to EN-UNI and vice-

versa.

After factoring out boilerplate portions of text,6 one of the most strik-

ing differences in the two lists was once again the relative typicality

in the EN-UNI corpus of possessive adjectives, like “our” and “your”,

5 Notice that the presence of this 3-gram here is due to a tagger’s error, which

did not correctly recognise “top” as a noun.

6 E. g. around 80% of the sequences “to + have + DET” and “have + DET +

ADJ” correspond to the phrase “to have the right”, the near totality of which

occurs in a single site.

41

which are absent from the IT-UNI top 5 list (the first 3-gram including

a possessive adjective in the IT-UNI ranked list appears at position 75).

On the other hand, IT-UNI seems to display a more prominent use of

noun phrases, in the form of either adjective-noun or noun-noun se-

quences. Notice that, apart from the “’s + ADJ + NOUN” sequence,

ranked in third position,7 the first sequences including adjective-noun

or noun-noun pairs are both ranked below the 50th position in EN-UNI.

These findings seem to corroborate those presented in the two

previous sections, pointing at more personal and involved style in

native English vs. greater formality in non-native English. Interest-

ingly, this finding confirms results obtained in studies of translational

language (e. g. Olohan 2002). These issues are taken up in section 3.

For the present purposes it is important to note that this second cor-

pus evaluation phase has confirmed results obtained in the first, i. e.

no evidence has emerged of obvious imbalances due to faulty corpus

construction procedures.

3. Institutional academic English in Italy:

a preliminary investigation with acWaC

3.1 Lexical bundles

As a first step in the identification of typical features of institutional

academic English used in Italy, we focused on phraseology, in line

with a well-established tradition in corpus linguistics, investigating

both methodological issues and native language (see e. g. Sinclair

1991, Granger / Meunier 2008) and non-native / learner language (Nes-

selhauf 2004, Meunier / Granger 2008). Given the complexity of this

subject, reflected by the terminological confusion surrounding it (see

7 In this case we suspect that the high ranking is due, rather than to the tipicality

of the adjective-noun sequence, to its use in combination with the genitive

“’s”, a structure which in absolute terms is infrequent in the IT-UNI corpus

(less than 500 occurrences).



e. g. Moon 1998 and Wray 2002 for surveys of theoretical constructs

and terms), for the present purposes we did not attempt to extract

theoretically-defined phraseological units. Rather, we adopted the

bottom-up, “corpus-driven” approach (Tognini-Bonelli 2001) pro-

posed by Biber et al. (1999) and subsequently used in several studies

(e. g. Biber 2006, Cortes 2004). This consists in selecting word com-

binations based solely on their frequency in one or more corpora,

regardless of any other parameters (e. g. lexico-grammatical struc-

ture, well-formedness, salience, idiomaticity). These sequences, called

lexical bundles, are then classified in terms of their structure and the

function they play in discourse. Comparing lexical bundles across

the two varieties of English represented in the EN-UNI and IT-UNI

sub-corpora of acWaC gives us first of all an idea of the relative role

of the idiom-principle and of the open-choice principle (Sinclair 1991)

in the discourse production of native and non-native authors / transla-

tors. In other words, we can find out which of these two language

varieties is more formulaic. Since formulaic language has been sug-

gested to play a role in making texts sound more “native-like” (fa-

mously by Pawley / Syder 1983), the lower number of lexical bundles

in IT-UNI would be an indication of less-than-ideal writing strate-

gies at work. Apart from the mere quantitative datum, we can also

compare the lexical bundles present in the two sub-corpora in terms

of their structure types and functions, to get an idea not only of how

many bundles there are, but also of their function. In line with previ-

ous work adopting this notion, we define as a lexical bundle any

uninterrupted sequence of 4 word forms occurring at least 200 times

(i. e., approximately 40 times per million words, henceforth pmw) in

either sub-corpus. In order to reduce the amount of noise resulting

from the automatic corpus building procedure, lemmas whose as-

signed lemmas are unknown to the tagger, proper nouns and sequences

containing non-alphabetical characters are filtered out from the search.

The resulting lists contain 224 (EN-UNI) and 184 (IT-UNI)

bundles respectively. Approximately 90% of these are boilerplate

sequences such as “accessible to any browser” and “all material is

copyright”. While several appear to be intuitively plausible lexicalised

phrases, their relevance to our purposes is unclear (see section 2.3.2).

Indeed, several bundles thus identified are simply the result of the

43

juxtaposition of unrelated words in menus and navigation bars (e. g.

“Symposia Concerts Music Lessons”, “Instruments Public Relations

Associations”). A cursory browsing of the two lists suggests that the

EN-UNI boilerplate is much more structurally complex than that found

in IT-UNI, with several matches corresponding to intuitions about

lexical bundles, while IT-UNI boilerplate sequences mainly consist of

casual noun sequences from menus and navigation bars. This would

suggest greater attention to navigation issues in EN-UNI, an impression

confirmed by the analysis of personal style discussed in section 3.3.

Yet, lacking an objective way of telling “interesting” instances of boiler-

plate apart from “uninteresting” ones, all boilerplate sequences are

discarded from the current analysis. Further investigations shedding

light on the role of set phrases and boilerplate in native and non-native

web writing would be an interesting development of the present work.

Once boilerplate is removed, we are left with 22 bundles in

EN-UNI and 11 in IT-UNI (see table 6). While numbers are very

small, they do suggest that EN-UNI texts make greater use of com-

mon set phrases than IT-UNI texts. Moving on to a structural classi-

fication of these lexical bundles, EN-UNI and IT-UNI similarly fea-

ture almost exclusively phrase-level (rather than clause-level) units

headed by a preposition or a noun. This is unsurprising perhaps, since

phrase-level lexical bundles, and in particular noun and prepositional

phrases, are typical of written expository prose (Biber 2006), distin-

guishing it from casual conversation, which is much richer in clause-

level bundles and verb-phrases.

A functional analysis along the lines of Biber (2006) points at

some more similarities but also at differences. Both sub-corpora fea-

ture no stance bundles (such as “are accountable for all” and “it is

important to”) and quite a few referential bundles (IT-UNI: “at the

University of”, “on the basis of”, “the beginning of the”; EN-UNI:

“at the University of”, “the end of the”, “a wide range of”). However,

EN-UNI is noticeably richer in discourse organising bundles

(6 out of 22 vs. 1 out of 11 in IT-UNI). Most of these (5 out of 6) have

a focusing function in discourse; this function is almost absent from

the IT-UNI bundles (1 occurrence of “one of the most”). Furthermore,

3 out of 5 of these focusing bundles express positive evaluation. “One

of the most”, “is one of the” and “one of the largest” are typically used



for singling out some features for which the institution outperforms

competitors. The referential (quantity) bundle “a wide range of” found

in EN-UNI also plays an evaluative role, being used to enumerate the

opportunities offered by a given institution (see figures 2 (above) and

3 for typical examples of use of these bundles in EN-UNI). This pro-

motional aspect is virtually absent from the IT-UNI bundle list.

IT-UNIExamples

EN-UNIExamples

Lexical bundles 184 224

Boilerplate bundles 173 where to find us 202 Skip to this section

Site Map Advanced Go back to top

Search to content The University

Deutsch Home Info

Site

Selected lexical bundles 11 at the University of 22 the end of the

(no boilerplate) as well as the a member of the

the beginning of the one of the largest

Table 6. Examples of lexical bundles from the acWaC corpus.

ality of education , ours <is one of the> leading departments of A

- The Department of Drama <is one of the> leading centres for rese

ocal region . The Faculty <is one of the> UK ’s top social science

------- The School of Law <is one of the> leading centres for rese

science and technology , <is one of the> academics chosen to take

The University of Bristol <is one of the> leading research univers

f Science and Engineering <is one of the> largest and highest-rate

f Psychology at Edinburgh <is one of the> longest-established cour

rgh Nuclear Physics Group <is one of the> most diverse in the coun

The University of Glasgow <is one of the> world ’s top 100 univers

Figure 3. Selected examples of “is one of the” from EN-UNI.

3.2 Stance expressions

The analysis of stance expressions used to express obligation, neces-

sity and volition follows the approach proposed in contribution 5 of

Biber (2006). Clearly, there are innumerable ways of conveying these

meanings, which refer to the expectations of speakers / writers (insti-

tutions in our case) concerning actions to be undertaken by recipients

45

(typically prospective and current students). Here we limit the analy-

sis to those structures which were found by Biber (2006) to charac-

terise non-disciplinary writing in English (institutional writing, syl-

labi, course packs, etc.).

Starting off with modal verbs, “by far the most common gram-

matical device use to mark stance in university registers” (Biber 2006:

95), searches were conducted in the acWaC corpus for the modals

“must”, “should”, “will / ’ll” and the semi-modal “have to” preceded

by the subject pronoun “you” or the noun “students”. The passive

construction “students / you are verb-ed to” (as in “students / you

are expected to [do x]”) was also included to provide a wider spec-

trum of indirect means of expressing obligation / necessity / volition.

Results (pmw) are given in figure 4. The various structures are or-

dered according to their modal strength, going from the most direct,

i. e. “must” to the most indirect, i. e. “be verb-ed to”.

Figure 4. Distribution of stance expressions conveying obligation / necessity / volition

in acWaC.

While we cannot be sure that all the matches, particularly those ob-

tained for the more indirect constructions, are used to express the

modal meanings we are focusing on, there seems to be a clear pattern

emerging from the comparison of the two sub-corpora. IT-UNI fa-

vours the more direct stance expressions “must” and “have to”, while

EN-UNI makes greater use of the more indirect means of expressing

obligation, i. e. “should”, “will” and the passive.

050

100150200250300350400450500

Stude

nts / y

ou m

ust

Stude

nts / y

ou ha

ve to

Stude

nts / y

ou sh

ould

Stude

nts / y

ou will

| 'll

Stude

nts / y

ou ar

e Ver

b-ed

to

fq p

mw

EN-UNI

IT-UNI



Further searches for even more indirect stance expressions such

as the extraposed construction “it is adjective to verb” confirm the

trend just observed. This search, targeting expressions like “it is neces-

sary / possible / important to [x]” shows that, while IT-UNI makes more

frequent use of this (indirect) pattern than EN-UNI (136 pmw vs. 94),

it also selects more direct lexical “fillers”. A comparison (table 7) of

the top three adjectives found in this pattern in the two sub-corpora

shows that in IT-UNI “necessary” is employed more frequently in this

pattern than “important”, while the reverse is true of EN-UNI. Verb

collocates of “it is necessary to verb” also provide interesting insights

into the divergent attitude of institutions sampled in the two sub-

corpora (see table 8). While the texts in EN-UNI use this construction

mainly to convey needs and requirements of a more intellectual / acade-

mic nature (“examine”, “undertake”, “question”, “know”), IT-UNI

texts use it preferably for administrative / normative requirements

(“accumulate”, “possess”, “prove”, “obtain”, etc.).

EN-UNI absolute fq fq pmw IT-UNI absolute fq fq pmw

Possible 107 (19.6) possible 211 (49.8)

important 94 (17.2) necessary 77 (18.2)

necessary 37 (6.8) important 43 (10.1)

Table 7. Top 3 adjectives found in the pattern “it is adjective to verb”.

EN-UNI absolute fq IT-UNI absolute fq

examine, 3 Have 12

undertake, Accumulate 6

have

maintain, 2 Be 4

it is necessary to carry, possess, 3

question, submit, send

take, know read, prove, 2

obtain, specify,

determine,

demonstrate,

book, attend,

apply, use, present

Table 8. Verb collocates of “it is necessary to verb” (fq >1).

47

Finally, and even more indirectly, obligation, volition and necessity

can be expressed by means of impersonal passive constructions with

will, that do not explicitly identify either the authority enforcing a

rule nor the persons expected to comply (Biber 2006: 125). The re-

sults of a search for “noun will be verb”, from which all animate

subjects have been manually filtered out, show (table 9) that EN-

UNI texts often use this construction to express obligation in the most

indirect way (5/10), while IT-UNI texts use it preferably for referring

to future time (8/10), and only infrequently with its stance-express-

ing modal meaning (2/10) – underlining in table 9 indicates such

stance-marking instances.

EN-UNI absolute fq IT-UNI absolute fq

election will be held 18 attention will be paid 22

conference will be held 13 attention will be given 17

mark will be applied 12 message will be sent 14

points will be deducted 12 course will be held 13

attention will be paid 11 lessons will be held 12

preference will be given 11 preference will be given 12

course will be taught 10 priority will be given 12

interviews will be conducted 10 scholarship will be reimbursed 12

emphasis will be placed 9 workshop will be held 12

essay will be carried 9 agreement will be signed 10

Table 9. Top 10 phrases matching the pattern “noun will be verb” (animate subjects

removed, lemmatised)

3.3 Personal style

The third part of our analysis investigates personal style, i. e. those

cases in which the institution addresses itself as “we”, refers to the

students as “you”, and / or uses imperative verb forms (Biber 2006:

129–130). First, a search for “we verb [that] you” was made in acWaC.

Results suggest that this pattern is used more frequently in the native

component of the corpus (87 occurrences in EN-UNI vs. 62 in IT-

UNI). The distribution of verbs in this pattern is even more revealing



of the institutions’ attitude. If we focus on the top collocate in EN-

UNI, “hope” (26 occurrences vs. 2 in IT-UNI), this is often used to

express commitment by the institution to ensure students’ satisfac-

tion (see figure 5). On the other hand, the top collocate in IT-UNI,

“advise” (0 occurrences in EN-UNI in this pattern), is typically used

to give instructions (see figure 6), and would not seem to express any

of the concern and involvement displayed in EN-UNI.

llege Admission Offices . <We hope you> find it useful . More pu

tmosphere here in Leeds . <We hope you> enjoy your visit ! Profe

context of health SWAps . <We hope you> find the resource pack e

of further information . <We hope you> will take pleasure in se

s to support your claim . <We hope you> enjoy the module ! The S

eving your true potential <we hope you> will be inspired to make

to refresh our website . <We hope you> will find the content in

ge in scientific debate . <We hope you> will join the society ,

eryone who came along and <we hope you> enjoyed the experience .

es our winter programme <we hope you> will enjoy the talks and

our returning students , <we hope you> have had a good Summer .

Figure 5. Selected examples of “we hope you” in EN-UNI.

vered . For this reason , <we advise you> to explore the neighbo

y yourself . In this case <we advise you> to come to Rome at lea

ted firms . In addition , <we advise you> to have a look at our

licant will be rejected . <We advise you> to send the package us

e cost of all the bills . <We advise you> not to contact private

e health insurance NOTE : <We advise you> to contact the Italian

regards projects 6 and 7 <we advise you> to look up the page \x

Italian admittance exam . <We advise you> to ask the Welcome Off

For abstract preparation <we advise you> to download the abstra

validity of your diploma <we advise you> to contact the Italian

Figure 6. Selected examples of “we advise you” in IT-UNI.

Lastly, the search for a verb base form following a sentence break,

which is used to retrieve verbs in the imperative mode, returns twice

as many matches in EN-UNI than in IT-UNI (3,690 (or 679 pmw) in

EN-UNI vs. 509 (or 357 pmw) in IT-UNI). This is partly due to a

greater attention paid to web navigation in the native sub-corpus,

confirming the results obtained when analysing lexical bundles (sec-

tion 3.1 above). This finding also supports, however, the impression


that non-native authors and translators (or possibly the institutions

they represent) shun the more personal, involved, informal style that

is so often used in the native corpus.

4. Conclusion and ways forward

This contribution has presented a methodology for constructing spe-

cialised corpora from the web semi-automatically, applied it to the

construction of a monolingual comparable corpus of websites repre-

senting native (British and Irish) and non-native (Italian) varieties of

institutional academic English, and exemplified its use for shedding

light on the attitudinal and stylistic preferences opposing a native

variety from a lingua franca variety.

In today’s context of increasing competition, collaboration and

exchange among institutes of higher education on a global scale, the

importance of the efforts brought about by the “Bologna Process” is

undeniable both for individual universities and for the national aca-

demic systems of all participating countries. Similarly to what hap-

pens in other non-English speaking countries, Italian universities are

under increasing pressure to make their courses accessible to an in-

ternational public, using English as the medium of instruction and as

the means to conduct overseas student recruitment campaigns. Strat-

egies for academic internationalisation are aggressively pursued at

government and ministerial level, and feature high on the agenda of

individual institutions.

Judging from our analysis, however, the communication strat-

egies put into place, as well as the selection of English language con-

tents made available to an international audience may be improved

and enhanced. While not claiming that Italian university websites

should conform to foreign standards that may not be appropriate for

their specific context, a comparative corpus analysis like the one pre-

sented here would seem to provide a repository of underused writing

strategies and alternative turns of phrase that might be of help to non-

native authors and translators, depending on the circumstances.


In the near future we plan to add other non-native components

to our comparable corpus, so as to account for institutional academic

English in European countries other than Italy, in which English is

used (more or less extensively) as a lingua franca. This would allow

us to gain a better idea of the full range of variation in this specialised

language variety. Secondly we hope to provide a more thorough de-

scription of the lingua franca variety of English used on Italian aca-

demic websites, and to use these insights to develop a corpus-based

writing aid for translators and non-native authors working in this field.

References

Atkins, Sue / Clear, Jeremy / Ostler, Nicholas (1992). ‘Corpus de-

sign criteria’. Literary and linguistic computing. 7(2): 1–16.

Baroni, Marco / Bernardini, Silvia (eds.) (2006). WaCky! Working

Papers on the Web as Corpus. Bologna: GEDIT.

Baroni, Marco / Bernardini, Silvia (2004). ‘BootCaT: Bootstrapping

corpora and terms from the web’. Proceedings of LREC 2004.

1313–1316.

Baroni, Marco / Ueyama, Motoko (2006). ‘Building general- and spe-

cial-purpose corpora by Web crawling’. Proceedings of the 13th

NIJL International Symposium. 31–40.

Biber, Douglas (2006). University language. A corpus-based study of

spoken and written registers. Amsterdam: Benjamins.

Biber, Douglas / Kurjian, Jerry (2007). ‘Towards a taxonomy of Web

registers and text types: A multidimensional analysis’. In Hundt

et al. 109–131.

Biber, Douglas / Johansson, Stig / Leech, Geoffrey / Conrad, Susan /

Finegan, Edward (1999). Longman grammar of spoken and

written English. Harlow: Longman. 109–132.

Brekke, Magnar (2000). ‘From the BNC towards the cybercorpus: A

quantum leap into chaos?’. In Kirk, John (ed.) Corpora ga-

lore: Analyses and techniques in describing English. Papers

from the 19th International Conference on English Language


Research on Computerised Corpora. Amsterdam: Rodopi. 227–

247.

Castagnoli, Sara (2006). ‘Using the web as a source of LSP corpora

in the terminology classroom’. In Baroni, Marco / Bernardini,

Silvia (eds.) 159–172.

Cortes, Viviana (2004). ‘Lexical bundles in published and student

disciplinary writing: Examples from history and biology’. Eng-

lish for specific purposes. 23: 397–423.

Dunning, Ted (1993). Accurate methods for the statistics of surprise

and coincidence. Computational linguistics. 19(1): 61–74.

Evert, Stefan / Kilgarriff, Adam / Sharoff, Serge (eds.) (2008). Pro-

ceedings of the 4th Web as Corpus Workshop – Can we beat

Google? Marrakech, 1 June 2008.

Fairclough, Norman (1993). ‘Critical discourse analysis and the

marketisation of public discourse: The universities’. Discourse

and society. 4(2): 133–168.

Ferraresi, Adriano / Zanchetta, Eros / Baroni, Marco / Bernardini,

Silvia (2008). ‘Introducing and evaluating ukWaC, a very large

web-derived corpus of English’. In Evert et al. 47–54.

Fletcher, William (2004). ‘Making the web more useful as a source

for linguistic corpora’. In Connor, Ulla / Upton, Thomas (eds.)

Corpus Linguistics in North America 2002. 191–205.

Gatto, Maristella (2009). From body to web. An introduction to the

web as corpus. Bari: Laterza.

Goffman, Erving (1981). Forms of talk. Philadelphia: University of

Pennsylvania Press.

Granger, Sylviane / Meunier, Fanny (eds.) (2008). Phraseology: An

interdisciplinary perspective. Amsterdam: Benjamins.

Hoffman, Sebastian (2007). ‘From web-page to mega-corpus: The

CNN transcripts.’. In Hundt et al. 69–85.

Hundt, Marianne / Nesselhauf, Nadja / Biewer, Carolin (eds.) (2007).

Corpus linguistics and the web. Amsterdam: Rodopi.

Jenkins, Jennifer (2007). English as a lingua franca: Attitude and

identity. Oxford: Oxford University Press.

Keller, Frank / Lapata, Mirella (2003). ‘Using the Web to obtain fre-

quencies for unseen bigrams’. Computational linguistics. 29(3):

459–484.


Kilgarriff, Adam (2007). ‘Googleology is bad science’. Computa-

tional linguistics. 33(1): 147–151.

Kilgarriff, Adam / Grefenstette, Gregory (2003). ‘Introduction to the

special issue on the web as corpus’. Computational linguistics.

29(3): 333–347.

Lee, David Y. W. (2001). ‘Genres, registers, text types, domains, and

styles: Clarifying the concepts and navigating a path through

the BNC jungle’. Language, learning and technology. 5(3):

37–72.

Leturia, Igor / San Vicente, Inaki / Saralegi, Xabier / Lopez de Lacalle,

Maddalen (2008). ‘Collecting Basque specialized corpora from

the web: Language-specific performance tweaks and improving

topic precision’. In Evert et al. 40–46.

Lüdeling, Anke / Evert, Stefan / Baroni, Marco (2007). ‘Using Web

data for linguistic purposes’. In Hundt et al. 7–24.

Manning, Chris / Schütze, Hinrich (1999). Foundations of statistical

natural language processing. Cambridge, MA: MIT Press.

Mauranen, Anna (2003). ‘The corpus of English as Lingua Franca in

academic settings’. TESOL Quarterly. 37 (3): 513–527.

Meunier, Fanny / Granger, Sylviane (eds.) (2008). Phraseology in

foreign language learning and teaching. Amsterdam: Benja-

mins.

Moon, Rosamund (1998). Fixed expressions and idioms in English.

Oxford: Oxford University Press.

Nesselhauf, Nadja (2004). Collocations in a learner corpus. Amster-

dam: Benjamins.

Olohan, Maeve (2002). ‘Leave it out! Using a comparable corpus to

investigate aspects of explicitation in translation’. Cadernos

de Tradução. 9: 153–169.

Pawley, Andrew / Syder, Frances (1983). ‘Two puzzles for linguistic

theory: Nativelike selection and nativelike fluency’. In Richards,

Jack / Schmidt, Richard (eds.) Language and communication.

New York: Longman. 191–226.

Rayson, Paul / Garside, Roger (2000). ‘Comparing corpora using fre-

quency profiling’. In Proceedings of the workshop on compar-

ing corpora, Hong Kong, October 2000. 1–6.


Santini, Marina (2007). ‘Characterizing genres of web pages: Genre

hybridism and individualization’. In Proceedings of the 40th

Hawaii International Conference on System Sciences. 1–10.

Seidlhofer, Barbara (2001). ‘Closing a conceptual gap: The case for a

description of English as a lingua franca’. International jour-

nal of applied linguistics. 11: 133–158.

Sharoff, Serge (2006). ‘Creating general-purpose corpora using au-

tomated search engine queries’. In Baroni, Marco / Bernardini,

Silvia (eds.). 63–98.

Sinclair, John McHardy (1991). Corpus, concordance, collocation.

Oxford: Oxford University Press.

Thelwall, Mike (2005). ‘Creating and using web corpora’ International

journal of corpus linguistics. 10(4): 517–541.

Tognini-Bonelli, Elena (2001). Corpus linguistics at work. Amster-

dam: Benjamins.

Wray, Alison (2002). Formulaic language and the lexicon. Cambridge:

Cambridge University Press.

Institutional academic English in the European context: a web-as-corpus approach to comparing native...

Documents

Transcript of Institutional academic English in the European context: a web-as-corpus approach to comparing native...