Creating Tizen Native Apps with the Native UI & Graphics ...
Institutional academic English in the European context: a web-as-corpus approach to comparing native...
Transcript of Institutional academic English in the European context: a web-as-corpus approach to comparing native...
SILVIA BERNARDINI, ADRIANO FERRARESI, FEDERICO GASPARI
Institutional academic English in the
European context: a web-as-corpus approach
to comparing native and non-native language
1. Introduction and overview
In this contribution we present a corpus-based analysis of institu-
tional English as used in the Italian academic context. In order to
make this multi-faceted object of study more manageable and fo-
cused, the investigation is limited to academic websites. These are
viewed as particularly relevant inasmuch as they provide a powerful
means of making contents available to a vast audience, including,
crucially, international students. Producing appropriate and effective
web texts of an institutional nature in English is a must for institu-
tions in non-English speaking countries in order to favour EU-wide
student mobility and to attract prospective students from outside the
EHEA. From a descriptive / theoretical point of view, studies of aca-
demic discourse conducted so far have mainly focused on discipli-
nary academic English, and especially on scientific writing. Institu-
tional English produced within academia has received much less
attention, with the exception of a few landmark publications (notably
Fairclough 1993 and Biber 2006). Hence the relevance of the present
investigation.
The contribution has a double focus. In the first part the poten-
tial and limitations of the “web-as-corpus” methodology for special-
ised comparable corpus construction are illustrated. We describe the
semi-automatic process through which we collected English-language
texts published on the web by Italian universities; a similar approach
was adopted to build a matching sub-corpus of UK and Irish websites,
which afford examples of native English standards within the EU.
28 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
The corpus-building procedure as well as the corpus structure and
contents are described in section 2.2. Since the procedure is semi-
automatic and allows limited control over the corpus contents, we
carried out a preliminary analysis to ensure that the resulting data
sets are appropriate for the purposes of our study. In line with current
work on web-as-corpus methods (Sharoff 2006), we present two ways
of assessing to what extent the two corpora may be regarded as com-
parable in terms of topics covered and (broadly speaking) text types
included. This methodologically-oriented exploration is reported in
section 2.3.
The second part of the contribution has a more descriptive fo-
cus, seeking to shed light on the characteristics of institutional aca-
demic English in Italian websites. As part of the efforts to achieve the
demanding strategic objectives of the “Bologna Process” universities
need to disseminate information on the web in English. On the whole,
Italian universities have implemented this requirement to different
degrees, and preliminary investigations reveal a rather disappointing
situation. Interventions aimed at supporting multilingual communi-
cative strategies are therefore needed to strengthen internationalisation
policies. To date, however, no in-depth studies have been devoted to
the discursive features of institutional English as it is used on the
websites of Italian universities, nor has this “lingua franca” variety
of English been compared to native varieties within the EU context.
The crucial importance of English as a lingua franca, especially in
scientific and academic international settings, is nowadays widely
recognised and has stimulated a number of comparative studies (some
corpus-based) analysing non-native varieties against the background
of standard “native” varieties (Seidlhofer 2001, Mauranen 2003,
Jenkins 2007).
In section 3 we analyse institutional academic language in its
native and lingua-franca varieties, taking as a starting point the analysis
of institutional university registers offered in Biber (2006). A com-
parative analysis of characteristic lexical bundles and of ways of ex-
pressing stance (more direct / indirect forms of obligation and neces-
sity) is carried out. Our findings suggest that the Italian institutions
make lighter use of set phrases assisting navigation and positively
evaluating themselves, and show a dispreference for personal style
29Institutional academic English in the European context
and for the more indirect stance expressions. As a result, these texts
come across as more directive, and the institutions who published
them as arguably more remote than is the case in the native corpus.
The study is part of a larger project which in the longer-term
seeks to provide resources for Italian professional writers and trans-
lators working with institutional academic English, and which aims
to set a standard for an EU-wide pool of corpora representing non-
native varieties of English used by universities in a number of Euro-
pean countries.
2. Web-as-corpus methods for specialised
corpus construction
2.1 Previous work
In the last decade the use of web data has become common practice
in corpus linguistics research. Witness to this is a large and growing
literature on web-as-corpus approaches to corpus building and us-
age, which are adopted for purposes as diverse as terminology ex-
traction (Castagnoli 2006) and register analysis (Biber / Kurjian 2007).
The term web-as-corpus is normally used for two main methodologi-
cal paradigms. The first consists in considering the web per se as a
corpus. Within this paradigm, web data are accessed either through
existing commercial search engines or through post-processors of
search engines’ output, such as WebCorp,1 which are intended to
present data in a “linguist-friendly” format. This approach is particu-
larly apt for investigations on pre-defined linguistic phenomena, e. g.
re-lexicalisation of specific terms (Brekke 2000) or infrequent bigram
identification (Keller/Lapata 2003), but is worse suited when consid-
erations regarding text types or domains are central to the research
questions. For this reason, this approach will not be considered here;
1 <http://www.webcorp.org.uk/>.
30 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
for a discussion of its advantages and limitations see, e. g., Lüdeling
et al. (2007) and Kilgarriff (2007).
The second paradigm within web-as-corpus research consists
in drawing on the enormous amounts of textual material available on
the web to compile off-line corpora, usually relying on (semi-)auto-
matic text selection and download procedures. These involve either
performing a customised crawl of specific websites, which are con-
sidered as representative of the specific text type / topic of interest,
e. g., the CNN transcripts as instances of broadcast language (Hoffman
2007), or adopting ad hoc strategies relying on search engines to
retrieve potentially relevant web pages, which are then downloaded
for corpus compilation (Baroni / Bernardini 2004, Leturia et al. 2008).
Of course, this approach is not devoid of problems. First, web data
tend to be very noisy, i. e. they contain duplicate pages, pages pro-
duced automatically by servers, and “boilerplate” – a term first intro-
duced by Fletcher (2004) to indicate portions of text which are re-
peated across the pages of a site, e. g. navigation bars, copyright
notices, etc., which tend to distort statistics about corpus composition.
Secondly, and perhaps more crucially, automatic procedures of cor-
pus construction, while making it possible to build relatively large
corpora in little time, allow for limited control over corpus contents.
As Baroni / Ueyama (2006: 2) point out, however, these are typical
problems of “quick and dirty” corpora, rather than web corpora per
se, and it is ultimately the trade-off between costs (in terms of time,
funding, etc.) and quality / usefulness of the resource that should be
considered when deciding what methods are to be adopted for its
construction.
In fact, several papers have been devoted to highlighting the
advantages of using web data (see, e. g., Kilgarriff / Grefenstette 2003,
Fletcher 2004). Web texts are in machine readable format, thus fa-
cilitating their retrieval and processing for inclusion in a corpus; the
web is constantly updated, and has been demonstrated to be a valid
resource for investigating contemporary language usage (Lüdeling et
al. 2007); and it makes available linguistic materials which can hardly
be found elsewhere, i. e., samples of specialised languages and web-
based genres. Often, corpus resources simply do not exist or do not
provide enough samples of specific language varieties, such as, e. g.,
31
medical English (Gatto 2009: 101 ff.). In these circumstances, the
web, with its extreme heterogeneity of text types and topics and its
up-to-dateness, seems to be the obvious place to turn to. As for web-
based genres (e. g. blogs, chat rooms), it has been suggested that they
display peculiar communicative structures which set them apart from
“traditional” written texts (Santini 2007), and thus deserve dedicated
analyses.
An example of research following the second paradigm, and
focusing on institutional academic language is Thelwall (2005). Us-
ing a customised crawler, the author builds a corpus of pages from
university websites of three English-speaking countries (Australia,
New Zealand and the UK), and then carries out a preliminary evalu-
ation of its contents based on a frequency analysis of the most com-
mon words in its three components and in the written component of
the BNC. This is intended to pinpoint differences both across the
national provenances sampled in the corpus, and of academic web
English when compared to “general” English.
2.2 Building acWaC
For the purposes of this study, we needed a relatively large and up-
to-date corpus which would represent contemporary English in the
websites of Italian universities, focusing in particular on “institutional”
communication (Biber 2006). In order to highlight features of this
variety, we also needed a benchmark for comparison, and we settled
on texts published on the websites of British and Irish universities,
taken as a native standard within the European Union. Following
Thelwall (2005) and adopting web-as-corpus semi-automatic con-
struction procedures seemed the most obvious choice to make (cf.
section 2.1). Unlike Thelwall, who performed a customised crawl of
university websites, we used the BootCaT toolkit (Baroni and Bernar-
dini 2004), a set of Perl scripts for rapidly building specialised, ad
hoc corpora. Using BootCaT is less labour intensive and technically
demanding than setting up a personal crawler. The resulting mono-
lingual comparable corpus is called acWaC, i. e., academic Web-As-
Corpus.
Institutional academic English in the European context
32 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
The first step of the BootCaT procedure consists in manually
identifying relevant “seeds”, i. e. words or word combinations that
are assumed to be characteristic of the language variety of interest.
For research focusing on domain-specific varieties of a language,
seeds are usually key terms of that domain. In our case, however, the
guiding criterion in corpus construction was not gathering texts ac-
cording to their domain. Rather, they had to share particular text prin-
cipals (to borrow Goffman’s (1981) term), i. e. the texts had to be
made available over the web by universities, considered as the enti-
ties responsible for their production (regardless of the actual draft-
ers). A preliminary browsing of the websites of Italian universities
was therefore carried out, in order to assess which of them feature
substantial amounts of English language contents (based on our
checks, slightly more than half do). For the sake of sub-corpus com-
parability, the number of UK institutions had to be narrowed down
with respect to the totality of the available websites. We decided to
include the 20 universities of the Russell Group for the UK,2 and all
Irish universities. The seeds for our search consisted therefore in a
set of URLs corresponding to (English language) websites of the
universities which were identified during this preliminary phase, plus
a few common English words like “the”, “of” and “and”.
In the second step of the procedure, the URLs (used as argu-
ments of the site: operator) and the seed words are submitted as query
terms to a search engine (either Google or Yahoo!). Further heuristics
can be adopted to maximise the precision of the results: we employed
the language filter provided by Google, and excluded pdf files (through
the filetype: operator) so as to limit the impact of disciplinary writing
(e. g. research papers), often published in this format on the web. The
search engine produced a list of URLs for each pre-determined
website, and a maximum of 300 documents per site were downloaded.
Notice that since the procedure relies on search engines’ ranking al-
gorithms, and since we download the first 300 pages, results depend
to a large extent on the way a particular website is indexed by the
search engine itself. It is likely that the pages that will end up in the
2 The Russell Group is an association of top-level universities in the United
Kingdom (<http://www.russellgroup.ac.uk/>).
33
corpus are skewed towards the more “popular” ones (Gatto 2009:
51–52), but this is not considered a problem here, since a) we are
interested in analysing the documents that a student is likely to en-
counter on a university website, and b) the same retrieving procedure
is used for the Italian and UK / Irish sub-corpora (henceforth, respec-
tively, IT-UNI and EN-UNI).
In the final phase, documents were POS-tagged and lemmatised
using the TreeTagger3 and indexed for corpus consultation with the
CorpusWorkBench.4 Table 1 provides size information about the
acWaC corpus.
acWaC
IT-UNI EN-UNI
Tokens 4,228,841 5,435,855
Types 165,037 125,089
Documents 6,745 7,721
University websites sampled 55 28
Table 1. Size information of the acWaC corpus.
Notice that, despite the semi-automatic procedure that was employed,
the size of the two sub-corpora is roughly comparable. A point that
should also be stressed is that one of the major strengths of this pro-
cedure is that it is easily replicable, and can be taken as a model to
rapidly build similar corpora for universities based in other coun-
tries, as well as making it possible to track diachronic changes within
the same websites.
There remains one open issue: how certain are we that the pro-
cedure worked, i. e. it retrieved documents matching our expectations?
When building a corpus from the web using automated queries to a
search engine, one has to always bear in mind that the retrieved docu-
ments may simply not match the targeted contents. While this has
never been a problem in traditional corpus construction, where texts
for the corpus are manually selected, it is a crucial step in web-as-
3 <http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/>.
4 <http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/>.
Institutional academic English in the European context
34 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
corpus projects. Ideally, one would want to quickly read through all
the documents in the corpus, or at least a large enough number of
them. However, this is hardly feasible for a corpus of about 14,000
documents, as well as being at odds with the rationale behind the
web-as-corpus methodology, which is meant to speed up and ease the
procedure of corpus construction. The next section looks at the method
we employed to evaluate contents and corpus comparability.
2.3 Evaluating acWaC
In this project we adopted two methods of corpus evaluation / com-
parison previously used in the literature. First of all, as an initial quali-
tative step, we randomly selected 200 documents from each sub-cor-
pus, read through them and classified them in terms of (broad) topic /
genre classes. This was a cost-effective procedure that partly followed
Sharoff (2006), yet making no attempt to bring in the burgeoning
literature on genre classification (see e. g. Lee 2001; Santini 2007).
Secondly, we retrieved lists of lemmas and of part of speech sequences
from the two sub-corpora, cleaned them, and compared them using a
statistical measure that reveals the words and lexico-grammatical
structures that are more characteristic of one sub-corpus compared to
the other (Sharoff 2006, Ferraresi et al. 2008).
These exploratory analyses confirmed the overall comparabil-
ity of the two sub-corpora, as well as providing initial evidence of
some differences that were further investigated in the analysis proper
(section 3). In turn, the latter confirmed that the corpus construction
procedure had been successful in retrieving comparable texts repre-
senting the language varieties under investigation, following the con-
struction-evaluation-use virtuous circle advocated by Atkins et al.
(1992) in their classic work on corpus design.
2.3.1 Comparing samples of texts from the sub-corpora
Two random samples of 200 texts were extracted from EN-UNI and
IT-UNI and classified in broadly functional terms. The categories
were developed bottom-up and refined through several rounds of
35
analysis until all the texts under scrutiny had been accounted for. The
results are reported in table 2.
Text category IT-UNI EN-UNI
Description of research centres, departments, committees 40 62
Description of courses, degrees, modules 48 34
News, events, life 26 47
Regulations 17 11
Personal pages 23 14
Disciplinary writing 26 21
Web navigation 03 03
Mixed or unclassifiable 17 8
Table 2. Random samples (200 texts each) from IT-UNI and EN-UNI compared.
The distribution of texts in the different categories is not identical in
the two sub-corpora. The British / Irish part contains more descrip-
tions of bodies within a given institution (faculties, research centres,
committees, etc.) and more pages referring to current events and stu-
dents’ life, while the Italian counterpart contains more pages describ-
ing the courses on offer, regulatory texts (e. g. agreements), personal
pages and disciplinary writing (e. g. academic papers).
This distribution suggests that the English language contents
provided by Italian websites focus on the institution’s educational
offer and on exchange projects (the regulatory texts); that there is an
attempt on the part of individuals to make their research / teaching
activity known internationally, and possibly that more disciplinary
writing gets published on the web in html format than in pdf (remem-
ber that the pdf files were excluded from the crawls). The under-
represented categories also tell us that institutions in Italy seem more
concerned with the “formal” aspects of students’ exchange projects
than with their daily and social life, and, more interestingly perhaps,
that they do not feel as compelled as the UK / Irish institutions to
describe and ultimately advertise themselves (their research and teach-
ing credentials, their facilities, etc.), as opposed to the courses they
offer. Lastly, the higher number of mixed or unclassifiable texts in
IT-UNI testifies to the greater difficulty of harvesting texts in one
Institutional academic English in the European context
36 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
language from websites that are written mainly in another, as op-
posed to single-language websites. This is an unavoidable consequence
of the methodology and research design, and possibly of the object
of study itself, but would not seem to impact on the comparison to an
extent that could distort the general picture.
Though based on relatively few texts, this comparison has high-
lighted possible differences between the two language varieties rep-
resented in the corpus, to be further investigated in the analysis proper.
In terms of evaluation of the corpus construction methodology, there
is no indication that the automatic procedure has gone awry, yielding
non-comparable sub-corpora.
2.3.2 Comparing word- and ngram-lists
As a second step in the evaluation of corpus contents, we adopted a
methodology that is widely used to assess the composition and char-
acteristic linguistic features of corpora, both web-derived (Sharoff
2006) and “traditional” (Rayson / Garside 2000). This consists in com-
paring a frequency list obtained from the corpus of interest with a
benchmark, using log-likelihood as a statistical association measure,
which, unlike Mutual Information or t-score, has been proved to be
independent of corpus size (Dunning 1993).
We extracted lists of lemmas and of sequences of 3 parts of
speech (henceforth 3-grams) from both sub-corpora, and compared
them using each list in turn as a benchmark for the other. Taken to-
gether, these lists can give us an idea of lexico-grammatical regulari-
ties in the two sub-corpora, and, crucially, reveal salient differences
between them. As a pre-processing step, which was aimed at reduc-
ing the amount of noise in the lists, lemmas unknown to the tagger
were filtered out, along with proper nouns and words containing non-
alphabetical characters. For each (ranked) list pertaining to IT-UNI
and EN-UNI, we generated and analysed concordances for the top 50
entries in the lemma lists and for the top 5 ones in the POS 3-gram
lists, taking them as clues to salient corpus differences in terms of
(broad) topic categories and functional linguistic features.
37
2.3.2.1 Comparing lemma lists
The 50 lemmas more typical of IT-UNI when compared to EN-UNI
can be grouped into three broad topic / function categories (see table
3 for examples). Notice that in this analysis we do not take into account
words appearing in boilerplate sections of the web pages, which ac-
count for nearly 50% of the key lemmas found both for IT-UNI (e. g.
“fax”) and EN-UNI (e. g. “accessibility”). These words typically appear
in portions of text which are repeated identically across different pages
of the same website. While boilerplate text might be an interesting
object of study (for a discussion, cf. section 3.1), the very fact that it
is “repeated text” is most likely to distort frequency data, and hence
blur the analysis of its “typicality” in one corpus (Fletcher 2004).
IT-UNI
1. Institutional activities 2. Relations with other institutions 3. Academic / diciplinary
Credit agreement scientific
Exam cooperation model
Professor company analysis
Cycle field
Table 3. The most typical lemmas of IT-UNI when compared to EN-UNI, split by
topic / function.
The first category of non-boilerplate words in the IT-UNI wordlist is
that of words that are related to what we might call “institutional
activities”, and include, e. g., “credit”, “exam” and “cycle”. These
are mainly found in pages describing the educational offer of univer-
sities, such as degree or module descriptions, i. e. informative / regula-
tory texts aimed at providing (foreign and / or exchange) students with
information about the available programmes. Also notice the presence
within this category of “context-bound” terms, such as “professor” (a
term referring to a more common position in Italy than in the UK /
Ireland), and “cycle” (referring to each stage of an educational path,
leading to e. g. a Bachelor’s or Master’s-level degree). The second
category includes words which are mainly used to refer to relations
with other universities, institutions and private companies. The rela-
tively high frequency of these words in IT-UNI can be accounted for
Institutional academic English in the European context
38 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
on the basis of a larger presence of texts that regulate, e. g., student
exchanges or internship contracts (cf. section 2.3.1). As an example of
the use of these words, randomly selected concordance lines for the
lemma “agreement” are provided in figure 1. The third category of
typical lemmas in IT-UNI groups words which are characteristic of
academic disciplinary texts, e. g., “scientific”, “model” and “analy-
sis”. These words mainly appear in research articles, suggesting that
the exclusion of pdf files is not a good enough heuristics for guarding
against disciplinary writing, at least for the non-native component (cf.
section 2.2).
in organisations which have an <agreement> with the University
ersity of Cagliari * Bilateral <agreements> with european
erican and african countries * <Agreements> with eastern count
rd Lyon I\x{201D } , France.An <agreement> for co-operation an
peration * Signing Cooperation <Agreements> * Community and
countries and eastern europe * <Agreements> with mediterranean
s Programme , or * an exchange <agreement> . If your home
ersity has signed one of these <agreements> , just take part ,
vities * Health Care * Service <Agreements> * Video and Movie
( Erasmus , Tempus , bilateral <agreement> , etc. ) , the leng
Figure 1. Ten random concordance lines for the lemma “agreement” in IT-UNI.
Moving on to the analysis of key lemmas in EN-UNI, these seem to
be more varied than the ones which were found for IT-UNI, and can
be grouped into four categories (see table 4). At a first glance, the
first category would seem to coincide with that of “institutional ac-
tivities” also found in the Italian sub-corpus. However, a closer analy-
sis reveals that these key lemmas in EN-UNI refer mainly to research
activities (e. g. “postgraduate”, “research”), rather than to the uni-
versities’ educational offer, and are mainly found in informative texts
describing the institution and promoting, e.g., its research achieve-
ments. The “Services / support” category includes words related to
ways in which universities assist students. An example is “funding”,
which is found in pages in which institutions specify the funding
opportunities for both undergraduate and postgraduate study. This is
in line with the results presented in section 2.3.1, according to which
UK / Irish universities seem to be more concerned with students’ life
and welfare than their Italian counterparts.
39
EN-UNI
1. Institutional activities 2. Services / support 3. Evaluative language 4. Function words
Postgraduate Support range your
Research Funding include our
MA Disability
Table 4. The most typical lemmas of EN-UNI when compared to IT-UNI, split by
topic / function
While the categories just described seem to point at differences be-
tween the two corpora in terms of domain and topic representation
(in turn probably reflecting different communicative priorities of Ital-
ian and UK / Irish institutions), categories three and four seem to point
at more strictly linguistic differences. The presence of words like
“range” and “include” among the most typical of EN-UNI requires
an analysis of concordance lines. As can be seen in figure 2, these
words typically occur as part of expressions which depict the oppor-
tunities offered by universities as particularly vast, and thus convey
positive evaluation. Notice that the presence of self-promotional lan-
guage in university websites was also observed by Thelwall (2005:
537; cf. also section 3.1).
and work . It supports a whole <range> of nationalities and cultu
----- The School offers a wide <range> of subjects at undergradua
ost two million books , a wide <range> of periodicals and IT
duate students an unparalleled <range> of expertise and the
oduction Drawing upon the wide <range> of skills and knowledge of
AA ) has an exceptionally wide <range> of chronological interests
taff cover between them a wide <range> of fields and expertise in
area centred on the city . The <range> of community participation
oject staff are constructing a <range> of computer-based , multi-
e period of the course , and a <range> of work placements and pro
Figure 2. Ten random concordance lines for the lemma “range” in EN-UNI.
Lastly, closed-class words such as “your” and “our” resulted among
the most typical of EN-UNI. This was somewhat surprising, insofar
as the proportion of function words tends to remain more stable than
that of lexical words across corpora (Manning / Schütze 1999: 20–
21). The presence of these possessives, as the analysis presented in
Institutional academic English in the European context
40 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
section 3.3 seems to confirm, can be taken as a clue to the use of a
more “personal” style on the part of UK / Irish universities, which
identify themselves as “we”, and address students as “you”, thus
trying to establish a more “involved” relationship with them.
2.3.2.2 Comparing lists of part-of-speech 3-grams
The comparison of 3-gram lists was carried out as a further means of
evaluating corpus contents. Given the limited space, in this contribution
the method is presented mainly for illustrative purposes, to highlight
the possible types of linguistic information that this analysis can yield
when applied to corpus evaluation. Results are presented in table 5.
POS 3-gram Examples
IT-UNI
To have DET to have the (boilerplate)
Have DET ADJ have the right (boilerplate)
PREP DET ADJ on the other; at the same
PREP ADJ NOUN for foreign students; of foreign languages
DET NOUN NOUN the degree course; the research group
EN-UNI
PERS_PRON be ADV you are here (boilerplate)
PREP POSS_ADJ NOUN of our research, of its kind
’s ADJ NOUN ’s inaugural lecture; ’s leading research
ADV to V back to top,5 here to get (boilerplate)
NOUN PREP POSS_ADJ participation in our; springboard for your
Table 5. The 5 most typical POS 3-grams of IT-UNI compared to EN-UNI and vice-
versa.
After factoring out boilerplate portions of text,6 one of the most strik-
ing differences in the two lists was once again the relative typicality
in the EN-UNI corpus of possessive adjectives, like “our” and “your”,
5 Notice that the presence of this 3-gram here is due to a tagger’s error, which
did not correctly recognise “top” as a noun.
6 E. g. around 80% of the sequences “to + have + DET” and “have + DET +
ADJ” correspond to the phrase “to have the right”, the near totality of which
occurs in a single site.
41
which are absent from the IT-UNI top 5 list (the first 3-gram including
a possessive adjective in the IT-UNI ranked list appears at position 75).
On the other hand, IT-UNI seems to display a more prominent use of
noun phrases, in the form of either adjective-noun or noun-noun se-
quences. Notice that, apart from the “’s + ADJ + NOUN” sequence,
ranked in third position,7 the first sequences including adjective-noun
or noun-noun pairs are both ranked below the 50th position in EN-UNI.
These findings seem to corroborate those presented in the two
previous sections, pointing at more personal and involved style in
native English vs. greater formality in non-native English. Interest-
ingly, this finding confirms results obtained in studies of translational
language (e. g. Olohan 2002). These issues are taken up in section 3.
For the present purposes it is important to note that this second cor-
pus evaluation phase has confirmed results obtained in the first, i. e.
no evidence has emerged of obvious imbalances due to faulty corpus
construction procedures.
3. Institutional academic English in Italy:
a preliminary investigation with acWaC
3.1 Lexical bundles
As a first step in the identification of typical features of institutional
academic English used in Italy, we focused on phraseology, in line
with a well-established tradition in corpus linguistics, investigating
both methodological issues and native language (see e. g. Sinclair
1991, Granger / Meunier 2008) and non-native / learner language (Nes-
selhauf 2004, Meunier / Granger 2008). Given the complexity of this
subject, reflected by the terminological confusion surrounding it (see
7 In this case we suspect that the high ranking is due, rather than to the tipicality
of the adjective-noun sequence, to its use in combination with the genitive
“’s”, a structure which in absolute terms is infrequent in the IT-UNI corpus
(less than 500 occurrences).
Institutional academic English in the European context
42 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
e. g. Moon 1998 and Wray 2002 for surveys of theoretical constructs
and terms), for the present purposes we did not attempt to extract
theoretically-defined phraseological units. Rather, we adopted the
bottom-up, “corpus-driven” approach (Tognini-Bonelli 2001) pro-
posed by Biber et al. (1999) and subsequently used in several studies
(e. g. Biber 2006, Cortes 2004). This consists in selecting word com-
binations based solely on their frequency in one or more corpora,
regardless of any other parameters (e. g. lexico-grammatical struc-
ture, well-formedness, salience, idiomaticity). These sequences, called
lexical bundles, are then classified in terms of their structure and the
function they play in discourse. Comparing lexical bundles across
the two varieties of English represented in the EN-UNI and IT-UNI
sub-corpora of acWaC gives us first of all an idea of the relative role
of the idiom-principle and of the open-choice principle (Sinclair 1991)
in the discourse production of native and non-native authors / transla-
tors. In other words, we can find out which of these two language
varieties is more formulaic. Since formulaic language has been sug-
gested to play a role in making texts sound more “native-like” (fa-
mously by Pawley / Syder 1983), the lower number of lexical bundles
in IT-UNI would be an indication of less-than-ideal writing strate-
gies at work. Apart from the mere quantitative datum, we can also
compare the lexical bundles present in the two sub-corpora in terms
of their structure types and functions, to get an idea not only of how
many bundles there are, but also of their function. In line with previ-
ous work adopting this notion, we define as a lexical bundle any
uninterrupted sequence of 4 word forms occurring at least 200 times
(i. e., approximately 40 times per million words, henceforth pmw) in
either sub-corpus. In order to reduce the amount of noise resulting
from the automatic corpus building procedure, lemmas whose as-
signed lemmas are unknown to the tagger, proper nouns and sequences
containing non-alphabetical characters are filtered out from the search.
The resulting lists contain 224 (EN-UNI) and 184 (IT-UNI)
bundles respectively. Approximately 90% of these are boilerplate
sequences such as “accessible to any browser” and “all material is
copyright”. While several appear to be intuitively plausible lexicalised
phrases, their relevance to our purposes is unclear (see section 2.3.2).
Indeed, several bundles thus identified are simply the result of the
43
juxtaposition of unrelated words in menus and navigation bars (e. g.
“Symposia Concerts Music Lessons”, “Instruments Public Relations
Associations”). A cursory browsing of the two lists suggests that the
EN-UNI boilerplate is much more structurally complex than that found
in IT-UNI, with several matches corresponding to intuitions about
lexical bundles, while IT-UNI boilerplate sequences mainly consist of
casual noun sequences from menus and navigation bars. This would
suggest greater attention to navigation issues in EN-UNI, an impression
confirmed by the analysis of personal style discussed in section 3.3.
Yet, lacking an objective way of telling “interesting” instances of boiler-
plate apart from “uninteresting” ones, all boilerplate sequences are
discarded from the current analysis. Further investigations shedding
light on the role of set phrases and boilerplate in native and non-native
web writing would be an interesting development of the present work.
Once boilerplate is removed, we are left with 22 bundles in
EN-UNI and 11 in IT-UNI (see table 6). While numbers are very
small, they do suggest that EN-UNI texts make greater use of com-
mon set phrases than IT-UNI texts. Moving on to a structural classi-
fication of these lexical bundles, EN-UNI and IT-UNI similarly fea-
ture almost exclusively phrase-level (rather than clause-level) units
headed by a preposition or a noun. This is unsurprising perhaps, since
phrase-level lexical bundles, and in particular noun and prepositional
phrases, are typical of written expository prose (Biber 2006), distin-
guishing it from casual conversation, which is much richer in clause-
level bundles and verb-phrases.
A functional analysis along the lines of Biber (2006) points at
some more similarities but also at differences. Both sub-corpora fea-
ture no stance bundles (such as “are accountable for all” and “it is
important to”) and quite a few referential bundles (IT-UNI: “at the
University of”, “on the basis of”, “the beginning of the”; EN-UNI:
“at the University of”, “the end of the”, “a wide range of”). However,
EN-UNI is noticeably richer in discourse organising bundles
(6 out of 22 vs. 1 out of 11 in IT-UNI). Most of these (5 out of 6) have
a focusing function in discourse; this function is almost absent from
the IT-UNI bundles (1 occurrence of “one of the most”). Furthermore,
3 out of 5 of these focusing bundles express positive evaluation. “One
of the most”, “is one of the” and “one of the largest” are typically used
Institutional academic English in the European context
44 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
for singling out some features for which the institution outperforms
competitors. The referential (quantity) bundle “a wide range of” found
in EN-UNI also plays an evaluative role, being used to enumerate the
opportunities offered by a given institution (see figures 2 (above) and
3 for typical examples of use of these bundles in EN-UNI). This pro-
motional aspect is virtually absent from the IT-UNI bundle list.
IT-UNIExamples
EN-UNIExamples
Lexical bundles 184 224
Boilerplate bundles 173 where to find us 202 Skip to this section
Site Map Advanced Go back to top
Search to content The University
Deutsch Home Info
Site
Selected lexical bundles 11 at the University of 22 the end of the
(no boilerplate) as well as the a member of the
the beginning of the one of the largest
Table 6. Examples of lexical bundles from the acWaC corpus.
ality of education , ours <is one of the> leading departments of A
- The Department of Drama <is one of the> leading centres for rese
ocal region . The Faculty <is one of the> UK ’s top social science
------- The School of Law <is one of the> leading centres for rese
science and technology , <is one of the> academics chosen to take
The University of Bristol <is one of the> leading research univers
f Science and Engineering <is one of the> largest and highest-rate
f Psychology at Edinburgh <is one of the> longest-established cour
rgh Nuclear Physics Group <is one of the> most diverse in the coun
The University of Glasgow <is one of the> world ’s top 100 univers
Figure 3. Selected examples of “is one of the” from EN-UNI.
3.2 Stance expressions
The analysis of stance expressions used to express obligation, neces-
sity and volition follows the approach proposed in contribution 5 of
Biber (2006). Clearly, there are innumerable ways of conveying these
meanings, which refer to the expectations of speakers / writers (insti-
tutions in our case) concerning actions to be undertaken by recipients
45
(typically prospective and current students). Here we limit the analy-
sis to those structures which were found by Biber (2006) to charac-
terise non-disciplinary writing in English (institutional writing, syl-
labi, course packs, etc.).
Starting off with modal verbs, “by far the most common gram-
matical device use to mark stance in university registers” (Biber 2006:
95), searches were conducted in the acWaC corpus for the modals
“must”, “should”, “will / ’ll” and the semi-modal “have to” preceded
by the subject pronoun “you” or the noun “students”. The passive
construction “students / you are verb-ed to” (as in “students / you
are expected to [do x]”) was also included to provide a wider spec-
trum of indirect means of expressing obligation / necessity / volition.
Results (pmw) are given in figure 4. The various structures are or-
dered according to their modal strength, going from the most direct,
i. e. “must” to the most indirect, i. e. “be verb-ed to”.
Figure 4. Distribution of stance expressions conveying obligation / necessity / volition
in acWaC.
While we cannot be sure that all the matches, particularly those ob-
tained for the more indirect constructions, are used to express the
modal meanings we are focusing on, there seems to be a clear pattern
emerging from the comparison of the two sub-corpora. IT-UNI fa-
vours the more direct stance expressions “must” and “have to”, while
EN-UNI makes greater use of the more indirect means of expressing
obligation, i. e. “should”, “will” and the passive.
050
100150200250300350400450500
Stude
nts / y
ou m
ust
Stude
nts / y
ou ha
ve to
Stude
nts / y
ou sh
ould
Stude
nts / y
ou will
| 'll
Stude
nts / y
ou ar
e Ver
b-ed
to
fq p
mw
EN-UNI
IT-UNI
Institutional academic English in the European context
46 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
Further searches for even more indirect stance expressions such
as the extraposed construction “it is adjective to verb” confirm the
trend just observed. This search, targeting expressions like “it is neces-
sary / possible / important to [x]” shows that, while IT-UNI makes more
frequent use of this (indirect) pattern than EN-UNI (136 pmw vs. 94),
it also selects more direct lexical “fillers”. A comparison (table 7) of
the top three adjectives found in this pattern in the two sub-corpora
shows that in IT-UNI “necessary” is employed more frequently in this
pattern than “important”, while the reverse is true of EN-UNI. Verb
collocates of “it is necessary to verb” also provide interesting insights
into the divergent attitude of institutions sampled in the two sub-
corpora (see table 8). While the texts in EN-UNI use this construction
mainly to convey needs and requirements of a more intellectual / acade-
mic nature (“examine”, “undertake”, “question”, “know”), IT-UNI
texts use it preferably for administrative / normative requirements
(“accumulate”, “possess”, “prove”, “obtain”, etc.).
EN-UNI absolute fq fq pmw IT-UNI absolute fq fq pmw
Possible 107 (19.6) possible 211 (49.8)
important 94 (17.2) necessary 77 (18.2)
necessary 37 (6.8) important 43 (10.1)
Table 7. Top 3 adjectives found in the pattern “it is adjective to verb”.
EN-UNI absolute fq IT-UNI absolute fq
examine, 3 Have 12
undertake, Accumulate 6
have
maintain, 2 Be 4
it is necessary to carry, possess, 3
question, submit, send
take, know read, prove, 2
obtain, specify,
determine,
demonstrate,
book, attend,
apply, use, present
Table 8. Verb collocates of “it is necessary to verb” (fq >1).
47
Finally, and even more indirectly, obligation, volition and necessity
can be expressed by means of impersonal passive constructions with
will, that do not explicitly identify either the authority enforcing a
rule nor the persons expected to comply (Biber 2006: 125). The re-
sults of a search for “noun will be verb”, from which all animate
subjects have been manually filtered out, show (table 9) that EN-
UNI texts often use this construction to express obligation in the most
indirect way (5/10), while IT-UNI texts use it preferably for referring
to future time (8/10), and only infrequently with its stance-express-
ing modal meaning (2/10) – underlining in table 9 indicates such
stance-marking instances.
EN-UNI absolute fq IT-UNI absolute fq
election will be held 18 attention will be paid 22
conference will be held 13 attention will be given 17
mark will be applied 12 message will be sent 14
points will be deducted 12 course will be held 13
attention will be paid 11 lessons will be held 12
preference will be given 11 preference will be given 12
course will be taught 10 priority will be given 12
interviews will be conducted 10 scholarship will be reimbursed 12
emphasis will be placed 9 workshop will be held 12
essay will be carried 9 agreement will be signed 10
Table 9. Top 10 phrases matching the pattern “noun will be verb” (animate subjects
removed, lemmatised)
3.3 Personal style
The third part of our analysis investigates personal style, i. e. those
cases in which the institution addresses itself as “we”, refers to the
students as “you”, and / or uses imperative verb forms (Biber 2006:
129–130). First, a search for “we verb [that] you” was made in acWaC.
Results suggest that this pattern is used more frequently in the native
component of the corpus (87 occurrences in EN-UNI vs. 62 in IT-
UNI). The distribution of verbs in this pattern is even more revealing
Institutional academic English in the European context
48 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
of the institutions’ attitude. If we focus on the top collocate in EN-
UNI, “hope” (26 occurrences vs. 2 in IT-UNI), this is often used to
express commitment by the institution to ensure students’ satisfac-
tion (see figure 5). On the other hand, the top collocate in IT-UNI,
“advise” (0 occurrences in EN-UNI in this pattern), is typically used
to give instructions (see figure 6), and would not seem to express any
of the concern and involvement displayed in EN-UNI.
llege Admission Offices . <We hope you> find it useful . More pu
tmosphere here in Leeds . <We hope you> enjoy your visit ! Profe
context of health SWAps . <We hope you> find the resource pack e
of further information . <We hope you> will take pleasure in se
s to support your claim . <We hope you> enjoy the module ! The S
eving your true potential <we hope you> will be inspired to make
to refresh our website . <We hope you> will find the content in
ge in scientific debate . <We hope you> will join the society ,
eryone who came along and <we hope you> enjoyed the experience .
es our winter programme <we hope you> will enjoy the talks and
our returning students , <we hope you> have had a good Summer .
Figure 5. Selected examples of “we hope you” in EN-UNI.
vered . For this reason , <we advise you> to explore the neighbo
y yourself . In this case <we advise you> to come to Rome at lea
ted firms . In addition , <we advise you> to have a look at our
licant will be rejected . <We advise you> to send the package us
e cost of all the bills . <We advise you> not to contact private
e health insurance NOTE : <We advise you> to contact the Italian
regards projects 6 and 7 <we advise you> to look up the page \x
Italian admittance exam . <We advise you> to ask the Welcome Off
For abstract preparation <we advise you> to download the abstra
validity of your diploma <we advise you> to contact the Italian
Figure 6. Selected examples of “we advise you” in IT-UNI.
Lastly, the search for a verb base form following a sentence break,
which is used to retrieve verbs in the imperative mode, returns twice
as many matches in EN-UNI than in IT-UNI (3,690 (or 679 pmw) in
EN-UNI vs. 509 (or 357 pmw) in IT-UNI). This is partly due to a
greater attention paid to web navigation in the native sub-corpus,
confirming the results obtained when analysing lexical bundles (sec-
tion 3.1 above). This finding also supports, however, the impression
49Institutional academic English in the European context
that non-native authors and translators (or possibly the institutions
they represent) shun the more personal, involved, informal style that
is so often used in the native corpus.
4. Conclusion and ways forward
This contribution has presented a methodology for constructing spe-
cialised corpora from the web semi-automatically, applied it to the
construction of a monolingual comparable corpus of websites repre-
senting native (British and Irish) and non-native (Italian) varieties of
institutional academic English, and exemplified its use for shedding
light on the attitudinal and stylistic preferences opposing a native
variety from a lingua franca variety.
In today’s context of increasing competition, collaboration and
exchange among institutes of higher education on a global scale, the
importance of the efforts brought about by the “Bologna Process” is
undeniable both for individual universities and for the national aca-
demic systems of all participating countries. Similarly to what hap-
pens in other non-English speaking countries, Italian universities are
under increasing pressure to make their courses accessible to an in-
ternational public, using English as the medium of instruction and as
the means to conduct overseas student recruitment campaigns. Strat-
egies for academic internationalisation are aggressively pursued at
government and ministerial level, and feature high on the agenda of
individual institutions.
Judging from our analysis, however, the communication strat-
egies put into place, as well as the selection of English language con-
tents made available to an international audience may be improved
and enhanced. While not claiming that Italian university websites
should conform to foreign standards that may not be appropriate for
their specific context, a comparative corpus analysis like the one pre-
sented here would seem to provide a repository of underused writing
strategies and alternative turns of phrase that might be of help to non-
native authors and translators, depending on the circumstances.
50 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
In the near future we plan to add other non-native components
to our comparable corpus, so as to account for institutional academic
English in European countries other than Italy, in which English is
used (more or less extensively) as a lingua franca. This would allow
us to gain a better idea of the full range of variation in this specialised
language variety. Secondly we hope to provide a more thorough de-
scription of the lingua franca variety of English used on Italian aca-
demic websites, and to use these insights to develop a corpus-based
writing aid for translators and non-native authors working in this field.
References
Atkins, Sue / Clear, Jeremy / Ostler, Nicholas (1992). ‘Corpus de-
sign criteria’. Literary and linguistic computing. 7(2): 1–16.
Baroni, Marco / Bernardini, Silvia (eds.) (2006). WaCky! Working
Papers on the Web as Corpus. Bologna: GEDIT.
Baroni, Marco / Bernardini, Silvia (2004). ‘BootCaT: Bootstrapping
corpora and terms from the web’. Proceedings of LREC 2004.
1313–1316.
Baroni, Marco / Ueyama, Motoko (2006). ‘Building general- and spe-
cial-purpose corpora by Web crawling’. Proceedings of the 13th
NIJL International Symposium. 31–40.
Biber, Douglas (2006). University language. A corpus-based study of
spoken and written registers. Amsterdam: Benjamins.
Biber, Douglas / Kurjian, Jerry (2007). ‘Towards a taxonomy of Web
registers and text types: A multidimensional analysis’. In Hundt
et al. 109–131.
Biber, Douglas / Johansson, Stig / Leech, Geoffrey / Conrad, Susan /
Finegan, Edward (1999). Longman grammar of spoken and
written English. Harlow: Longman. 109–132.
Brekke, Magnar (2000). ‘From the BNC towards the cybercorpus: A
quantum leap into chaos?’. In Kirk, John (ed.) Corpora ga-
lore: Analyses and techniques in describing English. Papers
from the 19th International Conference on English Language
51Institutional academic English in the European context
Research on Computerised Corpora. Amsterdam: Rodopi. 227–
247.
Castagnoli, Sara (2006). ‘Using the web as a source of LSP corpora
in the terminology classroom’. In Baroni, Marco / Bernardini,
Silvia (eds.) 159–172.
Cortes, Viviana (2004). ‘Lexical bundles in published and student
disciplinary writing: Examples from history and biology’. Eng-
lish for specific purposes. 23: 397–423.
Dunning, Ted (1993). Accurate methods for the statistics of surprise
and coincidence. Computational linguistics. 19(1): 61–74.
Evert, Stefan / Kilgarriff, Adam / Sharoff, Serge (eds.) (2008). Pro-
ceedings of the 4th Web as Corpus Workshop – Can we beat
Google? Marrakech, 1 June 2008.
Fairclough, Norman (1993). ‘Critical discourse analysis and the
marketisation of public discourse: The universities’. Discourse
and society. 4(2): 133–168.
Ferraresi, Adriano / Zanchetta, Eros / Baroni, Marco / Bernardini,
Silvia (2008). ‘Introducing and evaluating ukWaC, a very large
web-derived corpus of English’. In Evert et al. 47–54.
Fletcher, William (2004). ‘Making the web more useful as a source
for linguistic corpora’. In Connor, Ulla / Upton, Thomas (eds.)
Corpus Linguistics in North America 2002. 191–205.
Gatto, Maristella (2009). From body to web. An introduction to the
web as corpus. Bari: Laterza.
Goffman, Erving (1981). Forms of talk. Philadelphia: University of
Pennsylvania Press.
Granger, Sylviane / Meunier, Fanny (eds.) (2008). Phraseology: An
interdisciplinary perspective. Amsterdam: Benjamins.
Hoffman, Sebastian (2007). ‘From web-page to mega-corpus: The
CNN transcripts.’. In Hundt et al. 69–85.
Hundt, Marianne / Nesselhauf, Nadja / Biewer, Carolin (eds.) (2007).
Corpus linguistics and the web. Amsterdam: Rodopi.
Jenkins, Jennifer (2007). English as a lingua franca: Attitude and
identity. Oxford: Oxford University Press.
Keller, Frank / Lapata, Mirella (2003). ‘Using the Web to obtain fre-
quencies for unseen bigrams’. Computational linguistics. 29(3):
459–484.
52 Silvia Bernardini, Adriano Ferraresi, Federico Gaspari
Kilgarriff, Adam (2007). ‘Googleology is bad science’. Computa-
tional linguistics. 33(1): 147–151.
Kilgarriff, Adam / Grefenstette, Gregory (2003). ‘Introduction to the
special issue on the web as corpus’. Computational linguistics.
29(3): 333–347.
Lee, David Y. W. (2001). ‘Genres, registers, text types, domains, and
styles: Clarifying the concepts and navigating a path through
the BNC jungle’. Language, learning and technology. 5(3):
37–72.
Leturia, Igor / San Vicente, Inaki / Saralegi, Xabier / Lopez de Lacalle,
Maddalen (2008). ‘Collecting Basque specialized corpora from
the web: Language-specific performance tweaks and improving
topic precision’. In Evert et al. 40–46.
Lüdeling, Anke / Evert, Stefan / Baroni, Marco (2007). ‘Using Web
data for linguistic purposes’. In Hundt et al. 7–24.
Manning, Chris / Schütze, Hinrich (1999). Foundations of statistical
natural language processing. Cambridge, MA: MIT Press.
Mauranen, Anna (2003). ‘The corpus of English as Lingua Franca in
academic settings’. TESOL Quarterly. 37 (3): 513–527.
Meunier, Fanny / Granger, Sylviane (eds.) (2008). Phraseology in
foreign language learning and teaching. Amsterdam: Benja-
mins.
Moon, Rosamund (1998). Fixed expressions and idioms in English.
Oxford: Oxford University Press.
Nesselhauf, Nadja (2004). Collocations in a learner corpus. Amster-
dam: Benjamins.
Olohan, Maeve (2002). ‘Leave it out! Using a comparable corpus to
investigate aspects of explicitation in translation’. Cadernos
de Tradução. 9: 153–169.
Pawley, Andrew / Syder, Frances (1983). ‘Two puzzles for linguistic
theory: Nativelike selection and nativelike fluency’. In Richards,
Jack / Schmidt, Richard (eds.) Language and communication.
New York: Longman. 191–226.
Rayson, Paul / Garside, Roger (2000). ‘Comparing corpora using fre-
quency profiling’. In Proceedings of the workshop on compar-
ing corpora, Hong Kong, October 2000. 1–6.
53Institutional academic English in the European context
Santini, Marina (2007). ‘Characterizing genres of web pages: Genre
hybridism and individualization’. In Proceedings of the 40th
Hawaii International Conference on System Sciences. 1–10.
Seidlhofer, Barbara (2001). ‘Closing a conceptual gap: The case for a
description of English as a lingua franca’. International jour-
nal of applied linguistics. 11: 133–158.
Sharoff, Serge (2006). ‘Creating general-purpose corpora using au-
tomated search engine queries’. In Baroni, Marco / Bernardini,
Silvia (eds.). 63–98.
Sinclair, John McHardy (1991). Corpus, concordance, collocation.
Oxford: Oxford University Press.
Thelwall, Mike (2005). ‘Creating and using web corpora’ International
journal of corpus linguistics. 10(4): 517–541.
Tognini-Bonelli, Elena (2001). Corpus linguistics at work. Amster-
dam: Benjamins.
Wray, Alison (2002). Formulaic language and the lexicon. Cambridge:
Cambridge University Press.