Moving Beyond Words: Challenges in Automated Corpus Analyses

Moving Beyond Words: Challenges in Automated Corpus Analysis

and Some Solutions

A Brief Prepared for the Discussants

Jeff Elmore, W. Jill Fitzgerald, & Michael F. Graves

Paper presented at the annual meeting of the American Educational

Research Association,

April 2015.

1

Moving Beyond Words: Challenges in Automated CorpusAnalysis and Some Solutions

A Brief Prepared for the DiscussantsMany theoretical and operational challenges exist when

conducting corpus research using computational methods. We focus

on challenges related to the use of words as the lexical unit of

analysis. We describe three significant challenges and discuss

how they might be addressed: 1) semantic relationships between

words are at least as important as the meaning evoked by

individual word forms in isolation; 2) meaning is often carried

by multi-word units, not just single words, particularly for

content-area terms; and 3) most words have multiple meanings,

sometimes related, sometimes totally distinct. To address the

first challenge, we present a new type of vector-space language

model, describe its properties with some examples, and present an

application of it to improve the selection of vocabulary words.

For challenges two and three, we provide some background and a

brief description of our proposed solutions in each case.

Solutions are based on a hybrid approach of combining manually

2

constructed lexical resources with automated computational

procedures. Note: due to limited time, only the first of these

challenges will be presented at the conference.

PerspectiveThe notion of treating words as the primary carriers of

meaning in language seems fairly straightforward, but defining

just what a word is and how it relates to meaning both in

language and in the real world is incredibly complex (Paradis,

2012). Additionally, while previous corpus-based vocabulary work

may have been able to benefit more from direct human involvement

with limited sampling of texts and sometimes significant manual

coding (Carroll, 1971; Zeno, 1995; Marzano 2004), larger corpora

require researchers to rely on automated approaches.

Consequently, incorporating more of the complexity and nuance of

language use into corpus analyses requires that we do so in a

computationally achievable manner. Note, incorporating human

expertise is still possible and very important. In many cases

this can be done by combining automated procedures with manually

constructed lexical resources (Manning & Schütze, 1999).

3

Challenge 1: Semantic InterrelatednessBackground

Because words are inherently relational, analyzing the frequency of individual lexical units is inherently limited. JohnRupert Firth famously said, “you shall know a word by the companyit keeps,” (Firth, 1957). This insight has been further developedby many scholars (Harris, 1954; Weaver, 1955; Furnas et al., 1983; Deerwester et al., 1990) and has achieved considerable attention in the education field in the form of Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997). Recently, artificial intelligence researchers have developed a new class of methods called neural probabilistic language models (Bengio, Schwenk, Senécal, Morin, & Gauvain, 2006) for computing vector-space word representations, like those used in LSA. As in LSA, neural language models are developed using large unstructured corpora. Whereas vector representations of words in LSA are calculated by performing a mathematical transformation of a term-document matrix, in neural language models vector representations of wordsare estimated by optimizing weights in a multi-layer neural network to predict each word in the corpus using its immediate context. Probabilistic neural language models been shown to outperform similar statistical vector-space models like LSA and PLSA on various semantic tasks such as word similarity and analogies (Baroni et al., 2014). One particularly successful neural language model has been Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013a). To provide some more tangible evidence of the capacity of neural language models to capture meaningful semanticinformation, we provide two examples below to demonstrate the model’s capability to capture semantic similarity between words and to model semantic relationships between pairs of words.

Example 1: How a Neural Language Model Can Address Semantic Similarity

As with LSA, the intuition behind a neural language model is

that similar words will share similar locations in a semantic

space. A common practice for visualizing words in the semantic

4

space is to use Principal Components Analysis (PCA) to reduce the

dimensionality of the semantic space from hundreds of dimensions

down to two. Figure 1 shows some examples of words related to one

another in various degrees.

Figure 1: Example of Word2Vec Semantic Space for Similar Words

The language model has captured some apparently meaningful

spatial relationships. Marine life is grouped into the top right

5

corner, animals associated with domestication are grouped to the

left and extinct animals are grouped in the bottom right.

Example 2: How a Neural Language Model Can Address Semantic Relationships

In addition to modeling the similarity of words, which is a

specific type of semantic relationship, Word2Vec seems capable of

modeling a variety of semantic relationships between pairs of

words. We demonstrate the capacity of Word2Vec to capture

semantic relationships by presenting a collection of word pairs

that share the same semantic relationship and showing that a

consistent spatial relationship exists between the pairs. Figure

2 shows a set of countries and their capital cities with lines

connecting cities to their countries.

6

Figure 3: Spatial Relationship between Countries and Capital

Cities from Word2Vec

The spatial relationship represented by the roughly parallel

lines can be described by simple algebraic operations on word

vectors. For example, the expression Moscow - Russia + England =

X where X represents a point in semantic space very near London.

This is equivalent to a verbal analogy question of the form,

7

Russia is to Moscow as England is to what? Word2Vec shows

improvements in performance for analogy tasks over previous

vector-space language models (Baroni et al., 2014).

It is also notable that Principal Component 1 seems to

represent a geographical dimension with countries spanning

roughly from South Asia, through Russia, into Europe.

Semantic Interrelatedness: Proposed SolutionIncorporating semantic relationships between words is too

broad a challenge to propose a single specific solution. In

general we are optimistic about Word2Vec or a similar language

model having numerous applications in our work, from addressing

issues of polysemy to helping identify domain-specific

vocabulary. We present one example below showing the potential

for exploiting knowledge about semantic similarities between

words to assist in selecting words for vocabulary instruction.

A substantial portion of words are learned incidentally while reading, and attributes of the words and aspects of the contexts in which they appear can impact the likelihood of incidental word learning (Nagy 1985). Contexts in which target words are surrounded by familiar related words are more supportive of incidental word learning (Sternberg, 1987). An indication of which words tended to appear in supportive context and which did not could potentially be useful in identifying

8

specific words to target for direct instruction. We present a proof-of-concept demonstration of such an approach.

Application of Language Model: Quantifying Contextual Supportiveness

We employ 300-dimensional vector representations of words calculated by the Word2Vec framework (Mikolov, 2013a) developed using 100 billion words of news articles from Google and age-of-acquisition ratings for ~50,000 words (Kuperman, 2012) to quantify the semantic supportiveness of a given context for a particular target word. Semantic supportiveness simply means the degree to which a certain context would improve, even slightly, implicit learning of concepts associated with a target word within that context.

Operationally, supportiveness is defined as the sum of cosine similarities, a commonly employed spatial distance metric,between the target word and all words in the context around the target word that satisfy two requirements: a) the word must have an age-of-acquisition rating below that of the target word and b)the word must have a cosine similarity to the target word of at least 0.3.

Consider the target word democracy in several different contexts:

● Example 1: Democracy is a system of government whereby citizens are given some representative voice in which laws are passed.

● Example 2: Democracy cannot be installed by an external force. It must grow organically within a country.

● Example 3: Democracy cannot be installed by an external force. It must grow organically.

The supportiveness values for the three example sentences are 0.78, 0.36, and 0 for examples 1, 2, and 3 respectively. It seemsplausible that example 1 would be more helpful in improving a reader’s knowledge of the target word than examples 2 and 3.

9

Challenge 2: Multi-Word UnitsBackground

Multi-word units range from proper nouns and technical terms

to common phrases and idioms to any statistically improbable

combination of words (Sag et al., 2001) and there is a wide

variety in the terminology employed and methods used in their

analysis (Moon, 1998; Cowie, 1998; Sag et al., 2001; Danielsson,

2007; Baldwin & Kim, 2010). A point of consensus among

researchers is that multi-word units of any kind are typically

underrepresented in lexical resources (Pavel, 1993). For now, we

are focused on multi-word units corresponding to domain-specific

academic vocabulary, but other aspects of the analysis multi-word

units are potentially interesting in the future.

Multi-Word Units: Proposed SolutionWork has already been done to identify content-area

terminology (Marzano, 2004). Additionally many techniques exist

for extracting general multi-word units (Mikolov, 2013c) and

domain-specific terms from large corpora (Hartmann, 2012).

Manually constructed lists are unlikely to capture all of the

terms in a given corpus while automated approaches are likely to

10

identify erroneous multiword units. Through a combination of

lists of specific terms and results from automated analyses, we

hope to generate a comprehensive list of relevant content-

specific terms.

Challenge 3: PolysemyBackground

It is well understood that words have many different

meanings, even infinitely many meanings depending on the context

in which they are used. Many word senses (meanings) are closely

related in meaning, for example literal and figurative meanings

exist for many words. However, words also have multiple meanings

that are totally distinct, for example mean as in unkind, mean

as in what some word represents, and mean as in average. In

these cases in particular it would be valuable to differentiate

word senses in our analyses.

The process of automatically identifying the intended

meaning of a particular word in a text based on its context is

called Word Sense Disambiguation (WSD). Most WSD solutions

involve two parts: a database of word senses (meanings), usually

11

called a sense inventory, and an algorithm for selecting a

particular sense for a word based on context.

WordNet is a large lexical database of English word senses

that is commonly used in WSD systems as a sense inventory. In

WordNet, nouns, verbs, adjectives and adverbs are grouped into

sets of cognitive synonyms (synsets), each expressing a distinct

concept. Synsets are interlinked by means of conceptual-semantic

and lexical relations (Felbaum, 1998).

Unfortunately, state-of-the-art performance in automatic

word sense disambiguation is relatively poor (Navigli, 2009). One

reason for the poor performance is the fine distinctions of

popular word sense inventories like WordNet. For example, the

word ‘bank’ has eighteen senses in WordNet. For practical

purposes four or five is probably more reasonable. Although

polysemy presents many challenges to consider, we are choosing to

focus on developing a more appropriate sense inventory as a first

step to addressing polysemy in our analyses. Specifically we want

to group related meanings in WordNet into larger categories, for

example these two senses for the word ‘bank’ from WordNet:

12

“sloping land (especially the slope beside a body of water)” and

“a long ridge or pile” would be combined.

Several automated approaches for combining semantically

related WordNet senses were evaluated and found to be inadequate

for our purposes (Navigli, 2006; Snow, 2007).

Polysemy: Proposed SolutionInstead of a fully automated approach, we have developed a

crowdsourcing application for combining similar senses within

WordNet. Users of the application are presented with a list of

word sense “glosses” (short definitions) and are asked to combine

senses by dragging and dropping more specific senses under more

general related senses. We will then analyze the judgments on

relatedness of word senses for two purposes: 1) to establish

consensus groupings of word senses with the top level

representing the most general and distinct senses of a word and

2) to assess the degree of variability in users’ groupings of

word senses. Figure 3 shows an example the word-sense clustering

application for the word-form bass.

13

Figure 3: Example of Word-Sense Clustering Application for

the Word-Form bass

Using WordNet as a starting point for a coarser grain word-

sense inventory has the advantage that we can leverage all of the

resources within WordNet and other work based on WordNet but we

will have a version of it that is more suited to our purposes.

Conclusion The challenges discussed in this paper are all a result of

the fact that, at present, computers are unable to read and

understand text. There are obvious limits to the insights we can

gain from analyzing corpora with machines that lack

14

comprehension. Of course, corpus analysis is still an incredibly

powerful tool for understanding the world. Even something as

simple as word frequency has had profound effects on education

research and practice. Recently however, much more sophisticated

techniques have become available due to significant advances in

the fields of computational linguistics and artificial

intelligence over the past few decades. This paper described an

on-going effort to apply state-of-the-art techniques in

computational linguistics to advance corpus-based analyses in

education, however slowly, towards text understanding. Each step

has the potential to produce more meaningful and relevant results

to inform research and practice.

15

References

Baldwin, T. and Km, S. N. (2010). Multiword Expressions. In

Indurkhya, N. and Damerau, F. J., editors, Handbook of

Natural Language Processing, Second Edition. CRC Press,

Taylor and Francis Group, Boca Raton, FL. ISBN 978-

1420085921.

Baroni, M. Dinu, G. and Kruszewski, G. Don’t count, predict! a

systematic comparison of contextcounting vs. context-

predicting semantic vectors. (2014). In Proceedings of the

52nd Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 238–247,

Baltimore, Maryland, June 2014. Association for

Computational Linguistics.

Bengio, Y, Schwenk, H., Sen´ecal, J.S., Morin, F., and Gauvain,

J.L. (2006). Neural probabilistic language models. In

Innovations in Machine Learning, pp. 137–186.

Carroll, J. B, Davies, P., & Richman, B. (1971). The American

Heritage word frequency book. New York: Houghton Mifflin.

16

Cowie, A. P. (1998). Phraseology: Theory, Analysis, and

Applications: Theory, Analysis, and Applications. Clarendon

Press.

Danielsson, P. 2007. “What constitutes a unit of analysis in

language”? Linguistik online 31, 2/2007, 17–24.

Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer & R.

Harshman. 1990. Indexing by latent semantic analysis.

Journal of the American Society for Information Science

41(6). 391–407. doi:10.1002/(SICI)1097-4571(199009)41:

6<391::AID-ASI1>3.0.CO;2-9.

Fellbaum, C. (1998, ed.) WordNet: An Electronic Lexical

Database. Cambridge, MA: MIT Press.

Firth, J.R. (1957). "A synopsis of linguistic theory 1930-1955".

Studies in Linguistic Analysis (Oxford: Philological

Society): 1–32. Reprinted in F.R. Palmer, ed. (1968).

Selected Papers of J.R. Firth 1952-1959. London: Longman.

Furnas, G. W., Thomas K. Landauer, L. M Gomez & S. T. Dumais.

1983. Statistical semantics: Analysis of the potential

performance of keyword information systems. Bell System

Technical Journal 62(6). 1753–1806.

17

Harris, Z. (1954). "Distributional structure". Word 10 (23):

146–162.

Hartmann, S., Szarvas, G., & Gurevych, I. (2012). Mining

Multiword Terms from Wikipedia. In M. Pazienza, & A.

Stellato (Eds.) Semi-Automatic Ontology Development:

Processes and Resources (pp. 226-258). Hershey, PA:

Information Science Reference. doi:10.4018/978-1-4666-0188-

8.ch009

Landauer, T.K. and Dumais, S.T. (1997). A solution to Plato’s

problem: The Latent Semantic Analysis theory of acquisition,

induction and representation of knowledge. Psychological

Review, 104(2):211–240.

Manning, C.D., & Schütze, H. (1999). Foundations of Statistical

Natural Language Processing. Cambridge MA: MIT Press.

Marzano, R. J. (2004). Building background knowledge for

academic achievement. Alexandria, VA: Association for

Supervision and Curriculum Development.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a).

Efficient estimation of word representations in vector

space. CoRR, abs/1301.3781.

18

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J.

(2013c). Distributed representations of phrases and their

compositionality. In Advances in Neural Information

Processing Systems

Moon, R. (1998). Fixed Expressions and Idioms in English: A

Corpus-Based Approach. Oxford University Press.

Navigli R. (2006). Meaningful Clustering of Senses Helps Boost

Word Sense Disambiguation Performance. Proc. of COLING-ACL

2006, Sydney, Australia, July 17-21, 2006, pp. 105-112.

Navigli, R. (2009). Word Sense Disambiguation: a Survey. ACM

Computing Surveys. New York: ACM Press.

Paradis, C. (2012). Lexical Semantics. The Encyclopedia of

Applied Linguistics, ed. Chapelle, C.A. Oxford, UK: Wiley

Blackwell, 2012, (pp 3357–3356).

Pavel, S. (1993). Neology and phraseology as terminology-in-the-

making. Terminology: applications in interdisciplinary

communication, pp. 21–34. John Benjamins, Amsterdam.

Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and

Flickinger, D. (2001). Multiword Expressions: A Pain in the

Neck for NLP. In In Proc. of the 3rd International

19

Conference on Intelligent Text Processing and Computational

Linguistics (CICLing-2002), pp. 1–15.

Snow, R., Prakash, S., Jurafsky, D., and Ng, A. Y. (2007).

"Learning to merge word senses",In Bird, S (Ed.) Proceedings

of Empirical Methods in Natural Language Processing.

Cambridge, MA: MIT Press.

Sternberg, R. J. (1987). Most vocabulary is learned from

context. In McKeown, M. G. and M. E. Curtis. (Eds.). The

Nature of Vocabulary Acquisition. Lawrence Erlbaum

Associates. Hillsdale: New Jersey. 89-105.

Weaver, W. (1955). Translation. In William N. Locke & A. Donald

Booth (eds.), Machine translation of languages: Fourteen

essays, 15–23. Cambridge, MA: MIT Press.

Zeno, S. M., Ivens, S. H., Millard, R. T., & Duvvuri, R. (1995).

The Educator's word frequency guide. Brewster, NY: Touchstone

Applied Science Associates.

20

Moving Beyond Words: Challenges in Automated Corpus Analyses

Documents

Transcript of Moving Beyond Words: Challenges in Automated Corpus Analyses