Moving Beyond Words: Challenges in Automated Corpus Analyses
-
Upload
independent -
Category
Documents
-
view
1 -
download
0
Transcript of Moving Beyond Words: Challenges in Automated Corpus Analyses
Moving Beyond Words: Challenges in Automated Corpus Analysis
and Some Solutions
A Brief Prepared for the Discussants
Jeff Elmore, W. Jill Fitzgerald, & Michael F. Graves
Paper presented at the annual meeting of the American Educational
Research Association,
Moving Beyond Words: Challenges in Automated CorpusAnalysis and Some Solutions
A Brief Prepared for the DiscussantsMany theoretical and operational challenges exist when
conducting corpus research using computational methods. We focus
on challenges related to the use of words as the lexical unit of
analysis. We describe three significant challenges and discuss
how they might be addressed: 1) semantic relationships between
words are at least as important as the meaning evoked by
individual word forms in isolation; 2) meaning is often carried
by multi-word units, not just single words, particularly for
content-area terms; and 3) most words have multiple meanings,
sometimes related, sometimes totally distinct. To address the
first challenge, we present a new type of vector-space language
model, describe its properties with some examples, and present an
application of it to improve the selection of vocabulary words.
For challenges two and three, we provide some background and a
brief description of our proposed solutions in each case.
Solutions are based on a hybrid approach of combining manually
2
constructed lexical resources with automated computational
procedures. Note: due to limited time, only the first of these
challenges will be presented at the conference.
PerspectiveThe notion of treating words as the primary carriers of
meaning in language seems fairly straightforward, but defining
just what a word is and how it relates to meaning both in
language and in the real world is incredibly complex (Paradis,
2012). Additionally, while previous corpus-based vocabulary work
may have been able to benefit more from direct human involvement
with limited sampling of texts and sometimes significant manual
coding (Carroll, 1971; Zeno, 1995; Marzano 2004), larger corpora
require researchers to rely on automated approaches.
Consequently, incorporating more of the complexity and nuance of
language use into corpus analyses requires that we do so in a
computationally achievable manner. Note, incorporating human
expertise is still possible and very important. In many cases
this can be done by combining automated procedures with manually
constructed lexical resources (Manning & Schütze, 1999).
3
Challenge 1: Semantic InterrelatednessBackground
Because words are inherently relational, analyzing the frequency of individual lexical units is inherently limited. JohnRupert Firth famously said, “you shall know a word by the companyit keeps,” (Firth, 1957). This insight has been further developedby many scholars (Harris, 1954; Weaver, 1955; Furnas et al., 1983; Deerwester et al., 1990) and has achieved considerable attention in the education field in the form of Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997). Recently, artificial intelligence researchers have developed a new class of methods called neural probabilistic language models (Bengio, Schwenk, Senécal, Morin, & Gauvain, 2006) for computing vector-space word representations, like those used in LSA. As in LSA, neural language models are developed using large unstructured corpora. Whereas vector representations of words in LSA are calculated by performing a mathematical transformation of a term-document matrix, in neural language models vector representations of wordsare estimated by optimizing weights in a multi-layer neural network to predict each word in the corpus using its immediate context. Probabilistic neural language models been shown to outperform similar statistical vector-space models like LSA and PLSA on various semantic tasks such as word similarity and analogies (Baroni et al., 2014). One particularly successful neural language model has been Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013a). To provide some more tangible evidence of the capacity of neural language models to capture meaningful semanticinformation, we provide two examples below to demonstrate the model’s capability to capture semantic similarity between words and to model semantic relationships between pairs of words.
Example 1: How a Neural Language Model Can Address Semantic Similarity
As with LSA, the intuition behind a neural language model is
that similar words will share similar locations in a semantic
space. A common practice for visualizing words in the semantic
4
space is to use Principal Components Analysis (PCA) to reduce the
dimensionality of the semantic space from hundreds of dimensions
down to two. Figure 1 shows some examples of words related to one
another in various degrees.
Figure 1: Example of Word2Vec Semantic Space for Similar Words
The language model has captured some apparently meaningful
spatial relationships. Marine life is grouped into the top right
5
corner, animals associated with domestication are grouped to the
left and extinct animals are grouped in the bottom right.
Example 2: How a Neural Language Model Can Address Semantic Relationships
In addition to modeling the similarity of words, which is a
specific type of semantic relationship, Word2Vec seems capable of
modeling a variety of semantic relationships between pairs of
words. We demonstrate the capacity of Word2Vec to capture
semantic relationships by presenting a collection of word pairs
that share the same semantic relationship and showing that a
consistent spatial relationship exists between the pairs. Figure
2 shows a set of countries and their capital cities with lines
connecting cities to their countries.
6
Figure 3: Spatial Relationship between Countries and Capital
Cities from Word2Vec
The spatial relationship represented by the roughly parallel
lines can be described by simple algebraic operations on word
vectors. For example, the expression Moscow - Russia + England =
X where X represents a point in semantic space very near London.
This is equivalent to a verbal analogy question of the form,
7
Russia is to Moscow as England is to what? Word2Vec shows
improvements in performance for analogy tasks over previous
vector-space language models (Baroni et al., 2014).
It is also notable that Principal Component 1 seems to
represent a geographical dimension with countries spanning
roughly from South Asia, through Russia, into Europe.
Semantic Interrelatedness: Proposed SolutionIncorporating semantic relationships between words is too
broad a challenge to propose a single specific solution. In
general we are optimistic about Word2Vec or a similar language
model having numerous applications in our work, from addressing
issues of polysemy to helping identify domain-specific
vocabulary. We present one example below showing the potential
for exploiting knowledge about semantic similarities between
words to assist in selecting words for vocabulary instruction.
A substantial portion of words are learned incidentally while reading, and attributes of the words and aspects of the contexts in which they appear can impact the likelihood of incidental word learning (Nagy 1985). Contexts in which target words are surrounded by familiar related words are more supportive of incidental word learning (Sternberg, 1987). An indication of which words tended to appear in supportive context and which did not could potentially be useful in identifying
8
specific words to target for direct instruction. We present a proof-of-concept demonstration of such an approach.
Application of Language Model: Quantifying Contextual Supportiveness
We employ 300-dimensional vector representations of words calculated by the Word2Vec framework (Mikolov, 2013a) developed using 100 billion words of news articles from Google and age-of-acquisition ratings for ~50,000 words (Kuperman, 2012) to quantify the semantic supportiveness of a given context for a particular target word. Semantic supportiveness simply means the degree to which a certain context would improve, even slightly, implicit learning of concepts associated with a target word within that context.
Operationally, supportiveness is defined as the sum of cosine similarities, a commonly employed spatial distance metric,between the target word and all words in the context around the target word that satisfy two requirements: a) the word must have an age-of-acquisition rating below that of the target word and b)the word must have a cosine similarity to the target word of at least 0.3.
Consider the target word democracy in several different contexts:
● Example 1: Democracy is a system of government whereby citizens are given some representative voice in which laws are passed.
● Example 2: Democracy cannot be installed by an external force. It must grow organically within a country.
● Example 3: Democracy cannot be installed by an external force. It must grow organically.
The supportiveness values for the three example sentences are 0.78, 0.36, and 0 for examples 1, 2, and 3 respectively. It seemsplausible that example 1 would be more helpful in improving a reader’s knowledge of the target word than examples 2 and 3.
9
Challenge 2: Multi-Word UnitsBackground
Multi-word units range from proper nouns and technical terms
to common phrases and idioms to any statistically improbable
combination of words (Sag et al., 2001) and there is a wide
variety in the terminology employed and methods used in their
analysis (Moon, 1998; Cowie, 1998; Sag et al., 2001; Danielsson,
2007; Baldwin & Kim, 2010). A point of consensus among
researchers is that multi-word units of any kind are typically
underrepresented in lexical resources (Pavel, 1993). For now, we
are focused on multi-word units corresponding to domain-specific
academic vocabulary, but other aspects of the analysis multi-word
units are potentially interesting in the future.
Multi-Word Units: Proposed SolutionWork has already been done to identify content-area
terminology (Marzano, 2004). Additionally many techniques exist
for extracting general multi-word units (Mikolov, 2013c) and
domain-specific terms from large corpora (Hartmann, 2012).
Manually constructed lists are unlikely to capture all of the
terms in a given corpus while automated approaches are likely to
10
identify erroneous multiword units. Through a combination of
lists of specific terms and results from automated analyses, we
hope to generate a comprehensive list of relevant content-
specific terms.
Challenge 3: PolysemyBackground
It is well understood that words have many different
meanings, even infinitely many meanings depending on the context
in which they are used. Many word senses (meanings) are closely
related in meaning, for example literal and figurative meanings
exist for many words. However, words also have multiple meanings
that are totally distinct, for example mean as in unkind, mean
as in what some word represents, and mean as in average. In
these cases in particular it would be valuable to differentiate
word senses in our analyses.
The process of automatically identifying the intended
meaning of a particular word in a text based on its context is
called Word Sense Disambiguation (WSD). Most WSD solutions
involve two parts: a database of word senses (meanings), usually
11
called a sense inventory, and an algorithm for selecting a
particular sense for a word based on context.
WordNet is a large lexical database of English word senses
that is commonly used in WSD systems as a sense inventory. In
WordNet, nouns, verbs, adjectives and adverbs are grouped into
sets of cognitive synonyms (synsets), each expressing a distinct
concept. Synsets are interlinked by means of conceptual-semantic
and lexical relations (Felbaum, 1998).
Unfortunately, state-of-the-art performance in automatic
word sense disambiguation is relatively poor (Navigli, 2009). One
reason for the poor performance is the fine distinctions of
popular word sense inventories like WordNet. For example, the
word ‘bank’ has eighteen senses in WordNet. For practical
purposes four or five is probably more reasonable. Although
polysemy presents many challenges to consider, we are choosing to
focus on developing a more appropriate sense inventory as a first
step to addressing polysemy in our analyses. Specifically we want
to group related meanings in WordNet into larger categories, for
example these two senses for the word ‘bank’ from WordNet:
12
“sloping land (especially the slope beside a body of water)” and
“a long ridge or pile” would be combined.
Several automated approaches for combining semantically
related WordNet senses were evaluated and found to be inadequate
for our purposes (Navigli, 2006; Snow, 2007).
Polysemy: Proposed SolutionInstead of a fully automated approach, we have developed a
crowdsourcing application for combining similar senses within
WordNet. Users of the application are presented with a list of
word sense “glosses” (short definitions) and are asked to combine
senses by dragging and dropping more specific senses under more
general related senses. We will then analyze the judgments on
relatedness of word senses for two purposes: 1) to establish
consensus groupings of word senses with the top level
representing the most general and distinct senses of a word and
2) to assess the degree of variability in users’ groupings of
word senses. Figure 3 shows an example the word-sense clustering
application for the word-form bass.
13
Figure 3: Example of Word-Sense Clustering Application for
the Word-Form bass
Using WordNet as a starting point for a coarser grain word-
sense inventory has the advantage that we can leverage all of the
resources within WordNet and other work based on WordNet but we
will have a version of it that is more suited to our purposes.
Conclusion The challenges discussed in this paper are all a result of
the fact that, at present, computers are unable to read and
understand text. There are obvious limits to the insights we can
gain from analyzing corpora with machines that lack
14
comprehension. Of course, corpus analysis is still an incredibly
powerful tool for understanding the world. Even something as
simple as word frequency has had profound effects on education
research and practice. Recently however, much more sophisticated
techniques have become available due to significant advances in
the fields of computational linguistics and artificial
intelligence over the past few decades. This paper described an
on-going effort to apply state-of-the-art techniques in
computational linguistics to advance corpus-based analyses in
education, however slowly, towards text understanding. Each step
has the potential to produce more meaningful and relevant results
to inform research and practice.
15
References
Baldwin, T. and Km, S. N. (2010). Multiword Expressions. In
Indurkhya, N. and Damerau, F. J., editors, Handbook of
Natural Language Processing, Second Edition. CRC Press,
Taylor and Francis Group, Boca Raton, FL. ISBN 978-
1420085921.
Baroni, M. Dinu, G. and Kruszewski, G. Don’t count, predict! a
systematic comparison of contextcounting vs. context-
predicting semantic vectors. (2014). In Proceedings of the
52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 238–247,
Baltimore, Maryland, June 2014. Association for
Computational Linguistics.
Bengio, Y, Schwenk, H., Sen´ecal, J.S., Morin, F., and Gauvain,
J.L. (2006). Neural probabilistic language models. In
Innovations in Machine Learning, pp. 137–186.
Carroll, J. B, Davies, P., & Richman, B. (1971). The American
Heritage word frequency book. New York: Houghton Mifflin.
16
Cowie, A. P. (1998). Phraseology: Theory, Analysis, and
Applications: Theory, Analysis, and Applications. Clarendon
Press.
Danielsson, P. 2007. “What constitutes a unit of analysis in
language”? Linguistik online 31, 2/2007, 17–24.
Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer & R.
Harshman. 1990. Indexing by latent semantic analysis.
Journal of the American Society for Information Science
41(6). 391–407. doi:10.1002/(SICI)1097-4571(199009)41:
6<391::AID-ASI1>3.0.CO;2-9.
Fellbaum, C. (1998, ed.) WordNet: An Electronic Lexical
Database. Cambridge, MA: MIT Press.
Firth, J.R. (1957). "A synopsis of linguistic theory 1930-1955".
Studies in Linguistic Analysis (Oxford: Philological
Society): 1–32. Reprinted in F.R. Palmer, ed. (1968).
Selected Papers of J.R. Firth 1952-1959. London: Longman.
Furnas, G. W., Thomas K. Landauer, L. M Gomez & S. T. Dumais.
1983. Statistical semantics: Analysis of the potential
performance of keyword information systems. Bell System
Technical Journal 62(6). 1753–1806.
17
Harris, Z. (1954). "Distributional structure". Word 10 (23):
146–162.
Hartmann, S., Szarvas, G., & Gurevych, I. (2012). Mining
Multiword Terms from Wikipedia. In M. Pazienza, & A.
Stellato (Eds.) Semi-Automatic Ontology Development:
Processes and Resources (pp. 226-258). Hershey, PA:
Information Science Reference. doi:10.4018/978-1-4666-0188-
8.ch009
Landauer, T.K. and Dumais, S.T. (1997). A solution to Plato’s
problem: The Latent Semantic Analysis theory of acquisition,
induction and representation of knowledge. Psychological
Review, 104(2):211–240.
Manning, C.D., & Schütze, H. (1999). Foundations of Statistical
Natural Language Processing. Cambridge MA: MIT Press.
Marzano, R. J. (2004). Building background knowledge for
academic achievement. Alexandria, VA: Association for
Supervision and Curriculum Development.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a).
Efficient estimation of word representations in vector
space. CoRR, abs/1301.3781.
18
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J.
(2013c). Distributed representations of phrases and their
compositionality. In Advances in Neural Information
Processing Systems
Moon, R. (1998). Fixed Expressions and Idioms in English: A
Corpus-Based Approach. Oxford University Press.
Navigli R. (2006). Meaningful Clustering of Senses Helps Boost
Word Sense Disambiguation Performance. Proc. of COLING-ACL
2006, Sydney, Australia, July 17-21, 2006, pp. 105-112.
Navigli, R. (2009). Word Sense Disambiguation: a Survey. ACM
Computing Surveys. New York: ACM Press.
Paradis, C. (2012). Lexical Semantics. The Encyclopedia of
Applied Linguistics, ed. Chapelle, C.A. Oxford, UK: Wiley
Blackwell, 2012, (pp 3357–3356).
Pavel, S. (1993). Neology and phraseology as terminology-in-the-
making. Terminology: applications in interdisciplinary
communication, pp. 21–34. John Benjamins, Amsterdam.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and
Flickinger, D. (2001). Multiword Expressions: A Pain in the
Neck for NLP. In In Proc. of the 3rd International
19
Conference on Intelligent Text Processing and Computational
Linguistics (CICLing-2002), pp. 1–15.
Snow, R., Prakash, S., Jurafsky, D., and Ng, A. Y. (2007).
"Learning to merge word senses",In Bird, S (Ed.) Proceedings
of Empirical Methods in Natural Language Processing.
Cambridge, MA: MIT Press.
Sternberg, R. J. (1987). Most vocabulary is learned from
context. In McKeown, M. G. and M. E. Curtis. (Eds.). The
Nature of Vocabulary Acquisition. Lawrence Erlbaum
Associates. Hillsdale: New Jersey. 89-105.
Weaver, W. (1955). Translation. In William N. Locke & A. Donald
Booth (eds.), Machine translation of languages: Fourteen
essays, 15–23. Cambridge, MA: MIT Press.
Zeno, S. M., Ivens, S. H., Millard, R. T., & Duvvuri, R. (1995).
The Educator's word frequency guide. Brewster, NY: Touchstone
Applied Science Associates.
20