The Complexity of Natural Language

25
The Complexity of Natural Language A Dissertation Proposal Matt Mahoney Introduction The goal of my dissertation will be to estimate the memory requirement of a human language model for natural language processing tasks. Currently, no model is known, but it should be possible to estimate the size (and hence the cost) of a model without building one. Arguments from psychology, neurophysiology, and attempts at AI suggest a size between 10 8 and 10 14 bits, and probably around 10 9 bits. My approach will be to compare human text prediction with text compression as model size is varied. A language model is a probability distribution over the set of all possible human-generated messages and dialogs, a prediction of how people respond to arbitrary input during communication. It describes the desired behavior of natural language processing (NLP) tasks such as information retrieval and natural language interfaces to databases. In language translation, speech recognition, optical character recognition, handwriting recognition, and proofreading, a language model would select the most likely or sensible interpretation of noisy or ambiguous input. No natural language model is known; consequently no NLP system can perform comparably (in error rate) to a human expert given the same task. One way to estimate the size of the natural language model would be to find the relationship between model size and system error rates, and extrapolate to human error rates. The standard benchmark of AI, the Turing test, is qualitative, and therefore poorly suited for this purpose. Text prediction, a component of data compression algorithms, does not have this difficulty. Error rate is given by compression ratio, and since compressors learn a language model from their input, the model size is the amount of compressed text. For many programs, compression appears to improve as a logarithmic function of input file size, then level off when the program’s memory limit is reached. Humans outperform all known compression programs in text prediction for even the largest models tested. A language model has at least three types of rules: lexical, which constrain letter order, semantic, which constrain word associations, and syntactic, which constrain word order. For instance, we have qu but not uq, needle and thread but not needle and cloud, the cat but not cat the. In addition, people learn categories at each level, for instance, grouping q with Q, needle with sewing, and the with a.. Thus, we generalize to the novel QU, thread and sewing, a cat. Most compression programs learn only a lexical model without generalization, because it is fast, applicable to many types of data other than text, and adequate for small files. A potentially large source of error in extrapolating compression ratio (using a full language model and unlimited memory), is measuring the goal, the entropy of the input text using human prediction. Current methods involve either inferring a probability distribution from a sequence of

Transcript of The Complexity of Natural Language

The Complexity of Natural LanguageA Dissertation Proposal

Matt Mahoney

IntroductionThe goal of my dissertation will be to estimate the memory requirement of a human languagemodel for natural language processing tasks. Currently, no model is known, but it should bepossible to estimate the size (and hence the cost) of a model without building one. Argumentsfrom psychology, neurophysiology, and attempts at AI suggest a size between 108 and 1014 bits,and probably around 109 bits. My approach will be to compare human text prediction with textcompression as model size is varied.

A language model is a probability distribution over the set of all possible human-generatedmessages and dialogs, a prediction of how people respond to arbitrary input duringcommunication. It describes the desired behavior of natural language processing (NLP) taskssuch as information retrieval and natural language interfaces to databases. In languagetranslation, speech recognition, optical character recognition, handwriting recognition, andproofreading, a language model would select the most likely or sensible interpretation of noisy orambiguous input. No natural language model is known; consequently no NLP system canperform comparably (in error rate) to a human expert given the same task.

One way to estimate the size of the natural language model would be to find the relationshipbetween model size and system error rates, and extrapolate to human error rates. The standardbenchmark of AI, the Turing test, is qualitative, and therefore poorly suited for this purpose. Textprediction, a component of data compression algorithms, does not have this difficulty. Error rateis given by compression ratio, and since compressors learn a language model from their input, themodel size is the amount of compressed text. For many programs, compression appears toimprove as a logarithmic function of input file size, then level off when the program’s memorylimit is reached. Humans outperform all known compression programs in text prediction for eventhe largest models tested.

A language model has at least three types of rules: lexical, which constrain letter order, semantic,which constrain word associations, and syntactic, which constrain word order. For instance, wehave qu but not uq, needle and thread but not needle and cloud, the cat but not cat the. Inaddition, people learn categories at each level, for instance, grouping q with Q, needle withsewing, and the with a.. Thus, we generalize to the novel QU, thread and sewing, a cat. Mostcompression programs learn only a lexical model without generalization, because it is fast,applicable to many types of data other than text, and adequate for small files.

A potentially large source of error in extrapolating compression ratio (using a full language modeland unlimited memory), is measuring the goal, the entropy of the input text using humanprediction. Current methods involve either inferring a probability distribution from a sequence of

guesses in a letter guessing game, resulting in a large uncertainty in how the results areinterpreted, or directly assigning a probability distribution to each successive character in agambling game, which is time consuming and error prone.

My plan to estimate the space complexity of natural language has four steps.1. Obtain a corpus of text from diverse fields of knowledge.2. Develop a text compression algorithm that uses lexical, syntactic and semantic learning

learning, including generalization on all levels.3. Establish the relationship between compression and model size at each level of the language

model and overall.4. Develop an improved method of entropy measure on the corpus to establish the compression

goal.

Why estimate language model size?A natural language model is a probability distribution over the set of possible messages or dialogsin human communication. This model is a critical component of all natural language processingsystems, such as information retrieval, language translation, speech recognition, and proofreading.In its most general form, it describes a hypothetical machine that would pass the Turing test forartificial intelligence. It is currently unknown how to construct such a model, and it is unknownhow much effort would be involved if one could be constructed using any of the approaches thathave been tried.

It is deceptively easy to build a toy model, with a restricted vocabulary, grammar, and knowledgedomain, demonstrate it with a few examples, and declare that the work is almost done. Forexample, Baseball, an expert on 1959 baseball statistics, could intelligently answer questions suchas “How many games did the Yankees win in July?” (Green et. al. 1961). Uhr (1963) developed asystem that recognized single handwritten characters with 96% accuracy (compared to 97% forhumans), and another that segmented continuous speech by multiple speakers with 100%accuracy for a single 4-word example. A 1959 U.S. government sponsored project to translateRussian to English produced rough quality but understandable translations (Borko 1967). Earlysuccesses such as these led to predictions that the solution to the AI problem was just around thecorner. We now know, of course, that these estimates were premature, that problems of this typewere vastly more difficult than they first appeared to be.

The problem is that we still do not know just how difficult the AI problem is. It is easy, inhindsight, to criticize the Russian-English project (abandoned in 1962) as overly ambitious, buthistory continues to repeat itself. A 1988 English-Japanese translator developed by Toshiba(Newsbytes 1988), had a 130,000 word vocabulary and used 100,000 language rules, buttranslators have advanced little since the ill-fated Russian-English project. Modern translatorssuch as Alta-Vista (1998), Comprende (1998), and Tsunami-Typhoon (Williams, 1996), produce“draft quality” output, understandable but requiring post editing by a native speaker. Forexample, Alta-Vista translates the English sentence, “The spirit is willing, but the flesh is weak”to Spanish and back as “The alcohol is arranged, but the meat is weak.”

The Cyc project (Guha 1994; Whitten 1994) is probably the most ambitious attempt at a languagemodel. Begun in 1984, Cyc was to encode common sense and “break the software brittlenessbottleneck”. Cyc first used a frame-slot-attribute representation, and when this provedinadequate, switched to a “sea of assertions”, using an extended first order logic notation, all handcoded by teams of knowledge engineers. By 1990, they had surpassed one million rules. In 1994,the developers had declared that they were finished, and hoped that Cyc would be running onevery computer in five years. Obviously this has not happened. Now a commercial company(Cycorp 1997), Cycorp has yet to deliver a product, although they are under contract to DARPAto develop a system that will involve coding over 70,000 new rules.

Estimates of Language ComplexityTuring (1950) is well known for having proposed the imitation game as a test for artificialintelligence. In what is now known as the Turing test, a machine is said to have artificialintelligence if it cannot be distinguished from a human in a game where it cannot be seen and onlywritten communication is allowed. Turing predicted that by 2000, a machine with 109 bits ofmemory would be able to win the imitation game against a human opponent 30% of the time after5 minutes of conversation. He offered no explanation for his memory estimate, although he wasremarkably accurate in his estimate of the memory size of modern computers, especiallyconsidering that his estimate predates Moore’s law (which says that hardware capacity doublesevery 18 months).

Turing suggested a machine learning approach to solving the AI problem: program a computer tolearn as a human would, then educate it as one would a child. One could speculate that Turing’sestimate is based on the amount of text input that this machine would receive. If we assume 150words per minute, several hours per day until adulthood, and a compressed representation with noforgetting, then we arrive at roughly 109 bits.

Estimates from attempts at AIWe can estimate a lower bound on language model size from the largest NLP systems that havebeen built so far. If we assume that each of Cyc’s assertions could be encoded in a few hundredbits, then the model size is about 109 bits. By a similar argument, the Toshiba Japanese-Englishtranslator would be about 108 bits.

WordNet (Miller 1993) is an electronic dictionary developed over a 5 year period. It has 96,500words linked by synonyms (hot, warm), antonyms (hot, cold), hypernyms (oak, tree), andmeronyms (tree, branch). Like most lexicons, it is a lexical and semantic model of English,though it lacks a syntactic model (grammar and word forms) and common sense knowledge of thetype found in Cyc (such as the facts that plants grow, are green, need sunlight and water, etc.).The compressed download size is 37 MB, or 3 × 108 bits, although this is probably not the mostcompact representation.

Estimates from human long term memory capacityHumans, by definition, possess a natural language model. Landauer (1986) estimated the capacityof human long term memory to be 109 bits, based on known rates of learning and forgetting. Itwould seem that a language model would have to be much smaller than this because some of thismemory is used to store images and other sensory information, and images ought to require manyorders of magnitude more storage space than words.

This is not so. Standing (1973) tested long term memory capacity for pictures and words andfound a somewhat greater capacity to remember pictures, but still within the same order ofmagnitude. He tested subjects with 20 to 10,000 pictures (for instance, a dog or an airplane), 20to 1000 vivid or unusual pictures (for instance, a dog with a pipe in its mouth, an airplane crash),and 20 to 1000 written words (from the most common 25,000 words in a dictionary). After 2days, subjects were tested for recognition with two or more images or words, and ask to pick theone which was in the previous set. Allowing for random guessing, Standing found that the log10

memory capacity for n items, with correlation 0.99, was 0.93 log10 n + 0.08 for pictures, 0.97log10 n + 0.04 for vivid pictures, and 0.92 log10 n − 0.01 for words. Thus, recall for 10,000pictures would be100.93 log 10,000 + 0.08 = 6300 (actually it was 6600), and 10,000 words would be 4700. Trainingoccurred at the rate of one item per 5.6 seconds, up to 2000 per day, but similar results wereobtained with 2 seconds exposure per item. Nickerson (1968) obtained for a test with 612pictures, immediate recognition rates of 98%, and 90% after 2 weeks, compared with Standing’sformula: 77% for normal pictures and 90% for vivid pictures after 2 days).

At a rate of one word or picture per second, 12 hours per day, we would be exposed to a total of3 × 108 items in 20 years. Most forgetting occurs in the first few days. If we project usingStanding’s formula and assume worst case, no memory loss after 2 days, then we wouldremember 6.1 × 107 words (about 20%) or 9.2 × 107 normal pictures (about 31%). Assuminglog2 25,000 = 14.6 bits per word, verbal capacity would be 9 × 108 bits.

Picture memory is not much greater, since we only need to store enough features of the picture todistinguish it from others in the training set. If each picture was stored as a 30-bit feature vector,then there is less than a 3 × 108 × 2−30 = 30% that a picture would be confused with another in theset. Allowing 30 bits per picture sets the memory capacity at 2.8 × 109 bits, about three times theverbal capacity.

In another test by Standing, subjects were asked to recall pictures using a verbal description. Theaverage description was 6 words, sufficient to identify a single picture 87% of the time from a setof 200. Using Shannon’s (1951) estimate of the entropy of English of 0.6 to 1.3 bits percharacter (appropriate because the words are not uniformly distributed), and 4.5 letters per word,we have 20 to 43 bits per picture.

Estimates from neurophysiologyThe cerebral cortex has about 1010 neurons with 103 to 104 synapses (connections) each (Crickand Asanuma 1986; Kuffler, Nicholls, and, Martin,1984; Sejnowski, 1986). Hebb (1949) firstproposed that memory is stored in the synapse as a function of the activity levels of the twoneurons it connects. Nearly all neural network models accept this view, including perceptrons(Rosenblatt 1958), associative memory (Hopfield 1982), Boltzmann machines (Ackley, Hinton,and Sejnowski 1985), and back propagation networks (Rumelhart, Hinton, and Williams 1986).Based on this evidence alone, it would appear that the brain has about 1014 bits of memory,allowing a few bits per synapse.

We must be careful, however. We do not completely understand how the brain works. Contraryto most neural network models, no neurons stimulate some cells and inhibit others (Crick andAsanuma 1986). About 80% of cells are of the stimulating type. We do not know whatpercentage, if any, are involved in learning. In fact, Hebb’s hypothesis has still not beenconfirmed experimentally in the brains of vertebrates. However, it has been shown that learningresults in the growth of new neurons in the hippocampus of rats (Bower 1998).

The upper bound of 1014 bits seems quite high in light of the connectionist model (Feldman andBallard 1982). The model is based on the associationist model of thought, described by James(1890). In this model, the firing of a neuron or group of neurons (a unit) represents a momentarythought or concept. Connections between units represent associations. In a linguistic model,each unit would represent a letter, phoneme, syllable, word, morpheme, part of speech, or easilyexpressed concept, i.e., a thing that could be thought about. Hopfield (1982) found that a fullyconnected network of n neurons and n2 synapses has a memory capacity of 0.15n2 bits. If we addup all of the linguistic elements in a language, we have about 105, mostly words and morphemes.The capacity of this model is 1.5 × 109 bits of memory.

Language modeling in NLP systemsA language model describes the behavior of an interactive system, human or machine. Formally, alanguage is a set of strings, where a string is a sequence of characters from some alphabet.However, for natural languages such as English, it seems more appropriate to use a fuzzy set or aprobability distribution, in recognition of the fact that some strings are more of a member thanothers. For instance, the two strings roses are red, and roses are blue, are both Englishsentences, and therefore members of the set English, but the former is more likely to appear in atext sample than the latter. Humans, of course, have no difficulty agreeing on this point, based ontheir knowledge of the colors of flowers, but it is exceedingly difficult to program a computermake decisions of this type, simply because of the sheer number of rules required.

We can define a language model as a probability distribution over the set of human-generatedmessages. Let L(x) denote the probability of generating string x in language L. Thus, if L isEnglish, then L(roses are red) > L(roses are blue). More generally, we define L over adistribution over dialogs, which are sequences of one or more alternating messages between twopeople. Solving this problem requires the same type of high level knowledge. For instance, we

readily recognize that L(What color are roses? Red.) > L(What color are roses? Blue.). Alanguage model L is a completely general way to describe any interactive system, probabilistic ordeterministic, human or machine. If we let x be an input message, and y be the response, thenL(xy)/L(x) is the probability that the machine with model L will respond to x with output y.

The following examples show how a human language model L would be used in some ideal NLPsystems.

• Question answering machine. Given input x, the output isyL xy

L xL xy

y y= =max

( )

( )max ( ) . In

other words, the output y is the answer that a human would be most likely to give to questionx. The machine models a human that doesn’t make mistakes (if we define “correct” bymajority vote).

• Information retrieval. Given query x, the output is a list of documents ranked by L(xy)P(y),where P(y) is the probability of selecting document y independent of x, which is 0 if y is not adocument. In the simplest model, P(y) = 1/n in a set of n documents. In other words, thedocuments are ranked according to how likely a human would be to choose document y as themost relevant to question x.

• Language translation. (Knight 1997). Given input x in language X, and output y in

language Y, y P y xP x y P y

P xP x y P y P x y L y

y y y y= = = =max ( | ) max

( | ) ( )

( )max ( | ) ( ) max ( | ) ( ) , where P(y|x) is

the probability that x translates to y, P(x|y) is the probability that y translates to x, and P(y) =L(y) is just the language model for Y. The equation follows by Bayes law and because P(x) isfixed. The idea is to use a good monolingual model L(y) to improve a poor bilingual modelP(x, y). x may be ambiguous and have several possible translations, so L(y) is used to selectthe best one. We can think of L as the human native speaker of Y cleaning up the translationby P.

• Speech recognition. (Stolcke 1997). As above, except that x is a speech signal and P(x|y) isa model of speech generation, the probability that speech signal x would result, given that textstring y is read aloud. Since x may be noisy or ambiguous, L is used to select the best possibley from among, say, recognize speech and wreck a nice beach.

• Optical character recognition. As above, except that x is a raster scanned image and P(x|y)is the probability of producing image x from text y. L is used to disambiguate, say, mail andrnai1 (MAIL and RNAI1).

• Handwriting recognition. As above, except that x is a sequence of pen movements and y isthe intended text.

• Proofreading. As above, except that x is a document containing spelling and grammaticalerrors, and P(x|y) is a model of common spelling, typing, and grammatical errors, theprobability of producing x from an error free document y. It is possible that more than onecorrect document could be mistyped as x. L would choose the most likely one.

Testing a language modelHow are we to know if a language model L is a good representation of a human language such asEnglish? Turing (1950) proposed, as a test for artificial intelligence, the imitation game, in whicha machine and a human compete to convince an interrogator that they are the human. Allcommunication is written, so that the interrogator cannot see the competitors. If the interrogatorcannot tell who is the machine, then the machine passes the test. This test is now known as theTuring test, and is widely accepted as the definition of artificial intelligence (Rich and Knight1991).

The Loebner Prize has been offered in a contest, held annually since 1991, to any machine able topass a variation of the Turing test (Loebner 1998). In 1998, five programs and four humanconfederates competed for three hours in a contest judged by ten interrogators (FlindersUniversity, 1998). These numbers vary each year, but are typical. English speaking judges andconfederates with no special expertise in computers were selected from newspaper ads. Eachjudge ranked the competitors from most human to least human. A $2000 prize is awarded eachyear to the machine with the highest median ranking. There is a $25,000 prize offered to the firstmachine with a median rank higher than that of the lowest ranking human in any contest, but thatprize has yet to be claimed. (There is also a $100,000 prize for a machine meeting certainaudiovisual requirements outside the scope of the Turing test).

Although the Turing test is widely accepted, it is not very practical for evaluating prototypes. Itsimply says that a machine is intelligent or not, and even that result depends on the backgroundand education level of the interrogator and the machine’s human opponent. The Loebner contestoffers a somewhat quantitative measure by sampling a population of judges and confederates, butit is expensive to administer. I have proposed text compression as an inexpensive and quantitativealternative measure of AI (Mahoney 1999). Given a large sample of human-generated text, acompressor using a better language model will produce a smaller compressed file. The criteria forpassing the test is to be able to compress (or equivalently, predict) text as well as the averagehuman, as measured by entropy. Shannon (1950) estimated the entropy or uncertainty of Englishprose to be between 0.6 and 1.3 bits per character (bpc), restricting the character set to the 26single-case letters and the space. I found that the best known compression programs as of Sept.1998 (Gilchrist 1998) compress to 1.75 to 1.85 bpc on similar text. (Popular programs such ascompress, pkzip, and gzip get 2.5 to 3.1 bpc).

Both the Turing test and the compression test can be described in terms of language models. Inthe Turing test, the machine approximates the language L of its human opponent and theinterrogator with a model M. In response to a question x, the machine responds with y withprobability M(xy)/M(x). The interrogator then evaluates the appropriateness of the response usingL(xy)/L(x). The goal of AI is M = L, an exact model, in which case the machine isindistinguishable from human. (However, it turns out that M = L is not optimal for passing theTuring test. First, the language of the interrogator and human opponent will not be identical. Themachine should model the interrogator in this case. Second, if the interrogator makes nopresumptions about the capabilities of machines (i.e. does not make spelling or typing errors),

then the optimal response is deterministic and given by max( )

( )y

L xy

L x. In other words, the machine

should not deliberately inject errors in order to appear more human (Mahoney 1999). In practice,some machines do inject random errors (Mauldlin 1994), because interrogators do assume thatmachines don’t make spelling and typing mistakes.)

Compression programs work by encoding common messages with short encodings and lesscommon messages with longer encodings. Let L be the probability distribution of possible inputfiles, and M be the program’s model or estimate of the unknown distribution L. Shannon(Shannon and Weaver 1949) showed that for a message x with probability P(x), that the optimalencoding of x will have length −log2 P(x) bits. Using this encoding, the expected length of acompressed message x will be the entropy of the probability distribution,H P x P x

x

= −∑ ( ) log ( )2 bits. But since the compressor uses M as an estimate of true but unknown

probability L, the expected size of the output for a large sample x drawn with probability L(x) isactually −∑ L x M x

x

( ) log ( )2 , which will be larger than H unless M = L.

In comparing the Turing test with compression, we must be careful to use the same language Lfor both tests. In the Turing test, L is the distribution of possible dialogs between interrogatorand human opponent, such as What color are roses? Red. We could compress dialogs such asthese, but we are more likely to select ordinary prose, such as Roses are red. To be a legitimatetest, the text to be compressed should reflect high-level human knowledge of the type useful tothe system being tested, and its entropy with respect to human prediction should be known.

Types of language constraintsWe can describe three levels of language constraints lexical, semantic, and syntactic. (I presentthem in this order because that’s the order in which children learn them. See Papalia andWendkos-Olds, 1990). Lexical rules constrain the selection of letters and groups of letters (orphonemes in speech). For instance, q is always followed by u, and e is more common than x.Semantic rules constrain co-occurrence of related words within a text window. For instance,flower and garden are likely to appear together, but not flower and hammer. Syntactic rulesconstrain word order. For instance, the is often followed by an adjective or noun, but never averb (except in this sentence).

At each level, people learn rules and generalize them to new situations. At the lexical levels,people classify A with a, B with b, space with newline, etc. Thus, after learning a new word, suchas limulus, people have no trouble guessing the next character in the sequence LIMULU_. (Mostdata compression programs would not make this generalization).

Semantic generalization is the application of the transitive rule to word association. Thus, ifflower and garden co-occur in one paragraph, and garden and rose in another, then we wouldexpect flower and rose to be likely to co-occur in a third.

Syntactic generalization involves learning parts of speech for new words. For instance, consider,

“The porgs demiled a barse”, he said.

“Why did they do that?”, she asked.

What does they refer to? What does that refer to? We can answer such questions because wehave assigned syntactic categories to the meaningless words porg and barse (nouns), and demile(a verb), based on the surrounding terms the, -s, -ed, and a. More generally, after observing thesequences A-C, A-D, and B-C, we predict B-D by assigning A and B to one category, and C andD to another.

Is this all there is to a language model? Is it possible to represent all human knowledge as just aset of local constraints on letter order, word frequency, and word order? Of course not. Peoplelearn many things nonverbally, such as what a particular person’s face looks like, how to ride abicycle, or what a banana smells like. But this type of knowledge would not help a machine passthe Turing test; there is simply no way to communicate it to an interrogator. In fact, it is hard tothink of any type of knowledge that could be tested verbally but not learned verbally. Just as ablind man can learn that the sky is blue, so can a computer. It does not have to understand thewords, just know how to use them.

Lexical modelsLexical models constrain the selection of characters based on the immediately surroundingcharacters. These models are used mostly by data compression algorithms. Lexical constraintsforce some substrings to occur more frequently than others, allowing a compression program tochoose shorter codes for the more frequent sequences. The non-uniform distribution means thattwo randomly chosen sequences are more likely to match each other than without suchconstraints. This is shown below for a large sample of text, Far from the Madding Crowd byThomas Hardy (book1 from the Calgary corpus, a widely used data compression benchmark).

Separation between strings

P(match)(ppm)

1

10

100

1000

10000

100000

1 4 16 64 256 1K 4K 16K 64K 256K

1

2

3

4

5

6

7

8

Probability (in parts per million) that two randomly chosen strings of length 1 to 8 willmatch, as a function of separation (1 to 256K characters) in Far from the Madding Crowd(book1 from the Calgary corpus).

We make the following observations.• Characters tend not to repeat at adjacent locations, indicated by low P for separations of 1.• Word-length strings (3-6 characters) are most likely to repeat after 50-100 characters,

indicated by humps in these curves. In other words, words tend to repeat within a paragraphbut not within a sentence.

• Language becomes more redundant in increasingly large contexts, as indicated by thenarrowing of the spacing between the curves as string length increases.

• Natural language appears to be nearly ergotic over long distances, as indicated by the levelingof the curves. In an ergotic source, the probability of observing any string is independent ofits position in the text.

• The point at which text becomes ergotic increases exponentially with context length.

The most frequent strings of length 1 through 6 and their frequencies for book1 are given below.Spaces are shown as underscores.

.1633 _ .0266 e_ .0146 _th .0104 _the .0076 _the_ .0022 ,_and_

.0942 e .0227 he .0125 the .0083 the_ .0042 _and_ .0015 _that_

.0651 t .0223 _t .0117 he_ .0049 and_ .0025 ,_and .0014 n_the_

.0622 a .0208 th .0062 ing .0047 _and .0021 _was_ .0013 _of_th

.0583 o .0178 _a .0061 and .0044 ing_ .0018 _that .0012 of_the

.0532 n .0172 d_ .0059 nd_ .0041 _of_ .0018 n_the .0010 f_the_

.0489 h .0145 in .0059 _an .0039 _to_ .0016 that_ .0010 _the_s

.0481 i .0142 t_ .0059 ed_ .0025 ,_an .0014 d_the .0010 _with_

.0428 r .0130 er .0049 ng_ .0024 _in_ .0014 _with .0010 d_the_

The most frequent strings of length 1 to 6 in book1.

Semantic modelsSemantic models constrain word frequency within the region of text surrounding other wordswith related meanings. By word, we mean a term (written) or morpheme (spoken), the smallestlexical unit which can stands alone and convey meaningful information. Thus, New York is oneterm, streets is two terms (street and -s).

Word identification. Most NLP programs use hand-coded rules to parse text into words. Thesimplest rule is to break words on spaces and punctuation, but this misses prefixes and suffixessuch as -s, -ed, -able, un-, etc. Hand-coded stemming algorithms, such as (Porter 1980), can becomplex and do not work in every case. Hutchens and Alder (1997, 1998) found that worddivisions occur at points of high entropy, i.e., the first character of a word is less predictable thanthe other characters in a text stream. This is true even when spaces and punctuation are removed,although the technique is not as reliable as hand-coded parsing and stemming rules.

Word frequency. Zipf (1935) found that in many languages, that if the words are sorted byfrequency of occurrence, then the n’th ranked word occurs with frequency approximately c/n,where c is a constant, about 0.1 in English. Thus, the most common English word, the occurswith frequency 1/10 in running text. This is followed by of, 1/20; and, 1/30; to, 1/40; and so on.About half of the vocabulary of any large text sample occurs only once.

This approximation cannot hold for an infinitely large vocabulary, because c

nn=

∑ = ∞ ≠1

1. The

summation is correct for a vocabulary of about 12,000 words. Average adult vocabulary isactually about 40,000 common words and an equal number of proper nouns. Mandelbrot (1953)generalized Zipf’s law to the form c/(n + a)b, where b is about 1.1, a is small (near 0), and all ofthe constants are chosen to make the summation equal to 1. Nevertheless, Zipf’s law is a goodapproximation.

Word association. Traditional language models represent related words, such as tree, branch,plant, and grow, using some type of discrete data structure. For instance, WordNet (Miller) usesa graph with edges labeled to indicate synonyms, antonyms, hypernyms (is-a) and meronyms(has-a) relationships. Connectionist systems (Feldman and Ballard 1982) use a graph withweighted but unlabeled edges (i.e. a neural network). Cyc (Guha and Lenat 1994) uses anextended first order logic. For instance, ∀x(tree(x)→plant(x)) expresses the rule that all trees areplants.

When words are related, they are likely to appear in close proximity. This idea is exploited byinformation retrieval systems (search engines) in the form of relevance feedback (Grossman andFrieder 1998) allowing, for instance, a search for doctor to retrieve documents containing thewords medical or hospital. The system learns relations by observing a higher than chance co-occurrence of pairs of words in the same document. Resnik (1992) also observed a relationshipbetween meaning and proximity for verbs and their objects, such as open and door.

Word association is transitive. If word A is related to B, and B is related to C, then A is related toC. This property allows a system to reason. For instance: trees are plants and plants grow,therefore trees grow. (This model is not perfect. For instance: trees are plants and trees havebark, therefore plants have bark. However, people do sometimes reason this way. Some plantshave bark).

Words influence word frequency independently of one another. If A and B each affect thelikelihood of word C in their vicinity, then their combined effect is additive. In informationretrieval systems (search engines), the presence of query terms in a document independently affectthe probability of relevance, even though in general the terms are not independent of each other(Grossman and Frieder 1998).

Syntactic modelsA syntactic model constrains word order. Such models include context free grammars,probabilistic context free grammars, and hidden Markov models. Semantic graphs with directed

edges (all trees are plants) also contain syntactic information (to distinguish from all plants aretrees).

Syntactic models classify words by their part of speech: noun, verb, preposition, etc. The rulesconstrain their order. For instance, in English, articles (the, a) may be followed by nouns oradjectives, but never verbs. Adjectives precede nouns but do not follow them

Context free grammars group sequences such as article-adjective-noun into higher level groupssuch as noun-phrase, sentence, etc., and constrain their order as well. This hierarchical structureis useful in artificial programming languages (Fischer and LeBlanc 1991), but all attempts to applythem to natural language have failed. Natural language grammars tend to be ambiguous, to havemore than one possible derivation (Lindsay 1963). Probabilistic grammars (Charniak 1997),which assign a probability to each rule and thus to each parse, alleviate this problem somewhat,but do not solve it. Attempts to add parsing to speech recognition systems (Stolcke 1997) andinformation retrieval systems (Grossman and Frieder 1998) have had no effect on error rates.

The hidden Markov model is not hierarchical, and has been applied successfully in part of speechtagging (Charniak 1997), speech recognition (Stolcke 1997) and information extraction (Cardie1997). The model consists of two sets of probabilities. One is the probability of the next wordcategory, given the previous one or two words, as in P(noun | article, adjective). The other is theprobability of the current word, given the current category, as in P(house | noun).

Data Compression ModelsData compression algorithms use a language model to select a coding. If a string x occurs withprobability P(x) = L(x), then an optimal code has length log2 1/L(x) bits (Shannon and Weaver1949). The hard part is finding L.

Compression can be expressed as a prediction problem. If x = x1x2...xn is a string of n characters,then

L x L x x x xi i ii

n

( ) ( | )= −=

∏ 2 11

Data compression programs are designed to work on a wide variety of formats, not just naturallanguage text. For many files, a good predictor of xi is the few characters immediately precedingit, its context. The algorithm examines previous occurrences of the context, and guesses whatevercharacters followed them. For instance, if the last two characters are th, and there are severaloccurrences of the in the previous input, then the next character is likely to be e, which wouldreceive a short code.

This is a purely lexical model, although it takes advantage of semantic and syntactic constraints ifthe context happens to be a whole word or a sequence of words, respectively. However, it doesno generalization at any level:• Lexical. After seeing the, it will not predict E after TH.

• Semantic. After seeing rose garden, it will not predict gardening roses. After seeing roseand flower, it will not predict flower garden.

• Syntactic. After seeing in the, in a, to the, it will not predict to a.

Compression AlgorithmsShannon proved that an optimal encoding of a string x with probability P(x) into an alphabet ofsize k has length logk P(x) symbols. In practice, the number of symbols (bits if k = 2, or bytes if k= 256) must be integral, so a slightly shorter or longer than optimal code must be chosen. AHuffman code is optimal in this respect, but since it has exponential time complexity in the lengthof x, it is normally applied only to single characters (Storer 1988). A more practical code for longstrings is an arithmetic code (Bell, Witten, and Cleary 1989). Given a succession of characters xi

and their probabilities conditioned on the previous text, P(xi|x1x2...xi−1), the real range [0, 1] isrepeatedly subdivided in proportion to these probabilities on each character. The output is theshortest number, expressed as a base k decimal fraction, within the resulting subrange. This codeis never more than one symbol longer than optimal.

Predictive arithmetic encoders are among the best known compression algorithms for text files ofmoderate size (under 1MB). These encoders assign character probabilities by examining thecontext of the preceding few characters and collecting statistics for previous occurrences of thesame context, a technique that works for many types of data, not just text. For instance, given atext stream ending in ...chai_, a predictor would look for previous occurrences of chai (an order-4 context) and observe which letter comes next. Perhaps it finds two occurrences, chair andchain. Then it might assign P(r) = P(n) = P(all others) = 1/3. Then it drops the oldest characterc and searches for hai to assign probabilities to the remaining characters. For instance, if it findshail, and hair, it would assign P(l) = P(all others) = 1/6, since P(r) is already assigned. Thisprocess continues until the context is shortened to 0.

The above algorithm is called PPMA, prediction by partial match, type A. Other PPM algorithmsdiffer by how the different context orders are blended, or in the assignment of escapeprobabilities, P(all others). For instance, PPMC, which gives better compression in practice,would have assigned P(r) = P(n) = 1/4, P(all others) = 1/2, that is, an escape probabilityproportional to the number of different letters encountered in that context. The DataCompression FAQ (1998) listed an order-5 PPMC compressor, ha 0.98 (Hirvola 1998) as thebest known on the Calgary corpus as of 1993, a widely used benchmark (Calgary Corpus 1993).

There is no single best way to assign escape probabilities or otherwise blend models of differentorder. The PPMZ algorithm (Bloom 1998) collected context-sensitive statistical data on whichmethods worked best and used them adaptively. Results at the time were the best known. Sincethen, the variants boa (Sutton 1998, and personal communication) and rkive (Taylor 1998) haveachieved slightly better compression (Gilchrist 1998).

Popular compression programs such as UNIX compress (compress 1990; Gilly 1993), pkzip(1993) and gzip (Gailly 1993) use LZ or Ziv-Limpel compression. The idea is to replace repeatedsubstrings with pointers to previous occurrences. There are many variations in how pointers are

encoded, the size of the text window (maximum pointer range), and so on. Bell, Witten, andCleary (1989) showed that LZ compression is equivalent to predictive methods. In practice,compression is sacrificed for speed and memory. Unlike PPM, LZ decompression is very fast.

The Burrows-Wheeler transform (Burrows and Wheeler 1994, Nelson 1996) is fast and achievesgood compression on text. The idea is to sort the n rotations of a string of length n, producing anew string consisting of the last character of each rotation, i.e. the character preceding the sortcolumn. It is possible, given only the new position of the original (unrotated) string after sortingto convert the new string back to the original. More importantly, the new string is sorted bycontext (reading backwards), so it can be compressed efficiently using order-0 (contextinsensitive) methods. The szip implementation (Schindler 1997, 1998) achieves the best knowncompression on the large (2-4 MB) text file portion of the Canterbury corpus (Bell 1998), but isslightly inferior to the best PPM algorithms on the smaller files of the Calgary corpus.

Entropy of EnglishIf an information source produces messages x with probability P(x), then the entropy of thesource, given by

H P x P xx

= −∑ ( ) log ( )2

is the theoretical limit, in bits, for which these messages can be compressed. For a string of ncharacters, P(x) is the product of the conditional probabilities for each character, given theprevious text.

P x P x x x P x x x xn i ii

n

( ) ( ) ( | )= = −=

∏1 2 1 2 11

� �Shannon was interested in finding the entropy (in bits per character) for English text for various n.To estimate P, he drew random samples of text of length n from various sources and had humansubjects guess the n’th character given the preceding n − 1 characters. In one test, he drewsamples from Jefferson the Virginian by Dumas Malone, and had subjects guess from a 27character alphabet (the letters A through Z and space), counting the number of guesses untilcorrect. In one test with n = 100 and 100 trials, the subject guessed correctly on the first try 80%of the time, on the second try 7% of the time, and needed 3 or more tries on the remaining 13%.Depending on how the data is interpreted, Shannon determined that English has an entropybetween 0.6 and 1.3 bits per character. He obtained slightly higher figures for newspaper articles,poetry, and technical literature.

The uncertainty in Shannon’s measurements stem from how the distribution of guesses isconverted to a probability, which differs from letter to letter. In the worst case (1.3 bpc), thedistribution is the same for every letter in the test. Actually, subjects guessed poorly on the firstletter of each word, and with greater confidence on subsequent letters. In the best case (0.6 bpc),the distribution of guesses would be a sum of uniform probabilities over various ranges, i.e., 60%of the letters can be picked with certainty, 14% can be picked with 50% probability from 2choices, another set has 3 choices with probability 1/3 each, and so on. Shannon proved thatthese two interpretations bound the entropy.

Cover and King (1978) overcame the uncertainty in Shannon’s measurement by having subjectsassign probabilities directly in a betting game. Using the same text as Shannon and the same 27character alphabet, they determined an upper bound of 1.3 to 1.7 bits per character for eachsubject, and 1.3 when the results were combined. Tan (1981) obtained a similar measurement forMalay text using this technique..

A problem with Cover and King’s technique is that people have difficulty estimating probabilitiescorrectly. When asked to estimate the probability of an event, given several pieces of evidence ofvarying relevance, people tend to incorrectly weight the evidence equally (Schwartz and Reisberg1991, p. 569). Also, when given a choice of n possible outcomes to an event, people tend toassign probabilities of 1/n to each outcome, regardless of evidence favoring one outcome overanother (McDonald 1998). For instance, if the n = 2 outcomes of winning a lottery are win andlose, and P(win) < P(lose), then people will tend to overestimate their chances of winning. Theunfortunate effect of this in an entropy test would be to estimate a more uniform characterfrequency distribution than actually exists, which would increase the entropy measurement.

Compression of English TextI (Mahoney 1999) compared text compression with the entropy measurements describedpreviously. To achieve comparable results, two large text files, Alice in Wonderland by LewisCarroll (actually the last 152,141 bytes of alice30.txt from the Gutenberg press, after strippingoff the legal header) and Far from the Madding Crowd by Thomas Hardy (actually book1 fromthe Calgary corpus) were used. The files were reduced to a 27 character alphabet by convertingupper case characters to lower case and all other sequences of one or more nonalphabeticcharacters to a single space. These were then compressed with the programs listed below.

Program Options Version Type Referencecompress 4.3d LZ compress, 1990pkzip 2.04e LZ pkzip, 1993gzip -9 1.2.4 LZ Gailly, 1993ha a2 0.98 PPM Hirvola, 1993szip -b41 -o0 1.05xf BW Schindler, 1998ppmz 9.1 PPM Bloom, 1998boa -m15 0.58 PPM Sutton, 1998rkive -mt3 1.92b1 PPM Taylor, 1998

Compression programs.

Program alice alice-diff book1 book1-diffsize 135,059 135,059 731,361 731,361compress 3.030 2.892 3.151 3.120pkzip 2.616 2.565 2.958 2.950gzip 2.592 2.526 2.926 2.912ha 1.989 1.900 2.141 2.086

szip 1.938 1.915 2.102 1.993ppmz 1.876 1.761boa 1.871 1.754 1.962 1.872rkive 1.856 1.750 1.944 1.856

Compression results (bits per character).

The average compression for the best algorithm, rkive, is about 1.9 bits per character. To get abetter idea of what each program would achieve on a larger text sample, the differentialcompression (alice-diff and book1-diff) was computed by splitting each file in half and calculatingthe compression on each half after the other was compressed, averaged over both halves. Thecompression is better, about 1.8 bits per character, because the availability of statistics from thefirst half makes the second half more predictable. This effect is shown most clearly by Burrowsand Wheeler (1994) for book1 (the original text with mixed case and punctuation), and the 103megabyte Hector corpus of English text from various sources. Compression improves with blocksize in a Burrows-Wheeler compressor, which effectively compresses each block independently.

Log (base 10) block size

Bits per character

00.5

11.5

22.5

33.5

44.5

0 1 2 3 4 5 6 7 8 9

hector

book1

szip

rkive

Fig. 10. Compression vs. size for English text. book1 and hector were compressed using aBurrows-Wheeler block sorting algorithm for various block sizes (Burrows and Wheeler 1994)For comparison, compression for book1 is shown for szip, the best known Burrows-Wheelercompressor, and rkive, the best known of any type, as of Sept. 1998.

Where book1 and hector overlap (1K to 750K), the Burrows and Wheeler data agree within 0.04bits/character. Compression for book1 is 2.49 bits/character, compared with 2.35 for szip and2.19 for rkive, the latter an improvement of 0.30. The 27-character version of book1 compressedat 2.10 for szip and 1.94 for rkive, a decrease of 0.25 for either programs over the full text. Thehector corpus compressed at 2.01. Therefore, the most optimistic possible projection of thecompression for hector on a 27 character alphabet would be 1.46 bits/character. This is stilllarger than the 1.3 upper bound first estimated by Shannon. Nevertheless, the data suggests thatthis result might be achieved on even larger text files, perhaps on the order of gigabytes.

ProposalI propose to estimate the memory requirements for a model of English by finding the relationshipbetween compression and model size (amount of compressed text) and compare to the entropy ofEnglish based on human prediction tests such as Shannon’s. There are four major areas of workto be done.

1. Obtain a corpus of English text.2. Develop a compression algorithm that generalizes at the lexical, semantic, and syntactic levels.3. Determine the relationship between compression and model size.4. Develop an improved measure of entropy.

1. Obtaining a corpusAn ideal corpus for comparison between compression and the Turing test would be the transcriptsbetween the judges and confederates in the Loebner contests. These are available, but not largeenough to build a language model. Large amounts of text are available by randomly samplingweb pages and USENET posts. To make sure that the corpus is a valid sample, its entropyshould be similar to the Loebner transcripts, whether measured by compression in part 2, or byhuman tests in part 4.

2. Developing a multilevel, generalizing compressorWe have seen that all of the elements of a natural language model can be extracted from textsamples without any a priori knowledge. Words and phrases can be identified by high entropyboundaries. Generalization at each level is a matrix operation, as follows.

Lexical level. Let A be a character adjacency matrix for a text sample, such that aij is the numberof times that character i is followed by character j. Let ci be the number of occurrences ofcharacter i. Then aij/ci is the estimated probability that i is followed by j, and aij/cj is theprobability that j is preceded by i. This is an order-1 context model, but the idea can be extendedto higher orders.

To generalize the lexical model, let B = AAt + AtA. Then bij/cicj is the probability that i and j arepreceded or followed by the same character. It remains to be determined experimentally theextent to which a context containing character xi should be generalized to contexts containing xj.

Generalization can be iterated by repeated squaring. The elements of B2 indicate how often thecategories in which i and j are placed are preceded or followed by characters in the samecategory.

Semantic level. Let A be a word association matrix, such that each element,

ac c d x d x

i jiji j i jxx ji

=−

≠∑∑1 12( ( ) ( )),

where xi and xj are the i’th and j’th words in the vocabulary, ci and cj are the number ofoccurrences of xi and xj in the text, and d(x) is the position of word x in the text. It is the inversesquare of the average distance between words xi and xj. Small values indicate that the wordsoften appear close together, and are therefore related. The elements of A2 indicate how often twowords appear close to the same third word. Again, the transitive property of related words can beextended by repeated squaring.

Syntactic level. This is like the lexical level, except that the elements are words instead of letters.If two words are preceded or followed by the same word, then they have a similar grammaticalrole (i.e. the cat ate, the dog ate, both nouns).

3. Relating compression to model sizeIn a compression program, both compression ratio and model size are easy to measure. Modelsize is the amount of compressed text after compression.

For most compression programs, compression improves as we increase file size until the memorylimit is reached, and then it levels off. Poor (but fast) compressors such as compress reach thislimit quickly. For very large files, the Burrows-Wheeler transform (szip) appears to be the best,but there is no reason in principle why an LZ or predictive arithmetic encoder should not do aswell, given unlimited memory.

I believe the Burrows-Wheeler data on the Hector corpus does not show any signs of leveling offup to 103 MB (though the authors conclude differently). However, this, like all the compressorstested, uses a lexical model without generalization. I believe that as the input size increases, thatsemantics and syntax will become more important. To discern an accurate trend, it will benecessary to characterize the effects of lexical, semantic, and syntactic generalization individually,as well as when taken together. It is important to understand how these different levels affectcompression. Otherwise it would be easy to claim that my compressor would level off at somepoint short of the goal.

4. Improving entropy measurementsThere is uncertainty in the entropy measurements of English that results from variations in textsource, variations in human skill at prediction, and variations in the interpretation of the results.In Shannon’s case, the uncertainty in interpretation was 0.6 to 1.3 bpc. Cover and King used amethod to remove that uncertainty, but made the testing process much more difficult for thesubjects, arriving at 1.3 to 1.7 bpc for individual subjects depending on their ability, and 1.3 as anupper bound when the results were combined.

I believe that the actual entropy of English is closer to 1 bpc. I propose an alternative test thatreduces (but does not eliminate) Shannon’s uncertainty, but is easier to administer than Cover andKing’s test. Rather than guessing characters sequentially or assigning odds to them all at once,subjects would be given a choice of two sets of characters, A or B, and also assign a confidence

level, such as very certain, certain, probably, or don’t know. Guessing would proceed until theset is narrowed down to a single character, then repeat with the next character.

The guessing would be computer assisted. The two sets in each choice would be chosen to be ofroughly equal probability, according to the program’s language model. Often it would benecessary to guess more than one character at a time, if a single character had a very highprobability.

In each category of guesses (certainly A, probably B), the number of right and wrong guesseswould be counted, and the probability p of guessing correctly would be estimated. The entropyfor each category is n(p log 1/p + (1 − p) log 1/(1 − p)), where n is the number of guesses in thecategory. The entropy of the text is the sum of the entropies of the categories. This methodovercomes the difficulty in Shannon’s technique by grouping together guesses of roughly equalprobability.

The program could be written in JavaScript and placed on a web page, with results recorded viaCGI on a web server.

Evaluating the resultsI do not believe that it is possible to positively prove that a certain language model size will passthe Turing test, short of actually building a model and winning the Loebner prize. I don’t plan todo that. What I hope to do is show a general trend that points to a number, which I believe willbe around 109 bits. If this doesn’t happen, then I may need to look for another level of languageconstraints then the three I considered here. These are easy to find in other file formats, forinstance, cyclical constraints in raster scanned image files due to the correlation of adjacentvertical pixels.

A fairly convincing demonstration, at least of the correctness of the language model, would be toimprove on the best known compression programs. There is probably little room forimprovement in the widely used Calgary corpus, due to the intense level of competition (seeGilchrist 1998), and relatively small file size (under 1 MB each). An entropy measurement wouldsettle this question of how much improvement is possible. There is no reason that the technique Idescribed could not be used on binary files, when presented appropriately. However, I believethat the greatest room for improvement is on very large files.

If the estimate of 109 bits is correct, it ought to be possible to build a compressor and demonstrateit in other AI tasks. For instance, decompressing random data should produce English-like text.Decompressing a compressed question followed by random data should produce a reasonableanswer. In a search engine, documents could be ranked according to how well the query iscompressed after each document. In a proofreader, regions that compress poorly should beflagged as errors.

ScheduleThe work should take 1 to 2 years. If necessary, the last part, measuring the entropy of English,could be skipped in favor of published results, in which case the estimate of language complexitywould be less accurate.

ConclusionAI is one of the most difficult problems in computer science. Before we give up and say the goalis unobtainable, we should at least try to see where the goal lies. Evidence from psychologysuggests that it is not so remote.

The usual approach in AI has been to attempt to build systems that imitate something that peopledo well. We don’t normally think of data compression as an AI task, even though for text, it isreally the same. People are not good at compression because an exactly identical language modelis needed to decompress the data, and we cannot make a copy of the human brain as we can witha program.

Compression is a difficult problem, like AI, so just trying to solve it is not the answer. However,compression can be measured precisely and objectively on standard benchmarks, so it is ideal forstudying the dependence on model size and projecting the results to a known goal.

ReferencesAckley, David H., Geoffrey E. Hinton, and Terrence J. Sejnowski (1985), “A learning algorithmfor Boltzmann machines”, Cognitive Science 9:147-169

Alta-Vista (1998), http://www.altavista.digital.com (May 28, 1998).

Anderson, James A., (1983) “Cognitive and Psychological Computation with Neural Models”,IEEE Transactions on Systems, Man, and Cybernetics (13)5:799-815.

Bell, Timothy, Ian H. Witten, John G. Cleary (1989), “Modeling for Text Compression”, ACMComputing Surveys 21(4):557-591.

Bell, Timothy (1998), Canterbury Corpus, http://corpus.canterbury.ac.nz/Bloom, Charles, ppmz v9.1 (1997), http://www.cco.caltech.edu/~bloom/src/ppmz.html (Sept 28,1998).

Bloom, Charles (1998), “Solving the Problems of Context Modeling”,http://www.cco.caltech.edu/~bloom/papers/ppmz.zip (Sept 28 1998).

Borko, Harold (1967), Automated Language Processing, The State of the Art, New York: Wiley.

Bower, B. (1998), “Learning to make, keep adult neurons”, Science News 155(11):170.

Burrows, M., and D. J. Wheeler (1994), A Block-sorting Lossless Data Compression Algorithm,Digital Systems Research Center, http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html (Oct 30 1998).

Calgary Corpus (1993), http://www.kiarchive.ru/pub/msdos/compress/calgarycorpus.zip (Oct. 91998).

Cardie, Claire (1997), “Empirical Methods of Information Extraction”, AI Magazine 18(4):65-79.

Carroll, Lewis (1865), Alice in Wonderland, Gutenberg Press,ftp://sunsite.unc.edu/pub/docs/books/gutenberg/etext97/alice30h.zip (Oct. 5, 1998)

Charniak, Eugene (1997), “Statistical Techniques for Natural Language Parsing”, AI Magazine18(4):33-43.

Comprende (1998), http://comprende.globalink.com/main.html (May 28, 1998).

compress 4.3d for MSDOS (1990), ftp://ctan.tug.org/tex-archive/tools/compress/msdos.zip (Nov.3, 1998).

Cover, T. M., and R. C. King (1978), “A Convergent Gambling Estimate of the Entropy ofEnglish”, IEEE Transactions on Information Theory 24(4):413-421.

Crick, F. H. C. and C. Asanuma (1986), Certain aspects of the anatomy and physiology of thecerebral cortex, in Rumelhart, David E., James L. McClelland, and the PDP Research Group(1986), Parallel Distributed Processing, vol. 2, Cambridge MA: MIT Press. p. 333-371.

Cycorp Inc. (1997), http://www.cyc.com (Oct. 15, 1998)

Data Compression FAQ (July 25, 1998), http://www.cs.ruu.nl/wais/html/na-dir/compression-faq/.html (Oct. 5, 1998).

Dietterich, Thomas G. (1997), “Machine Learning Research, Four Current Directions”, AIMagazine, 18(4):97-136.

Feldman, J. A. and D. H. Ballard (1982), “Connectionist Models and their Properties”, CognitiveScience, 6:205-254.

Fischer, Charles N., LeBlanc, Richard J. Jr (1991), Crafting a Compiler with C, Redwood City,Calif: Benjamin/Cummings Publishing Co., Inc.

Flinders University (1998) The Flinders University of South Australia presents the 1999 LoebnerPrize Competition, http://www.cs.flinders.edu.au/Research/AI/LoebnerPrize/ (Oct 15, 1998).

Gilchrist, Jeff (Sept. 1998), Archive Comparison Test,http://www.geocities.com/SiliconValley/Park/4264/act-mcal.html (Oct. 5, 1998).

Gilly, Daniel (1992), Unix in a Nutshell, Sebastopol CA: O’Reilly..

Green, Bert F. Jr., Alice K. Wolf, Carol Chomsky, and Kenneth Laughery (1961), Baseball: AnAutomatic Question Answerer, Proceedings of the Western Joint Computer Conference, 19:219-224, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York:McGraw Hill, 1963.

Grossman, David A., and Ophir Frieder (1998), Information Retrieval: Algorithms andHeuristics, Boston: Kluwer Academic Publishers.

Guha, R. V., and D. B. Lenat (1994), “Comparing CYC to Other AI Systems”,http://www.cyc.com/tech-reports/act-cyc-406-91/act-cyc-406-91.html (May 29, 1998)

gzip 1.2.4 (1993), Jean-loup-Gailly, http://www.kiarchive.ru/pub/msdos/compress/gzip124.exe(Oct. 9, 1998).

Hebb, D. O. (1949), The Organization of Behavior, New York: Wiley.

Hirvola, H., 1993. ha 0.98, http://www.webwaves.com/arcers/msdos/ha098.zip (Jan 12, 1999).

Hopfield, J. J. (1982), “Neural networks and physical systems with emergent collectivecomputational abilities”, Proceedings of the National Academy of Sciences 79:2554-2558.

Hutchens, Jason L., and Michael D. Alder (1997), “Language Acquisition and DataCompression”, 1997 Australasian Natural Language Processing Summer Workshop Notes(Feb.), 39-49, and http://ciips.ee.uwa.edu/au/~hutch/research/papers (Oct. 5, 1998).

Hutchens, Jason L., and Michael D. Alder (1998), “Finding Structure via Compression”,Proceedings of the International Conference on Computational Natural Language Learning, pp.79-82, http://ciips.ee.uwa.edu/au/~hutch/research/papers (Oct. 5, 1998).

James, William (1890), Psychology (Briefer Course), New York: Holt, ch. XVI “Association”,253-279.

Knight, Kevin, (1997), “Automatic Knowledge Acquisition for Machine Translation”, AIMagazine, 18(4):81-96.

Kuffler, Stephen W., Nicholls, John G., Martin, Robert A., (1984), From Neuron to Brain, 2nded., Sunderland MA: Sinauer Associates Inc.

Landauer, Tom (1986), “How much do people remember? Some estimates of the quantity oflearned information in long term memory”, Cognitive Science 10:477-493.

Lindsay, Robert K (1963), “Inferential Memory as the Basis of Machines which UnderstandNatural Language”. Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York:McGraw Hill, 217-233.

Loebner, Hugh, (1998) Home Page of The Loebner Prize--"The First Turing Test”.http://www.loebner.net/Prizef/loebner-prize.html (Nov. 4, 1998).

Mahoney, Matthew V. (1999), “Text Compression as a Test for Artificial Intelligence”,http://www.he.net/~mmahoney/paper4.ps.Z

Mandelbrot (1953), reference in [Borko 1967] p. 96 given as: Mandelbrot, B, “An informationaltheory of the structure of language based upon the theory of the statistical matching of messagesand coding”, Proceedings of a Symposium on Applications of Communication Theory, W.Jackson (ed.), London: Butterworths, 1953,. 486-502.

Mauldlin, Michael L. (1994). Chatterbots, Tinymuds, And The Turing Test: Entering TheLoebner Prize Competition. AAAI-94. http://www.fuzine.com/mlm/aaai94.html (Nov. 4 1998).

McDonald, John, (1998), 200% Probability and Beyond: The Compelling Nature of ExtraordinaryClaims in the Absence of Alternative Explanations. Skeptical Inquirer 22(1): 45-49, 64.

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K. (1993), Introduction to WordNet:An On-line Lexical Database, ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.ps (Feb. 5,1999).

Minsky, Marvin, and Seymour Papert (1969), Perceptrons, Cambridge MA: MIT Press,Introduction pp. 1-20.

Mitchell Tom M. (1997), Machine Learning, New York: McGraw-Hill..

Newsbytes (1988), “Automatic Language Translation System for Telecommunications”, Mar. 22,1988, http://www.nbnn.com/pubNews/88/48342.html (Mar. 29, 1998).

Nickerson, R. S. (1968), “A note on long term recognition for picture material”, PsychonomicScience 11:58.

Papalia, Diane E., and Sally Wendkos-Olds (1990), A Child’s World, Infancy throughAdolescence, New York: McGraw-Hill.

PKZIP (1993), version 2.04e, PKWARE Inc.

Porter, M. F. (1980), “An algorithm for suffix striping”, Program 14:130-137.

Resnik, Philip (1992), “WordNet and Distributional Analysis: A Class-based Approach to LexicalDiscovery”, in Statistically-Based Natural Language Programming Techniques, Papers from the1992 AAAI Workshop, Technical Report W-92-01, Menlo Park CA: AAAI Press, 48-56.

Rich, Elaine, and Kevin Knight (1991), Artificial Intelligence, 2nd Ed., New York: McGraw-Hill.

Rosenblatt, F. (1958), “The preceptron: a probabilistic model for information storage andorganization in the brain”, Psychological Review, 65:386-408.

Schindler, Michael (1998), szip homepage, http://www.compressconsult.com/szip/ (Oct 30,1998).

Schindler, Michael (1997), A Fast Block-sorting Algorithm for Lossless Data Compression, 1997Data Compression Conference, http://www.compressconsult.com/szip/ (Oct 30, 1998).

Schwartz, Barry, and Daniel Reisberg (1991), Learning and Memory, New York: W. W. Nortonand Company.

Sejnowski, T. J., (1986), Open questions about computation in cerebral cortex, in Rumelhart,David E., James L. McClelland, and the PDP Research Group (1986), Parallel DistributedProcessing, vol. 2., Cambridge MA: MIT Press, 372-389.

Shannon, Claude, and Warren Weaver (1949), The Mathematical Theory of Communication,Urbana: University of Illinois Press

Shannon, Claude E. (1950), “Prediction and Entropy of Printed English”, Bell Sys. Tech. J, 3:50-64.

Standing, L. (1973), “Learning 10,000 Pictures”, Quarterly Journal of Experimental Psychology,25:207-222

Stolcke, Andreas (1997), “Linguistic Knowledge and Empirical Methods in Speech Recognition”,AI Magazine 18(4):25-31

Storer, James A. (1988), Data Compression, methods and theory, Rockville MD: ComputerScience Press.

Sutton, Ian (1998), boa 0.58 beta, http://webhome.idirect.com/~isutton/ (Oct 5, 1998).

Tan, C. P., (1981), “On the Entropy of the Malay Language”, IEEE Transactions on InformationTheory, 27(3):383-384.

Taylor, Malcolm, RKIVE v1.91 beta 1 (1998)http://www.geocities.com/SiliconValley/Peaks/9463/rkive.html (Nov. 10, 1998).

Turing, A. M., (1950) “Computing Machinery and Intelligence, Mind, 59:433-460, reprinted inComputers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963.(See also http://www.loebner.net/Prizef/TuringArticle.html (Dec. 8, 1998)).

Uhr, Leonard, Charles Vossler (1963) “A Pattern-Recognition Program that Generates,Evaluates, and Adjusts its own Operators”, Computers and Thought, E. A. Feigenbaum and J.Feldman eds, New York: McGraw Hill, 251-268

Whitten, David (1994) “The Unofficial, Unauthorized CYC Frequently Asked QuestionsInformation Sheet.”, http://www.mcs.net/~jorn/html/ai/cycfaq.html (May 29, 1998).

Williams, Martyn (1996), “Review - Tsunami, Typhoon for Windows”, http://www.nb-pacifica.com/headline/reviewtsunamityphoonf_608.shtml (May 29, 1998).

WordNet, http://www.cogsci.princeton.edu/~wn/ (Feb 5, 1999).

Zipf, George Kingley (1935), The Psycho-Biology of Language, an Introduction to DynamicPhilology, M.I.T. Press.