Progressing the state-of-the-art in grammatical inference by competition: The Omphalos Context-Free...

25
1 Progressing the State-of-the-art in Grammatical Inference by Competition The Omphalos Context-Free Language Learning Competition Bradford Starkie a,* , Fran¸ cois Coste b and Menno van Zaanen c a Telstra Research Laboratories 770 Blackburn Rd Clayton Melbourne, Victoria Australia 3127 E-mail: [email protected] b IRISA Campus de Beaulieu 35042 Rennes France E-mail: [email protected] c Tilburg University Postbus 90153 5000LE, Tilburg the Netherlands E-mail: [email protected] This paper describes the Omphalos Context-Free Lan- guage Learning Competition held as part of the Inter- national Colloquium on Grammatical Inference 2004. After the success of the Abbadingo Competition on the better known task of learning regular languages, the competition was created in an effort to promote the development of new and better grammatical inference algorithms for context-free languages, to provide a fo- rum for the comparison of different grammatical infer- ence algorithms and to gain insight into the current state-of-the-art of context-free grammatical inference algorithms. This paper discusses design issues and decisions made when creating the competition, leading to the in- troduction of a new complexity measure developed to estimate the difficulty of learning a context-free gram- mar. It presents also the results of the competition and lessons learned. Keywords: Grammatical Inference, Context-Free Grammars, Competition * Corresponding author. 1. Introduction Research performed in the field of Grammatical Inference (GI) is, as the name suggests, directed towards learning grammars from “sequential” data (i.e. sequences, words or strings but also more com- plicated structures such as trees, terms or even graphs). The induced, or inferred, grammars can then be used to classify unseen data, compress this data or provide some suitable model for this data. The Omphalos competition focussed upon the in- ference of Context-Free Grammars (CFG’s). Grammatical inference is a well established re- search field in Artificial Intelligence as it dates back to the 1960s. Gold [1] originated this study motivated by the construction of formal model of human language acquisition. Since this seminal pa- per, grammatical inference has been investigated, more or less independently, within many research fields, including machine learning, computational learning theory, structural and syntactic pattern recognition, computational linguistics and compu- tational biology. A good survey of the field is pre- sented in [2] while [3] proposes a recent guide into its bibliography. Formal grammatical inference is predominantly concerned with the learnability of specific types of grammars. Learnability in this type of work can be defined in different ways. For example, the well-known work by [1] defines learnability in a game theoretic way. The idea is that in the limit a learner has to find the same grammar used by the teacher to generate examples of the language described by the grammar. The setting of active learning, with the possibility of asking queries to the teacher has also been studied [4]. A different definition of learnability is called probably approx- imately correct (PAC) learning [5], where the goal is to minimize the probability of learning an incor- rect grammar. Within this field, proofs show that AI Communications ISSN 0921-7126, IOS Press. All rights reserved

Transcript of Progressing the state-of-the-art in grammatical inference by competition: The Omphalos Context-Free...

1

Progressing the State-of-the-art in

Grammatical Inference by CompetitionThe Omphalos Context-Free Language Learning Competition

Bradford Starkie a,∗, Francois Coste b andMenno van Zaanen c

a Telstra Research Laboratories

770 Blackburn Rd Clayton

Melbourne, Victoria

Australia 3127

E-mail: [email protected] IRISA

Campus de Beaulieu

35042 Rennes

France

E-mail: [email protected] Tilburg University

Postbus 901535000LE, Tilburg

the Netherlands

E-mail: [email protected]

This paper describes the Omphalos Context-Free Lan-

guage Learning Competition held as part of the Inter-

national Colloquium on Grammatical Inference 2004.

After the success of the Abbadingo Competition on the

better known task of learning regular languages, the

competition was created in an effort to promote the

development of new and better grammatical inference

algorithms for context-free languages, to provide a fo-

rum for the comparison of different grammatical infer-

ence algorithms and to gain insight into the current

state-of-the-art of context-free grammatical inference

algorithms.

This paper discusses design issues and decisions

made when creating the competition, leading to the in-

troduction of a new complexity measure developed to

estimate the difficulty of learning a context-free gram-

mar. It presents also the results of the competition and

lessons learned.

Keywords: Grammatical Inference, Context-Free

Grammars, Competition

*Corresponding author.

1. Introduction

Research performed in the field of GrammaticalInference (GI) is, as the name suggests, directedtowards learning grammars from “sequential” data(i.e. sequences, words or strings but also more com-plicated structures such as trees, terms or evengraphs). The induced, or inferred, grammars canthen be used to classify unseen data, compress thisdata or provide some suitable model for this data.The Omphalos competition focussed upon the in-ference of Context-Free Grammars (CFG’s).

Grammatical inference is a well established re-search field in Artificial Intelligence as it datesback to the 1960s. Gold [1] originated this studymotivated by the construction of formal model ofhuman language acquisition. Since this seminal pa-per, grammatical inference has been investigated,more or less independently, within many researchfields, including machine learning, computationallearning theory, structural and syntactic patternrecognition, computational linguistics and compu-tational biology. A good survey of the field is pre-sented in [2] while [3] proposes a recent guide intoits bibliography.

Formal grammatical inference is predominantlyconcerned with the learnability of specific typesof grammars. Learnability in this type of workcan be defined in different ways. For example, thewell-known work by [1] defines learnability in agame theoretic way. The idea is that in the limita learner has to find the same grammar used bythe teacher to generate examples of the languagedescribed by the grammar. The setting of activelearning, with the possibility of asking queries tothe teacher has also been studied [4]. A differentdefinition of learnability is called probably approx-imately correct (PAC) learning [5], where the goalis to minimize the probability of learning an incor-rect grammar. Within this field, proofs show that

AI Communications

ISSN 0921-7126, IOS Press. All rights reserved

2 B. Starkie et al. / Omphalos

certain classes of grammars can be learnt accord-ing to the set of learnability criteria.

When considering efficiency, results are mainlynegative even for the simplest class of the Chom-sky hierarchy if the presentation of the examplesis arbitrary. Gold [6] proved that the problem offinding the smallest deterministic finite automata(DFA) consistent with a given finite sample of pos-itive and negative examples is NP-hard. Moreover,Pitt and Warmuth [7] showed that no polynomialalgorithm could be guaranteed to find an approxi-mation (in the size) of this target automata. In thePAC setting, they also showed that the problemof polynomially approximate predictability of theclass of DFA is hard [8], indeed this problem canbe linked with solving cryptographic problems be-lieved to be intractable [9]. To allow efficient learn-ability, one has to assume that “good” examplesare provided to the learner: simple distributionsin the PACS setting [10] or presence of samplescontaining characteristic sets formalized by de laHiguera [11] in his model of identification in thelimit from polynomial time and data. This latermodel enables all languages that can be describedusing a DFA to be learnt but only when the train-ing set contains a characteristic set of sentences.Note that even in this setting, not all context-freelanguages can be identified.

Due to the hardness of the task, the mainalgorithms developed for the grammatical infer-ence of context-free languages are either algo-rithms devoted to specific subclasses of grammars[12,13,14,15,16] or heuristic approaches [17,18,19]which are both difficult to compare from a the-oretical or computational point of view. One ofthe more objective ways of comparing these vari-ous approaches is to test their generalization per-formance on given data. Several benchmarks havebeen constructed by the authors to test theirown algorithms1 but to really compare these vari-ous approaches, “neutral” benchmarks are needed.This has been one of the achievements of gram-matical inference competitions.

Here, we will describe Omphalos, a context-freelanguage learning competition held in conjunctionwith the International Colloquium on Grammati-cal Inference 2004.2 Omphalos is the latest compe-

1See for instance this GI benchmark repositoryhttp://www.irisa.fr/symbiose/people/coste/gi benchs.html.

2The competition data will continue to be available afterthe completion of the competition and can be accessed viahttp://www.irisa.fr/Omphalos/.

tition in a line of competitions within the GI field.Whereas Abbadingo3 and its follow-on Gowachin4,or the GECCO contest from noisy data5 weredirected towards learning deterministic finite au-tomata, which can describe the class of regularlanguages, Omphalos is the first GI competitionaimed at learning context-free languages.

Motivated by the progress permitted by the Ab-badingo competition [20], the competition was setup with the following aims:

– to promote the development of new and bettergrammatical inference (GI) algorithms justbeyond the current state-of-the-art in gram-matical inference,

– to provide a forum in which the performanceof different grammatical inference algorithmscan be compared on a given task, and

– to provide an indicative measure of the com-plexity of grammatical inference problemsthat can be solved with the current state-of-the-art techniques.

The rest of the article is divided as follows. Wewill start with a description of context-free gram-mars for those unfamiliar with the concept. In thatsection we will also define the notation used in thispaper. We will then describe the competition taskof Omphalos, in a section where we treat designissues, the actual setup of the competition and dif-ferent evaluation methods.

We propose in the next section a new complex-ity measure to estimate the difficulty of learninga context-free grammar. The article will end withthe results of the competition, including the contri-bution of the competition to advancing the state-of-the-art and a discussion on the lessons that havebeen learnt from the competition.

1.1. Context-Free Grammars

A context-free grammar can be represented by a4-tuple G = (N, Σ, Π, S) where N is the alphabetof non-terminal symbols; Σ is the alphabet of ter-minal symbols such that N ∩ Σ = ∅; Π is a finiteset of productions P of the form “L → s1s2 . . . sn”where L ∈ N and each si ∈ (N∪Σ); and S is a spe-cial non-terminal called the start symbol that rep-resents all of the sentences generated by G. For the

3See http://abbadingo.cs.unm.edu/.4See http://www.irisa.fr/Gowachin/.5See http://cswww.essex.ac.uk/staff/sml/gecco/NoisyDFA.html.

B. Starkie et al. / Omphalos 3

sake of brevity, a context-free grammar is some-times described purely in terms of its productionrules. This is because as long as the start symbolis known the values of N and Σ can be extractedfrom the production rules. All example grammarsin this paper will use the start symbol S.

For example consider the following grammar.

Grammar 1.

S → i’d like to fly XX → to LOCX → fromLOCto LOCLOC → melbourneLOC → sydney

Valid sentences of a language described by agrammar can be created by selecting one of therules that has the start symbol on its left-handside. Each non-terminal Ni on the right-hand sideof a rule is then expanded by replacing Ni withthe right-hand side of a rule with Ni on its left-hand side. The process is repeated until no furthernon-terminals can be expanded.

The notation “A ⇒∗ w” denotes that A canbe expanded via zero or more applications of aproduction rule to become w.

For instance according to the Grammar 1:S ⇒∗ i’d like to fly from melbourne to sydneyThe sequence of rules used to generate a sen-

tence can be represented by a structure known asa derivation tree, that can be represented by eitherby a tree diagram (Figure 1) or series of bracket-ings (Figure 2).

In this document we will use the following nota-tions.

1. Lowercase letters represent terminal symbols,e.g., “a”, “b”.

2. Uppercase symbols represent non-terminalsymbols, e.g., “A”, “B”. The symbol S willbe reserved to represent the start symbol.

3. Bold lowercase letters represent sequences ofone of more terminal symbols, e.g., “a”, “b”.

4. Italicized lowercase letters represent sequencesof zero of more terminal symbols, e.g., “a”,“b”.

5. The symbol ε is used to represent the emptystring.

6. Bold uppercase letters represent sequences ofone or more terminal or non-terminal sym-bols, e.g., “A”, “B”.

7. The notation L(G) will be used to denote thelanguage defined by the grammar G.

A special subclass of context-free grammar is theright linear grammar. In a right linear grammar,all production rules are either of the form “A →bC”, “D → e”, or “F → ε”.

Any language that can be described using acontext-free grammar is known as a context-freelanguage. Similarly any language that can be de-scribed using a right linear grammar is known as aregular language. The class of all regular languagesis a proper subset of the class of all context-freelanguages.

2. Design issues

Apart from the class of grammars to be learnt,several design issues have to be decided upon. Themain issues were:

Format of training data For the participants ofthe competition to learn an unknown lan-guage, example sentences from that languageneed to be made available to them. There aremany design choices to made about the for-mat of these example sentences, for instance,the amount of data, the type of data, com-pleteness, amount of noise in the data, etc.

Method of evaluation To be able to determine awinner of the competition, a means of evalu-ating the results needed to be decided upon.There are several ways of doing this, with cer-tain problems associated with each possibil-ity. We could, for instance, judge entries bymeasuring the difference between the inferredlanguage and the target language, or, alter-natively, by measuring the difference betweenderivation trees assigned to certain sentencesby the inferred grammar and the correspond-ing derivation trees of the target grammar.

Complexity of tasks One of the goals of the com-petition was to determine and advance thestate-of-the-art in grammatical inference. There-fore the competition tasks should be selectedso as to be neither too simple, nor too diffi-cult to solve. To be able to rank the competi-tion problems, the complexity of the learningtask should be quantifiable. This design issueis, of course, closely related to several aspectsof the held-out data.

4 B. Starkie et al. / Omphalos

S

i’d like to fly X

from LOC

melbourne

to LOC

sydney

Fig. 1. Example parse tree using the grammar of Grammar 1.

(S i’d like to fly (X from (LOC melbourne) to (LOC sydney)))

Fig. 2. Alternative parse tree notation.

2.1. Format of Training Data

On a high level, one needs to decide the type ofinformation that is given to the learning algorithm.There are several options, among others:

positive examples only a set of sentences that arepart of the language,

positive and negative examples a set of sentenceswhere each is labelled to be a member of thelanguage or not,

(partially) structured examples a set of tree struc-tures denoting the (partial) analysis of thesentence according to the underlying gram-mar.

It is known that (within the framework of “iden-tification in the limit”), many interesting classesof grammars are not learnable from positive dataonly. Context-free grammars are one of them[1]. However, [21] gives an enumerative algorithmthat identifies Stochastic Context Free Grammars(SCFGs) in the limit with probability one fromstochastic data, i.e. data generated by an SCFG.The type and the amount of queries needed tolearn context-free grammars is still an open prob-lem [4].

The goal of the competition is not to see whethercontext-free grammars are learnable or not, butto investigate the interaction between the learningalgorithms and the complexity of the grammars,the amount and the type of data available. Manyparameters define the possibility of whether or notcontext-free grammars are learnable.

In this competition, it was decided to provideboth positive or positive and negative examples.Even though learning a grammar from structuredexamples is not trivial, (partially) structured ex-amples are too closely related to an underlyinggrammar. Using positive and negative examples ismore in line with the previous (regular language)competitions, and it is also possible for systems toignore the negative examples, if they want to claimto learn the grammar using positive examples only.

Apart from the type of training data, there arestill a number of aspects of the data that are im-portant for the setting of the problems. These as-pects will be described below in section 3, wherethe complexity of specific tasks is discussed.

2.2. Method of Evaluation

In order to find out when a system is success-ful in learning a grammar, a means of evaluationneeds to be decided upon. We must have a wayof knowing whether the grammar that underliesthe training sentences has been learnt. There areseveral ways of tackling this problem. The actualmethod should be automatic, objective, and easyto use.

Before we can decide upon an evaluation method,we must first decide what exactly we want to beable to learn. A specific grammar defines a fixedlanguage that possibly contains an infinite numberof sentences. The sentences can be generated bycombining the grammar rules, which can be seen

B. Starkie et al. / Omphalos 5

as a tree structure on top of the language. Thestring language is now the set of plain sentencesthat are part of the language and the tree languageis the set of sentences together with their deriva-tion trees. There are often different ways to gen-erate the same sentence using a grammar, whichmeans that one sentence can have different trees.This is an important aspect in evaluation, becausewe need to know exactly what the goal, or the out-put of the GI system should be.

There are many possible approaches to the eval-uation of grammatical inference systems. Here wewill give an overview of several ways to evaluateGI systems together with a discussion on the ad-vantages and disadvantages of each [22].

2.2.1. Rebuilding known grammarsStarting from a pre-defined grammar, a set of

training sentences is generated. Using this set ofsentences, the GI system should recreate the orig-inal grammar. The output grammar of the GI sys-tem is then compared against the original gram-mar.

There are several advantages of this approach.

– The grammar can be created by hand, havingspecific properties that might be difficult tolearn by grammatical inference systems.

– The complexity of the grammar can be variedif needed by choosing a different grammar.

– It is easy to vary the amount of training data,because if necessary, more data can be gener-ated.

– The generation process can be adjusted, al-lowing, for example an ordering in the trainingdata or a preference for sentences with certainaspects.

However, there are also problems with this ap-proach.

– For most languages, there exists more thanone grammar that generates the same stringlanguage. Although all regular grammars canbe transformed into a canonical form, no suchcanonical form exists for context-free gram-mars. This implies that it is impossible tocompare context-free grammars directly andthe only way to know for sure if two grammarsare the same is to do manual investigation orto compare the set of sentences that can begenerated by the grammars. Unfortunately, inthe general case a context-free grammar can

have an infinite number of sentences and com-paring two grammars may not be possible. Itis definitely the case that this cannot be per-formed automatically using the current state-of-the-art.

– The flexibility of sentence generation can alsobe seen as a disadvantage. A decision has tobe made on how exactly the sentences are gen-erated. Not only may this influence the appli-cability of certain GI systems, it also intro-duces a new design choice: should informationon the generation method be available to theGI systems?

2.2.2. Comparison of treebanksThis evaluation approach is based on a treebank,

an annotated set of trees. A tree in this contextis a sentence in the language combined with itsderivation according to the grammar. The GI sys-tem may use the plain sentences from the treebankto try and structure these sentences. The inferredtrees (which indirectly describes the grammar) arecompared against the structure of the original tree-bank. A distance measure is then used to rank thesimilarity between the derivation trees assigned tosentences by the GI system and derivation treesthat are in the original treebank.

Compared to the previous approach, the com-parison has certain advantages.

– An automated comparison of the output ofthe GI system against the original treebankis possible. Comparing tree structures can bedone automatically.

– Even partial grammatical equivalence can bemeasured by the distance metric. Once again,because for most languages there is morethan one grammar describing that language, agrammar submitted by a competitor may de-scribe the target language exactly, but wouldrank poorly according to the distance mea-sure.

Again, there are certain problems with this ap-proach.

– When ambiguity occurs in either the tar-get or inferred grammar, multiple solutionsare then possible. The GI system must nowdisambiguate between the multiple structur-ings (even though the string language of bothgrammars is the same).

6 B. Starkie et al. / Omphalos

– Related to the previous point, in addition tolearning a grammar, this approach also needsa way to create structured versions of the in-put sentences.

2.2.3. Classification of unseen examplesThe previous evaluation methods concentrate on

measuring the structure that is learnt by the GIsystem. This can be a grammar as in the firstmethod or the structure of sentences as in the sec-ond method. Because these methods both haveproblems related to comparing the structure, wealso describe two methods that do not necessarilydepend on the actual structure of the grammar orthe sentences.

In the classification of unseen examples ap-proach, the GI systems receive unstructured train-ing data. This training data can be labelled as pos-itive (i.e. part of the language) only or positiveand negative (i.e. not part of the language). Thesentences in the language can be generated froma grammar describing the language, negative sen-tences need to be generated in another way. Next,the GI system should assign language membershipinformation to unseen (unlabelled) test data. Inother words, the system must say for each test sen-tence whether or not it is contained in the languagedescribed by the underlying grammar.

There are several advantages of this approach.

– The evaluation is unaffected by the actualstructure of the grammar. This removes theproblems with comparing grammars. Addi-tionally, a wider variety of inference tech-niques can be used by competitors which en-ables the competition to compare the resultsof a wider range of techniques.

– The actual evaluation process is clear. Thelabels can be correct or incorrect and notpartially correct (which may occur when treestructures are compared).

Of course, this approach has some problems aswell.

– Since this task is essentially a classificationtask, no real grammatical information has tobe learnt as such. In fact, generic machinelearning classifiers can be used.

– If too little test data is used, it is hard to geta good idea on the actual performance of theGI systems.

– The way in which negative data is constructedfor the purposes of testing has a major im-pact upon the ability of the test to measurethe distance between the inferred grammar,and the target grammar. For instance, if thenegative data varies greatly from the positivedata, then it may be possible to correctly clas-sify negative and positive data with very lit-tle actual knowledge of the target language.In contrast if the positive and negative exam-ples are very similar then a much more accu-rate model of the target language is requiredto accurately classify the test sentences.

2.2.4. Precision and recallEach GI system receives unstructured training

data generated according to a target grammar.From this training data a grammar is inferred.Next, previously unseen sentences that are in thetarget language are provided to the system. Thepercentage of these sentences that are in the in-ferred grammar is then measured. This measure isknown as the recall. Next sentences are generatedaccording to the inferred grammar. The number ofthese sentences that exist in the target languageis then measured. This measure is known as theprecision. Recall and precision can then be mergedinto a single measure known as the F-score. TheF-score is a measure of the similarity between thetarget and inferred languages.

Advantages of this approach are as follows.

– This approach is quite similar to the classifi-cation of unseen examples. The main differ-ence is that it is more flexible; systems thatdo not perform a perfect classification can stillbe measured against each other.

– The precision and recall measures give a niceindication on the correctness and complete-ness of the systems. The F-score combinesthese measures together in one number, whichallows for an easy and direct comparison ofsystems.

Of course, this approach has disadvantages aswell.

– Once a test sentence has been used to mea-sure recall it cannot be used to measure recalla second time. This because that test sentencecan also be used as an additional training ex-ample.

– This technique requires that all grammaticalinference systems be capable of generating ex-ample sentences.

B. Starkie et al. / Omphalos 7

2.2.5. Design choiceAll these evaluation methods have problems

when applied to a generic GI competition as de-scribed above. It was decided, however, that forthe Omphalos competition, the technique of “clas-sification of unseen examples” was to be used toidentify the winner of the competition. There areseveral reasons for this. Firstly, this is the sameevaluation method that is used for the Abbadingo[20] DFA (regular language) learning competition;therefore this method already has some acceptancein the GI community. Secondly, it allows for manyGI systems to participate (no structure needs tobe generated) and, lastly, automatic evaluation ispossible.

For Omphalos, a competitor was considered tohave solved a problem when the test sentenceswere classified correctly with 100% accuracy, com-pared to 99% accuracy for the Abbadingo com-petition. This stricter requirement was used in aneffort to encourage the development of new trulycontext-free learning algorithms.

The main benefit of this technique is that itplaces few restrictions upon the techniques used toclassify the data. In addition, if the inferred gram-mar describes the target language exactly it willclassify the test examples exactly. The main dis-advantage of this technique is that it is possiblefor a (non-GI) classifier to classify all of the testexamples exactly without the classifier having anaccurate model of the target language.

One way to overcome the problem that classi-fiers can be used, is to only consider the problem tobe solved when the precision of the inferred gram-mar is greater than a threshold value. We have de-cided not to implement this additional constraint,since it is believed that if the test sets contain neg-ative examples that are sufficiently close to pos-itive examples, then classification accuracy is asuitable measure of how close the inferred gram-mar is to the target grammar.

3. Measuring the complexity of a learning task

Ideally, when constructing the learning tasks inthe Omphalos competition it should be known thatit is possible for a competitor to solve the prob-lems. In addition it is desirable to know how dif-ficult it would be for a competitor to solve thoseproblems. To do this a measure of the complex-

ity of a grammatical inference task was needed.Such a measure can be used to rank grammaticalinference tasks not only within the competition,but can also be used to rank the complexity of GIproblems that can be solved using state-of-the-arttechniques at any point in time.

To determine the complexity of a grammaticalinference task, a measure was derived by mod-elling the learning task as a brute force search. Todo this a hypothetical grammatical inference algo-rithm called the BruteForceLearner algorithm wascreated. This algorithm can identify any context-free grammar in the limit in exponential time, us-ing positive and negative data. The choice of mod-elling a grammatical inference algorithm that usedboth positive and negative data was taken be-cause it is known that context-free languages can-not be identified without negative examples, us-ing the identification in the limit learning model.The size of the hypothesis space considered by theBruteForceLearner algorithm was used as a mea-sure of the complexity of the learning task. Themodel was also used to determine if the trainingdata was sufficient to identify the target grammar,that is whether or not it was possible to solve thecompetition problems. The model also gave insightinto the way in which the training data shouldbe constructed. As far as it is known this workis the first to have quantified the size of a mini-mum hypothesis space for grammatical inferenceof all context-free languages, as well as being thefirst test that can be applied to determine if a setof training examples is sufficient for at least onealgorithm to infer the target language.

The definition of the BruteForceLearner is givenin Algorithm 3. This function makes use ofthe PruneGrammars and ConstructAllGrammarsfunctions given in Algorithm 1 and 2 respectively.

Definition 1. A context-free grammar is in Chom-sky Normal Form (CNF) if all rules are of the form“S → ε”,“A → BC” or “A → b” where S is thestart symbol, A, B, C are non-terminals, ε is theempty string and b is a terminal symbol.

It will now be shown that the BruteForcer-Learner algorithm can identify any context-freelanguage in the limit.

Lemma 1 (from [23]). All context-free languagescan be described by a grammar in CNF.

8 B. Starkie et al. / Omphalos

Algorithm 1 ConstructAllGrammars(O)

1: {O = a set of positive training examples.}2: Let Σ = {b|abc ∈ O} {i.e. the set of all termi-

nal symbols in O.}3: Let N = {Ni|1 ≤ i ≤ (((

∑j(2 × |Oj | − 2)) +

1)} ∪ {S} {i.e. a set of non-terminals.}4: Let Π = {S → ε} ∪ {A → BC|A, B, C ∈ N} ∪

{A → a|A ∈ N, a ∈ Σ) {i.e. the set of allpossible CNF rules using Σ and N .}

5: Let G = (N, Σ, Π, S) {i.e. a CNF grammarwith all possible CNF rules using Σ and N .}

6: Let H = {(Ni, Σ, Πi, S)|Πi ⊆ G} {i.e. all pos-sible sub languages of G.}

7: return H

Algorithm 2 PruneGrammars(O+, O−)

1: {O+ = a set of positive training examples andO− = a set of negative training examples}

2: Let H = ConstructAllGrammars(O+)3: Eliminate all grammars in H that are incon-

sistent with O+ and O−

4: return H

Algorithm 3 BruteForceLearner

Let H = {({S}, ∅, {S → ε}, S)}.loop

Randomly select one of the remaining gram-mars in H as being the hypothesis grammar.Get the next sample On.if On is a positive example then

Let O+ = O+ ∪ {On}.end ifif On is a negative example then

Let O− = O− ∪ {On}.end ifEliminate all grammars in H that are incon-sistent with O+ and O−.if the |H | = 0 then

Let H = PruneGrammars(O+, O−).end if

end loop

Definition 2 (redundant rules). A grammarG = (N, Σ, Π,S) contains redundant rules if anyrule can be removed without affecting the languagegenerated by G.

Lemma 2. Let G = (N, Σ, Π,S) be a CNF gram-mar without redundant rules. CalcDerivations(G)(as shown in Algorithm 5) returns a finite set

of derivations D with each unique derivationdi =“S ⇒∗ w”∈ D deriving a unique sentence w,such that each rule of G is used at least once in D.

Proof. As listed in the comments of Algorithm 5we know that w exists because if it did not existthen we could remove the rule R from G and allsentences could still be be derived. This would im-ply that R was a redundant rule. It can be seenthat |O| ≤ |Π| and |D| = |O|. Therefore both Dand O are finite. Each sentence in O has a uniquederivation in D, and every rule in Π is used at leastonce in D.

Algorithm 4 CalcObservations(G)

Require: G = (N, Σ, Π, S) is a CNF grammar withno redundant rules.Let O = ∅for all R ∈ Π do

construct a sentence w such that every deriva-tion tree of w uses R. {We know that w ex-ists because if it did not exist then we couldremove the rule R from G and all sentencescould still be be derived. This would implythat R was a redundant rule.}Let O = O ∪ {w}.

end forreturn O

Algorithm 5 CalcDerivations(G)

Require: G = (N, Σ, Π, S) is a CNF grammar withno redundant rules.Let D = ∅Let O = CalcObservations(G)for all w ∈ O(G) do

Add a single derivation of the form “S ⇒∗ w”to D

end forreturn D

Lemma 3. Let G = (N, Σ, Π,S) be a CNF gram-mar with no redundant rules. The maximum num-ber of unique non-terminals that can exist inCalcDerivations(G) is defined by the followingequation.

j

(2 × |Oj | − 2)) + 1) (1)

B. Starkie et al. / Omphalos 9

Proof. According to [24] for any CNF grammar allderivation trees of strings of length n have 2×n−1interior nodes. The non-terminal attached to theroot is known to be the start symbol (S). Themaximum number of unique non-terminals occurswhen all other non-terminals differ, which corre-sponds to 2 × n − 2 non-terminals other than thestart symbol for each derivation of a sentence oflength n.

Lemma 4. Let L be a context-free language. LetO = CalcObservationsGivenLang(L) (described inAlgorithm 6).Claims.

1. |O| is finite.2. O ⊆ L.3. When O is presented to the BruteForce-

Learner algorithm (labelled as positive exam-ples) the algorithm constructs a finite set ofhypotheses languages H such that L ∈ H.

4. After receiving O the BruteForceLearner al-gorithm will not construct an alternative setof hypotheses languages.

Proof. According to Lemma 2 CalcObservations isfinite so CalcObservationsGivenLang is finite. Itcan be seen from the definition of CalcObserva-tions that O ⊆ L. According to Lemma 3 the max-imum number of non-terminals that are requiredto derive CalcObservations(G) is given by Equa-tion 1. By definition it can be seen that the func-tion ConstructAllGrammars constructs all possi-ble CNF grammars using the terminal alphabet ofCalcObservations(G) and a set of unique terminalsN where |N | is described by equation 1. There-fore there is at least one grammar G in the setH constructed by the function ConstructAllGram-mars that is identical to G up to the renaming ofnon-terminals. Therefore there is at least one lan-guage G in the set of hypotheses languages thatis identical to L. All subsequent positive examplesobserved by the algorithm will be consistent withG, and all subsequent negative examples observedby the algorithm will be consistent with G there-fore the set H will always include L. Therefore theBruteForceLearner algorithm will not construct analternative set of hypotheses languages.

Algorithm 6 CalcObservationsGivenLang(L)

Require: L is a context-free language.Let G = (N, Σ, Π, S) be a CNF grammarwith no redundant rules such that L(G) = L.{According to Lemma 1 we know that G exists.}Remove all redundant rules from G.return CalcObservations(G).

Lemma 5. Let L be a context-free language. Con-sider the scenario where the BruteForceLearner al-gorithm is presented with sentences marked as pos-itive or negative according to L. Let H be a fi-nite set of hypotheses languages such that L ∈H. Let the BruteForceLearner algorithm containH as its set of hypotheses languages. Let O =CalcAdditional(H, L) as defined in Algorithm 7.Claims.

1. |O| is finite.2. Items in O are labelled as positive or negative

according to L.3. If the BruteForceLearner algorithm subse-

quently receives O it eliminates all incorrecthypotheses languages from it’s set of hypothe-ses languages.

4. After receiving O the BruteForceLearner al-gorithm will not construct an alternative setof hypotheses languages.

Proof. From the definition of CalcAdditional it canbe seen that the number of samples added to Ois less than the number of grammars in H . |H |is finite therefore |O| is finite. From the definitionof CalcAdditional it can be seen that all samplesadded to O are labelled as positive or negative ac-cording to L. For all hi ∈ H one of the two follow-ing scenarios apply.

– L(hi) = L or– L(hi) 6= L

Case 1. L(hi) = LIn this case all subsequent positive examples ob-

served by the algorithm will be consistent with hi,and all subsequent negative examples observed bythe algorithm will be consistent with hi, thereforethe grammar hi will not be deleted from the setH .

Case2. L(hi) 6= LIn this case one of the two following scenarios

apply.

10 B. Starkie et al. / Omphalos

– L(hi) ⊂ L or– L(hi) 6⊂ L

Case 2.1 L(hi) ⊂ LIn this case from the definition of CalcAdditional

it is known that O contains a sentence w such thatw /∈ L(hi) marked as a positive example. On re-ceipt of this sentence the BruteForceLearner algo-rithm will eliminate hi from the set of hypothesesgrammars.

Case 2.2 L(hi) 6⊂ LIn this case from the definition of CalcAdditional

it is known that O contains a sentence w such thatw ∈ L(hi) marked as a negative example. On re-ceipt of this sentence the BruteForceLearner algo-rithm will eliminate hi from the set of hypothesesgrammars.

It can be seen that the cases above cover all pos-sibilities, and in all possibilities for any grammarhi such that L(hi) 6= L there exists a sentence inO that causes the BruteForceLearner algorithm toeliminate hi from the set of hypotheses grammars.In all other cases L(hi) = L and the BruteForce-Learner algorithm will never eliminate hi from theset of hypotheses grammars.

Algorithm 7 CalcAdditional(H , L)

Require: H is a finite set of context-free lan-guages, L is a context-free language.Let O = ∅for all hi ∈ H do

if L(hi) 6= L thenif L(hi) ⊂ L then

Let w be a sentence such that w ∈ L butw /∈ L(hi). {Due to the definition of ⊂ itis known that such a sentence exists.}Add w to the set O marked as a positiveexample.

else if L(hi) 6⊂ L thenLet w be a sentence such that w ∈ L(hi)but w /∈ L. {Due to the definition of 6⊂ itis known that such a sentence exists.}Add w to the set O marked as a negativeexample.

end ifend if

end forreturn O

Theorem 6. Let L be a context-free language. Let O= CalcCharacteristic(L) as defined in Algorithm8.Claims.

1. |O| is finite.2. Items in O are labelled as positive or negative

according to L.3. When O is presented to the BruteForce-

Learner algorithm, the algorithm identifies Lexactly.

4. After receiving O the BruteForceLearner al-gorithm will not construct an alternative setof hypotheses languages.

Proof. According to Lemma 4 after the Brute-ForceLearner algorithm receives the sentences inO constructed by CalcObservationsGivenLang theBruteForceLearner algorithm constructs a set ofhypotheses languages H such that L ⊆ H . This setof hypotheses is the same set of hypotheses passedto the function CalcAdditional in the functionCalcCharacteristic. Therefore according to Lemma5 after receiving the remaining sentences in O theBruteForceLearner algorithm eliminates all incor-rect hypotheses languages from its set of hypothe-ses languages, and will not construct an alterna-tive set of hypotheses languages. The fact that |O|is finite, and items in O are labelled according toL can derived from the definition of CalcCharac-teristic, Lemma 4 and Lemma 5.

Algorithm 8 CalcCharacteristic(L)

Require: L is a context-free language.Let O+ = CalcObservationsGivenLang(L)Let O− ={ }Let H = PruneGrammars(O+, O−)Let C = O + CalcAdditional(H , L) {Note C isan ordered list.}return C

Theorem 7. Let L be a context-free language.Let O = CalcObservationsGivenLang(L).Let C = CalcCharacteristic(L).

Claim: On receipt of C the total number of hy-potheses grammars considered by the algorithm isdescribed by the following equation where T (O) isthe set of terminal symbols in O.

21+(((

∑j(2×|Oj |−2))+1)3+((

∑j(2×|Oj |−2)+1)T (O)))

(2)

B. Starkie et al. / Omphalos 11

Proof. From the definition of ConstructAllGram-mars it can be seen that the number of hypothesesgrammars constructed by the function when pre-sented with O is given by Equation 2. It can alsobe seen that if Q1 ⊆ Q2 then ConstructAllGram-mars(Q1) ⊆ ConstructAllGrammars(Q2). Fromthe definition of the BruteForceLearner algo-rithm it can be seen that although the functionConstructAllGrammars may be called numeroustimes, each time that it is called, it is called witha superset of the sentences that it was called withpreviously. According to Theorem 6 the last timethat the function ConstructAllGrammars will becalled, it will be called with O. Therefore Equation2 describes the total number of hypotheses gram-mars considered by the BruteForceLearner algo-rithm.

Lemma 8. If the BruteForceLearner algorithm cor-rectly identifies a context-free language it does soin time exponential with the size of the target lan-guage.

Proof. Let S1 be the the number of rules of thesmallest CNF context-free grammar that can beconstructed to represent the target language. LetS1 be a measure of the size of the target lan-guage. Let T be the time taken by the BruteForce-Learner algorithm to correctly identify the targetlanguage. Let S2 be the number of candidate rulesthat the function ConstructAllGrammars is con-sidering when constructing its set of hypothesesgrammars. If the algorithm has correctly identifiedthe target language then S2 ≥ S1. The number ofgrammars created by the function ConstructAll-Grammars is 2S2 . If we assume that the time takento construct each one of these grammars is t2 thenT > t2 × 2S2 ≥ t2 × 2S1.

At this point we have a method of construct-ing a set of positive and negative examples fromany context-free language, such that we know thatit is possible to infer the language exactly fromthose training examples (although not necessar-ily in polynomial time). We also know the size ofthe hypothesis space that would be considered byat least one grammatical inference algorithm thatcan do this, and the size of this hypothesis spacecan be used as a metric of the complexity of thegrammatical inference task.

If the target language is described using a deter-ministic grammar in CNF, we know that it does

not contain any redundant rules, and we knowthat all derivations of any sentence generated bya given rule R must use R. This is known becauseeach sentence has only one derivation. Thereforegiven any deterministic context-free grammar G inCNF, we can construct a set of sentences equiv-alent to CalcObservations(G) simply by generat-ing for each rule in G one sentence derived usingthat rule. It can be shown by constructive proofthat this technique can be used even if G is notin CNF. Therefore we have a way of constructingCalcObservations(G) in polynomial time with re-spect to the number of rules in the target gram-mar. Therefore this technique was used in the Om-phalos competition to construct a portion of thetraining sentences from each deterministic targetgrammar. A similar technique was also used toconstruct a set equivalent to CalcObservations(G)from those non-deterministic grammars used inthe competition. This was made possible becausethe non-deterministic grammars were created byadding non-deterministic constructs to determin-istic grammars.

The technique for constructing the set of sen-tences CalcCharacteristic(L) described in the proofshowever cannot be executed in polynomial time.This is because the function PruneGrammars isused which cannot be executed in polynomial time.Therefore if the complexity of the learning taskswas small enough to enable the set CalcCharacter-istic(L) to be constructed, then the learning taskswould be too easy. Instead we made use of the fol-lowing theorem that can be informally interpretedto mean that on receipt of a “sufficiently” largeenough set of unique sentences tagged accordingto a target language L the BruteForceLearner willinfer L exactly. This theorem can also be formallystated as “the BruteForceLearner can identify anycontext-free language in the limit from positiveand negative data” [1].

Theorem 9. On receipt of an infinite stream ofunique sentences correctly labelled as being eitheran element or not an element of a target context-free language L, at some finite time the Brute-ForceLearner algorithm will contain a set of hy-potheses grammars H such that for all hi ∈ H :L(hi) = L, and after that point the algorithm willnot change its hypotheses.

Proof. According to Lemma 4 there exists at leastone finite set of sentences O ⊆ L labelled as pos-

12 B. Starkie et al. / Omphalos

itive examples that when presented to the Brute-ForceLearner algorithm, will cause it construct aset of hypotheses languages H such that L ⊆ H ,and upon receipt of O the BruteForceLearner al-gorithm will not construct an alternative set of hy-potheses languages. Therefore at some finite pointin time the algorithm will receive a set of sentencesO4 such that O ⊆ O4. At that point the algorithmwill construct a set of hypotheses languages H suchthat L ⊆ H , and upon receipt of these sentencesthe algorithm will not construct an alternative setof hypotheses grammars.

According to Lemma 5 there exists at least oneadditional set of positive and negative trainingsentences O2 that when presented to the Brute-ForceLearner algorithm after the presentation ofO4 cause it eliminate all incorrect hypotheseslanguages from it’s set of hypotheses languages.Therefore at some finite point in time the algo-rithm will observe a set of positive and negativesentences O5 such that O5 ⊆ O2 and after thattime the algorithm will have eliminated all incor-rect hypotheses in its set of hypotheses grammars,and after that point it will not change its set ofhypotheses.

3.1. Updated Theory

Theorem 9 would be more useful if it was knownhow many more additional samples are needed tobe created, so that it can be guaranteed that theproblems can be solved by at least one grammati-cal inference algorithm. At the time of the creationof the training sets for the Omphalos competition,there was no known procedure for determining howmany samples in addition to the set CalcObserva-tions should be produced. At the time of settingthe competition tasks a decision was made to makethe number of additional samples to be between10 and 20 times the number of sentences in the setCalcObservations(G). This number was somewhatarbitrary based upon past experience and gut feel(“instinct”). In hindsight it appears to have beensufficient.

Since the completion of the competition how-ever some improvements have made to the theory,and we can now generate a conservative estimateof the number of sentences that should be cre-ated for each competition task, using PAC learn-ing theory. After the receipt of the set CalcObser-vations(G) the BruteForceLearner algorithm has

a finite hypothesis space of known size. Thereforewe can use the following equation that provides aupper bound on the number of additional labelledsamples (m) sufficient for any consistent learner(and therefore the BruteForceLearner) to proba-bly (with probability 1− δ) approximately (withinerror ε) infer the correct language [25].

m ≥1

ε(ln |H | + ln(1/δ)) (3)

The number of samples returned by the Equa-tion 3 using Equation 2 to calculate |H | is verylarge indeed. For instance if Equation 3 is usedwith |H | defined using Equation 2 for the simplest

grammar in the competition with |H | = 21.1×109

and ε, δ = 0.01 at least 7.62×1010 examples in ad-dition to the 16 examples in O would be required(The target grammar had 23 rules). In actualitythe problem was solved by three different competi-tors with only 266 positive and 802 negative ex-amples.

As a result of this the BruteForceLearner wasexamined again to see if the size of the hypothesisspace could be drastically reduced, while preserv-ing the identification in the limit properties. Theintuition that lead to the definition of the Brute-ForceLearner algorithm was that given a set of sen-tences like the one generated by function CalcOb-servations, there are a finite number of unlabelledderivation trees that can be assigned to those sen-tences, and a finite number of maximum non-terminals that could be used by those derivations.On receipt of these sentences however the Brute-ForceLearner algorithm would not know when ithad received a superset of CalcObservations, butwould rather take a guess at the number of non-terminals required, and then only reconstruct alarger hypothesis set, when no grammar in the hy-pothesis set was consistent with the training data.An alternative strategy is to use a variable n todenote a guess of the number of non-terminals inthe target grammar. Whenever a new hypothesisset needs to be constructed the algorithm couldjust increment n and construct all CNF grammarswith exactly n non-terminals. With a slight mod-ification to the proofs listed earlier in this docu-ment it can be seen that the algorithm will at somepoint, create a hypothesis space that is sufficientwhile consider the following number of hypothesisgrammars.

21+n3+n×T (O) (4)

B. Starkie et al. / Omphalos 13

where n is the number of non-terminals in thesmallest CNF grammar that can represent the tar-get language. The set of additional sentences couldthen be created using the function CalcAdditionalusing the smaller set of hypothesis grammars. Itshould be noted that in this case not only wouldcreating this number of grammars be sufficient,but that the target language could not be iden-tified with a smaller hypothesis of the number ofnon-terminals than n. For this reason Equation 4is a better measure of the complexity of the gram-matical inference task than Equation 2. For thesake of consistency however we will continue touse Equation 2 in this paper as the measure ofthe complexity of a grammatical inference task, al-though in future competitions Equation 4 may beused.

It is however still possible for the BruteForce-Learner to consider an even smaller set of hy-pothesis grammars, and still identify all context-free grammars in the limit. Constructing thissmaller set however would most likely result ineven longer execution times however, but consid-ering this smaller hypothesis set is of use in de-termining the number of additional samples thatshould be constructed to pass to the BruteForce-Learner algorithm in addition to those constructedusing CalcObservations. The BruteForceLearneralgorithm can consider a smaller set of grammarsbecause the Chomsky normal form is not a uniquenormal form. Therefore many of the grammars inthe set of hypothesis grammars will describe thesame language. This can lead to problems evenwhen using Equations 3 and 4 to estimate the num-ber of additional samples required. This is becausethe value returned by Equation 3 is close to thelower bound on the number of additional labelledsamples required only when the number of incor-rect hypotheses in H is close to |H |. This is onlythe case when every grammar in H describes aunique language, which is not the case. For in-stance consider any grammar in H that has r non-terminals (including the start symbol) out of a pos-sible n non-terminals (including the start symbol).In this case the number of other grammars thatare isomorphic (identical up to the renaming ofnon-terminals) to this grammar in H would be thenumber of permutations of (n−1) in (r−1)−1, i.e.((n−1)!/(n−1−(r−1))!)−1 = ((n−1)!/(n−r)!)−1.In addition, because the CNF is not a unique nor-mal form, many of the grammars will describe the

same language even if they are not isomorphic. Forinstance all grammars that contained useless ruleswill describe the same language as the grammarwith those useless rules removed (which will alsoexist in H). Also many instances of left recursioncan be replaced by an instance of right recursionwithout changing the language described by thegrammar. The hypothesis space will also includesome languages that cannot generate the train-ing examples from which the hypotheses space wascreated.

As an alternative to using the ConstructAll-Grammars described early, the BruteForceLearneralgorithm could construct a set of hypothesisgrammars in which no grammar is isomorphic toanother, the start symbol exists in every gram-mar, and for all terminal symbols a there is atleast one rule of the form N → a. It can beshown that the size of this set of grammars isN∑

n=1

n∑r=1

r(2r2

−1)×T (2r−1)×2(r−1)! . This equation can be

approximated using the following equation.

2log

2N+N2+log

2T+N+1−

N−1∑x=1

log2

x

(5)

Table 1 gives the number of estimated addi-tional samples required using Equations 3 and 4and Equations 3 and 5. The table also gives an in-dication of the reduction in |H | using the differenttechniques described in this section (i.e. columns2 to 5).

4. Constructing the Competition Tasks

This section of the document describes themechanism by which the competition tasks wereconstructed. First we will discuss the way in whichthe complexity of the competition tasks were cho-sen. Then the method by which grammars wereconstructed will be discussed, followed by the wayin which the training and testing examples werecreated.

4.1. Setting the complexity of the competitiontasks

As one of the major motivations of the Ompha-los competition was to promote the developmentof new grammatical inference algorithms, it wasdesirable that the learning tasks in the competi-

14 B. Starkie et al. / Omphalos

Table 1

Estimates of required number of additional training sen-tences using different equations

ProblemComplexity

Size of Training SetTheoretical SizeRequired

Eq 2 Eq 4 Eq 5 Eq 4 & 3 Eq 5 & 3

2 & 6.2 7.12 × 108 2.54 × 105 3.76 × 103 280 7.66 × 106 113,239

4 & 6.4 1.13 × 1010 5.68 × 106 3.08 × 104 418 1.71 × 108 927,348

tion were sufficiently hard so as to be just outsidethe state-of-the-art. Ideally, the best of the currentsystems should be able to solve the simplest prob-lems, but not the more difficult ones. Omphalosshould give an idea of where the current state-of-the-art currently is, and try to push the boundaryfrom there.

As discussed earlier in this paper, Equation 2was considered to be a suitable measure of thecomplexity of a grammatical inference task. To ap-ply this equation one begins with a grammar defin-ing the target language. Then a set of positiveexamples is generated from the grammar using afunction that approximates the function CalcOb-servations. Then the number of terminal symbolsin this set is calculated, and finally Equation 2 isapplied.

In order to estimate the complexity of problemsthat could be solved by state-of-the-art algorithmsthe literature was reviewed to identify pre-existingbenchmark grammatical inference problems. Thework in [26] and [27] identified some benchmarkproblems, i.e. grammars that can be used as somesort of standard to test the effectiveness of a GIsystem by trying to learn these grammars. Thegrammars were taken from [28] and [24]. UsingEquation 2 the complexities of these grammarswere calculated. A description of these grammarsand their complexity measures are listed in Ta-ble 2. Using these results the complexities of thetarget grammars of the competition problems wereselected. Once again based upon “gut feel” it wasdecided to set the complexity of the easiest com-petition problem at around 2109

despite the factthat the complexity of the most difficult problemlisted in Table 2 was around 2105

.Table 3 lists the complexities of the Ompha-

los competition problems. It can be seen that thecompetition included several grammatical infer-ence tasks that were ranked according to difficulty.The ranking of these problems was based upon

three axes of difficulty in grammatical inference,specifically:

1. The complexity of the underlying grammar(as defined by Equation 2),

2. whether or not negative data is available and,3. how similar the negative examples in the

test set are to positive examples of the lan-guage. For instance whether or not the testset includes sentences that can be generatedby regular approximations to the target lan-guage but not the target language itself.

At the close of the competition the competitor thathad solved the highest ranking problem was de-clared the winner of the competition overall.

The reason why the complexity measure definedusing Equation 2 was not used as the sole metricfor ranking the problems was to encourage partic-ipation from as many researchers as possible. Forinstance the ranking methodology enabled algo-rithms that used positive data only, to competewith algorithms that used both positive and nega-tive data. Specifically, the successful inference of atarget language from positive data only was rankedhigher than the successful inference of a languagefrom positive and negative data, all other thingsbeing equal. For instance, problem 2 is rankedhigher than problem 1.

During the lifetime of the competition the thirdaxis of difficulty was introduced when it becameclear that some of the original test data did notaccurately determine whether or a competitor hadinferred the target language or a regular approxi-mation to it. Correctly labelling a test set in whichthe negative sentences closely resembled the posi-tive sentences was ranked higher than correctly la-belling a test set where the negative examples dif-fered greatly from the positive examples. For in-stance, problem 6.1 is ranked higher than problem1 and even problem 6.

B. Starkie et al. / Omphalos 15

Table 2

Benchmark grammatical inference problems.

Description Example phrase log2 compl. Properties

1 (aa)∗ aa, aaaa, aaaaaa 1.34 × 103 Regular

2 (ab)∗ ab, abab, ababab 2.22 × 103 Regular

3 Operator precedence (small) (a+(b+a)) 9.37 × 103 Not regular

4 Parentheses (), (()), ()(), ()(()) 2.22 × 103 Not regular

5 English verb agreement (small) that is a woman, i am there 5.33 × 105 Finite

6 English lzero grammar a circle touches a square, asquare is below a triangle

4.17 × 106 Finite

7 English with clauses (small) a cat saw a mouse that sawa cat

6.59 × 105 Not regular

8 English conjugations (small) the big old cat heard themouse

9.13 × 105 Regular

9 Regular expressions ab∗(a)∗ 9.39 × 103 Not regular

10 {ω = ωR , ω ∈ {a,b}{a,b}+} aaa, ba 6.90 × 104 Not regular

11 Number of a’s=number of b’s aabbaa 4.29 × 104 Not regular

12 Number of a’s=2×number of b’s aab, babaaa 3.01 × 105 Not regular

13 {ωω|ω ∈ {a,b}+} aba, aa 9.12 × 104 Regular

14 Palindrome with end delimiter aabb$, ab$, baab$ 1.18 × 105 Not regular

15 Palindrome with center mark aca, abcba 4.96 × 103 Not regular

16 Even length palindrome aa, abba 9.30 × 103 Not regular

17 Shape grammar da, bada 2.45 × 104 Not regular

All other things being equal, solving a prob-lem with a higher complexity measure was rankedhigher than solving one with a lower complexitymeasure. For instance, problem 3 is ranked higherthan problem 1.

In addition it was decided to make the high-est ranking problems non-deterministic languages,while the lowest ranking problems were determin-istic languages. There were two reasons for this.Firstly it was been noted by [27] that the in-ference of non-deterministic languages is a moredifficult task than the inference of deterministiclanguages. Therefore solving those problems thatinvolved non-deterministic languages was rankedhigher than solving those problems that involveddeterministic languages. Secondly deterministiclanguages form a useful subset of context-free lan-guages that are a superset of regular languages.There is a large amount of background knowledgeon deterministic languages, that might useful forthe inference of deterministic languages, but at thetime of the setting of the competition was not be-ing applied to the task of grammatical inference. Itwas hoped that the focus on the inference of deter-ministic context-free languages might be a catalystfor advances in grammatical inference.

Lastly it should be noted that the Omphaloscompetition employed an interactive technique insetting the complexity of the grammatical infer-ence tasks. Although an initial set of six prob-lems were created, the competition was arrangedso that if these problems were solved early in thelife of the competition, then additional more dif-ficult problems could be added. Similarly, if afteran extended amount of time the competition taskswere not solved, easier problems could be addedto the list of competition tasks. By the setting ofthe difficulty of the competition tasks using thistechnique, we could get a feel of the complexityof grammatical inference problems that could besolved by state-of-the-art algorithms.

4.2. Creation of the Target Grammars

After examining the complexities of those lan-guages listed in Table 2 it was decided that thecomplexity of the easiest tasks (tasks 1 & 2) should

be in the order of 2109

. This complexity measurewas chosen for three reasons. Firstly it was a feworders of magnitude higher than those tasks listedin Table 2. Secondly problems of this complex-ity could not be learnt using a brute force tech-

16 B. Starkie et al. / Omphalos

Table 3

Complexities of Benchmark Inference Problems in Omphalos Competition.

Training data Properties log2 compl.

1 Positive and negative Not regular, deterministic 1.10 × 109

2 Positive only Not regular, deterministic 7.12 × 108

3 Positive and negative Not regular, deterministic 1.65 × 1010

4 Positive only Not regular, deterministic 1.13 × 1010

5 Positive and negative Not regular, non-deterministic 5.46 × 1010

6 Positive only Not regular, non-deterministic 6.55 × 1010

6.1 Positive and negative Not regular, deterministic 1.10 × 109

6.2 Positive only Not regular, deterministic 7.12 × 108

6.3 Positive and negative Not regular, deterministic 1.65 × 1010

6.4 Positive only Not regular, deterministic 1.13 × 1010

6.5 Positive and negative Not regular, non-deterministic 5.46 × 1010

6.6 Positive only Not regular, non-deterministic 6.55 × 1010

7 Positive and negative Not regular, deterministic 5.88 × 1011

8 Positive only Not regular, deterministic 1.63 × 1011

9 Positive and negative Not regular, non-deterministic 1.08 × 1012

10 Positive only Not regular, non-deterministic 9.92 × 1011

nique (such as the BruteForceLearner algorithm)in time to be entered into the Omphalos com-petition. Thirdly based upon published executiontimes of existing algorithms it was estimated thatthese problems could not be solved using existingpublished techniques in time to be entered into thecompetition. This decision was made to encour-age the development of new grammatical inferencetechniques.

It was decided that higher ranking problemsshould have a complexity measure that was at leastan order greater than similar tasks lower in the or-dering to ensure that they truly were more difficulttasks. Once the form and complexities of the tar-get grammars were decided upon, the grammarswere created as follows:

1. The number of non-terminals, terminals andrules were estimated based upon the numberof non-terminals, terminals and rules in thebenchmark problems (Table 2).

2. A set of terminals and non-terminals werecreated. Rules were then created by ran-domly selecting terminals and non-terminals.A fixed number of rules were created to con-tain only terminal strings.

3. Useless rules were then identified. If a non-terminal could not generate a terminal string,a terminal rule was added to it. If a non-terminal was not reachable from the start

symbol, rules were added to ensure the rule

was reachable from the start symbol. For in-

stance if the non-terminal N was unreach-

able from the start symbol, a rule was cre-

ated with the start symbol on the left hand

side of the rule, and N on the right hand side

of the rule.

4. Additional rules were added to ensure that

the grammar did not represent a regular lan-

guage. Specifically rules containing center re-

cursion were added.

5. The complexity of the grammar was calcu-

lated and if it was not in the desired range,

then the grammar was deleted.

Tests were then undertaken to ensure that gram-

mars 1–4 represented deterministic languages.

Specifically LR(1) parse tables were constructed

from the grammars using Bison [29]. To en-

sure that grammars 4 and 5 represented non-

deterministic languages, rules were added to the

target grammars. After problem 6 was solved,

problems 7 to 10 were added to the competition.

Problems 6.1 to 6.6 were added later in the com-

petition when it appeared that problems 7 to 10

would not be solved during the life of the compe-

tition.

B. Starkie et al. / Omphalos 17

4.3. Construction of training and testing sets.

Once the target grammars were constructed,sets equivalent to CalcObservations were con-structed for each grammar. Sets of positive exam-ples were then created using the GenRGenS soft-ware [30].

For the first six problems additional exampleswere then created by randomly generating sen-tences of length up to five symbols more thanthe length of the longest sentence in CalcObser-vations(G). These sentences were then parsed us-ing the target grammar and were labelled as beingeither in or out of the target language. This setof sentences was then randomized and split intotesting and training sets, but in such a way as toensure that the training set contained a supersetof CalcObservations(G). For those problems thatwere to be learnt from positive data only the train-ing sets had all negative examples removed.

For problems 6.1 to 10 a more rigorous methodof constructing negative data was used as follows:

– For each context-free grammar an equivalentregular grammar was constructed using thesuperset approximation method based on Re-cursive Transition Network (RTN) describedin [31]. Sentences that could be generatedfrom this regular approximation to the tar-get language were included as negative data.These sentences were included to distinguishbetween competitors who had created regularapproximations to the underlying context-freelanguages, and competitors who had identi-fied a non-regular language.

– A context-free grammar that defined a gram-mar that was a superset of the the target lan-guage was constructed by treating the tar-get grammar as a string rewriting system,and normalizing the right hand sides of rulesusing the normalization algorithm describedin [32]. That is, a context-free grammar wasconstructed that described a language thatwas a superset of the target language, but inwhich the right hand side of each rule couldnot be parsed by the right hand side of anyother rule. Negative examples were then cre-ated from this approximation to the targetlanguage.

– Each target grammar in the competition in-cluded some deliberate constructs designed to

trick grammatical inference algorithms. Forinstance most included sequences that wereidentical to center recursion expanded to a fi-nite depth. An example is anbn where n < m,m is an integer > 1. To ensure that the train-ing and testing examples tested the abilityof the inferred grammars to capture thesenuances, the target grammars were copiedand hand modified changing the anbn wheren < m to become anbn where n > 1. Inaddition, where center recursion existed ofthe form anbn in the target grammar theregular approximations a∗b∗ were includedin the “tricky” approximation to the targetgrammar. Negative examples were then cre-ated from these approximations to the targetgrammar and added to the test set.

– There were an equal number of positive andnegative examples in the test sets.

In addition for problems 6.1 to 10:

– The longest training example was shorterthan the longest test example.

– The grammar rules for problems 7 to 10 wereshorter than for problems 1 to 6.6 and hadmore recursion. Some non-LL(1) constructswere also added to the target grammars forproblems 7 to 10.

4.4. Competition Infrastructure

The Omphalos competition reused a number ofpieces of software from the Abbadingo competi-tion [20] and the Gowachin server, including theOracle that was consulted by competitors to de-termine if they had successfully solved the com-petition tasks or not. To enter the competition,competitors needed to:

1. Download the training data and test datafrom http://www.irisa.fr/Omphalos/.

2. Train their system using the training data.3. Apply their system to the test data, resulting

in a list of 1’s and 0’s, indicating that eachsentence in the test set is either in or is notin the language (respectively).

4. Submit the list of 1’s and 0’s to the Omphalos“Oracle” using an html form on the web-site.

5. The Oracle would then notify the competi-tor, that either they were successful or un-successful.

18 B. Starkie et al. / Omphalos

The decision to provide only a pass or fail to com-petitors was a deliberate design choice, to preventthat information to being used by competitors forHill Climbing (test set tuning). In addition if anycompetitor had deliberately consulted the Oraclemore than 25 times in one day they would havebeen disqualified.

5. Results of Omphalos

The Omphalos competition ran for eight monthsin 2004 from February the 5th until October 11th.During that time the competition data was down-loaded from seventy different internet domains, onalmost all continents.

The web-site itself received over 1000 visits fromover 270 domains.6 There were 139 label submis-sions by 8 contestants to the Oracle. Of the 16competition problems, 8 were solved during the lifeof the competition, and one other was solved af-ter the competition had closed. Most importantlythe winner of the competition developed some newtechniques that have pushed the state-of-the-art ingrammatical inference. Table 4 contains details ofkey dates in the competition. Alexander Clark wasdeclared the winner of the competition on Octo-ber 1st. A detailed journal article on his winningtechniques has been submitted [33].

If the number of different domains from whichthe competition data was downloaded from is in-dicative of the number of competitors, then thenumber of competitors (70) is almost 50% morethan the number of people who attended ICGI2004, and greater than the number of estimatedindividuals who downloaded the Abbadingo com-petition data. The number of individuals who ac-tually submitted submitted labellings to the Ora-cle however was extremely low. From discussionsduring the Omphalos session at ICGI 2004, emailcorrespondence and from published comments [34]the reason for this rate of submission was two fold.Predominantly many competitors stated that theyrun their existing grammatical inference systemson the competition data, and using cross valida-tion decided that the performance of their algo-rithms on this data was too poor to warrant thesubmission of a test set to the Oracle. Other re-searchers reported that due to other commitments

6Attempts were made to discard crawlers and bot hits.

they were unable to spend enough time on thetasks. It should be noted that all machine learn-ing algorithms contain learning bias, as does themethod by which the target grammars were gen-erated. Nevertheless, because of the high partic-ipation rate and the complexity of the problemssolved we consider that it can be concluded thewinning algorithm is world class and the complex-ity of the solved problems represent the limit ofproblems that can be solved using state-of-the-artalgorithms.

5.1. Problems 1, 3, 4, 5, and 6

Problem 1 was solved by Joan Andreu Sanchezfrom the Departament de Sistemes Informatics iComputacio, Universitat Politecnica de Valencia.Although Sanchez originally tried to solve theproblem using the Inside-Outside algorithm, he ac-tually solved it manually. After discovering theregularities in the positive examples he used a reg-ular expression constructed by hand to classify thetest examples as being either positive or negative.Although the target grammar was not a regularlanguage the test sets did not include a single non-regular example. This in addition to the speed inwhich the problem was solved suggests that thefirst task was overly simple, and the negative ex-amples were too different from the positive exam-ples to be an accurate test of whether or not thelanguage had been successfully learnt.

Problems 3, 4, 5, and 6 were solved by ErikTjong Kim Sang from the CNTS - Language Tech-nology Group at the University of Antwerp in Bel-gium. Tjong Kim Sang used a pattern matchingsystem that classifies strings based on n-grams ofcharacters that appear either only in the posi-tive examples or only in the negative examples.With the exception of problem 1 this techniquewas not sufficient to solve the problem, so TjongKim Sang generated his own negative data, usingthe principle that the majority of randomly gen-erated strings would not be contained within thelanguage. His software behaved as follows;

1. Firstly it loaded positive examples in fromthe training file.

2. It then generated an equal number of unseenrandom strings, and added these to the train-ing data as negative examples.

B. Starkie et al. / Omphalos 19

Table 4

Competition time line.

Date Event

February 15th 2004 Competition begins

February 20th 2004 Problem 1 solved by Joan Andreu Sanchez

March 22nd 2004 Problems 3, 4, 5 and 6 were solved by Erik Tjong Kim Sang

April 14th 2004 Problems 7, 8, 9 and 10 were added

June 7th 2004 New larger testing sets were added for problems 1 to 6

September 6th Problems 2 and 6.2 were solved by Alexander Clark

September 7th Problem 6.4 was solved by Alexander Clark

October 1st 2004 Competition closed

October 1st 2004 Alexander Clark was declared the winner of the competition

October 11th 2004 Omphalos session at ICGI 2004

3. A n-gram classifier was then created as fol-lows: A count was made of n-grams of length2 to 10 that appeared uniquely in the posi-tive examples or uniquely in the negative ex-amples. A weighted (frequency) count of suchn-grams in the test strings was then made.For each sentence in the test set. If the posi-tive count was larger than the negative countthen the string was classified as positive, oth-erwise it was classified as negative. If a stringcontained two zero counts then that sentencewas classified as unknown.

4. Steps 2 and 3 were repeated thirty times.5. Strings that were always classified as positive

in all thirty tests were then assumed to bepositive examples. Strings that were classifiedas negative one or more time were classifiedas negative. Other strings were classified asunknown.

5.2. Problems 6.2 and 6.4

Problems 6.2 and 6.4 were solved by AlexanderClark from the University of Geneva. Clark’s ap-proach was based on a simplified version of thealgorithm described in [35] that also made useof some of properties of deterministic context-freegrammars. Alexander also used some Alignment-based learning techniques [36]. It should be notedthat not only did Clark correctly label a large num-ber of similar test sentences (12,004 and 17,230) af-ter training his algorithm from an extremely smallnumber of positive sentences (281 and 419), at thecompletion of the competition his grammars werecompared to the target grammars and it was foundthat he had identified the target languages exactly.

The target grammars had 24 and 38 rules respec-tively.

Clark’s algorithm can best be described as amodel merging technique, whereby constituentsare firstly identified and assigned arbitrary non-terminals. The constituents and sentences are thenreduced using other constituents. This reductionprocess adds structure to the sentences, and re-duces the size of the hypothesis grammar. Non-terminals are then merged, which generalizes thelanguage described by the grammar.

There are two features of Clark’s technique thatmake it distinct from other model merging tech-niques. The first of these is the use of mutual in-formation to filter candidate constituents [35]. Thesecond is the use of substitution graphs to iden-tify whether or not constituents can be used inter-changeably.

Mutual information is an information-theoreticallymotivated measure for discovering interesting col-locations between different events, in this case thecollocations of words before and after constituents.

Consider the alphabet∑

augmented with a dis-tinguished symbol #. For a substring w that oc-curs in the sentence a w b with a, b ∈ Σ∗, we candefine l to be the final symbol of a if there is one,or to be # if a = ε. Similarly we define r to be thefirst symbol in b or #.

Let a(l, w, r) be the number of times that w oc-curs in the context (l, r).

Let b(l, w) be the number of times l appears be-fore the substring w i.e.

∑r

a(l, w, r).

Let c(w, r) be the number of times r appearsafter the substring w i.e.

la(l, w, r).

20 B. Starkie et al. / Omphalos

Let d(w) be the number of times that thesubstring w appears in the training sample i.e.∑

l,ra(l, w, r).

Clark uses these counts in the following equationto calculate mutual information.

MI(w) =∑

l∈Σ∪{#}

∑r∈Σ∪{#}

a(l,w,r)d(w) log a(l,w,r)d(w)

b(l,w)c(w,r)

Clark claims that substrings that are con-stituents will tend to have a high value of thisquantity, whereas substrings that are not con-stituents will have a low value of this quantity.

As an example, the counts of the local contextsof the string “q g m t” in the training set of prob-lem 6.4 are given in Table 5. According to Table 5for instance the sequence “k q g m t h” appeared69 times in the training set. From these counts mu-tual information is calculated as shown in Table 6.It can be seen that the mutual information of thesequence “q g m t” is 1.00.

Table 5

Counts of all contexts of the sequence “q g m t”.

h k r

k 69 14 0

n 0 0 75

Table 6

Mutual information of the sequence “q g m t”.

l r a(l, w, r) b(l, w) c(w, r) d(w) MI

k h 69 83 69 158 0.41

k k 14 83 14 158 0.08

n r 75 75 75 158 0.51

Total 1.00

Whether or not the mutual information is suffi-ciently high depends upon the setting of a thresh-old value in the algorithm. In this case the highvalue of mutual information calculated (1.00) sug-gests that “q g m t” is a true constituent. ThusClark’s algorithm assigned the non-terminal NT16

to it, e.g. NT16 ⇒∗ qgmt. As a result the gram-mar inferred by Clark contains three rules as fol-lows.

NT16 → qgmtNT14 → nNT16rq

NT4 → kNT16NT4owNT3

These three rules map onto the following threerules in the grammar from which the training ex-amples were generated.

Y3 → qgmtY11 → nY3rqY5xeY6 → kY3Y6owY9

The identification of constituents enables the in-troduction of non-terminals into the hypothesisgrammar, but for the grammatical inference al-gorithm to generalise from the training examples,some assumptions need to be made about the sub-stitutability of constituents. A common guessingstrategy used by grammatical inference algorithmsis that if A →∗ u, A →∗ v and xuy ∈ L then itis assumed that xvy ∈ L. While this holds truefor some context-free languages, in general it can-not be assumed. To identify the instances whenthis guessing strategy should be applied, and whenit should not, Clark constructed a data structurethat he calls a substitution graph. For each sub-string u observed in the training sample a node isconstructed in an undirected graph. An arc is thencreated between two nodes u and v (representingthe substrings u and v) if there are strings l and rsuch that both l u r and l v r are in the trainingsample. For instance Figure 3 shows part of such agraph inferred by Clark’s algorithm for problem 4.Here we can see that the constituents “q t b u r db o v”, “v k g i o o” and “p i p l d s k” are observedto be fully interchangeable with one another in thetraining sample. In contrast however if the con-stituents “q t b u r d b o v” and “v k g i o o” wereinterchangeable with another and similarly “v k gi o o” and “p i p l d s k” were interchangeable but“q t b u r d b o v” and “p i p l d s k” were not thenwith a sufficiently large enough training sample asubstitution graph similar to that shown in Figure4 would be constructed.

Clark’s technique can be summarized as follows.

– The algorithm calculated all substrings ob-served in the training sample using a suffix ar-ray. All substrings that occurred greater thana predefined number (fmin) of times were con-sidered to be candidate constituents.

– The algorithm then discarded all candidateconstituents whose mutual information wasless than a predefined number (Mmin).

B. Starkie et al. / Omphalos 21

Fig. 3. Part of a substitution graph showing the observedsubstitutability of constituents (from [33])

Fig. 4. An alternative substitution graph showing an in-complete subgraph (limited substitutability)

– A substitution graph was then constructed toidentify cliques, i.e, which constituents weresubstitutable with one another.

– Sentences were then reduced using the con-stituents.

Clark made multiple attempts at those prob-lems that he solved. For instance he made sevenattempts at problem 6.2 before solving it. Afterinferring a grammar from the training data thatdid not give a correct answer on the test data,Clark assumed that his inferred grammar was toospecific and so he generalized it. He did this byimplementing a series of heuristic modifications tothe hypothesised grammars. Given two rules in thehypothesis grammar of the form A → UWV andB → W, he sometimes generalised the first rule byreplacing it with A → UBV. Conversely, if a ruleexisted of the form A → UBV such that for every

string in the training sample the occurrence of B inthis rule is expanded by a particular rule B → w,then he sometimes made the grammar more spe-cific by replacing the first rule with A → UwV. Inaddition there were a number of free parametersthat could be tweaked. Therefore there appears tostill be a certain amount of “gut feel” and “tweak-ing” involved in the solving these problems. Thiswould appear to be a sound strategy for solvingcomplicated tasks like the Omphalos competition,or real world applications. Clark’s advice for a suc-cessful strategy is to exploit useful properties thatrandomly generated grammars are likely to have.

Although it is known that all context-free lan-guages cannot be learnt using positive examplesonly 7 Clark’s algorithm used a range of techniquesto prevent the algorithm from overgeneralizing assummarized below.

– The context distributions needed to be similarusing a local measure of similarity (MI).

– There needed to exist two strings c and d suchthat c a d and c b d appeared in the trainingset (as identified in the substitution graph).

– There should not be any substring of a or bin the set of constituents except when a is asubstring of b.

These techniques enabled Clark to identify thetarget grammars exactly, despite the lack of anyexplicit negative examples.

As a result of his involvement in the Ompha-los competition and the success of his techniques,Clark believes that with further research he will beable to prove that strict deterministic languagesare PAC-learnable when the distributions of thetraining examples are generated by suitable prob-abilistic deterministic push down automata.

Additionally he envisions that his techniquescould be extended into more powerful algorithmswith (string) sample complexity that could workleft-to-right and have provable convergence prop-erties.

6. Conclusions

The main conclusions that can be drawn fromthe competition is that it has achieved all of thegoals that were attempted. The competition has

7in the identification in the limit scenario.

22 B. Starkie et al. / Omphalos

provided a forum for the comparison of differentGI algorithms on a given task. It has brought someexisting techniques to the foreground [35] and hasresulted in the refinement of these and other tech-niques. It has also stimulated research into thefield in general. For instance the number of com-petitors who downloaded the competitor data sug-gests that many non-GI researchers attempted tosolve the problems. In particular four of the eightproblems solved during the lifetime of the com-petition were solved by a competitor (Tjong KimSang) who has not published any prior research ongrammatical inference. The competition has alsodefined a method of calculating the complexity ofgrammatical inference tasks, and because of theparticipation rate we can conclude that the com-plexity of the highest ranked competition problemis indicative of the complexity of GI problems thatcan be solved using state-of-the-art techniques.

As with all small scale experiments when weconsider the statistical significance of the results,we know that we can’t confidently state these con-clusions with certainty. For instance being able toidentify two target grammars exactly does not im-ply that all deterministic context-free languagescan be PAC learnt using these techniques. Simi-larly the fact that some algorithms performed bet-ter on these tasks than others does not imply thatthis will be the case on all target languages, oreven the majority of other languages. However webelieve that due to the participation rates, and therepeated success of the winning technique, that thecompetition gives valuable insights into the state-of-the-art of the field.

6.1. Lessons Learnt

One important lesson learnt from the competi-tion with respect to the setting of grammatical in-ference competitions is that great care needs to betaken in the generation of negative examples fortraining and particularly for testing purposes. Thisprobably also applies to other classification andmodelling tasks, where the majority of randomlyselected examples are negative examples.

Both Sanchez and Tjong Kim Sang solved theeasier problems using techniques that had approx-imated the target language with a regular lan-guage. It was therefore known that although theyhad correctly classified the test examples, theydid not have an accurate model of the target lan-

guages. As organizers of the competition it was alittle disappointing that the competitors did notuse context-free GI techniques to solve the prob-lems, as this does not progress that state-of-the-art in context-free grammatical inference. This washowever a possibility that we were prepared for.We had deliberately designed the competition sothat not only would differing GI techniques be pit-ted against one another, but also GI techniqueswould be pitted against non-GI techniques, such asgeneric classification techniques and even solvingthe problems by hand.

After problems 1,3,4,5 and 6 had been solveda great deal of effort was put into constructingtest data sets that were more accurate indicationsof whether or not the competitor had successfullyconstructed an accurate model of the target lan-guage. Specifically new problems were added tothe competition on April 14th and on June 7th ad-ditional test sets for problems 1 to 6 were added tothe competition. These test sets became problems6.1 to 6.6. At the completion of the competition itwas discovered that when Clark had solved prob-lems 6.4 and 6.6 he had accurately identified thetarget languages exactly. Similarly the techniquesused by Sanchez and Tjong Kim Sang could not beused to solve these harder problems. This suggeststhat the strategy of careful selection of test data,in particular negative data, was successful.

6.2. Recommendations for future competitions

Over the last decade, much of the research inmachine learning has focused on problems likeclassification and regression, where the predictionis a single univariate variable. A challenge for ma-chine learning in the next decade is to transfersuccesses in these areas into the area of inferringcomplex objects like grammars, trees, sequences,or orderings. Perhaps this is one reason why gram-matical inference is becoming a more active areaof research. In this respect regular grammars areideal for modelling sequences. Context-free gram-mars are also well suited to modelling sequences,but in addition can be used to model hierarchi-cal structures within sequence (such as the phrasestructure of language). Future grammatical infer-ence competitions like the Omphalos competitioncan continue to play a part in the developmentof the field, by providing common tasks for re-

B. Starkie et al. / Omphalos 23

searchers to focus on, thus providing a good meansof comparing techniques.

At the end of the Omphalos session at ICGI2004 a discussion was held to provide the oppor-tunity for those present at the conference to pro-vide feedback to the Omphalos committee mem-bers. Topics of discussion included “was the com-petition a success”, “how could the competitionbe improved for subsequent years” and “shouldthe competition be continued for subsequent ICGIconferences”. There was a general agreement thatthe shared task was a good opportunity to com-pare algorithms and promote the development ofnew techniques and that the competition shouldbe continued. As a result we will couch the re-ceived comments in terms of how subsequent GIcompetitions were held, in the remainder of thissection.

There was a request for the competition tasksto have a practical application focus, and that theapplication should dictate the metric used to eval-uate the task. In this case the practical applicationwould dictate the form in which training and test-ing data should take. For instance if the task is toinfer a grammar for use with a speech recognitionsystem, then speech recognition accuracy wouldbe a suitable metric. Similarly if the applicationtask is to infer the meaning of sentences, then thephrase structure of sentences is more useful thanclassifying whether or not a sentence is grammat-ical. This opinion was expressed more than onemember present. This is therefore believed to bethe majority opinion.

There was also a range of specific sugges-tions and comments made by varying individualspresent. These suggestions were not universally ac-cepted, and some suggestions made by some peo-ple were in conflict with other suggestions madeby other people. These specific suggestions were asfollows.

– There was a request for more than one typeof data, e.g. partially structured data, in thebelief that with a greater variety of data morepeople will compete.

– Some people expressed the opinion that ifmore than one type of data were availablethen the number of competitors per taskwould be low.

– There was a request for the task to be anapproximation (rather than an identification)task.

– There was a request for the winner of the taskto be the competitor that obtained the bestaccuracy.

– There was a request for the task to be relatedto identifying the structure of language. Thatis identification of a grammar rather than ofa language. In this case the target languagecould be regular or finite acyclic languages.

– There was a complaint that the alphabet sizewas too small to be comparable to real worldtasks.

– There was a request for the task to be onein which the training and test sets containednoise.

As a result of these discussions it is planned tocontinue the GI competition tradition and includeanother GI competition at the next ICGI to beheld in 2006. At this point it is planned to, as dis-cussed at ICGI, to model the competition tasks onreal world applications. There will most likely betwo competition tasks, one in the area of naturallanguage, and the other in computational biology(DNA or protein sequences). Ideally we would liketo use real world data in the competition, but thisapproach suffers from several problems. The firstis that competitors may be able to bring externalknowledge to the task at hand in way that does notencourage the development of the state-of-the-art.For instance native English speakers would have noproblem labelling grammatical and ungrammati-cal English sentences. Another major problem isthat “real” data, like DNA sequences and naturallanguage have an inherent bias that is unknownbeforehand. Also it becomes difficult to scale thecompetition problems to a size that is just outsidethe state-of-the-art, when there is no control overthe underlying target model.

Acknowledgments

The contribution of Alexander Clark in review-ing the section on his winning technique is herebyacknowledged.

References

[1] E. Mark Gold. Language identification in the limit.Information and Control, 10:447–474, 1967.

24 B. Starkie et al. / Omphalos

[2] Yasubumi Sakakibara. Recent advances of gram-matical inference. Theoretical Computer Science,185(1):15–45, 1997.

[3] C. de la Higuera. A bibliographical study of grammat-ical inference. Pattern Recognition, 2005. Special Is-sue on Grammatical Inference Techniques and Appli-cations.

[4] D. Angluin. Queries revisited. Theoretical ComputerScience, 313(2):175–194, 2004.

[5] L. G. Valiant. A theory of the learnable. Communi-cations of the Association for Computing Machinery,27(11):1134–1142, 1984.

[6] E. M. Gold. Complexity of automaton identificationfrom given data. Information and Control, 37:302–320,1978.

[7] L. Pitt and M. Warmuth. The minimum consistentDFA problem cannot be approximated within any

polynomial. In Johnson [39], pages 421–432.

[8] L. Pitt and M. Warmuth. Reductions among pre-diction problems: On the difficulty of predicting au-tomata. In Third Conference on Structure in Com-plexity Theory, pages 60–69, 1988.

[9] M. Kearns and L.G. Valiant. Cryptographic limitationon learning boolean formulae and finite automata. InJohnson [39], pages 433–444.

[10] R. Parekh and V. Honavar. Learning DFA fromsimple examples. In Workshop on Automata Induc-tion, Grammatical Inference, and Language Acquisi-tion, ICML-97, 1997.

[11] C de la Higuera. Characteristic sets for polynomialgrammatical inference. Machine Learning, 27:125–138,1997.

[12] C. de la Higuera and J. Oncina. Inferring determinis-tic linear languages. In Jyrki Kivinen and Robert H.Sloan, editors, Fifteenth Annual Conference on Com-putational Learning Theory (COLT 2002), pages 185–200. 2002.

[13] B. Starkie. Left aligned grammars—identifying a classof context-free grammar in the limit from positive data.In C. de la Higuera, P. W. Adriaans, M. van Zaa-nen, and J. Oncina, editors, Proceedings of the Work-shop and Tutorial on Learning Context-Free Gram-mars held at the 14th European Conference on Ma-chine Learning (ECML) and the 7th European Confer-ence on Principles and Practice of Knowledge Discov-ery in Databases (PKDD); Dubrovnik, Croatia, pages89–101, 2003.

[14] B. Starkie and Henning Fernau. The boisdalealgorithm—an induction method for a subclass of uni-fication grammar from positive data. In G. Paliourasand Y. Sakakibara, editors, Grammatical Inference:Algorithms and Applications; Seventh InternationalColloquium, ICGI 2004, volume 3264, pages 235–247.Berlin, Germany: Springer, Athens, Greece, 2004.

[15] J. M. Sempere and P. Garcıa. Learning locally testableeven linear languages from positive data. In Adriaanset al. [38], pages 225–237.

[16] T Yokomori. Polynomial-time identification of verysimple grammars from positive data. Theoretical Com-puter Science, 298:179–206, 2003.

[17] P. W. Adriaans. Language Learning from a Catego-rial Perspective. PhD thesis, University of Amsterdam,Amsterdam, the Netherlands, November 1992.

[18] A. Clark. Unsupervised Language Acquisition: The-ory and Practice. PhD thesis, University of Sussex,Brighton, UK, 2001.

[19] D. Klein and C. D. Manning. Distributional phrasestructure induction. In CoNLL [37], pages 113–120.

[20] K. J. Lang, B. A. Pearlmutter, and R. A. Price. Re-sults of the Abbadingo One DFA learning competitionand a new evidence-driven state merging algorithm. InV. Honavar and G. Slutzki, editors, Proceedings of theFourth International Conference on Grammar Infer-ence, volume 1433 of Lecture Notes in AI, pages 1–12,Berlin Heidelberg, Germany, 1998. Springer-Verlag.

[21] J. J. Horning. A study of grammatical inference. PhDthesis, Stanford University, Stanford:CA, USA, 1969.

[22] M. van Zaanen, A. Roberts, and E. Atwell. A mul-tilingual parallel parsed corpus as gold standard forgrammatical inference evaluation. In Lambros Kranias,Nicoletta Calzolari, Gregor Thurmair, Yorick Wilks,Eduard Hovy, Gudrun Magnusdottir, Anna Samiotou,and Khalid Choukri, editors, Proceedings of the Work-

shop: The Amazing Utility of Parallel and ComparableCorpora; Lisbon, Portugal, pages 58–61, May 2004.

[23] N. Chomsky. Three models for the description oflanguage. IRE Transactions on Information Theory,2:113–124, 1956.

[24] J. E. Hopcroft, R. Motwani, and J. D. Ullman. In-troduction to automata theory, languages, and compu-tation. Addison-Wesley Publishing Company, Read-ing:MA, USA, 2001.

[25] T. M. Mitchell. Machine Learning. McGraw-Hill, NewYork:NY, USA, 1997.

[26] A. Stolcke. Boogie. ftp://ftp.icsi.berkeley.edu/pub/ai/stolcke/software/boogie.shar.Z, 2003.

[27] K. Nakamura and M. Matsumoto. Incremental learn-ing of context-free grammars. In Adriaans et al. [38],pages 174–184.

[28] C. M. Cook, A. Rosenfeld, and A. R. Aronson. Gram-matical inference by hill climbing. Informational Sci-ences, 10:59–80, 1976.

[29] C. Donnelly and R. M. Stallman. Bison Manual: Usingthe YACC-compatible Parser Generator, for Version1.875. GNU Press, Boston, MA, USA, 2003.

[30] A. Denise, F. Sarron, and Y. Ponty. GenRGenS.http://www.lri.fr/denise/GenRGenS/, 2001. Labora-toire de Recherche en Informatique.

[31] M. J. Nederhof. Practical experiments with regular ap-proximation of context-free languages. ComputationalLinguistics, 26(1):17–44, 2000.

[32] F. Otto. On deciding the confluence of a finite string-rewriting system on a given congruence class. Journalof Computer and System Sciences, 35:285–310, 1987.

B. Starkie et al. / Omphalos 25

[33] A. Clark. Learning deterministic context free gram-mars: The omphalos competition. submitted to Ma-chine learning, 2005.

[34] R. Eyraud, C. de La Higuera, and J. Janodet. Rep-resenting languages by learnable rewrite systems. InG. Paliouras and Y. Sakakibara, editors, Grammat-ical Inference: Algorithms and Applications; 7th In-ternational Colloquium, ICGI 2004, volume 3264,pages 139–150, Athens, Greece, 2004. Berlin, Germany:Springer.

[35] A. Clark. Unsupervised induction of stochasticcontext-free grammars using distributional clustering.In CoNLL [37], pages 105–112.

[36] M. van Zaanen. ABL: Alignment-Based Learning. InProceedings of the 18th International Conference onComputational Linguistics (COLING); Saarbrucken,Germany, pages 961–967. Association for Computa-tional Linguistics (ACL), July 31–August 4 2000.

[37] Association for Computational Linguistics (ACL).Proceedings of the Workshop on Computational Natu-ral Language Learning held at the 39th Annual Meet-ing of the Association for Computational Linguistics(ACL) and the 10th Meeting of the European Chap-ter of the Association for Computational Linguistics(EACL); Toulouse, France, July 2001.

[38] P. W. Adriaans, H. Fernau, and M. van Zaanen, ed-itors. Grammatical Inference: Algorithms and Appli-cations (ICGI); Amsterdam, the Netherlands, volume2482 of Lecture Notes in AI, Berlin Heidelberg, Ger-many, September 23–25 2002. Springer-Verlag.

[39] D. S. Johnson, editor. STOC ’89: Proceedings of thetwenty-first annual ACM symposium on Theory ofcomputing, New York, NY, USA, 1989. ACM Press.