About reformulation in full-text IRS

11
Injormarion Processrng & Management Vol. 25, No. 6, pp. 647-651, 1989 03064573/89 $3.M) + .oO Printed in Great Britain. Copyright 0 1989 Pergamon Press plc ABOUT REFORMULATION IN FULL-TEXT IRS FATHI DEBILI UA 962 du CNRS, Conseil d’Etat, Palais Royal, 75001 Paris, France CHRISTIAN FLUHR* UniversitC PARIS-XI, INSTN, CEN-Saclay, 91191 Gif sur Yvette, France and PIERRE RADASOA SYSTEX, Ferme du Moulon, 91190 Gif sur Yvette, France (Received 8 June 1988; accepted in final form 17 February 1989) Abstract - In this paper, we analyze different kinds of reformulation used in Informa- tion Retrieval systems where full-text databases are accessed through natural language queries. These reformulations are tested separately on large actual full-text databases managed by the SPIRIT system (Syntactic and Probabilistic Indexing and Retrieval of Information in Texts). Some modifications on the weighted comparison are being tested for taking reformulation into account. At the end, an expert system solution is proposed as a reformulation strategy manager. 1. INTRODUCTION The problems that we deal with concern textual information retrieval in very large bases. A request is stated in unrestricted natural language. The aim is to provide relevant docu- ments in function of this query, and to locate as precisely as possible the most informa- tional parts of these documents. The request is stated as a description of what is wanted rather than as a simple question whose answer would have to be constructed. Here are the key ideas that underlie our approach: 1, First of all, we process free-text, i.e., without constraints on vocabulary and syn- tactic forms. 2. Linguistic processings are privileged. They are multilingual, modular (the modules are executed sequentially without backtracking, they provide an intermediate solu- tion which may hold local ambiguities, and the next modules will try to resolve these ambiguities if any), and present a well-structured separation between pro- grams and linguistical knowledges. They use learning methods, automatic construc- tions and simulations of linguistic informations. 3. Statistical processings follow linguistic processings. They take into account prob- lems which are not yet completely resolved by the linguistic processings, which remain nonetheless primordial in order to best identify the entities which the sta- tistical methods treat. The SPIRIT system, operational since 1981 [l], is built on these principles. In this paper, we present different reformulation mechanisms that we are testing and validating with the help of this system. Reformulation is the answer to the general problem posed by paraphrase. Often, the query is expressed by terms which are different from those found in the documents that should be retrieved. Generally, the problem is to find a way to match between query terms and document terms. The matching can be done by a transformation of the texts in the base, knowing the nature of the query language; by a query transformation, knowing the links between the user vocabulary and the text vocabulary; or by a joint transformation *Author to whom correspondence should be addressed. 647

Transcript of About reformulation in full-text IRS

Injormarion Processrng & Management Vol. 25, No. 6, pp. 647-651, 1989 03064573/89 $3.M) + .oO Printed in Great Britain. Copyright 0 1989 Pergamon Press plc

ABOUT REFORMULATION IN FULL-TEXT IRS

FATHI DEBILI UA 962 du CNRS, Conseil d’Etat, Palais Royal, 75001 Paris, France

CHRISTIAN FLUHR* UniversitC PARIS-XI, INSTN, CEN-Saclay, 91191 Gif sur Yvette, France

and

PIERRE RADASOA SYSTEX, Ferme du Moulon, 91190 Gif sur Yvette, France

(Received 8 June 1988; accepted in final form 17 February 1989)

Abstract - In this paper, we analyze different kinds of reformulation used in Informa- tion Retrieval systems where full-text databases are accessed through natural language queries. These reformulations are tested separately on large actual full-text databases managed by the SPIRIT system (Syntactic and Probabilistic Indexing and Retrieval of Information in Texts). Some modifications on the weighted comparison are being tested for taking reformulation into account. At the end, an expert system solution is proposed as a reformulation strategy manager.

1. INTRODUCTION

The problems that we deal with concern textual information retrieval in very large bases. A request is stated in unrestricted natural language. The aim is to provide relevant docu- ments in function of this query, and to locate as precisely as possible the most informa- tional parts of these documents. The request is stated as a description of what is wanted rather than as a simple question whose answer would have to be constructed.

Here are the key ideas that underlie our approach:

1, First of all, we process free-text, i.e., without constraints on vocabulary and syn- tactic forms.

2. Linguistic processings are privileged. They are multilingual, modular (the modules are executed sequentially without backtracking, they provide an intermediate solu- tion which may hold local ambiguities, and the next modules will try to resolve these ambiguities if any), and present a well-structured separation between pro- grams and linguistical knowledges. They use learning methods, automatic construc- tions and simulations of linguistic informations.

3. Statistical processings follow linguistic processings. They take into account prob- lems which are not yet completely resolved by the linguistic processings, which remain nonetheless primordial in order to best identify the entities which the sta- tistical methods treat.

The SPIRIT system, operational since 1981 [l], is built on these principles. In this paper, we present different reformulation mechanisms that we are testing and validating with the help of this system.

Reformulation is the answer to the general problem posed by paraphrase. Often, the query is expressed by terms which are different from those found in the documents that should be retrieved. Generally, the problem is to find a way to match between query terms and document terms. The matching can be done by a transformation of the texts in the base, knowing the nature of the query language; by a query transformation, knowing the links between the user vocabulary and the text vocabulary; or by a joint transformation

*Author to whom correspondence should be addressed.

647

648 FATHI DEBILI er al.

of the texts and the query converging to a normalized representation which covers the whole language.

Our approach concerns the third case. We reformulate both the texts and the query. The rest of this paper is organized as follows: We will describe the different reformulation techniques that we have experimented or that we are experimenting. After that, we will examine the modifications needed in the query/documents comparison for the evaluation of answer relevance and for ranking these answers. Finally, as the reformulation techniques are numerous, we will pose the problem of automatically piloting the reformulation sys- tem by an expert system.

2. THE DIFFERENT REFORMULATION TECHNIQUES

2.1 Lemmatization Lemmatization consists in representing each identified word of the text or the query

by a canonical form (usually, it is the same form as in a common dictionary of the lan- guage). Sometimes, the canonical form is arbitrarily chosen in consideration of polysemy problems.

This transformation concerns verbs, nouns and adjectives changing them into the infinitive, the singular and the masculine singular forms, or nominatives for languages with declension. But, it also concerns abbreviations which are associated with the complete forms; acronyms or their developed form with which are associated the lemmatized forms; multiple spellings; and so on. Some examples are:

indexed --f to-index Mr + Mister wrote -t to-write ref. + reference books --t book United Nations + U.N. mice -+ mouse UN + U.N. pupil’s --t pupil U.N + U.N. tonite -+ tonight enduser + end-user shorter --t short end user + end-user

The simplicity of these examples need not hide that in many cases, the lemmatization is confronted with an ambiguity problem that needs morpho-syntactical treatment, or even a higher level of treatment. Some other examples are:

can, noun + can can, auxiliary + could can, verb + to-can

For resolving these problems, we use in the SPIRIT system morpho-syntactical algo- rithms using big dictionaries (450,000 entries for the French dictionary, 300,000 entries for the Arabic one, 100,000 entries for the English one), and positional rules built by learn- ing method [l-6]. These rules contain the valid sequence of grammatical categories, for example: article-adjective-noun, or verb-article-adjective.

The correct recognition and the right lemmatization are not always obtained. Some cases need a semantical analysis, or even a pragmatical one [7]. After syntactic analysis, remaining grammatical ambiguities represent in the French documentary corpus less than 5% of all the processed words. This ratio is due to the great number of grammatical cat- egories managed by the system (more than 150). Most of these ambiguities do not affect the lemmatization process. The only cases that could create problems are ambiguities giving different lemmatizations and ambiguities between full and empty words (full words are informational words; empty words are non informational words like articles, preposi- tions . . . ). But SPIRIT can manage several lemmatizations for the same word occurrence. In case of ambiguity between full and empty word, the word is considered as a full one. This last choice avoids decreasing recall but can also decrease precision.

The lemmatization provides the basic terms from which all internal representations of the documents and the query are built. As these representations are the same, the lemmati-

About reformulation in full-text IRS 649

zation establishes a first link level between words (of the documents and the query) mor- phologically transformed but syntactically unchanged. For example:

“and what is the use of a book,” thought Alice, “without pictures or conversations?” [L. CARROLL]

After lemmatization, we obtain:

“and what to-be the use of a book,” to-think Alice, “without picture or conversation?”

2.2 Detecting errors and handling proper names Misprints or spelling errors can decrease recall in full-text retrieval systems. Some of

these errors can be detected and eventually corrected by the use of linguistic rules. Among these correctable errors, the one which has the most influence on recall rate is the trans- formation of a word belonging to the language into one that does not. This can be easily detected by the morphological analysis. Other kinds of errors are those related to gender, number, case or tense agreement. These errors have no influence on the recall because the lemmatization reformulation automatically eliminates them.

Another group of errors cannot be detected by the use of linguistic processing. We must include in this group the case of proper nouns. There is not, in general, a one-to-one correspondence between the spelling and the pronunciation of a proper noun. And in the case when foreign nouns are transcribed from another alphabet, transcriptions can vary, giving many different forms of the same original word. For these reasons, when process- ing the query, proper nouns that are not in the database index must be submitted to a reformulation processing.

Notably, in the first kind of errors (nonvalid words), the user can use his own knowl- edge of the language to correct them (automatic correction is not obligatory), in the sec- ond group the system must help the user to find the right way to write the word.

The reformulation processing consists in searching in the database index, all proper nouns close (phonetically and morphologically) to the proper noun of the query. The refor- mulation rules are obtained dynamically by the use of a grapheme to phoneme translation both on texts and queries, and a searching of similar pronunciation based on an alphacode access [8,9].

2.3 “Explicitation ” Here is an example concerning tax legal texts:

Query: “What is the definition of a book in tax regulation?”

The answer of such a query will give many wrong locations due to the great number of occurrences of the word “book” and to the absence of the word “definition” in the texts even if its sense is mentioned implicitly. Words like “date, delay, definition, list, mea-

sure, . . .,” can be found in the texts in the shape of phraseological frames [lo]:

Original text: “We call a book . . .”

Explicitation frame: “We to-call ARTICLE NOUN ARTICLE”

is the morpho-syntactical frame which permits the recognition of the “definition” notion concerning the word “book”.

Another example is:

Text: DOS command: -COPY -DIR -MD

650 FATHI DEBILI et al.

This frame hides a “list” notion which can be identified and related to the word “com- mand.” This permits correct answers to queries like:

“DOS command list?”

The idea in “explicitation” is to recognize such morpho-syntactical frames, and to add, in the internal representation of the document, the terms which describe these implicit con- cepts, and eventually the relation between these terms and these frames. The “explicitation” is carried out either on the texts or on the query. In the following examples:

Text: “Which are . . . prior to 12/25/85 . . .” Text: “The diameter of the wire must be more than 2 mm.”

It is obvious that the kind of the measures used (date or distance in the present case), must be determined for the whole text, so that the comparison could process terms of identical nature. This reformulation technique is also called UP POSTING, the replacement of a specific form by a generic one.

2.4 Stemmatization Farther along in paraphrase recognition, we can consider paraphrases which are built

by synonymies based on morphological proximity obtained by derivation or composition. In fact, these transformations generally preserve the meaning even if the syntactical sta- tus is changed. For example:

“Error detection is useful in . . .” “It is useful to detect errors in . . .”

To perform these transformations, another level is added to the paradigmatical descrip- tion but only of the query. This description consists of associating to each lemmatized query term, a family term and a relation between the respective terms. For the moment, we only distinguish two kinds of relations: Synonym(s) and related term (RT):

Paradigmatical description

Error detection is useful

Lemmatization

error detection to-be useful

Stemmatization

to-err (RT), erroneous (RT) to-detect (S), detector (RT)

to-use (RT), usefulness (RT)

The first problem with the stemmatization is its construction, and the next one is its use (the selection problems are discussed in the comparison paragraph). To realize these constructions, we create dictionaries of word families of this form:

W, R+ Wl, W2,. . . , Wn

where R represents the relation type. So, we have:

error, RT + to-err, erroneous detection, S + to-detect detection, RT + detector

This linguistic knowledge is general and covers the whole language. It is not restrained to a given domain. For some languages, its construction can be done automatically [11,12]. In French for instance, with the help of morphological features, we have used the stem- matization based on the compatibility and the non-compatibility of the suffixes [13].

2.5 Thesaurus

About reformulation in full-text IRS 651

Further, in this recognition of paraphrases, we would now like to recognize para- phrases between words that are semantic rather than just morphological neighbors. These semantic family relations are constructed manually for the moment. They are stored in the same way as word families, that is as rewriting rules:

W, R-r Wl, W2,. . ., Wn

where R is the semantic relation. Such rules can be obtained from existing thesauri. For example:

vehicle, NT + car, truck, van (NT: narrower term) car, BT + vehicle (BT: broader term) Europe, RT + E.E.C. (RT: related term) treatment, S + processing (S: synonym) nuclear reactor, NT -+ pressurized water reactor

Some of these rules are general, like geographical relations, for example. Others are linked to a specified domain, the synonymy relations for example. With this representa- tion, the user can easily add a new kind of relation, and define its utilization as is classi- cally done in thesaurus construction.

In addition, a relation could be made between phrases (not only between single words). Some of these relations could be controlled by a condition which depends on the word environment. The reformulation consists of completing the paradigmatical descrip- tion of the query by addition of terms obtained by these rewriting rules. As with stemmati- zation, this description is only done on the query, not on the texts.

This kind of reformulation which is used to find different expressions of the query in a given language can also be used to reformulate the query into another language using in this case, translation reformulation rules. Other thesaurus manipulations are presented in [11,14].

2.6 Learning of reformulation rules Access to textual data bases by the end-user in natural language can be tracked by the

database manager. In the case of general public access, often the queries are incomplete or the wrong word is used. This prevents the user from accessing the information. It is easy to ask the user if he succeeded or failed in his attempt to get information and to ask him to explain what he was looking for.

A first experiment to exploit this information is being conducted with SPIRIT being used as the HELP system for one of the biggest computer centers in France (CIRCE). Catalogs of programs, full-text technical documents, and technical bulletins of the com- puter center have been automatically indexed and can be accessed in natural language. Anyone that accesses the system can comment on the result of his interrogation and specify what he wanted and did or did not get. This information goes to the database manager along with a trace of the session.

The first evident result is that the users give less information in their queries to the sys- tem than they use in their explanatory message to the database manager, and that the text of this message, expressed in “genuine” natural language, would have given a good answer. For example:

Query: Print format

Commentary message to the database manager: I was searching different output classes on printer and especially archiving format.

Study of the queries that failed lead us to believe that users got the wrong answer because they used the wrong term ( this seems to be particularly true for non computer scientists on this base). Many of these “erroneous” queries treat the same subjects (e.g., the price

652 FATHI DEBILI et al.

of disk storage, login problems). The database manager who knows his database very well can establish a pair (WRONG QUERY, PART OF TEXT WHICH IS AN ANSWER) for these frequent queries.

With this information he can establish reformulation rules as in a thesaurus, the only difference being that the semantic relation represents the semantic model of the naive gen- eral public end-user for this base. For an experienced user this relation may be false, but for the naive class of users, the link between a bad term and the right term in this case is valid. An example of such a pair is:

“I want to write in Russian” + “Cyrilic keyboard”.

We are constructing a system that can automatically build such relations from the pairs manually established by the database manager. These reformulation rules establish a user class model that can adapt the linguistic habits of that population to the linguistic environment of the documents’ authors.

2.7 Syntagmatic extensions from documents Just as the librarian uses information from pertinent documents to reformulate the

query in order to increase recall, it is possible to use the whole or part of a document as a new natural language query [l].

J.C. Bassano [15] showed that in a scientific abstract database this kind of reformu- lation can be started automatically on the most relevant documents retrieved. According to our experience we think that this cannot be generalized for full text, because we are not sure in most cases that even the document deemed most relevant by the system may not be relevant to the user. S.E. Robertson [16] suggests a query expansion by adding terms occurring in the relevant documents, with an appropriate weight.

In the SPIRIT system [l], the user selects which document must be used as a syntag- matic reformulation. Parts of text relevant to the query are “automatically” used to com- pose a new query. This establishes a dynamic link between texts or parts of text, that is close to a notion of dynamic Hyperptext [17]. In the same way and without any reformu- lation of the original query, dynamic semantic links can be established from any tagged part of text to semantically related parts of text.

3. WEIGHTED COMPARISON

Automatic addition of terms from a thesaurus into the queries had been proposed very early in classical boolean systems [ 181. This function is not widely used at the present because even if it increases recall, it strongly decreases precision.

The idea that the problem can be solved by the use of a weighted comparison was proposed. But the first experiments were not a great success. The reason is that to increase precision in such a comparison, it is necessary to establish a non-trivial comparison mech- anism. For example, our experience leads us to agree with the results of Donna Harman’s test [12]: a simple adding of terms to the query and a processing of such an inflated query by a weighted comparer using the same algorithm as for the original query do not give a good result. In the same paper, some solutions are proposed such as decreasing the weights of inferred words, or using only short queries. While the former idea seems a good sub- ject for further research, the latter seems to be in contradiction with the flexibility offered by the use of weighted comparison for natural language queries. Indeed, the main advan- tage of using a weighted comparison is that the user can give many details about what he is searching for in one large query without running the risk of getting zero documents as is often the case with long boolean queries.

We shall now present some information on the comparer currently used by SPIRIT before presenting its modifications for the reformulation experiments [ 191.

SPIRIT’s linguistic processing produces, for each document (a summary, a chapter of a book or a complete book in full-text), a list of normalized words and pairs of nor-

About reformulation in full-text IRS 653

malized words. This corresponds to the lemmatization reformulation, The pairs of normal- ized words are words in dependence relations such as noun-noun or adjective-noun in the noun phrase, subject-verb, verb-direct object, and so on. Pairs of normalized words are built for the moment by adjacency (a dependency relation analysis will be used in the next version). A word appearing in a pair also appears as a single word. For each normalized term and relation, the system gives information about occurrences in text (i.e. its address and length). This information is necessary because the normalized form and the actual occurrence in text can be completely different.

For each normalized term or relation (in the following we will say search elements) a weight is computed according to a probabilistic model. This weight gives a measure of the information brought by the search element in the comparison. “A search element which appears in only one document has a maximum weight, a search element which appears in all the documents has a minimum weight” 141.

The comparison mechanism begins with the more informative search elements. This permits the optimization of the comparison, being able to stop it before total completion without losing pertinent information. The system builds up a description of the intersec- tion between the query and the documents proposed by the inverted list of search elements. When the comparison stops, the documents are gathered into classes that have the same intersection. This intersection is in fact the best boolean query that could have been made to get this class. The boolean operators permitted are AND and AND WITH A DEPEN- DENCY LINK [20], and of course the second one is “better” than the first one (this prob- lem is also mentioned in [21]). Then a weight is computed for each boolean query representative of a document class: W( dot. classl) = IV(sing1) + W(sing2) + . . . + 2*( IV(pairl) + W(pair2) + . . . ). Pairs’ weights are multiplied by 2 because a pair re- placed two single words.

The classes are ranked according to the previously computed weight. Inside this class, no other rankings are made based on the content of the documents (but can be made on seniority, or other criteria).

As documents can be complete books, a second level of ranking is made in a similar way on parts of texts. This ranking gives an ordered list of more informational screen pages that can be displayed by direct access or sequentially according to their relevance degree. This last function gives the document browser a capacity of dynamic Hypertext.

Now, let us see what kind of modification must be done on this comparer to process reformulation efficiently. It seems that paradigmatic and syntagmatic reformulation must not be managed in the same way.

3. I Paradigmatic reformulation The basic ideas for the paradigmatic reformulation are:

0 The influence on the class weight of all terms inferred by a term of the query can- not be superior to the weight of the original term.

* The weight of an inferred term depends on the type of relation. l The influence of an inferred term decreases because of the fact that there is not a

certainty that the inference is good for this specific query. If transitivity is done (such as for generic-specific relation) the decreasing processing must be used at each level.

. The co-occurrence of terms inferred from the original query term in the document can reinforce the weight but not more than we explain in the first point above.

Now we wiII describe treatment of weights according to the principal relations. ~y~o~y~7~. They are taken either from stemmatization or from thesaurus type rela-

tions. According to the probabilistic model this kind of relation gives some problems. The weight of all the synonyms should be the same. But we do not have the actual weight of the synonym class without recomputing the frequencies because the actual weight is the weight of each member of the synonym class. This recomputing is not necessarily a new

654 FATHI DEBILI et al.

processing of the whole database for the weighting system of SPIRIT. A similar process- ing occurs during database updating. A new weight is computed for each search element appearing both in the old database and the new documents.

If it is considered that recomputing the weights has too high a price, it is possible to use the following property to give an approximation of the weight:

W(.syn. class) 5 W(query term), W(Synl), W(Syn2), . .

A good approximation for W( syn. class) is Min ( W( query term) ,W( Synl ) ,W( Syn2), . . . ) . Whatever the number or type of synonyms of the term from the query, the weight for this occurrence is W( syn. class).

Related terms. They are taken either from stemmatization or from thesaurus type rela- tions. The main problem here is the uncertainty of the inference according to the query. The weight must be an attenuation of the “query term” weight (and not an attenuation of the inferred term weight). We are now experimenting to find a good attenuation coeffi- cient. The other problem that is subject to experimentation is: must co-occurrence of re- lated terms limit the attenuation coefficient?

These question are very difficult to answer because the related term relation is not very well defined. A solution of this difficulty could be to divide the RT relation into more pre- cise ones like actor of an action, tool of an action, object of an action, and so on. In this case it will be necessary to use a deeper analysis of text and queries (a dependency anal- ysis with tagged relations; for now, only the knowledge of a link is used by SPIRIT, but not the kind of a link given by linguistic analysis).

Specific terms. It is clear in this case that occurrence of only one inferred term gives a decreased weight and also that the more inferred terms co-occur, the more the concept given by the term of the query is fully represented. If all specific terms are in the document and not the original term this means that the original term could be inferred from the text.

It could be possible to take for each NTi (i=l to n) a weight: W(NTi) = l/n * W(query term). Total weight could be the inferred term weight multiplied by the number of different inferred terms in a document, this is in agreement with our principle that:

W(query term or inferred terms) I W(query term)

This solution can disadvantage terms that have many specific terms. Another point is that it is perhaps not a good idea to have a weight proportional to the number of different spe- cific terms, the co-occurrence of the former is more important than the co-occurrence of the latter. This gives the idea that the attenuation must not be directly linked to the number of specific terms, and that the total weight for an occurrence in the query must not be lin- ear to the number of specific terms in the intersection. Several weighting functions are at present being tested.

Generic terms. If the corpus is representative, it can be said that there is a relation between the information brought by the term and its semantic precision degree. That is to say that a generic term can be found in more documents than a more specific term. For this reason if an inferred term is a generic term the weight can be its own weight if W(BT term) I W(query term). If not, that means that the corpus is not representative of the domain. A new weight must be computed by using an attenuation coefficient. A good solu- tion could be to use the knowledge of how many specific terms the BT term has inferred. For generic terms co-occurrences of inferred terms have no reason to reinforce the weight because generic terms of the same term belong to separate sub-domains.

Comparison and weighting. For the moment we do not use different kinds of relations at the same time. Our aim is to study results of each one separately.

For each occurrence of normalized terms we make an expansion according to the rela- tion being tested. A weight is computed according to the above rules. But our comparer can manage pairs of normalized terms linked by dependency relations (similar problems

About reformulation in full-text IRS 655

are approached by using the incompatibility of dependency relations to eliminate bad pairs [22]). How might this fact be managed within an expansion mechanism? An example is:

in the query: in the document:

dep. rel. text +b indexing . . . has indexed documents on . . .

IRT 1s result of linguistic processing: document to-index

to-index f--* document

It can be easily seen in this example that the weight associated to the document must not be computed according to the local reformulation of the two words “text” and “index- ing,” but the weight of “to-index - document” must be computed by using the weight of “text - indexing.” The co-presence of a dependency link in the query and the document increases the validity of the inferred relation as a paraphrase of the original query. In this case a very small attenuation coefficient must be used and perhaps none. This must be con- firmed by experimentation.

3.2 Syntagmatic reformulation The syntagmatic reformulation is a reformulation that is not only a paraphrase but

which also adds information to the original query. One part of this reformulation is an automatic and dynamic thesaurus relation construction. So that it could be used with the same processing as other thesaurus relations, this reformulation produces relations tagged only with the relation type Related Term (as a first approximation). Another part of the reformulation is using new information not contained in the original query. By using the rest of a document responding to that query one can find other documents on the same subject in that particular base.

This type of reformulation is difficult to use precisely. For this reason we submit the document as a new query to the standard comparer. Our main effort in this field is to reduce the part of the text used to reformulate, to terms and pairs of terms directly linked to the original query to prevent divergence.

4. EXPERIMENTAL CONDITIONS

Our experiments are carried out on actual full-text databases; for example our com- puter science technical full-text database contains 60,000,CKM characters. Other experiments will be carried out on legal texts. Each kind of reformulation is tested separately to mea- sure efficiency and to gain expertise on their use. It is too early to give significative mea- sures, and we will publish them soon.

Nevertheless, here are some examples of query reformulation put on Wharton eco- nomical base (number of documents: 93; last update: October 29, 1985).

Query: Europe’s economy?

Query without reformulation:

Result: - 2 documents containing the terms Europe and economy (not in depen- dency relation);

- 1 document containing the term Europe; - 32 documents containing the term economy.

Query with stem reformulation:

Result: - 3 documents containing the terms Europe and economy or their stems (not in dependency relation);

- 6 documents containing the term Europe or its stems; - 50 documents containing the term economy or its stems.

IPW 25:6-E

656 FATHI DEBILI et uf.

The top ranked text:

TOPIC: WORLDWIDE TITLE: Italy: Clouds on the Economic Horizon AUTHOR: Massimo Tivegna; Deputy Director, Research Development Confederation of Itaiian Industry (Confindustria) DATE: AUGUST 19, 198.5 TEXT:

. . . But this is only part of the story, as the economy has shown symptoms of creeping deterioration in the last few months. The Italian economy is still expected to show a satisfactory performance this year with GDP growth of 2.3%, moderately lower than last year’s rate but still one of the highest in Western Europe. Some clouds are on the horizon, however. . . .

Query with thesaurus reformulation:

Result: - 2 documents containing the terms Europe and economy or their deducted terms (narrow terms for Europe), these terms are in dependency relation, that is to say that these documents are considered to be best answers;

- 3 documents containing the terms Europe and economy or their deducted terms (not in dependency relation);

- 7 documents containing the term Europe and its deducted terms; - 29 documents containing the term economy and its deducted terms.

The top ranked text:

TOPIC: EXECUTIVE SUMMARY DATE: SEPTEMBER 16, 1985 TEXT:

. . . ON THE INTERNATIONAL SCENE: After disappointing growth in the first quarter, Germany’s economy rebounded with a 2% jump in the second. A Wharton study reveals that proposed protectionist legislation would hurt the U.S. economy in the long run.

Query with thesaurus reformulation and stem reformulation combined:

Result: - 3 documents containing the terms Europe and economy or their deducted terms (by thesaurus and stemmatization);

_ 5 documents containing the terms Europe and economy or their deducted terms (not in dependency relation);

- 17 documents containing the term Europe and its deducted terms; _ 45 documents containing the term economy and its deducted terms.

5. TOWARDS AN EXPERT REFORMULATION SYSTEM

The multiplicity of reformulation techniques is very difficult to handle, it thus seems

a good idea to manage them by an expert system. Most of our reformulation is done by rewriting rules conditioned by a relation type.

A subset of reformulation rules can be activated by vahte assignment to relation type via the use of meta-rules. These meta-rules are activated by the result of the first level of comparison.

Example:

Query: Tax on whiskey?

Text: Taxes on alcohol are . . .

Meta-rule activated: IF query term not in database index THEN relation type = BT

Other example of meta-rule: IF the user belongs to “non expert class” THEN relation type = relation buih by learning on user habits

6. CONCLUSION

For the moment, we just have qualitative results about reformulation efficiency. More experiments are needed for having quantitative results on each reformulation technique

About reformulation in full-text IRS 657

presented in this paper, and of course, according to the language, there will be notable var- iations between these techniques (lemmatization is for instance more useful in French than in English).

So, before going farther in the reformulation technique combination, we are testing each technique separately. The results will help us in finding the optimal combination that will give the best retrieval result. For now, our work is carried out in French, and later in English and in Arabic (as we already have tools to do it, but we have not yet large actual databases in these latter languages).

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11. 12.

13.

14.

1.5.

16.

17.

18. 19.

20.

21.

22.

REFERENCES

Andreewsky, A.; Binquet, J.P.; Deb& F.; Fluhr, C.; Pouderoux, B. Linguistic and statistical processing of texts and its appIi~ation in the fiefd of legal documentation. 6th S~posium on Legal Data Processing in Europe (Council of Europe), Thessaloniki, July 1981. Andreewsky, A.; Fluhr, C. A learning method for natural language processing and application to information retrieval. IFIP Congress, Stockholm, August 1974, pp. 924-927. Andreewsky, A.; Debili, F.; Fluhr, C. Computational learning of semantic lexical relations for the genera- tion and automatic analysis of content. IFIP Congress, Toronto, August 1977, pp. 667-673. Fluhr, C. Algorithme a apprentissage et traitement automatique des langues These d’Etat, Universite Paris XI, 1977. Debili, F. AnaIyse s~taxico-s~mantique fond&e sur une acquisition automatique de refations Iexicales- semantiques. These d’Etat, Universitt Paris XI, 1982. Debili, F. Morphological analysis of written Arabic, with or without vowels, based on an automated con- struction of an Arabic dictionary. COGNITIVA 85, CESTA, Paris, Juin 1985. Crefenstette, G. Traitements linguistiques appliques a la documentation automatique; these de troisitme cycle, Universite Paris XI, 1983. Andreewsky, A.; Debili, F.; FIuhr, C. Une propriete remarquable du lexique des langues naturelles et son utilisation dans Ia correction automatique des erreurs typographiques. Note CEA-N-2067, Decembre 1978. Deloche, G.; Deb& F.; Andreewsky, E. Order information redundancy of verbal codes in French order and English: Neurolinguisti~ implications. Journal of Verbal Learning and Verbal Behavior 19, 1980. Bourcier, D. Information est signification en droit. Experience d’une expficitation automatique de concepts. Langages n 53, Mars 1979, Didier-Larousse. Salton, G.; McGill, M.J. Introduction to modern information retrieval. New York: McGraw-Hill; 1983. Harman, D. A failure analysis on the limitations of suffixing in an online environment. Tenth annual inter- national ACMSIGIR conference, New Orleans, June 1987. Debili, F. A method of automatic word family building. Deontic logic, computational linguistics and IegaI information systems, Volume II, A.A. Martin0 (ed.) pp. 305-325 North-HolIand Publishing Company, 1982; International Study Congress on Logic Informatics, Law, Florence, Italy, April 1981. Bruandet, M.F. Outline of knowledge base model for an intelligent information retrieval system. Tenth annual international ACMSIGIR Conference, New Orleans, June 1987. Bassano, J.C. Dialect: an expert assistant for Information Retrieval; AI Conference (Canadian AI Confer- ence); Montreal, May 1986. Robertson, S.E. On relevance weight estimation and query expansion Journal of documentation, Vol. 42, N. 3, September 1986. Colklin, J. Hypertext: An introduction and survey; Microelectronics and Computer Technology Corp.; Com- puter, September 1987. MISTRAL Manuel d’utilisation, la Documentation Francaise. Fluhr, C. SPIRIT: a linguistic and probabilistic information storage and retrieval system. First international workshop on natural communication with computers, Warsaw, September 1980. Fluhr, C. Le traitement et I’interrogation des bases de don&es textuelles; Informatique et Droit en Europe; Universitt Libre de Bruxelles, Editions Bruylant, 1984. Kwok, K.L. Some considerations for approximate optimal queries. Tenth annual international ACMSIGIR Conference, New Orleans, June 1987. Fagan, J.L. Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. Tenth annual international ACMSIGIR Conference, New Orleans, June 1987.