Cross-Lingual Information Retrieval Problems: Methods and findings for three language pairs

16
1 Cross-Lingual Information Retrieval Problems: Methods and findings for three language pairs Ari Pirkola & Turid Hedlund & Heikki Keskustalo & Kalervo Järvelin University of Tampere Department of Information Studies Finland Email: [email protected] Abstract In this paper we will discuss dictionary-based cross-language information retrieval (CLIR) methods, and report recent findings and problems. We will consider three language pairs for CLIR: Finnish to English, English to Finnish, Swedish to English. We show that Finnish and Swedish have special features, e.g., the frequency of homography and a high frequency of compound words that affect retrieval effectiveness. Especially correct word form normalization and compound splitting are essential. We report findings concerning the effectiveness of various query translation methods, query structures and linguistic tools used for CLIR. We also point out some problems and deficiencies in such tools. 1. Introduction There is an increasing amount of full text material in various languages available through the Internet and other information suppliers. Therefore Cross-language information retrieval (CLIR) has become an important new research area (Oard & Dorr, 1996; Pirkola, 1999). It is a process of selecting and ranking documents in a language different from the query language. One of the main approaches to CLIR is based on bilingual translation dictionaries. For an overview of the approaches, see (Hull & Greffenstette, 1996; Oard & Dorr, 1996; Pirkola, 1999). The main problems associated with dictionary-based CLIR are (1) phrase identification and translation, (2) source language ambiguity, (3) translation ambiguity, (4) the coverage of dictionaries, (5) the processing of inflected words, and (6) untranslatable keys, in particular proper names spelled differently in different languages. Translation ambiguity refers to the proportional increase of bad keys due to translation. Research has developed many effective methods to handle the problems. These involve the use of special dictionaries for the dictionary coverage problem (Pirkola, 1998, 1999), POS tagging for phrase translation (Ballesteros and Croft, 1997) and for removing bad translation equivalents (Ballesteros and Croft, 1998; Davis, 1997), stemming and morphological analysis to handle inflected words (Hull, 1996; Krovetz, 1993; Porter, 1980), corpus-based query expansion (Ballesteros and Croft, 1998; Nie et al., 1999; Sheridan et al., 1997), and query structuring for the ambiguity problem (Pirkola, 1998, 1999; Sperer and Oard, 2000). Because English has been the main language for IR system development, much research on IR involves English. However, IR systems for small languages like Finnish and Swedish and other languages differing from English in morphology (inflection, derivation, gender and compound words), or in semantic features (e.g., the frequency of homonymy, polysemy and hyponymy), cannot be developed properly without studying their special features. Although Spanish and Chinese have rendered special tracks in the TREC Conferences 1 the results cannot be applied on linguistically quite different languages. 1 The fourth, fifth and sixth Text REtrieval Conferences, 1998-1998. URL: http://trec.nist.gov/

Transcript of Cross-Lingual Information Retrieval Problems: Methods and findings for three language pairs

1

Cross-Lingual Information Retrieval Problems:Methods and findings for three language pairs

Ari Pirkola & Turid Hedlund & Heikki Keskustalo & Kalervo JärvelinUniversity of Tampere

Department of Information StudiesFinland

Email: [email protected]

Abstract

In this paper we will discuss dictionary-based cross-language information retrieval (CLIR) methods, andreport recent findings and problems. We will consider three language pairs for CLIR: Finnish to English,English to Finnish, Swedish to English. We show that Finnish and Swedish have special features, e.g., thefrequency of homography and a high frequency of compound words that affect retrieval effectiveness.Especially correct word form normalization and compound splitting are essential. We report findingsconcerning the effectiveness of various query translation methods, query structures and linguistic toolsused for CLIR. We also point out some problems and deficiencies in such tools.

1. Introduction

There is an increasing amount of full text material in various languages available through the Internet andother information suppliers. Therefore Cross-language information retrieval (CLIR) has become animportant new research area (Oard & Dorr, 1996; Pirkola, 1999). It is a process of selecting and rankingdocuments in a language different from the query language. One of the main approaches to CLIR is basedon bilingual translation dictionaries. For an overview of the approaches, see (Hull & Greffenstette, 1996;Oard & Dorr, 1996; Pirkola, 1999).

The main problems associated with dictionary-based CLIR are (1) phrase identification and translation,(2) source language ambiguity, (3) translation ambiguity, (4) the coverage of dictionaries, (5) theprocessing of inflected words, and (6) untranslatable keys, in particular proper names spelled differentlyin different languages. Translation ambiguity refers to the proportional increase of bad keys due totranslation. Research has developed many effective methods to handle the problems. These involve theuse of special dictionaries for the dictionary coverage problem (Pirkola, 1998, 1999), POS tagging forphrase translation (Ballesteros and Croft, 1997) and for removing bad translation equivalents (Ballesterosand Croft, 1998; Davis, 1997), stemming and morphological analysis to handle inflected words (Hull,1996; Krovetz, 1993; Porter, 1980), corpus-based query expansion (Ballesteros and Croft, 1998; Nie etal., 1999; Sheridan et al., 1997), and query structuring for the ambiguity problem (Pirkola, 1998, 1999;Sperer and Oard, 2000).

Because English has been the main language for IR system development, much research on IR involvesEnglish. However, IR systems for small languages like Finnish and Swedish and other languagesdiffering from English in morphology (inflection, derivation, gender and compound words), or insemantic features (e.g., the frequency of homonymy, polysemy and hyponymy), cannot be developedproperly without studying their special features. Although Spanish and Chinese have rendered specialtracks in the TREC Conferences1 the results cannot be applied on linguistically quite different languages.

1 The fourth, fifth and sixth Text REtrieval Conferences, 1998-1998. URL: http://trec.nist.gov/

2

In this paper we will discuss appropriate CLIR methods, report recent findings, and report some problemsto be solved in CLIR. We concentrate on three different CLIR tasks, namely Finnish to English, Englishto Finnish, and Swedish to English query translation. We will report on the use of natural languageprocessing (NLP) and query structuring (the Pirkola Method; Pirkola, 1998) for CLIR. The structuring ofqueries refers to the grouping of search keys, and the use of proper query operators. Publicly availableNLP tools have some pitfalls for CLIR, which are discussed.

Some 8 - 9 million people, mainly in Sweden, speak Swedish as a native language. There is also aSwedish-speaking minority population in Finland. However, due to close relationships between theNordic countries and the other Scandinavian languages the number of people who speak Swedish and canunderstand it is much larger (Teleman, Hellberg & Andersson 1999). Approximately 20 million peoplehave a basic knowledge of Swedish. Thus, a careful generalization of the results in this study can be madeto some other languages.

Some 5 million people speak Finnish. Both Finnish and Swedish have characteristics quite different fromEnglish, e.g., the frequency of compound words, and share these features with some other languages, e.g.,German. Finnish is exceptional due to its rich inflectional morphology. The differences between thelanguages, and the appropriate techniques, may be useful in broadening the scope of CLIR to novellanguages, not resembling English in their features.

The rest of this paper is organized as follows. Section 2 considers natural language processing for CLIR.Sections 3 to 5 report findings in CLIR with language pairs Finnish to English (Pirkola, 1998; Pirkola,Keskustalo & Järvelin, 1999), English to Finnish (Puolamäki, Pirkola & Järvelin, 2000), Swedish toEnglish (Hedlund, Pirkola & Järvelin, 2000). We report findings concerning the effectiveness of variousquery translation methods (e.g., the use of dictionaries), query structures (in particular structured queriesbased on the Pirkola Method) and discuss how linguistic tools (e.g., dictionaries, word form normalizers)should be used for CLIR. Section 6 presents concluding remarks.

2 Natural language processing for CLIR

Natural language processing involves linguistic methods and the analysis can take place on differentlevels of a language, i.e., morphological, syntactic and semantic levels. On a morphological level thestructure of words is analyzed. Recognizing different word forms as variants of the same basic wordaffects both indexing (word weights) and retrieval (matching). Syntactic analysis determines the structureof phrases and sentences, while semantic analysis investigates the meaning or sense of words andsentences.

Commonly used methods in document indexing are word form normalization (Koskenniemi 1983;Pirkola 1999) and stemming (Harman 1991). A stemmer removes affixes from the word forms and theoutput is a common root, not necessarily a real word. A similar type of process is normalization, but inthis case the output is the base form, a real word. Due to stemming and normalization three kinds ofbenefits may be gained (Harman 1991; Alkula & Honkela 1992). 1) A user does not need to worry abouttruncation and inflection, because different forms of the key are automatically conflated into the sameform. 2) Stemming and normalization result in storage savings. 3) Stemming and normalization mayimprove retrieval performance, especially recall since a larger number of potentially relevant documentsare retrieved. However, no significant improvement in performance was found by Harman (1991) in herexperiment with simple stemmers for the English language. For inflectionally more complex languagesthe results are not necessarily the same.

Finnish and Swedish are rich in compound words, and is thus confronted with the problem of embeddedsearch keys. Splitting the compounds into their components allows the use of the component words asseparate search keys. For instance, the decomposition of the compound hustak (roof of the house) givesthe expansion keys hus (house) and tak (roof). If the compound hustak was truncated and used as a key,documents including the word tak would not be found. Finnish has a particularly rich inflectionalmorphology. Each noun may have, theoretically, 2200 forms (Karlsson, 1987), adjectives and verbs even

3

more. Therefore word form normalization appears very important for Finnish — it is a prerequisite forquery translation as translation dictionaries contain their entry words in basic word forms.

In IR part-of-speech (POS) tagging may be used to identify central words (word classes, especiallynouns) and phrases of a sentence. In CLIR part-of-speech tagging is useful in matching the sourcelanguage keys with correct translation dictionary entry words.

A syntactic parser (program) determines the structure of a sentence according to a particular grammar(Grishman 1986). The parsing procedure may involve the assignment of a tree structure to the inputsentence. Linguistic transformation, such as transforming active sentences into passive can be of potentialvalue for IR. Syntactic analysis can be used as a basis for further analysis, e.g., anaphor resolution. InCLIR also syntactic parsing may be useful for matching source language keys with correct translationdictionary entry words.

Word sense disambiguation is an NLP method, which aims at finding correct senses for wordoccurrences. It has been studied intensively in IR and other fields. The methods used in the studiesinclude dictionaries (Dagan et al. 1991; Guthrie et al. 1991; Krovetz and Croft 1989), knowledge bases(Hirst 1987), statistical methods (Brown et al. 1991; Schütze & Pedersen 1995), multiple knowledgesources (McRoy 1992), thesauri (Voorhees 1993) and pseudo-words (Sanderson 1994; Sanderson 1997).Most IR studies have reported no or only slight improvements in retrieval performance due to word sensedisambiguation (Krovetz & Croft 1992; Sanderson 1994; Sanderson 1997; Vorhees 1993). Word sensedisambiguation helps in matching the source language keys with correct sense among translationdictionary entry words.

3 Findings on Finnish to English CLIR

Methods and data

The test collection was a subset of the TREC collection, consisting of AP Newswire, DOE Abstract, andFederal Register documents. The test collection contained 514,825 English documents. As test requestswe used 34 health related TREC topics.

We used the FINTWOL morphological analyzer for word form normalization and compound splitting.Inflected word forms of Finnish natural language/sentence queries were turned to base forms, because thedictionary entry words are in base forms. Finnish compounds were split, because sometimes they arefound in dictionaries only as their components. Both compounds and their components were translated.

As test dictionaries we used two Finnish - English - (Finnish) translation dictionaries, a general andmedical dictionary. The general dictionary contained 65,000 Finnish and 100,000 English entry words.The medical dictionary contained 67,000 Finnish and English entry words. The commercial versions ofthe dictionaries were converted automatically to CLIR versions by removing from them all other materialexcept for actual dictionary words.

The retrieval system was the InQuery information retrieval system which is a probabilistic system basedon Bayesian inference net model (Broglio et al, 1994). Queries can be formulated as bag of word queriesor can be structured by a variety of operators provided by the system. The Kstem morphological stemmer,which produces real English words as its stemming output is part of InQuery. It was used for stemmingthe words of the documents. Thus, the database index included stemmed words.

Constructing and translating queries

Figure 1 gives an overall picture of the basic test processes. To get test queries that are comparable to theoriginal English queries, the English queries were translated into Finnish by a human translator (by theauthor), and the Finnish queries were retranslated back to English by means of dictionaries. This approachis often used in dictionary-based CLIR studies.

4

[Figure missing]

Figure 1. The basic test processes

In TREC topics, important words are found in title, description, and narrative fields (some topics do nothave narrative fields). Test requests were constructed on the basis of these fields. The test requests wereshortened versions of the TREC topics, consisting of 1-2 natural English sentences. Hence, the testrequests represented requests that could be used by real users.

There were two main query types. The requests as such were the first type. This type is called naturallanguage/sentence, and is abbreviated to NL/S. The second type was formulated on the basis of therequests by selecting from them the most important words and phrases. It is called natural language/wordand phrase, and is abbreviated to NL/WP.

The English NL/S and NL/WP queries were translated into Finnish by the author. As a translation aid theauthor used printed dictionaries. The test dictionaries were not used in this phase. The English NL/S andNL/WP queries that provided the basis for Finnish queries, were also used as baselines for CLIR queries(see Figure 1). The term CLIR queries refers to final queries, i.e., queries translated by means ofdictionaries.

Both NL/S and NL/WP queries were divided into two subtypes, structured and unstructured queries. Thestructured queries had dictionary-based facets, i.e., the words that were derived from the same Finnishword, were grouped together by the syn-operator of InQuery. Figure 1 illustrates the structuring methodapplied in the study, showing how the original English query osteoporosis prevent reduce research istransfomed into a structured CLIR query. It should be noted that unstructured and structured NL/WPqueries as well as unstructured and structured NL/S are comparable, as they are derived from the sameFinnish queries. They also have the same baseline. The NL/WP and NL/S queries are not comparablesince they do not have identical search key sets.

Compound words are common in Finnish, whereas noun phrases, except for proper name phrases, arerelatively rare. A Finnish compound word is often translated as a noun phrase in English (like compoundword and yhdyssana). This is the main reason why phrases were identified in NL/WP queries (both in theoriginal English queries, or the baseline, and in the Finnish queries). In this way a precise correspondencewas obtained between the baseline queries with their many phrases and the Finnish queries with theirmany compound words. Phrase identification probably favored the baseline queries. The effect on CLIRqueries was small, as the Finnish queries did not have many phrases.

The query operators for NL/S and NL/WP queries were the sum-, syn-, and uwn-operators. Search keyscontained in the sum-operator have equal influence on search results. The syn-operator was used instructured CLIR queries (see Figure 1). The syn-operator treats its operand search keys as instances of thesame word. The uwn-operator (unordered window n) is a proximity operator. It was used, with n=3, tocombine phrase components and the English equivalent words derived from the same Finnish compound(see Section 4).

The translation methods were the following:

• gd translation: Finnish search keys were translated by means of the general dictionary.

• sd -> gd translation: Finnish search keys were translated by means of the medical dictionary and thegeneral dictionary, in this order. General dictionary translation was applied after medical dictionarytranslation only if the latter did not translate a word.

• sd and gd translation: Finnish search keys were translated by means of the medical dictionary and thegeneral dictionary. Duplicate words were removed.

5

If a word or a phrase was not found as an entry word in the dictionaries, it was sent unchanged to the finalquery. These kinds of expressions were English proper names, acronyms, and Finnish words not found inthe dictionaries.

Findings

The performance of test queries was evaluated as 10% recall precision, average precision at 10%-100%recall, and as precision-recall graphs. The results are presented in Tables 1-2 and Figures 2-5.

As shown, there is a significant gap between the baseline (BL) and the unstructured NL/S queries (Table1 and Figure 2). At 10% recall, the precision of the baseline is 37.9%, but only 15.4% for gd queries.Special dictionary effect is clear, but baseline queries still perform much better than sd -> gd and sd andgd queries; at 10% recall the precision of sd -> gd and sd and gd queries is, roughly, only half of theprecison of the baseline. When average precision is considered, the gap in performance is greater in favorof the baseline.

Structure put in NL/S queries through dictionaries results in a significant improvement in performance(Table 1 and Figure 3). At 10% recall the best CLIR queries, sd and gd, give the precision figure 35.9%,which is only 2.0% below the precision of the baseline queries (37.9%). The average precision of sd andgd queries is 12.9% and that of the baseline queries16.8%. The translation method used is of greatimportance. At the high precision level (10% recall - 50% recall), gd and sd -> gd queries perform muchpoorer than sd and gd queries.

As shown in Table 2 and Figure 4, the performance of unstructured NL/WP queries is significantly belowthat of the baseline. Figure 4 is very much like Figure 2, which demonstrates the behavior of theunstructured NL/S queries. As in NL/S queries, structuring improves the performance of NL/WP queriessignificantly (Table 2 and Figure 5). The best structured cross-language queries, sd and gd, do almost aswell as the baseline. For the former, precision at 10% recall is 31.1%, and for the latter 31.8%. Theaverage precision is practically the same, 12.4% for sd and gd queries and 12.5% for the baseline. Atthree recall levels, 50%, 60%, and 70%, sd and gd queries give better precision figures than the baseline.The figures are, respectively, 13.2% and 12.8%, 9.5% and 8.8%, and 6.3% and 6.1% (these figures arenot given in the tables of the paper).

Table 1. The performance of NL/S queries

Query type/Translation type 10%-recall P Average P

Structured, dictionary-basedfacets GD 30,9 10,5 SD --> GD 30,4 11,3 SD and GD 35,9 12,9 Unstructured GD 15,4 5,1 SD --> GD 19,2 5,8 SD and GD 20,4 6,3 Structured and unstructured, baseline 37,9 16,8

6

0

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90 100

Precision

Recall

GD SD --> GDSD and GD BL

Figure 2. Precision-recall curves for unstructured NL/S queries

0

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90 100

Precision

Recall

GD SD --> GDSD and GD BL

Figure 3. Precision-recall curves for structured NL/S queries

Table 2. The performance of NL/WP queries

Query type/Translation type 10%-recall P Average P

Structured, dictionary-basedfacets GD 24,9 9,8 SD --> GD 26,1 10,5 SD and GD 31,1 12,4 Unstructured GD 16,5 5,7 SD --> GD 14,6 5,0 SD and GD 19,3 6,5 Structured and unstructured, baseline 31,8 12,5

7

0

5

10

15

20

25

30

35

10 20 30 40 50 60 70 80 90 100

Precision

Recall

GD SD --> GDSD and GD BL

Figure 4. Precision-recall curves for unstructured NL/WP queries

0

5

10

15

20

25

30

35

10 20 30 40 50 60 70 80 90 100

Precision

Recall

GD SD --> GDSD and GD BL

Figure 5. Precision-recall curves for structured NL/WP queries

4 Findings on English to Finnish CLIR

It is possible that the specific linguistic features of Finnish as a source language or English as a targetlanguage contributed to the good performance of the structured queries in Fin-Eng CLIR (Section 3.).Thus, the effectiveness of the query structuring method may depend on the languages of a CLIR system(or the direction of translations). We explored whether the method is useful also in Eng -Fin textretrieval, i.e., the case where translations are done in an opposite direction to those done in theexperiments presented in Section 3.

English and Finnish are different types of languages, particularly in morphology. In English grammaticalrelations are indicated mainly by prepositions while Finnish typically uses a grammatical case. In Finnish,there are 14 features in the category of case. Therefore, the number of word forms that a given Finnishlexeme may take is very high, theoretically 2200 forms for nouns (Karlsson, 1987). Inflection hasdepressing effect on CLIR effectiveness, but it is hard to estimate in which case, Fin-Eng or Eng-Fin, theproblem is more severe. In Fin-Eng retrieval, inflection causes difficulties especially in query processingwhereas in Eng-Fin retrieval troubles occur in indexing.

8

In Finnish multiword expressions are typically compound words, in English they are often phrases. Fromthe IR and CLIR perspectives, a compound word is a more convenient type of expression than a phrase,because compound decomposition is easier than phrase identification. In this respect Fin-Eng retrieval iseasier than Eng-Fin retrieval. Finnish compounds can be split effectively into component words by adictionary-based morphological analyzer. The English equivalents of the components can be combinedby a proximity operator in CLIR queries (Section 3). The application of the technique in Eng-Fin retrievalrequires that phrases are identified. Correct phrase identification is difficult, however.

We also studied phrase identification and a structuring technique utilizing a proximity operator. If phrasesare not identified in CLIR, phrase components instead of full phrases are translated, and the senses ofmulti-word keys may be lost. This causes loss of retrieval effectiveness (Hull & Grefenstette, 1996).Automatic phrase identification methods involve the use of collocation statistics (Buckley et al., 1996),part-of-speech tagging (Ballesteros & Croft, 1997) , and shallow syntactic analysis (Strzalkowski, 1995;Zhai et al., 1997). In cross-language retrieval where the target language is a compound language, i.e., alanguage where multiword expressions are compounds rather than phrases, it would be possible torecognize as phrases the adjacent request words that correspond to a compound word in a target language.Compound languages involve such languages as German, Dutch, Swedish, and Finnish. A phraseidentification system could be based on the translation dictionary or the database index of a retrievalsystem, or it may be constructed as an independent system. In the present study, we marked as phrases inthe English requests the adjacent words as well as the words separated by the preposition of thatcorresponded compound words in Finnish requests. The Finnish equivalents of an English phrase werecombined by a proximity operator (uw3) in CLIR queries. The effectiveness of phrase-based queries wascompared to that of word-based queries.

Methods and data

The test collection contained around 55.000 articles published in three Finnish newspapers in 1988-1992.The average article length was 233 words. Our test environment provides 35 test requests (in Finnish) forwhich the relevance of 16 000 articles is known (Kekäläinen, 1999; Kekäläinen & Järvelin, 1999). 20 ofthe 35 requests were used as test requests in this study.

The requests were natural sentences. For this study, the inflected words of the requests were normalizedinto their base forms, and compound words were decomposed into their component words by theFINTWOL morphological analyzer. The normalized Finnish words were (1) used as search keys in thebaseline queries, and were (2) translated into English by the author. The translations were checked by acolleague whose native language is English. Human translation was done to get test queries that arecomparable to the original Finnish queries. The English words were translated back to Finnish by anEnglish - Finnish electronic dictionary (Section 3). As a test system we used the InQuery retrieval system

The query structuring technique was the same as in Fin-Eng CLIR, i.e., the translation equivalents of asource language word were combined by the syn-operator of the InQuery retrieval system. In addition, inphrase-based structured queries the uw3-operator was applied to the Finnish equivalents of the Englishphrase components; those equivalents that corresponded to the first part of the phrase were joined by theoperator to those equivalents that corresponded to the second part. All the combinations were generated.For example, the English equivalent of a Finnish compound tuotantomäärä is volume of production. Thiswas translated back to Finnish using the electronic dictionary, and the components were combined by theuw3-operator in a phrase-based query:

#syn(#uw3(esitys erä) #uw3(esitys joukko) #uw3(esitys kvantiteetti) #uw3(esitys määrä) #uw3(esityspaljous) #uw3(esitys suure) #uw3(produktio erä) #uw3(produktio joukko) #uw3(produktio kvantiteetti)#uw3(produktio määrä) #uw3(produktio paljous) #uw3(produktio suure) #uw3(tuotanto erä)#uw3(tuotanto joukko) #uw3(tuotanto kvantiteetti) #uw3(tuotanto määrä) #uw3(tuotanto paljous)#uw3(tuotanto suure))

The query types were as follows:

1. Finnish word-based queries, i.e., baseline for the CLIR queries of steps 2-4

9

2. Word-based unstructured CLIR queries

3. Word-based structured CLIR queries

4. Phrase-based structured CLIR

Findings

The results were evaluated as (1) average precision over ten recall points (10-100%), and as (2) precision-recall graphs. The results are presented in Table 3 and Figure 6.

As shown in Table 3, the average precision of word-based structured queries is 27.4% while unstructuredqueries give the precision figure of 18.8%. The relative improvement percentage due to structuring is45.7% (column 3). Phrase-based structured queries perform slightly better than word-based structuredqueries, with the relative improvement percentage due to structuring and phrase identification being54.3%. As shown in column 4 in Table 3, the relative performance percentages of CLIR queries withrespect to baseline queries are 77.0% (word-based structured queries), 81.5% (phrase-based structuredqueries), and 52.8% (word-based unstructured queries).

Figure 6 shows precision-recall curves for CLIR and baseline queries. As can be seen, structured queriesperform markedly better than unstructured queries but fall below baseline queries. Phrase-based queriesperform well particularly at the 10%-recall level.

Table 3. The performance of CLIR and baseline queries

Query Type Avg. Precision% Change Str. vs.

Unstr.Precision in relation to

baseline

Word-based structured 27,4 45,7 77,0

Word-based unstructured 18,8 - 52,8

Phrase-based structured 29,0 54,3 81,5

Baseline (Finnish) queries 35,6 - 100,0

10

0

10

20

30

40

50

60

70

80

10 20 30 40 50 60 70 80 90 100

Pre

cisi

on

Recall

Baseline Unstructured

Structured Phrase

Figure 6. Precision-recall curves for CLIR and baseline queries

5 Findings on Swedish to English CLIR

Swedish has linguistic features, for example, the use of “fogemorphemes” in compound words and a highfrequency of homographs that affect IR performance (Hedlund et al. 2000). When decomposingcompound words, morphological analysis programs have pitfalls that affect retrieval results and querytranslation. Here we shall focus on morphological decomposition of compounds and homographs.

Compound splitting. The constituents of a compound are needed for dictionary translation, in particular,when the whole compound is not in the dictionary. On the other hand we have common compounds thatare lexicalized and the meaning can no longer be determined on the basis of the constituents, for examplejordgubbe (strawberry). Swedish, Finnish and German compounds are spelled as one word. In Swedishthe components are often joined by a joining morpheme, a “fogemorpheme”. This is not the case in forexample English or French. In Finnish the non-last components often are in the genitive case. In Germanthe non-last component is also often in an inflected form.

Table 4 presents an example of compound splitting by SWETWOL. Here the components skog andindustrin are joined with the fogemorpheme “s”. There are other fogemorphemes in Swedish as well. Thelatter component industrin (the industry) is normalized to the base form industri, which is a hyperonym ofskogsindustri and therefore often a valuable search key. The former component skogs has retained the “s”,and is not normalized to the base form skog.

Table 4. Morphological analysis and compound splitting

Input word Compound splitting SWETWOL analysis

skogsindustrin (theforest industry)

skogs#industri (forest #industry)

<N> # N UTR DEF SG NOM (noun #noun, uter, definite, sg. nominative)

11

We have developed an algorithm for recognizing and handling fogemorphemes. It seeks to recognize, forall constituents, the base forms, and thereby allow the translation of all constituents through a bilingualdictionary. The algorithm for handling fogemorphemes appears to work well in the query formulationprocess and essentially reduces the number of non-translated words in several topics. However, since wedeal with constituents of compounds the actual effect on the search result also depends on other factors,such as the extent to which the constituents carry important search keys.

Nevertheless, the lesson is that (CL)IR requires morphological processing for compounds, which yieldscorrect basic word forms. Mere component separation is not sufficient. Compound words and inflectedfrom components may be a frequent feature among natural languages.

Homographs. Swedish is rich in homographs with many senses. Frequent words in a language usuallytend to have many senses, and they also tend to appear as constituents in compound words. Whenautomating the query formulation process in CLIR we have to deal with compound words that can bemorphologically separated to contain three or four constituents. In a translation process every constituentis translated separately and then combined as a phrase with the other translated constituents. Thus thenumber of alternative combinations may grow very rapidly.

The following example shows the features in compound splitting, homographs and the fogemorphemealgorithm. The Swedish word flygplansolycka (aeroplane accident) is in an automated query formulationprocess handled like this:

• the word plan is a homograph having the two senses “plan” and “plane”

• the morphological analysis: flyg#plans#olycka

• the fogemorpheme algorithm output: flyg#plan#olycka

• the translation process output where every combination of constituents is a phrase and allcombinations are treated as translation alternatives and synonyms is as follows:

#SYN(#OD4(aviation plane accident)#OD4(aviation plane disaster)#OD4(aviation plane misfortune)#OD4(aviation plane calamity)#OD4(aviation flat accident)#OD4(aviation flat disaster)#OD4(aviation flat misfortune)#OD4(aviation flat calamity)#OD4(aviation level accident)#OD4(aviation level disaster)#OD4(aviation level misfortune)#OD4(aviation level calamity)#OD4(aviation ground accident)#OD4(aviation ground disaster)#OD4(aviation ground misfortune)#OD4(aviation ground calamity)#OD4(aviation plan accident)#OD4(aviation plan disaster)#OD4(aviation plan misfortune)#OD4(aviation plan calamity)

#OD4(plane plane accident)#OD4(plane plane disaster)#OD4(plane plane misfortune)#OD4(plane plane calamity)#OD4(plane flat accident)#OD4(plane flat disaster)#OD4(plane flat misfortune)#OD4(plane flat calamity)#OD4(plane level accident)#OD4(plane level disaster)#OD4(plane level misfortune)#OD4(plane level calamity)#OD4(plane ground accident)#OD4(plane ground disaster)#OD4(plane ground misfortune)#OD4(plane ground calamity)#OD4(plane plan accident)#OD4(plane plan disaster)#OD4(plane plan misfortune)#OD4(plane plan calamity))

Without the fogemorpheme algorithm we would not have been able to translate the word plans since it isnot in base form and the translation would be like this:

SYN(#OD4(aviation plans accident)#OD4(aviation plans disaster)

12

#OD4(aviation plans misfortune)#OD4(aviation plans calamity)#OD4(plane plans accident)#OD4(plane plans disaster)#OD4(plane plans misfortune)#OD4(plane plans calamity)

The non-translated constituent thus would ruin the translation of the compound. We are currentlyevaluating how such NLP processing affects search effectiveness in large test collections.

6 Discussion and Conclusions

We have discussed dictionary-based cross-language information retrieval (CLIR) methods, and reportedrecent findings and encountered problems. We considered three language pairs for CLIR: Finnish toEnglish, English to Finnish, Swedish to English. Finnish and Swedish are rather different from English,both being rich in compounds, Finnish having a rich inflectional morphology, and Swedish text havingvery frequently homographs. In summary, our findings suggest that:

• query structuring through synonym sets is a simple and essential tool for dictionary-based CLIReffectiveness; query structuring performs disambiguation indirectly;

• the parallel use of general and special dictionaries improves effectiveness; in different types ofcollections, e.g., domain specific collections, the sequential sd -> gd application of dictionaries mayperform better;

• word by word translation of natural language request sentences yields performance comparable to (orbetter than) available by selecting source keys and phrases; languages rich in compounds have anadditional advantage of source language compounds trivially suggesting target language phrases;

• proper names are generally not translatable and may pose matching problems due to differingspelling (transliteration) and inflection; proper names thus require other translation techniques suchas based on n-grams;

• when compound word components may inflect or are joined together by special morphemes(fogemorphemes), compound splitting must recognize the correct base form of the components;otherwise component translation is endangered;

Linguistic features of both the source language (for retrieval) and the target language (both indexing andretrieval) must be observed for successful CLIR. Understanding the variety of natural languages that maybe used for CLIR, there are many problems to be studied in this research area.

References

Alkula, R., Honkela, T. 1992. Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen

tulkintaohjelmien avulla. FULLTEXT-projektin loppuraportti. [Linguistic processing and retrieval

techniques in Finnish fulltext databases. Final report of the FULLTEXT project.] VTT julkaisuja -

publikationer 765. Espoo: VTT.

13

Ballesteros, L. & Croft, W. B. 1997. Phrasal translation and query expansion techniques for cross-

language information retrieval. In: Proceedings of the 20th ACM SIGIR Conference: 84-91.

Ballesteros, L., Croft, W. B. 1998. Resolving ambiguity for cross-language retrieval. In: Proceedings of

the 21stAnnual International ACM SIGIR Conference: 64-71.

Broglio, J., Callan, J. & Croft, W.B. 1994. Inquery system overview. In: Proceedings of the TIPSTER

Text Program (Phase I): 47-67.

Brown, P.F., Della Pietra, S.A., Della Pietra, V.J. & Mercer, R.L. 1991. Word-sense disambiguation

using statistical methods. In: Proceedings of the 29th Annual Meeting of the Association for

Computational Linguistics: 264-270.

Buckley, C., Singhal, A., Mitra, M. & Salton, G. 1996. New retrieval approaches using SMART: TREC-

4. In: The Fourth Text REtrieval Conference (TREC-4), Gaithesburg, MD. Available at:

http://trec.nist.gov/pubs/trec4/t4_proceedings.html

Dagan, I., Itai, A. & Schwall, U. 1991. Two languages are more informative than one. In: Proceedings of

the 29th Annual Meeting of the Association for Computational Linguistics: 130-137.

Davis, M. 1997. New experiments in cross-language text retrieval at NMSU's Computing Research Lab.

In: The Fifth Text REtrieval Conference (TREC-5), Gaithesburg, MD. Available at:

http://trec.nist.gov/pubs/trec5/t5_proceedings.htm

Grishman, R. 1986. Computational linguistics: an introduction. Cambridge: Cambridge University Press.

Guthrie, J. A., Guthrie, L., Wilks, Y. & Aidinejad, H. 1991. Subject-dependent co-occurrence and word

sense disambiguation. In: Proceedings of the 29th Annual Meeting of the Association for

Computational Linguistics: 146-152.

Harman, D. 1991. How effective is suffixing? Journal of the American Society for Information Science

42(1): 7-15.

Hedlund, T., Pirkola, A. and Järvelin, K. 2000. Aspects of Swedish morphology and semantics from the

perspective of mono- and cross-language information retrieval. Information Processing &

Management 37, to appear.

Hirst, G. 1987. Semantic interpretation and the resolution of ambiguity. Cambridge: Cambridge

University Press.

14

Hull, D. 1996. Stemming algorithms: a case study for detailed evaluation. Journal of the American

Society for Information Science 47(1): 70-84.

Hull, D. & Grefenstette, G. 1996. Querying across languages: A dictionary-based approach to

multilingual information retrieval. In: Proceedings of the 19th ACM SIGIR Conference: 49-57.

Karlsson, F. 1987. A Finnish grammar. Porvoo: WSOY.

Kekäläinen, J. 1999. The effects of query complexity, expansion and structure on retrieval performance in

probabilistic text retrieval. Ph.D. Thesis, University of Tampere.

Kekäläinen, J. & Järvelin, K. 1999. The co-effects of query structure and expansion on retrieval

performance in probabilistic text retrieval. Information Retrieval 1(4): 329-344.

Koskenniemi K. 1983. Two-Level Morphology: A General Computational Model for Word-Form

Recognition and Production. Ph.D. Thesis. University of Helsinki.

Krovetz, R. 1993. Viewing morphology as an inference process. In: Proceedings of the 16th ACM SIGIR

Conference: 191-202.

Krovetz, R. & Croft, W.B. 1989. Word sense disambiguation using machine-readable dictionaries. In

Proceedings of the 12th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval: 127-136.

Krovetz, R. & Croft, W.B. 1992. Lexical ambiguity and information retrieval. ACM Transactions on

Information Systems 10(2): 115-141.

McRoy, S.W. 1992. Using multiple knowledge sources for word sense disambiguation. Computational

Linguistics 18(1): 1-30.

Nie J-Y, Simard M, Isabelle P & Durand R. 1999. Cross-language information retrieval based on parallel

texts and automatic mining of parallel texts from the Web. In: Proceedings of the 22nd ACM Sigir

Conference: 74-81.

Oard, D. & Dorr, B. 1996. A survey of multilingual text retrieval. Technical Report UMIACS-TR-96-19.

University of Maryland, Institute for Advanced Computer Studies.

Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language

information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference:

55-63.

15

Pirkola, A. 1999. Studies on linguistic problems and methods in text retrieval. Ph.D. Thesis, University of

Tampere.

Pirkola, A. 1999. Homonymy in cross-language retrieval. University of Tampere, Department of

Information Studies. Unpublished manuscript.

Pirkola, A. & Keskustalo, H. & Järvelin, K. 1999. The effects of translation method, conjunction, and

facet structure on concept-based cross-language retrieval. Information Retrieval 1: 217 - 250.

Porter M.F. 1980. An algorithm for suffix stripping. Program 14, 130-137.

Puolamäki, D., Pirkola, A. & Järvelin, K. 2000. Applying Query Structuring in Cross-Language

Retrieval. Manuscript.

Sanderson, M. 1994. Word sense disambiguation and information retrieval. In Proceedings of the 17th

Annual International ACM SIGIR Conference on Research and Development in Information

Retrieval: 142-151.

Sanderson, M. 1997. Word sense disambiguation and information retrieval. Ph.D. Thesis University of

Glasgow, Department of Computing Science.

Sheridan, P., Braschler, M. & Schäuble, P. 1997. Cross-language information retrieval in a multilingual

legal domain. In Peters, C. & Thanos, C., ed. Research and Advanced Technology for Digital

Libraries. First European Conference, ECDL '97. Lecture Notes in Computer Science, 1324: 253 -

268.

Schütze, H. & Pedersen, J.O. 1995. Information retrieval based on word senses. In: Proceedings of the

Symposium on Document Analysis and Information Retrieval: 161-175.

Sperer, R. & Oard, D.W. 2000. Structured translation for cross-language IR. In: Proceedings of the 23rd

Annual International ACM SIGIR Conference: .

Strzalkowski, T. 1995. Natural language information retrieval. Information Processing & Management

31(3): 397-417.

Teleman, Hellberg, & Andersson, E. 1999. Svenska Akademiens grammatik 1-4 [Grammar of the

Swedish Academy 1-4]. Stockholm: Svenska Akademien.

Vorhees, E.M. 1993. Using WordNet to disambiguate word senses for text retrieval. In: Proceedings of

the 16th Annual International ACM SIGIR Conference: 171-180.

16

Zhai, C., Tong, X., Milic-Frayling, N. & Evans, D.A. 1997. Evaluation of syntactic phrase indexing -

CLARIT NLP track report. In: The Fifth Text REtrieval Conference (TREC-5), Gaithesburg, MD.

Available at: http://trec.nist.gov/pubs/trec5/t5_proceedings.html