Cross-language information retrieval with the UMLS metathesaurus

9
Cross-Language Information Retrieval with the UMLS Metathesaurus David Eichmann Miguel E. Ruiz School of Library and Information Science School of Library and Information Science The University of Iowa The University of Iowa david-eichmann(Puiowa.edu mruizQcs.uiowa.edu Padmini Srinivasan School of Library and Information Science The University of Iowa [email protected] Abstract We investigate an automatic method for Cross Language Information Retrieval (CLIR) that uti- lizes the multilingual UMLS Metathesaurus to translate Spanish and French natural language queries into En- glish. Two experiments are presented using OHSUMED, a subset of MEDLINE. Both experiments examine re- trieval effectiveness of the translated queries. However, in the second experiment, the query translation procedure is augmented with digram based vocabulary normaliza- tion procedures. In this comparative study of retrieval effectiveness the measures used are: 11-point-average precision score (11-AvgP); average interpolated preci- sion at recall of 0.1; and noninterpolated (i.e., exact) precision after 10 retrieved documents. Our results in- dicate that for Spanish the UMLS Metathesaurus based CLIR method appears equivalent to multilingual dictio- nary based approaches investigated in the current litera- ture French yields less favorable results and our analysis suggests that linguistic differences may have caused the performance differences. 1 Introduction Cross Language Information Retrieval (CLIR) refers to retrieval when the query and the database are in different languages. This form of retrieval is increasingly relevant as network-based resources become commonplace. There are several ways for handling CLIR. One approach that has received significant attention is to translate the query thereby transforming the CLIR problem into a mono- lingual information retrieval (MLIR) problem for which there are standard solutions [2, 81. A second approach is to translate the document [19]. A third approach re- ceiving increasing attention is to automatically establish associations between queries and documents independent of language difference [6, 10, 211. CLIR methods involv- ing machine translation systems, bilingual dictionaries, parallel and comparable collections are currently being Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial ad- vantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. SIGIR’98, Melbourne, Australia Q 1998 ACM I-58113-015-5 8/98 $5.00. explored. Multilingual thesauri (or controlled vocabu- laries), however, are an underrepresented class of CLIR resources. We present here an investigation of the UMLS (Unified Medical Language System) [20] Metathesaurus, a product of the National Library of Medicine, as a re- source for free-text retrieval against a MEDLINE test database (English) given Spanish and French queries. In Oard’s hierarchical classification scheme of the CLIR methods [17], our work falls under the thesaurus based free-text CLIR category. In pure thesaurus based retrieval, documents and queries are matched through their thesaurus based rep- resentations, with document representations derived by an indexer and query representations provided by users. Extending this to CLIR is straightforward given a multi- lingual thesaurus. However, there are at least two prob- lems: it can be difficult for users to think in terms of a controlled vocabulary [17]; and this retrieval method ignores the free-text portions of documents during re- trieval. Our thesaurus based CLIR approach seeks to overcome both problems, allowing free-text user queries and considering the free-text portions of documents dur- ing retrieval. More generally, this research is motivated by the fact that, relative to dictionaries and collection based strategies, thesauri remain unexplored in the re- cent CLIR context. 2 Background and Related Work Major approaches for CLIR include bilingual dictionar- ies [3, 7, 141, parallel collections [4, 7, 10, 61 and com- parable collections [26] or some combination of these. Documents of a comparable collection may be aligned at the document, sentence or even word level. Compara- ble collections raise interesting research questions, such as alignment strategies and the measurement of ‘domain shift’ as explored for example, by Oard [17]. Methods baaed on dictionaries typically begin by de- riving a transfer dictionary specifying term equivalences across languages which is then applied to query transla- tion. Hull and Grefenstette use an online English-French dictionary to translate 50 queries from the TIPSTER col- lection (141. Query words are first morphologically re- duced to their root forms and then substituted by dic- tionary equivalents yielding an average precision at 5, 10, 15 and 20 retrieved documents of 0.235, compared to an MLIR baseline of 0.393. As the authors and other researchers point out, this method is troubled by incom- 72

Transcript of Cross-language information retrieval with the UMLS metathesaurus

Cross-Language Information Retrieval with the UMLS Metathesaurus

David Eichmann Miguel E. Ruiz School of Library and Information Science School of Library and Information Science

The University of Iowa The University of Iowa david-eichmann(Puiowa.edu mruizQcs.uiowa.edu

Padmini Srinivasan School of Library and Information Science

The University of Iowa [email protected]

Abstract We investigate an automatic method for Cross Language Information Retrieval (CLIR) that uti- lizes the multilingual UMLS Metathesaurus to translate Spanish and French natural language queries into En- glish. Two experiments are presented using OHSUMED, a subset of MEDLINE. Both experiments examine re- trieval effectiveness of the translated queries. However, in the second experiment, the query translation procedure is augmented with digram based vocabulary normaliza- tion procedures. In this comparative study of retrieval effectiveness the measures used are: 11-point-average precision score (11-AvgP); average interpolated preci- sion at recall of 0.1; and noninterpolated (i.e., exact) precision after 10 retrieved documents. Our results in- dicate that for Spanish the UMLS Metathesaurus based CLIR method appears equivalent to multilingual dictio- nary based approaches investigated in the current litera- ture French yields less favorable results and our analysis suggests that linguistic differences may have caused the performance differences.

1 Introduction

Cross Language Information Retrieval (CLIR) refers to retrieval when the query and the database are in different languages. This form of retrieval is increasingly relevant as network-based resources become commonplace. There are several ways for handling CLIR. One approach that has received significant attention is to translate the query thereby transforming the CLIR problem into a mono- lingual information retrieval (MLIR) problem for which there are standard solutions [2, 81. A second approach is to translate the document [19]. A third approach re- ceiving increasing attention is to automatically establish associations between queries and documents independent of language difference [6, 10, 211. CLIR methods involv- ing machine translation systems, bilingual dictionaries, parallel and comparable collections are currently being

Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial ad- vantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. SIGIR’98, Melbourne, Australia Q 1998 ACM I-58113-015-5 8/98 $5.00.

explored. Multilingual thesauri (or controlled vocabu- laries), however, are an underrepresented class of CLIR resources. We present here an investigation of the UMLS (Unified Medical Language System) [20] Metathesaurus, a product of the National Library of Medicine, as a re- source for free-text retrieval against a MEDLINE test database (English) given Spanish and French queries. In Oard’s hierarchical classification scheme of the CLIR methods [17], our work falls under the thesaurus based free-text CLIR category.

In pure thesaurus based retrieval, documents and queries are matched through their thesaurus based rep- resentations, with document representations derived by an indexer and query representations provided by users. Extending this to CLIR is straightforward given a multi- lingual thesaurus. However, there are at least two prob- lems: it can be difficult for users to think in terms of a controlled vocabulary [17]; and this retrieval method ignores the free-text portions of documents during re- trieval. Our thesaurus based CLIR approach seeks to overcome both problems, allowing free-text user queries and considering the free-text portions of documents dur- ing retrieval. More generally, this research is motivated by the fact that, relative to dictionaries and collection based strategies, thesauri remain unexplored in the re- cent CLIR context.

2 Background and Related Work

Major approaches for CLIR include bilingual dictionar- ies [3, 7, 141, parallel collections [4, 7, 10, 61 and com- parable collections [26] or some combination of these. Documents of a comparable collection may be aligned at the document, sentence or even word level. Compara- ble collections raise interesting research questions, such as alignment strategies and the measurement of ‘domain shift’ as explored for example, by Oard [17].

Methods baaed on dictionaries typically begin by de- riving a transfer dictionary specifying term equivalences across languages which is then applied to query transla- tion. Hull and Grefenstette use an online English-French dictionary to translate 50 queries from the TIPSTER col- lection (141. Query words are first morphologically re- duced to their root forms and then substituted by dic- tionary equivalents yielding an average precision at 5, 10, 15 and 20 retrieved documents of 0.235, compared to an MLIR baseline of 0.393. As the authors and other researchers point out, this method is troubled by incom-

72

plete dictionaries and ambiguities in translation. Dis- ambiguation strategies are typically employed to reduce translation errors. Davis [7] explores three different dis- ambiguation strategies while using the Collins English- Spanish dictionary to translate 25 Spanish queries into English. The first uses part of speech (POS) informa- tion to constrain translation. The second uses a parallel corpus aligned at the sentence and double-sentence levels to select Spanish terms that have the most in common with the English query. The third combines POS with a corpus based refinement strategy, yielding the best per- formance, 73.5% of MLIR (non interpolated average pre- cision) with the corpus based strategy adding the last 6%. In an earlier paper Davis and Dunning apply an evolutionary approach, optimizing for translated query performance using parallel collections [9].

The dictionary based CLIR work by Ballesteros and Croft [3] takes a different slant to reducing transla- tion ambiguity, exploring the value of pre- and post- translation query expansion strategies. Their evaluation with 25 queries of a TREC datab&e using the Collins English-Spanish dictionary indicates that expanding the query both before and after translation reduces errors and yields 68% of the corresponding MLIR baseline. Their research also supports the findings of Hull and Grefenstette [14] that phrase translations are important for CLIR.

Corpus based methods have also been investigated in- dependent of dictionaries. Davis and Dunning replace English query terms by the 100 most frequent terms in the top 100 documents retrieved from the Spanish side of an English-Spanish parallel collection. Similarly, the CL- LSI approach [lo] uses a parallel training corpus to com- pute a mapping from sparse term-based vectors to short and dense conceptual vectors suppressing cross language variations. More recently the generalized vector space model has shown good potential for CLIR [6].

Thirteen groups participated in the CLIR track intro- duced in TREC-6, with documents and queries in Ger- man, English, French and queries in Dutch and Spanish as well. Dictionary based CLIR was explored by sev- eral groups including New Mexico State University [8], University of Massachusetts [l], and the Xerox Research Center Europe [ll]. Groups such as ETH [15], and a collaboration between the University of Colorado, Duke University and Microsoft [21] investigated corpus based methods. Of particular interest to us is the ETH ap- proach using similarity thesauri constructed from com- parable documents of the SDA collection in French and German. The construction process used derives from term similarity as described in [25]. Unique angles in TREC-6 include document translation based CLIR [19] explored by the University of Maryland using the LO- GOS system. According to the authors, it appears that document translation performs at least as well as query translation. As reported in [24], another interesting angle in the CLIR track is the approach taken by Cor- nell University wherein they exploit the fact that there are many similar looking words between French and En- glish, i.e., near cognates. Interestingly, this assumption yielded good results in the English-F’rench CLIR runs. As summarized by Schauble and Sheridan [24] the TREC- 6 CLIR results appear consistent with previous results in that the performances typically range between 50 and 75% of the corresponding monolingual baselines. From our perspective, it is evident that given the nature of the TREC collections, CLIR approaches based upon multi- lingual thesauri remain difficult to explore. Our approach

to CLIR in MEDLINE is to exploit the UMLS Metathe- saurus and its multilingual components.

Soergel describes a general framework for the use of multilingual thesauri in CLIR [27], noting that a num- ber of operational European systems employ multilin- gual thesauri (such as UDC and LCSH) for indexing and searching. However, except for very early work with small databases [22], there has been little empirical eval- uation of multilingual thesauri (controlled vocabularies) in the context of free-text based CLIR, particularIy when compared to dictionary and corpus-based methods. This may be due to the expense of constructing multilingual thesauri, but this expense is unlikely to be any more than that of creating bilingual dictionaries or even re- alistic parallel collections. In fact, ongoing efforts such as the EC-funded EuroWordNet project indicate that such resources can be built collaboratively and semi- automatically [12]. In EuroWordNet, the goal is to ex- tend the WordNet thesaurus 1161 to include Dutch, Ital- ian, Spanish and English words. Multilingual thesauri can be built quite effectively by merging existing mono- lingual thesauri [27]; the UMLS Metathesaurus is an ex- cellent current example. Combining the UMLS Metathe- saurus with a MEDLINE test database enables an empir- ical investigation of a high quality multilingual thesaurus as a resource for free-text based CLIR using two broad approaches: document translation and query translation. We investigate query translation based CLIR here.

Our approach is independent of stemmers, part of speech taggers and parsers. (We wish to get baseline re- sults first before involving these additional techniques.) Comparable approaches include those conducted using bilingual dictionaries and similarity thesauri. In general these strategies yield performance scores in the range of 50 to 75% of the corresponding monolingual baselines. Our goal is to assess the UMLS Metathesaurus based CLIR approach within this context. The reader is re- ferred to the technical report by Oard and Dorr for an excellent review of the CLIR literature [18].

3 Methods

3.1 OHSUMED Test Set.

We utilize the OHSUMED test database, a subset of the MEDLINE database, extracted for retrieval research [13]. This database is accompanied by a collection of 106 En- glish language queries’. For our cross language experi- ments, these 106 queries are first translated into Span- ish by a native Spanish speaker and into French by the Translation Laboratory at the University of Iowa. The Spanish/French versions are then translated back into English by our automatic method. The original English queries provide our baseline performance.

3.2 Retrieval System

We use SMART [23] to identify appropriate Spanish/fiench UMLS phrases for each query and to run the retrieval ex- periments.

‘We use the corrected versions of these queries. For al1 but 5 queries, relevant document subsets are known. Please see ftp://medir.ohsu.edu/pub/ohsumed for d&ails. We use the 233,445 document subset that contains abstracts and MeSH phrases for each document.

73

3.3 Unified Medical Language System (UMLS) standard index lookup procedure ignores such weights.

The UMLS, a vocabulary system produced by the Na- tional Library of Medicine, has four components: the Metathesaurus, Semantic Network, Information Sources Map and the SPECIALIST Lexicon [20]. We use only the Metathesaurus, an integration of more than 40 inde- pendent vocabularies in the health care domain.

The Metathesaurus model involves the notions of ‘concept, ’ ‘term’ and ‘string.’ Lexical variants are linked under the same term, while variations (such as case) only define independent strings, where certain strings desig- nated as preferred forms for each concept. The 1997 Metathesaurus contains 331,756 concepts, 571,768 terms and 739,439 strings. The Metathesaurus is multilingual. French, Spanish, Portuguese and German translations of the MeSH subset of the UMLS are linked to their Con- cept hierarchies. There are 23,198, 23,093, 18,429 and 18,277 MeSH concepts with Spanish, Portuguese, Ger- man and French strings respectively. We investigate both Spanish (the highest represented) and French (the least represented) languages in the UMLS. This pair will allow us to examine the effect of representation level on CLIR performance.

Finally, comparable to the disambiguation strategies in dictionary based research, MeSH phrases selected in this SMART based procedure go through a pruning phase to remove irrelevant entries as described in Section 3.6.

3.4 Transfer Dictionaries Derived from the

UMLS

The remaining resource used is a transfer dictionary, which we derive from the multilingual subset of the Metathesaurus. A transfer dictionar specifies phrase equivalences tied to common concepts I . The Spanish in- formation (23,198 concepts, 32,282 unique strings and 22,891 unique words) and French information (18,277 concepts, 25,932 unique strings, and 18,179 unique words) form the foundation of our approach. Each language has an index file (mrwx.spa for Spanish and mrwx.fre for French) provided as part of the UMLS which contain the unique Spanish/French words found in the Metathe- saurus and link them to their associated Concept num- bers. The indexes hence also serve as indexes for our transfer dictionaries.

The Spanish or French query arrives as a (potentially ill-formed) sentence. The Spanish/French MeSH entries of the transfer dictionaries derived from the Metathe- saurus contain phrases. Hence we must first identify ap- propriate Spanish/French MeSH phrases for a query be- fore translating these into English using the dictionaries. The effectiveness of the CLIR process depends upon this first non-trivial categorization step.

The simplest selection strategy is to use the word based indexes (mrwx.spa and mrwx.fre) for the Span- ish/French MeSH phrases to pull out all phrases that contain at least one of the query words3. However, we would like to identify the ‘set’ of Spanish/French MeSH phrases for the query as a ‘whole.’ For example, we would like more important query terms to have a greater role in the selection of MeSH terms than less important terms. We would like to weight both the query words and the MeSH phrases by their statistical features (IDF, DF etc.) and consider these weights during phrase selection. A

2Researchers have remarked upon the non-trivial effort re- quired in deriving a transfer.dictionary 1141 from bilingual dic- tionaries. In contrast, our phrase equivalences are created using straightforward UNIX shell commands such as grep. Recently Brown [4] tested a relatively straightforward method for deriving a transfer dictionary from a sentence aligned parallel corpus.

sThis method is similar to those used to determine word-by- word translations from dictionaries [2, 141.

3.5 Selecting Spanish/French MeSH Phrases

for Queries

We first create a SMART database (see Table 1) for each language using its UMLS index file. There are a total of 22,891 unique Spanish word entries in the mrxw.spa in- dex and hence in the database and 18,179 records in the French index database. These are indexed by SMART without stemming following the removal of stopwords4, using the ate weighting scheme. Two separate index vectors are created, one for the .W field and the other for the .C field. We then retrieve database records for the free-text Spanish/French queries, indexed using the atn scheme. We retrieve by comparing the words in the Spanish/French query with the .W field of the database records and analyzing the top N records to identify the A4 most important concepts. It is essentially in this step that SMART offers the advantage of weights to distin- guish between the concepts. We then temporarily as- sign the Spanish/French phrases corresponding to the se- lected M concepts to the query5. Readers familiar with previous query expansion work with MEDLINE [28] and TREC (51 may recognize the ‘retrieval feedback’ or ‘near- est neighbor’ flavor in this approach. A sample query and the Spanish concepts identified in this step appear at the top of Table 2. (Since the procedures used are identical for both Spanish and French, we limit our examples and tables to Spanish for simplicity). Mask numbers (i.e., word positions in the query, ignoring trivial words) ap- pear next to non-trivial Spanish query words. For each concept, the table shows the concept#, string#, Spanish phrase, mask# values and English phrase.

3.6 Refining the Selected Set of Spanish/French

MeSH Phrases

Many of the phrases are irrelevant to the query, as shown in Table 2, so we next refine this set, selecting from a number of strategies. In combination strategies, each refinement step acts only upon the Spanish query that remains after the previous refinement step. We use the example of Table 2 to explain these strategies.

l Full Matches (FM):

Only MeSH phrases composed entirely of query words are retained. We always carry out this re- finement procedure. (The remaining optional re- finement strategies are tested only following this full match criteria.)

Selected Spanish Phrases: causa, cancer, pecho, es- trogenos Final English Query: causation, cancer, thorax, es- trogens

‘We use a 351 word Spanish stoplist and a 355 word French stoplist

‘Given the database schema, each query word is going to re- trieve at most one database record. Thus the best value for N is the number of informative query words. After examining the test queries, this was set to 10. Since we have follow up refinement steps in our CLIR approach, we set M, the number of concepts identified for each query, to 15.

74

English Querr Spanrsh Quew -St.i”g# .0446216 SO564307 SO564306 SO564165 a0571149 .0460547 *0563060 .0461034

SO461035 .0574045 SO782114

SO563561 *0563566 so563569 SO572767 SO451809 ~0451.810 ~0461.911 SO66Z95.5

*0451*12 SO451613 -

~nrnbase Schema Field Explanation .I Ftceord ID. .w Spanish word in index (single word) .c List of UMLS Concept numbers in which tbc Spanish word occur*.

Example Datsbnse Record

Field “al”= .I 600 .w ngudo .c COOOO727 COO322Ql COO36436 co242934

Table 1: SMART Database: Schema and Example Record

th,Cept# ~0007463 c0085976 COO26756 ~0006826 cOO3QQQ2 c0006031 ~0002962 coo14935

CO206074 c0206074 eOOlQfJ30

coo22414 cOOlQ.560 COO22748 ~0086511 cOO14Q38 COO14936 c0007406 COO14941 COO14840 COO14841

Spanish Phra.e EILY*II de muerte

tere+ia de rem&a de rinon terapia de reempla.o de hormona

ixx i-- 1 1 2 3 3 3 45

43 45 45

Table 2: Example to Illustrate Refinement Procedures.

Strategy Spanish French

Baseline (0.2431) FM 0.1559 (64%) 0.1117(46%) FM+PM 0.1597 (66%) 0.1040 (43%) FM+D 0.1610 (66%) 0.1276 (52%) FM + A 0.1673 (69%) 0.1329 (55%) FM+PM+D 0.1682 (69%) 0.1028 (42%) FM + PM + A 0.1707 (70%) 0.1090 (45%) FM+D+A 0.1737 (71%) 0.1493 (61%) FM + PM + D + A 0.1728 (71%) 0.1084 (45%)

Table 3: Experiment 1: ll-AvgP scores.

75

l Partial Matches (PM):

Partial matching phrases that best cover the re- maining portion of the query are retained. We sort phrases by mask combination and retain the short- est phrase corresponding to each unique combina- tion, ignoring stopwords when calculating phrase length. If more than one phrase qualifies, we choose the one with the smallest String # (given by the UMLS producers). Applying this strategy after the FM strategy yields:

Remaining Spanish Query: de la terapia (4) de reemplazo (5) de Selected Spanish Phrases: terapia de reemplazo de estrogeno, artroplastia de reemplazo English Translations: estrogen replacement, joint prosthesis Final English Query: causation, cancer, thorax, es- trogens, estrogen replacement, joint prosthesis

l Word Based Translation (D):

A special word based translation dictionary lim- ited to the (remaining) query words is built as fol- lows. For a given word, identify the Spanish MeSH phrases containing it and extract the correspond- ing English MeSH phrases. Then list the words in these phrases and select the most frequent word as the translation. Applying this strategy after the FM strategy yields:

Remaining Spanish Query: de la terapia (4) de reemplazo (5) de

Selected English Phrases: therapy, replacement Final English Query: causation, cancer, thorax, es- trogens, therapy, replacement

Thus the remaining query words ‘terapia’ and ‘reem- plaza’ are correctly translated.

l Addition of Spanish query words (A):

Any remaining Spanish query words are simply added to the final query. Applying this strategy after the FM strategy yields:

Remaining Spanish Query: de la terapia (4) de reemplazo (5) de Final English Query: causation, cancer, thorax, es- trogens, terapia, reemplazo

We tested the above refinement steps in several com- binations, with FM included as each combination’s initial step. Note that the same refinement procedures are used for the French collection.

4 Retrieval Experiment 1.

The goal is to evaluate the retrieval effectiveness of the UMLS Metathesaurus based query translation strate- gies. Searches are conducted against the free-text, i.e., title and abstracts of the OHSUMED documents. For each run the baseline is defined by retrieval using the original English query that came with the OHSUMED databa&. Indexing is done using stemming and after

‘We recognize that this baseline ignores any effects of the trans- lation of the original English query into Spanish and French by our native speaker.

Recall Baseline 1 Spanish 1 French Precision

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 11-AvgP % baseline

0.5454 0.4600 0.3598 0.2980 0.2476 0.2182 0.1722 0.1722 0.1075 0.0777 0.0492 0.2431 100%

=

0.3270 0.2578 0.2117 0.1683 0.1487 0.1184 0.0967 0.0773 0.0535 0.0341 0.1737

71% =

-

1 - 0.3682 0.2698 0.2196 0.1735 0.1485 0.1310 0.0993 0.0873 0.0653 0.0484 0.0316 0.1493

61%

Table 4: Experiment. 1: Precision Scores at 11 Standard Recall Points.

eliminating stopwords. Based on previous tation, ann weights are used on documents queries’.

4.1 Performance Measures

We use three performance measures. The

experimen- and atn on

first is ll- AvgP or the average of precision at 11 standard recall points (0.0, 0.1, 0.2, . . . . 1.0). In CLIR, given the expense of translation, a user is likely to be interested in the top few retrieved documents. Thus, our second measure is average interpolated precision at 0.10 recall. However, since the actual numbers behind this level of recall can vary considerably across queries, we also compute the noninterpolated (i.e., exact) precision scores for the top ranking documents, and focus particularly on the top 10 documents.

4.2 Results and Analysis

Table 3 presents the 11-AvgP scores. Abbreviations refer to the particular refinement strategies (FM: full match; PM: partial match; D: dictionary based and A: simple ad- dition of Spanish/French query words). The FM+D+A refinement strategy is the best, achieving 71% and 61% of baseline for Spanish and French respectively’. It is not surprising that strategy A improves performance since medical terms in Spanish, French and English often have the same Latin roots.

Table 4 indicates that for the FM+D+A runs, the average interpolated precision at 0.10 recall, i.e., when 10% of the relevant documents have been retrieved, is 0.3270 (71% of baseline) and 0.2698 (59% of baseline) for Spanish and French respectively. Table 5 shows that the performance achieved within the top 10 ranks is 79% and 51% respectively. Across all measures the performance range achieved is 71% to 79% (Spanish) and 51% - 61% (French) of monolingual performance.

7Since document and query lengths do not vary significantly there is no need to normalize the weights.

‘The ‘A’ indicates that any remaining Spanish/French words were simply added to the final query. For example, there are 98 untranslated Spanish query words (81 unique words) out of a total of 538 query words (356 unique words) for the runs represented in Table 4 and Table 5.

76

5.2 Digram Based Results

The previous CLIR runs were repeated with the differ- ence of including digram based vocabulary normalization procedures into the query translation process. All three performance scores for the two alternative query formats (Q’ and Q”) were computed. Similar to the previous ex- periment, the best runs are obtained with the FM+D+A combination of refinement strategies for both query types and languages. The digram approach, without the addi- tional length constraints, i.e., Q’, consistently gives bet- ter results comnared to Q”. The best result obtained for Spanish is 0.1832 ll-AvgP (75% baseline); 0.3493 aver- age interpolated precision at 0.1 recall (76% baseline) and 0.2179 exact precision at 10 retrieved documents (81% baseline). For French, the corresponding figures are 0.1647 (68%); 0.2935 (64% baseline) and 0.1547 (58% baseline) respectively. In comparison with the results of the first experiment, the best digram run offers improve- ments in the range of 2 to 5% for Spanish and 5 to 7% for French depending upon the measure used. Other nor- malization methods, such as stemming, will be explored in the future.

Table 5: Experiment. 1: Exact Precision Scores.

5 Further Exploration

The previous experiment did not involve any morphologi- cal normalization with stemmers in the query translation process. When we select Spanish/French MeSH terms from the Metathesaurus for the queries, exact match cri- teria are employed (see Section 3.5). Unfortunately, there are a number of instances where some vocabulary normal- ization may help. For example, the Spanish queries #25 and #30 have the word ‘aislado’. Although the Metathe- saurus does not contain this word, it contains the mor- phological variants ‘aislada’, ‘aisladores’ and ‘aislados’. As an alternative to stemming we explore digram based vocabulary normalization methods.

Digrams are determined prior to the SMART baaed matching process described in Section 3.5. The Span- ish/French query is first modified using the digram based method and then sent into the SMART procedure to identify appropriate MeSH concept phrases. Thus query words that do not occur in the Spanish/French Metathe- saurus are substituted (if possible) by the closest match- ing words based on the digrams method.

5.1 Digram Based Matching.

5.3 Comparison of Spanish and French Re- sults

In general it is clear that the French results are infe- rior to the Spanish results. The question asked at this point is whether the performance differences observed is due to their different levels of representation in the UMLS or due to important differences in the languages that we are not considering, or perhaps due to both ‘? We know that of the MeSH concepts, 23,198 yield a to- tal of 32,282 Spanish strings and 18,277 MeSH concepts yield 25,932 French strings. Interestingly except for a single concept, all concepts with French strings also have _

Each Metathesaurus Spanish/French word and each query Spanish strings”. However, 4,922 of the MeSH concepts word is represented by its set of digrams. Similarity is with Spanish strings do not have corresponding French computed using Dice’s Coefficient: strings. Thus for all practical purposes we may consider

the French concepts to be a proper subset of the Spanish

Sim(Query - word, Meta - word) = 2 * N/(P + Q) (1) concepts. In order to study the effect of the difference in representation further we first reduced the Spanish con-

where P and Q are the number of digrams in each word cepts to those 18,276 concepts which were also available

and N is the number in common. (Note that computed in French. We refer to this set as ‘Spanish-reduced’ and

similarity can be greater than one when a digram occurs use ‘Spanish’ for the original set of 23,198 Spanish con-

more than once in either word.) We test two selection cepts. To our surprise, the differences between Spanish-

strategies, with exact matches selected in both. reduced and Spanish from the viewpoint of our collection of 106 Spanish queries is minimal. The auerv set has 538

l Q’: Select a single Metathesaurus word with simi- words (after excluding stopwords) out ofwhich 82 do not larity >= 0.8. If this fails, retain the original query occur in the Spanish concepts and only an additional 5 word. do not occur in Spanish-reduced. Thus we do not expect

l Q”: Apply an additional length constraint where to see any differences in retrieval performance by moving

word length equals the number of unique digrams. from the set of Spanish concepts to the subset that also

Select all words >= 0.8 similarity within a differ- has French translations12. Thus we may conclude that

ence of 1 in word length. This will select ‘aislada’ the difference in the level of representation in the UMLS

and ‘aislados’ for the query word ‘aislado’ but reject across the two languages does not cause the difference in

‘aisladores’. CLIR performance observed for this query set.

Table 6 shows examples of queries transformed through both alternatives. Note that stopwords are removed and the query words alphabetized in the transformed queries9 Words in the original query which do not appear in the Spanish/French Metathesaurus and are therefore candidates for digram based substitution are highlighted as are their substitutions.

gSince our retrieval tests employ a “word” based approach, alphabetization has no effect.

“Of course the assumption bebind this question is that the translations in both languages are of equal quality.

“The one exception is concept COO05403 with t,he French string ‘Reflux Biliare’.

“This expectation was supported when we repeated the FM+D+A run of Table 3 on Spanish-reduced. The performance increased slightly to 0.1745. The slight increase may be explained by the fact that the 5 additional terms without representation in Spanish-reduced are of low frequency and are not very informative such as ‘adversos’.

77

.Tp.nle”

Q# Version Q--Y Ql Oti@,.l sxhten adversos en 10s lipidoa cuando I. pro~cstcro.. es admnnmtrad.

con terapi. de rsemplnrante hormonal estrogenae Q’ version .dmini.tr.do* adversos c.trogenos hormo”R Ilp,do* proge.ter0.a

reemplrzrnte terapi. Q” version .dmini.tr.dos advera.* eatrogenos hormonn lipid.* progeateron.

reempi.z..te tcr.pia Q2 Origin.1 p.tofl.iolo*i. y tratamiento dc coagulaeion intr.v.8cul.r diaeminad.

Q’ version co.gul.ei.n dineminad. fisiolopi. i.trav.ecul.r tratnmienta Q” version coagulation diseminsd. intrnvaecular p.isofl.iolo*i. tratamiento

Frcneh Q# “.r*ion 9-V Q2 Origin.1 phy.iop.tholo#m. et traitcmenl de la congulation inrravasculnire disacmi...

Q’ ver.io” e0.gul.ti.n dis*.min.. intrsvaaculaire phytopathologl. tr.itement

Q” v.r.i.. coagulation di...min. di...min.. intr.v.acul.ire phymiopathologi. traitemcnt

44 Original d l.z”“.z e subdural chez les pa-...... .gcca Q’ version sgees p.r*onn.. l.Y”C subdur., Q” v.r.io. agee, p.r.onn. p.r.onn.. revue subdural

Table 6: Sample Query Transformation through Digram Based Strategies

13ett.r thnn bnseline Equiv.1e.t to baseline Wore. th.s baseline Clos. 1 Class 2 Cl.** 3 c1s.a 4 Cl.*. 5 anas 0 .t ,e.at .b.o,ute d,ffcrcnce ,llore rho”

50% 10 to 30% betwecn 0 .nd 10% 10 to 30% 30 to 10% 70% Spsnish 7 7 40 10 18 24 French 12 8 22 11 18 35

Table 7: Distribution of query-by-query performance.

Table 8: Examples of query terms and performance

If we assume that the translations are of equal qual- ity then our conclusion is that there are important differ- ences in the two languages that our CLIR algorithm has yet to consider. This is also indicated by the fact that the addition of the heuristic ‘PM’ always degrades the French results but not the Spanish results. Future work is planned to examine these aspects in further detail.

6 Query-by-Query Analysis.

For this analysis we took the best performance of the Spanish and French queries and compare them against the baseline. We obtain the precision of each of the 106 queries and compute the percentage difference with re- spect to the baseline. The queries were grouped in six classes which are presented in Table 7. Each cell presents the number of queries in that class. We observe that the translation process improves several queries. Surprisingly French queries generate 20 significantly improved trans- lated queries, in contrast to 14 generated by the Span- ish queries. We also observe that the Spanish transla- tion generates 40 queries that perform equivalently to the baseline. Table 8 shows four of the queries from classes 1 and 6. We observe that in query 87 both translations perform significantly better than the baseline. In query 105 the Spanish translation is better than baseline but the French translation performs very low. The contrary happens in query 44. In those cases where the transla- tion performs better than the baseline, the process has introduced a new important term that was not present in the original English query. Query 61 is an example of a case where both translations perform worse than the baseline. In those cases, the translation process failed to translate an important term of the query.

This type of analysis will allow us to refine our meth- ods in future research.

7 Conclusions

We have explored a CLIR method for MEDLINE using only the multilingual Metathesaurus for query transla- tion. No tools such as part of speech taggers, stemmers and separate corpora are involved. The approach begins by first selecting an initial set of Spanish/French MeSH phrases that is appropriate for the query as a whole. This set is then refined using alternative strategies, The best performance is achieved by: first selecting phrases that contain only query words; then translating any remain- ing query words on a word-by-word basis; and finally re- taining the remaining Spanish/French query words. We tested the translated queries on OHSUMED with the monolingual retrieval results as the baseline. Three ver- sions of translated queries were tested over two retrieval experiments. The second and third query versions (Q’ and Q” ) involved digram based vocabulary normaliza- tion procedures.

In general, the best free-text based retrieval perfor- mance is at 75% of baseline MLIR performance in ll- AvgP; 76% of baseline in average precision at 0.1 recall and 81% baseline in exact precision at 10 retrieved doc- uments for Spanish. French yields less favorable results, with best scores 68%, 65% and 58% respectively. We show that these differences in performance are not caused by differences in level of representation in the UMLS. The must likely cause is difference in linguistic features.

When compared with previous results we see that Spanish CLIR using the Metathesaurus for query trans-

lation is on the high end of the performance range (of 50- 75% of baseline scores) observed with approaches based on dictionaries with or without information extracted from corpora 12, 3, 7, 14). As anticipated, performance is still behind dictionary independent methods using paral- lel corpora [lo]. It remains to be seen if the addition of tools such as stemmers and relevant parallel or compara- ble corpora improves performance.

In addition to the use of the UMLS Metathesaurus (an excellent example of a collaboratively built vocabu- lary system), this study has a number of unique features. First, it involves SMART in selecting MeSH phrases for queries which allows us to consider weights in this phase. Another unique feature is the exploration of a new and automatic method for deriving word based transfer dic- tionaries from phrase based transfer dictionaries. We also show that such dictionaries contribute to CLIR perfor- mance. This is also one of very few recent studies to empirically explore the value of multilingual thesauri or controlled vocabularies for CLIR. Moreover we investi- gate how a controlled vocabulary can be used to conduct free-text based CLIR. Lastly, this research contributes to the strengthening of the international impact of MED- LINE. Future work will build on these results by explor- ing alternate refinement strategies and appropriate sec- ond language stemmers. We also look forward to explor- ing the other languages of the UMLS: Portuguese and German.

Acknowledgements The authors thank Professor Bill Hersh for generously providing the OHSUMED test database. We also thank Professor Roccio Guillen for translating the OHSUMED queries into Spanish and Dr. Gertrud Champe of the UI Translation Laboratory for translating the queries into French. Finally we thank the reviewers of this paper for their recommendations.

References

PI

PI

PI

[41

[51

J. Allan, J. Callan, W.B. Croft, L. Ballesteros, D. Byrd, R. Swan, and J. Xu. INQUERY does bat- tle with TREC-6. In Proceedings of the Sixth Tezt Retrieval Conference (TRECG). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.

L. Ballesteros and W.B. Croft. Dictionary meth- ods for cross-lingual information retrieval. In Pro- ceedings of the %!h International DEXA Conference on Database and Expert Systems, pages 791-801, 1996. http://ciir.cs.nmass.edu/info/psfiles/ irpubs/ir.html.

L. Ballesteros and W.B. Croft. Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, .July 1997. 84-91.

R.D. Brown. Automated dictionary extraction for “knowledge-free” example-based translation. In Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation, July 1997.

C. Buckley, G. Salton, J. Allan, and A. Singhal. Au- tomatic query expansion using SMART:TREC 3. In

79

D.K. Harman, editor, The Third Text Retrieval Con- ference (TREC-3), pages 69-80. NIST, November 1994.

[6] J.G. Carbonell, Y. Yang, R.E. Frederiking, R.D. Brown, Y. Geng, , and D. Lee. Translingual informa- tion retrieval: A comparative evaluation. In Proceed- ings of the Fifteenth International Joint Conference on Artificial Intelligence, August 1997.

[7] M. Davis. New experiments in cross-language text retrieval at NMSU’s computing research lab. In The Fifth Text Retrieval Conference (TREC-5)., Novem- ber 1996.

[8] M. Davis. Free resources and advanced alignment for cross-language text retrieval. In proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.

[9] M. Davis and T. Dunning. Query transla- tion using evolutionary programming for multi- lingual information retrieval. In Fourth Annual Conference on Evolutionary Programming, August 1995. http://crl.nmsu.edu/users/madavis/Site/ Book2/evolmltrl.ps.gz.

[lo] ST. Dumais, T.A. Letsche, M.L. Littman, and Landauer T.K. Automatic cross-language re- trieval using latent semantic indexing. In D Hull and D Oard, editors, 1997 AAAI Symposium on Cross-Language Text and Speech Retrieval. Amer- ican Association for Artificial Intelligence, March 1997. http://wuu.clis.umd.edu/dlrg/filter/ sss/papers/dumais.ps.

[ll] E. Gaussier, G. Grefenstette, D.A. Hull, and B. M. Schulze. Xerox TREC-6 site report: Cross language text retrieval. In Proceedings of The Sixth Text Re- trieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.

[12] J. Gilarranz, J. Gonzalo, and F. Verdejo. An ap- proach to conceptual text retrieval using the Eu- rowordnet multi-lingual semantic database. In AAAI Symposium on Cross-Language Text and Speech Retrieval, March 1997.

[13] W. Hersh, C. Buckley, T. Leone, and D. Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In B Croft and C van Rijsbergen, editors, Proceedings of the 17th International Conference on Research and De- velopment in Information Retrieval, pages 192-200. New York: ACM, August 1994.

[14] D.A. Hull and G. Grefenstette. Querying across lan- guages: A dictionary-based approach to multilin- gual information retrieval. In H-P Frei, D Harman, P Schauble, and R Wilkinson, editors, Proceedings of the 19th International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 49-57. ACM, July 1996.

[15] B. Mateev, E. Munteanu, P. Sheridan, M. Wechsler, and P. Schluble. ETH TREC-6: Routing, Chinese, cross-language and spoken document retrieval. In Proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.

[16] G.A. Miller. WordNet:an on-line lexical database. International Journal of Lexicography, 3(4), 1990.

[17] D.W. Oard. Alternative approaches for cross- language text retrieval. In D Hull and D Oard, edi- tors, AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, March 1997.

[18] D.W. Oard and B.J. Dorr. A survey of multilin- gual text retrieval. Technical Report UMIACS-TR- 96-19 CS-TR-3615, University of Maryland, April 1996.

[19] D.W. Oard and P. Hackett. Document translation for cross-language text retrieval at the university of maryland. In Proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), Novem- ber 1998.

[20] National Library of Medicine. Unified Medical Lan- guage System (UMLS) Knowledge Sources, 6th ex- perimental edition. Bethesda, MD:NLM, 1997.

[21] B. Rehder, M.L. Littman, S. Dumais, and T.K. Lan- dauer. Automatic 3-language cross-language infor- mation retrieval with latent semantic indexing. In Proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.

[22] G. Salton. Automatic processing of foreign language documents. Journal of the American Society for Zn- formation Science, 21(3):187-194, May 1970.

[23] G. Salton, editor. The SMART Retrieval System- Experiments in Automatic Document Processing. NJ: Prentice Hall, 1971.

[24] P. Schluble and P. Sheridan. Cross-language in- formation retrieval (CLIR) track overview. In Proceedings of the Sixth Text Retrieval Conference (TRECG). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.

[25] P. Sheridan and J.P. Ballerini. Experiments in multi- lingual information retrieval using the SPIDER sys- tem. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 58-65, August 1996.

[26] P. Sheridan, M. Wechsler, and P. Schauble. Cross-language speech retrieval. In NJ Belkin, AD Narasimhalu, and P Willett, editors, Proceed- ings of the 20th International ACM SIGIR Con- ference on Research and Development in Informa- tion Retrieval, pages 99-109. New York: ACM, July 1997.

[27] D. Soergel. Multilingual thesauri in cross-language text and speech retrieval. In D Hull and D Oard, editors, AAAZ Symposium on Cross-Language Text and Speech Retrieval. American Association for Ar- tificial Intelligence, March 1997.

[28] P. Srinivasan. Retrieval feedback in medline. Jour- nal of the American Society for Information Science, 3(2):157-167, 1996.

80