Arabic machine translation: a survey

24
Artif Intell Rev DOI 10.1007/s10462-012-9351-1 Arabic machine translation: a survey Arwa Alqudsi · Nazlia Omar · Khalid Shaker © Springer Science+Business Media B.V. 2012 Abstract Although there is no machine learning technique that fully meets human require- ments, finding a quick and efficient translation mechanism has become an urgent necessity, due to the differences between the languages spoken in the world’s communities and the vast development that has occurred worldwide, as each technique demonstrates its own advan- tages and disadvantages. Thus, the purpose of this paper is to shed light on some of the techniques that employ machine translation available in literature, to encourage researchers to study these techniques. We discuss some of the linguistic characteristics of the Arabic language. Features of Arabic that are related to machine translation are discussed in detail, along with possible difficulties that they might present. This paper summarizes the major tech- niques used in machine translation from Arabic into English, and discusses their strengths and weaknesses. Keywords Arabic machine translation · Arabic language morphology 1 Introduction Machine translation is a computer application that translates texts or speech from one natural language to another. Machine translation receives a source sentence, A. Alqudsi (B ) · N. Omar Knowledge Technology Research Group (KT), School of Computer Science, Faculty of Information Science and Technology, University Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia e-mail: [email protected] N. Omar e-mail: [email protected] K. Shaker Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya, 50603 Lembah Pantai, Kuala Lumpur, Malaysia e-mail: [email protected] 123

Transcript of Arabic machine translation: a survey

Artif Intell RevDOI 10.1007/s10462-012-9351-1

Arabic machine translation: a survey

Arwa Alqudsi · Nazlia Omar · Khalid Shaker

© Springer Science+Business Media B.V. 2012

Abstract Although there is no machine learning technique that fully meets human require-ments, finding a quick and efficient translation mechanism has become an urgent necessity,due to the differences between the languages spoken in the world’s communities and the vastdevelopment that has occurred worldwide, as each technique demonstrates its own advan-tages and disadvantages. Thus, the purpose of this paper is to shed light on some of thetechniques that employ machine translation available in literature, to encourage researchersto study these techniques. We discuss some of the linguistic characteristics of the Arabiclanguage. Features of Arabic that are related to machine translation are discussed in detail,along with possible difficulties that they might present. This paper summarizes the major tech-niques used in machine translation from Arabic into English, and discusses their strengthsand weaknesses.

Keywords Arabic machine translation · Arabic language morphology

1 Introduction

Machine translation is a computer application that translates texts or speech from one naturallanguage to another. Machine translation receives a source sentence,

A. Alqudsi (B) · N. OmarKnowledge Technology Research Group (KT), School of Computer Science, Faculty of InformationScience and Technology, University Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysiae-mail: [email protected]

N. Omare-mail: [email protected]

K. ShakerDepartment of Software Engineering, Faculty of Computer Science and Information Technology,University of Malaya, 50603 Lembah Pantai, Kuala Lumpur, Malaysiae-mail: [email protected]

123

A. Alqudsi et al.

S = [s1 s2, . . . , si]

and generates a target sentence,

T = [t1 t2, . . . , tj

]

by translating the source sentence and give the meaning of it in the target language. Interest-ing authors in text translation are Hutchins (2007) who provides a good overview and Dorret al. (1999) for a comprehensive survey of machine translation. The question is “why shouldresearchers be interested to develop new translation systems?” The first and most importantreason is that there is a need for machine translation, since the advent of computers, as there isan increasing demand to create online communication between people worldwide, speakingin different languages. Another reason is that there is no machine translation that fully sat-isfies people’s requirements, in terms of translation quality and retrieval time. Furthermore,the use of computer translation tools can increase the speed of translation throughput, withimmediate results, taking into consideration less costs of translation. Finally, machine trans-lation is a major administrative activity in natural language processing for different fields.Thus, efficient techniques that work with special rules should be available to generate a usefulmachine translation system, in order to improve the translation of natural language texts intoother natural languages, and remove anomalies. Arnold et al. (1994) suggested that machinetranslation should focus on moderate translation that involves human interaction.

Recently, machine translations achieved better translation for almost all natural languages(Attia 2008). We found many online translation machines, such as Google Translator, whichis a free online text translation.1 that is based on statistical machine translation paradigms andsupport more than 55 different languages. Microsoft Translator is based on example-basedmachine translation and several statistical machine translation technologies. It is a free onlinetranslation.2 that supports 32 languages. Systran uses a rule based machine translation par-adigm. Systran.3 can translate a certain number of languages, like English, Arabic, French,Dutch, Chinese, and others. Many of the pairs include to or from English or French.

The goals of this paper are to provide details of the Arabic language, to characterize themain ideas of Arabic to English machine translation, and provide a classification of variousapproaches applied to translate Arabic text into English text.

1.1 Arabic language

Arabic is one of the six major world languages. It originated in the area currently knownas the Arabian Peninsula. Arabic has been used since the 2nd millennium Before the Com-mon Era. Most of the oral spoken Arabic is presently more divergent than written Arabic,due to dialectal interference. In morphological analysis, Arabic words are often ambiguous(Al-Sughaiyer and Al-Kharashi 2004).

Arabic is the joint official language in Middle Eastern and African states. Large commu-nities of Arabic speakers have existed outside of the Middle East since the end of the lastcentury, particularly in the United States and Europe. The motivation for this paper is to shedlight on Arabic language features and investigate several existing translation systems withinliterature related to Arabic to English translation, in terms of the strengths and weaknessesof translation.

1 http://translate.google.com.2 http://www.microsofttranslator.com.3 http://www.systransoft.com/.

123

Arabic machine translation

Fig. 1 Arabic sentence and theequivalent sentence in English

A new British investigation of Iraqis torture

Arabic has a different word order that provides a significant challenge to MT, due to thepossibilities of expressing the same sentence in Arabic. In Arabic, three elements make-up asentence, namely subject, verb, and object. Through these elements, Arabic can be classifiedinto four types of sentences, according to different word orders i.e., SVO, VSO, VOS, andSOV.

Put simply, the task is to take a string of words (“sentence”) in the source language withvocabulary, and transform it into another string of words (“sentence”) in the target languagewith vocabulary. Some languages may require special pre-processing, such as German orChinese, as there are no clearly marked word boundaries.

There is often no special treatment of morphological variants. Arabic is rich and complexin morphological and syntactic structures. Therefore, it is possible for the size of vocabulariesto reach into the tens or hundreds of thousands, or even millions.

Soudi et al. (2007) presented a review of the salient issues in Arabic computational mor-phology, provided a broad coverage of the computational techniques for the processing ofArabic morphology, and a detailed discussion of the linguistic approaches on which eachcomputational treatment is based. They also introduced the transliteration scheme, whichis used to represent Arabic words for readers who cannot read Arabic script, as well asguidelines for pronouncing Arabic, given this transliteration.

The goal of a translation system, when presented with an input sequence, is to find atarget sequence that matches the corresponding translation. An example of translational cor-responding sequences is shown in Fig. 1. We draw a line between the words in the sentencethat are translations of each other. For instance, we can see that an Arabic sentence is trans-lated into an English sentence, and that we could align the words as we draw a line betweeneach word in both sentences.

Therefore, any new system will need some kind of mechanism to choose between variouspossible options for each translation decision. The system will also need a mechanism tocorrectly reorder words, as words with their equivalent meanings do not always appear in thesame order in both source and target sentences. Reordering typically depends on the syntacticstructure of the target language.

The foremost challenge for Natural Language Processing (NLP) in Arabic, is overcomingambiguity (Kamir et al. 2002; Albared et al. 2009). It is not uncommon for the differentpossible translations of a word to have very different meanings, and because of its rich andcomplex morphology, Arabic is notorious for its morphological ambiguity (Attia 2006).

According to Daimi (2001), Fehri (1993), Chalabi (2000), there are many complexities inArabic. The following lists the major issues involving Arabic:

• Arabic writing direction is from right to left in a horizontal form. For example:

The translation for this sentence is Khalid read the book.

123

A. Alqudsi et al.

Table 1 Arabic free word order Sentence orders Arabic sentence English translation

VSO Ate Adam the apple

OVS The apple ate Adam

SVO Adam ate the apple

VOS Ate the apple Adam

the wrote

writerswriterbooksIt{{ was written libraries officesofficelibrarybook

Fig. 2 Derivation of words from a three-letter-root

• There are no capital letters in Arabic.• Punctuation in Arabic is similar to English, except for commas, which sit ‘on’ the line

instead of ‘under’ the line.• Arabic uses gender for all known nouns (none are neutral).• Space is left between words in sentences.

Some letters change shape depending on their location within a word, whether they are atthe start, middle, or end of the word.

For example: the shape of letter ( ) in the start of a word will be ( ), like in the word:, in the middle of a word it will be ( ), like in the word: , and at the end of a word it

will be ( or ), like in these words: and .Arabic has a comparatively free word order. See the example shown in Table 1.In the examples above, the translation for all four sentences is one translation (i.e., the

same sentence in English), which is Adam ate the apple, but using different ordering. Theorder for the first sentence, from the left side is VSO, for the second sentence is OVS, for thethird one is SVO, and for the last one is VOS.

• Arabic is a pro-drop language. According to Chalabi (2004), the subject can be omitted,leaving any syntactic parser with the challenge to decide, first, whether or not there is anomitted pronoun in the subject position, and second, what the antecedent of the omittedpronoun is. For example:

(He writes the lesson)• Arabic is a clitic and affixes language (Abbès et al. 2004). There are some words in Arabic

that hold the meaning of a full sentence. For example:(We will play)

• Arabic words can often be ambiguous, because of the three-letter root system. This sys-tem allows Arabic to develop to cover a wide choice of meanings. One or more of theroot letters is dropped in some derivations, and this leads to possible ambiguity. Figure 2shows the derivation of words from a three-letter-root.

• Arabic does not have copula verbs ‘to be’ and ‘to have’. An example is shown in Table 2.

123

Arabic machine translation

Table 2 Verb ‘to be’ Arabic sentence Arabic reading English sentence(to be)

The door opening The door is opening

He clever He is clever

Table 3 Verb ‘to have’ Arabic sentence Arabic reading English sentence(to have)

To her a bag She has a bagTo him a book He has a book

Table 4 Feminine nouns arederived from masculine nouns

Arabic nouns English Translation

(male) Engineer(female) Engineer

(male) Doctor(female) Doctor

Table 5 Feminine nouns aredifferent from masculine

Arabic nouns English translation

(male) Boy(female) Girl

(male) Man

(female) Woman

(male) Cock

(female) Chicken

In the above example, rather than say “the door is opening,” in Arabic, it would read like

“the door opening,” and in another example, instead of saying “He is clever,” inArabic, it would read “He clever,” .

In English, the verb ‘to have’ usually means ‘to own’. Rather than saying “she has a bag,”the Arabic equivalent is ‘to her a bag’, . In another example shown in Table 3, ratherthan saying “he has a book,” in Arabic, it would read like “to him a book” .

• Nouns in Arabic must either be masculine or feminine. Usually feminine nouns are derivedfrom masculine nouns, which are considered as the stem (see Table 4).

However, sometimes feminine nouns are different from masculine (the feminine nounsnot derived from a masculine noun); as shown in Table 5.

• The number system in Arabic includes the dual form, whereas English moves from asingular to a plural form directly, but in Arabic we need to add a suffixing morpheme tothe singular (stem) ( or ) depending on whether the case is nominative or accusativeand genitive (as shown in Table 6).

• The plural form of Arabic masculine nouns exists by suffixing morpheme to the singularnouns ( or ) depending on whether the word case is nominative or accusative andgenitive (see Table 7).

123

A. Alqudsi et al.

Table 6 Arabic dual and pluralforms

Arabic singular Arabic dual(nominative)

Arabic dual(accusativeand genitive)

English translation

(male) (male) (male) Two engineers(female)

(female) (female)Two engineers

Table 7 Plural form of ArabicMasculine nouns

Arabic singular Arabic plural(nominative)

Arabic plural(accusativeand genitive)

English translation

TeachersVisitors

Table 8 Plural form of Arabicfeminine nouns

Arabic singular Arabic plural(nominative)

Arabic plural(accusativeand genitive)

English translation

Teachers

Visitors

Table 9 Broken plural Arabicsingular

Englishtranslation

Arabic plural Englishtranslation

Door Doors

Pen Pens

Planet Planets

• The plural form of Arabic feminine nouns can be created by adding a suffixing mor-pheme to the stem word ( or ) depending on whether the word case is nominative oraccusative and genitive (see Table 8).

• In Arabic, some words have no fixed rule for their plural form. Their plural forms areformed by changing the vowels, or adding or deleting the original alphabet; this type ofplural is called a broken plural (as shown in Table 9).

1.2 Some arabic affixes matter

Arabic has a large number of suffixes and prefixes that can change a stem to form words. Thisleads to a high vocabulary in the lexicon, and hence, a potential increase in word error rate.To tackle this problem, many preceding results state that a simple morphological analysisfor Arabic words is helpful, and has shown a good potential for machine translation (Afifyet al. 2006). Prefixes and/or suffixes are merged to the word stems to produce new words.

123

Arabic machine translation

Table 10 Arabic prefixes,suffixes and their meanings

Prefixes Suffixes

(and) (your (singular))

(the) (your(plural))(then) (his)

(to) (her)

(and the) (their)(like) (my)

The stem word can be derived by applying some predefined patterns to the roots. Table 10shows several Arabic prefixes, suffixes, and their meanings.

2 Translation approaches

There are many different approaches to carrying out machine translation. In this work, wegive a brief explanation of the main approaches that have been used in previous works, asover the years, many techniques are used to enhance the performance of machine translation.

2.1 Rule-based

In the field of machine translation, a rule based approach is the first technique used by research-ers. Rules are written by humans according to their linguistic knowledge. The strength of thisis that it can deeply analyse both syntax and semantic levels. A key design of any rule-basedMT system is its lexical resources. In practice, rule-based machine translation systems oftenhave diverse dictionaries, where some contain main entries, and others contain specializedvocabulary. Brill and Resnik (1994) described a new rule-based approach to prepositionalphrase attachment disambiguation, and compared the results of this algorithm with otherrule-based approaches, with the same problem. Salem et al. (2008) introduced a systemcalled UniArab to support the development of a rule-based lexical framework for Arabiclanguage processing using a Role and Reference Grammar (RRG) linguistic model. Theweakness of a rule-based approach is that it is impossible to write rules that cover all lan-guages, as this requires great linguistic knowledge (Charoenpornsawat et al. 2002). Hutchinsand Harold (1992) state that rule-based machine translation approaches can be classified bytheir architectures into the following categories: Direct approach, Transfer based approach,and Interlingual approach.

2.1.1 Direct approach

This direct approach is used by most first generation MT systems. Usually, the source lan-guage text will not analyse structurally beyond morphology, when the MT system uses directtranslation, as the translation is based on many dictionaries. Translation occurs as follows:

• Translation is word-by-word.• Very little analysis of the source text (e.g., no syntactic or semantic analysis).• Relies on a large bilingual dictionary. For each word in the source language, the dictionary

specifies a set of rules for translating that word.• After the words are translated, simple reordering rules are applied.

123

A. Alqudsi et al.

Source language

input

Target language

output

Morphological analysis

Bilingual dictionary lookup

Local reordering

Fig. 3 Direct MT approach

Joshan and Lehal (2007) used and approved that the direct approach can translate Hinditext to Punjabi text with a tolerable good accuracy.

Traces of the direct approach can even be found in indirect systems. However, the directMT system model has a more rudimentary software design. Figure 3 shows some of thesteps of the translation process. The limitations of direct machine translation are that it lacksanalysis of the source language. This may cause several problems and words are translatedwithout disambiguation of their syntactic role.

In response to the apparent weakness of the direct approach, many types of indirectapproach were developed (as shown in Fig. 3). Systems of this nature are sometimes referredto as second generation systems.

2.1.2 Interlingua approach

Interlingua machine translation is a classic approach and is the most attractive approach for amultilingual system. The Interlingua approach is carried out in two stages. In the first stage,the source language sentence will be analysed into an Interlingua representation, and in thesecond stage, the target text will convert the meaning of that representation into an outputsentence.

Within the rule-based machine translation paradigm, the Interlingua approach is an alter-native to the direct and transfer approaches. Hutchins (1986) mentioned that Interlingua wasthe first indirect method. However, Hutchins and Harold (1992) mentioned that “In the past,the intention or hope was to develop an interlingual representation that was truly ‘universal’and could thus be intermediary between any natural languages. At present, interlingual sys-tems are less ambitious.” Development of Interlingua systems depend on translating thesource text into an intermediate language, or symbolic representation form, which couldthen be translated into any of other languages. Eric and Teruko (1992) described Knowl-edge-Based Natural Language Translation (KANT), which is the first knowledge-based in-terlingual machine translation system that combines a principled source language design,semi-automated knowledge acquisition, and knowledge compilation techniques to producefast, high-quality translation, into multiple languages. In addition, Bonnie et al. (2004) definedthe elements of an Interlingua, the main issues faced by researchers and builders of Interlin-gua, and improvement of Interlingua MT systems.

In the Interlingua approach, there are both advantages and disadvantages. One advantageof multilingual machine translations is that no transfer component has to be created for eachlanguage pair. An obvious disadvantage is that the definition of an Interlingua is difficult.

123

Arabic machine translation

English

French

German

Spanish

English

French

German

Spanish

Interlingua

Fig. 4 Interlingua MT with 4 languages

Interlingua

Transfer

Direct

Sourcelanguage text

Target language text

Fig. 5 Transfer approach

It is easier to add new language pairs to the system than it is in the direct method, as the addi-tion of a new language to the system entails the creation of just two new modules: analysisgrammar and generation grammar. For example, in a system that has four languages (i.e.,English, French, German, and Spanish) there are 12 language pairs (Hutchins and Harold1992). Figure 4 illustrates an Interlingua system.

2.1.3 Transfer approach

Transfer based machine translation is based on the idea of Interlingua, and is currently oneof the most widely used methods of machine translation. The transfer approach is carried outin three phases. The first phase analyses the source language sentence and builds a syntacticanalysis using the Source Language (SL) dictionary. The second phase is the transfer process,which changes the results of the analysis phase and produces the linguistic and structuralequivalents between two languages. The third phase is the generation phase, which producesthe Target Language (TL) text, based on the linguistic data of the source language, using atarget language dictionary.

Figure 5 represents the transfer strategy. This strategy involves three phases to show thedocument linguistically, using a source language dictionary. The transfer rules may look quitesimilar to the rules for direct translation systems, but they can operate on syntactic structures.This approach easily handles long-distance reordering.

Toma (1977) improved the SYSTRAN system, which was regarded as the first systemusing this approach. This system has proven capabilities in an operational environment. TheSYSTRAN system has an inherent capability to translate from one language to any numberof different languages. Lavie et al. (2004) developed a basic Hindi-to-English MT system, byenhancing the performance of a syntactically transfer-based approach, using strong statisticalmethods. Hatem and Omar (2010) proposed a transfer-based approach in Arabic to English

123

A. Alqudsi et al.

MT, in order to solve the word ordering problem. Their approach was tested on 100 titlesfrom the Aljazeera news website.

2.2 Statistical

The ideas behind statistical machine translation came from information theory, as text wastranslated using a probability process. The statistical approach does not require linguisticknowledge, but it does need a large sized bilingual corpus. A statistical approach uses thestatistics of bilingual corpus and a language model. The advantage of this approach is theability to produce suitable translations, even if a given sentence is not similar to any sentencesin the training corpus.

A classic example of this approach is IBM’s work on French-English translation, usingthe Canadian Hansards. A good survey about Statistical Machine Translation (SMT) wasproduced by Lopez (2008). Josef and Ney (2000) discussed five IBM alignment models andpresented different single-word based alignment models for statistical machine translation.Marcu (2001) presented an algorithm to translate natural language sentences by exploit-ing both a statistical-based translation model and Translation Memory (TM). The resultsshow that an automatically derived translation memory can often be used within a statisticalframework to find translations of higher probability than those found only using a statisticalmodel. Zavrel et al. (1997) used a statistical method, Memory-Based Learning, in trying todevelop the presentation of prepositional phrase attachment resolution. In recent years, statis-tical machine translation has contributed to the significant resurgence of interest in machinetranslation. It is now the most widely studied machine translation method, but as we believe,statistical machine translation has not yet achieved people’s requirements in terms of quality.

2.3 Example-based (EBMT)

Similar to the statistical approach, this approach does not require linguistic knowledge, but itdoes need a large sized bilingual corpus. Example-based MT can produce suitable translationsin cases of a given sentence, which is similar to sentences in the training data.

Over the last decade, Example-Based Machine Translation (EBMT) has shown great pro-gress (Furuse and Iida 1992; Nirenburg et al. 1994; Brown and Ralf 1996; Nagao 1997).Furuse and Iida (1992) presented a method called Transfer-Driven Machine Translation(TDMT), which utilized an example-based framework for various processes and combinedmulti-level knowledge. Meanwhile, Richardson et al. (2001) described Microsoft ResearchMachine Translation (MSR-MT) as a large-scale example-based Machine Translation system(and some statistical) for several language pairs. The system was tested on English-Spanish.The evaluation results showed that MSR-MT’s integration of rule-based parsers, statisti-cal techniques, and example-based processing, produced translations of high quality, whichexceeded those of un-customized commercial MT systems.

2.4 Knowledge-based (KBMT)

Knowledge-Based Machine Translation (KBMT) systems are based on the point that “highquality translation requires in-depth understanding of the text” (Arnold et al. 1994). Thisapproach requires mentioning real-world knowledge, as well as knowledge of the “differ-ences in cultural backgrounds and differences in conceptual divisions” (Hutchins and Harold1992) between diverse languages.

123

Arabic machine translation

Source language Text

Batch System

Representation of meaning

Target language

text

Semantic Knowledge

Base

Fig. 6 Knowledge-based MT approach

Knowledge Based Machine Translation (KBMT) (as shown in Fig. 6) was implementedin a pilot system called SAM (Script Applying Mechanism). It was a multifaceted projectto explore the role of stereotypical domain knowledge in automated text understanding. Thesyntactic analysis, which was required to build a language-free meaning representation, hasboth advantages and disadvantages over earlier approaches. KBMT creates the possibility of atrue multilingual translation by the abandonment of transfer grammars in favour of more prin-cipled parsing and generation techniques. The KBMT approach requires a parser to map thesource language into semantic symbols and a generator to map those symbols into the targetlanguage. As a result, “KBMT systems rely on an augmenter” Trujillo (1999). Furthermore,a source text could be translated into many languages, as it only needs to be parsed once, andthe resulting meaning representation can be generated in each target language. Generation isa simpler, less computationally demanding process. Thus, KBMT makes the process of mul-tilingual translation far more computationally tractable. It also greatly reduces the amountof development work required to reach eventual closure in the number of grammars neededto translate between all commonly spoken human languages.

Usually, KBMT systems require huge amounts of knowledge. Mitamura et al. (1991)improved a Knowledge-based, Accurate Natural-language Translation (KANT) system thatreduces this requirement, to produce handy, scalable, and perfect KBMT applications.Richardson et al. (2001) described Microsoft Research Machine Translation (MSR-MT)as an EBMT system that generates output with high quality in a specific domain, whichexceeds commercial machine translation systems. The system was applied to both Spanishto English and English to Spanish language. Tahir et al. (2010) proposed and designed anew knowledge-based machine translation system to overcome problems, such as syntacticand structural ambiguity, lexical ambiguity, polysemy, discourses, anaphoric ambiguity, anddifferent shades of meanings, by using data mining and text mining techniques. The systemfulfilled all of the requirements of computational linguistic natural language processing. Thesystem was designed for Urdu, but it could be used for many other languages.

2.5 Hybrid method

Hybrid methods are used to incorporate higher level abstract syntax rules to arrive at thefinal translation. Hybrid methods have been explored in the research community withoutany real success, due to the difficultly of merging fundamentally different approaches. Newknowledge algorithms i.e., how words should be translated, phrases and patterns, knowledgeof how syntax-based translation rules should be applied, and knowledge of how syntacticallybased target structures were developed.

123

A. Alqudsi et al.

Groves and Way (2005) incorporated marker chunks with statistical machine translationsub-sentential alignments. They discovered that it was capable of outperforming both baselinetranslation models. After that, Groves and Way (2006) continued their research by develop-ing new hybrid Statistical (STM) and Example-based (EBMT) systems. The hybrid systemoutperformed both SMT and EBMT baseline systems. Langlais and Simard (2002) attemptedto use the hybrid system by merging Example-Based and Statistical Machine Translation,but to no avail.

Paul et al. (2005a) presented a multi-engine hybrid approach to MT. The two main strate-gies used in corpus-based translation are firstly, the EBMT to retrieve the translation examplesthat are best matched to an input, which are then adjusted to obtain the translation; and sec-ondly, Statistical Machine Translation to translate from corpora and dictionaries, which itthen searches for the best translation. The system was applied to translate from Japanese toEnglish and Chinese-to-English. Paul et al. (2005b) proposed an approach to integrate exam-ple-based and rule-based machine translation systems with statistical methods. The sourcelanguage input is paired with an initial translation hypothesis. The outputs that are generatedby multiple translation engines, such as rule-based and example-based systems, are usedas the initial translation hypotheses. This approach was applied to the Japanese-to-Englishtranslation of conversation in the travel domain.

3 Related work

3.1 General Arabic machine translation

Arabic is the Qur’anic language, and there are millions of people that need to understandthe Quran (Muslim’s holy book). Therefore, efficient techniques that work with special rulesshould be available to generate a useful computer system to produce high quality translations.The issue of enhancing the quality of machine translations has been gaining interest amongstresearchers in recent years.

Salem et al. (2008) reported that computerised translations, rather than manual translations,can save a lot of effort and cost. Arabic is notorious for its complex morphology (McCarthy1979; Azmi 1988; Beesley 1998; Ratcliffe 1998; Ibrahim 2002). Arabic has always beena challenge in computational morphology and a difficult testing ground for morphologicalanalysis technologies. When translating from a morphologically rich language, the transla-tion process is passed into multiple steps, which are called tokenization. Habash and Sadat(2006), Lee (2004) stated that tokenization is helpful when translating Arabic, as Arabicis segmented by simple punctuation tokenization. This tokenization rank is not enough forsyntactic analysis (Hatem and Omar 2010).

Attia (2007) implemented a rule-based tokenizer that handles tokenization as a pre-processing stage in MT. The advantage of this implementation is that it can become moremanageable and deterministic in debugging. Its lack of robustness makes it inapplicable,as no single morphological transducer can claim to comprise all language words. Differentmodels of tokenization are applied at different levels of linguistic depth, while the tokenizerinteracts with other components. According to Beesley and Karttunen (2003), based on thelevel of analysis, there are three strategies to develop Arabic morphologies:

1. One level rules: analysing Arabic at the stem level and using regular concatenation.2. Two-level rules: analysing Arabic words as being composed of roots and patterns in

addition to concatenations.

123

Arabic machine translation

3. Three-level rules: analysing Arabic words as being composed of roots, templates, andvocalization, besides concatenations.

Attia (2005) developed a morphological analyser, which uses a one level rules-approachanalyser that considers stems as the base forms of Arabic words, and handles spelling differ-ences through alteration rules. Alansary et al. (2009) highlighted the four axes of analysis,which are morphological analysis, lexical analysis, syntactic analysis, and semantic analy-sis. They presented a test roadmap for Arabic corpus analysis, by following a stem-basedapproach to be used in analysing the international corpus of Arabic. They also discussedgeneral considerations to bear in mind when starting the process of analysing the interna-tional corpus of Arabic. Köpr and Miller (2009) presented a powerful Arabic morphologicalanalyser and generator. Their system was used as a component in both rule-based and sta-tistical machine translation systems. The authors gave implementation details on nominalmorphology, verbal morphology analysis, lexicon, derivational morphology, and morpho-logical generation. The overall accuracy rate for their system was 91.4 %. Some of the errorsoriginated from words that did not exist in the lexicon, and there was therefore no analy-sis for them. For the remaining errors, an incorrect analysis was produced for the sourcelanguage.

Žabokrtský and Smrž (2003) developed a dependency grammar for Arabic, with a focuson the automatic transformation of the phrase-structure syntactic trees of Arabic into depen-dency-driven analytical ones. Meanwhile, Ditters (2001) wrote a grammar for Arabic usingAGFLformalism (Affix Grammars over a Finite Lattice).

Usually, any sentence that has two or more structural representations is said to be syn-tactically ambiguous. Sometimes, Arabic sentences with only one structural representationmay be ambiguous. Daimi (2001) described a technique for identifying syntactic ambigu-ity, in single-parse Arabic sentences, using Definite Clause Grammar formalism. His workanalysed each sentence and validated the conditions that rule the survival of certain types ofsyntactic ambiguities in Arabic sentences.

Attia (2008) described the main syntactic structures in Arabic within the LFG framework.He built the first Arabic parser using a Xerox Linguistics Environment, which allowed thewriting of grammar rules and notations that follow the LFG formalisms. This parser wasonly tested on short sentences in the news domain. Spence and Christopher (2010) cre-ated higher parsing baselines and showed that Arabic parsing performance is not as pooras previously thought. They described the grammar state splits that significantly improveparsing performance, catalogued parsing errors, and quantified the effect of segmentationerrors.

Othman et al. (2003) reported an attempt to create an efficient chart parser for analys-ing Modern Standard Arabic (MSA) sentences. Their parser was able to satisfy syntacticconstraints; thus reducing parsing ambiguity. Lexical semantic features were used to dis-ambiguate sentence structure. The authors explained that an Arabic morphological analyserdepends on an Augmented Transition Network (ATN) technique. They used Prolog to imple-ment both the Arabic parser and the Arabic morphological analyser. Linguistic rules wereonly obtained from sentences in the agriculture domain.

Habash (2010) discussed modern standard Arabic. The author focused on Arabic script,phonology, orthography, morphology, syntax and semantics, and machine translation issuesabout Arabic, such as morphology and Arabic script. However, Larkey et al. (2002) describeda large-scale system that presents morphological analysis and the generation of on-lineArabic words, represented in the standard orthography, whether wholly vowelled, partiallyvowelled, or un-vowelled. The analysis shows the root, pattern, and all other affixes, together

123

A. Alqudsi et al.

with characteristic tags indicating part-of-speech, person, number, voice, mood, aspect, etc.The system depended on the lexicons and rules from a two-level morphological system,reworked extensively using Xerox Finite-State Morphology tools.

Beesley (1996) described a finite-state morphological analyser of written modern standardArabic words (available on the internet at http://www.xrce.xerox.com/research/mltt/arabic).The system is composed of the analyser, which runs on a network server, and Java appletsthat run on the user’s machine. It gives words in standard Arabic orthography; both for inputand output. Darwish (2002) presented a way to rapidly develop a shallow Arabic morpho-logical analyser, based on automatically derived rules and statistics. The system was limitedin the choice of roots to a fixed set and some rare Arabic words that constitute completesentences that do not appear in a training set. However, the system analyser lacked the abilityto decipher affix combinations.

Abu Shugier and Sembok (2007) asserted that “Arabic differs tremendously in terms of itscharacters, morphology, and diacritic, from other languages; and to claim otherwise would bea mistake.” Attia (2008) mentioned that the traditional classification of Arabic parts-of-speechinto nouns, verbs and particles, is not sufficient for a complete computational grammar. Thiswas confirmed by Farghaly and Senellart (2003), Chalabi (2001), Alsalman (2004), and theydescribed and evaluated the Arabic machine translation and pointed out the rules that mustbe followed in Arabic translation.

3.2 Arabic-to-English MT

Arabic achieved attention in the Natural Language Processing (NLP) community, becauseof its political importance and linguistic differences between it and other languages. Theselinguistic features (particularly its complex morphology), present motivating challenges forArabic language researchers. Significant work has been done in Arabic natural language pro-cessing in applications such as machine translation (Farghaly and Senellart 2003; Shaalanet al. 2004), entity extraction (Shaalan and Raza 2009), and sentiment analysis (Almas andAhmed 2007).

Farghaly and Shaalan (2009) presented several solutions that would guide current andfuture practitioners in the field of Arabic natural language processing, such as general fea-tures and specific properties of Arabic, and they also highlighted the significance of Arabic.Furthermore, the paper presented solutions that have already been adopted by some pioneer-ing researchers in the field of Arabic natural language processing.

Lee et al. (2003) presented a tough word segmentation algorithm, which segments a wordinto a prefix and suffix stem sequence. Their method is classified by a small manually seg-mented Arabic corpus and uses it to introduce an unsupervised algorithm to build a segmenterfor Arabic words from a large un-segmented Arabic corpus. The algorithm can be used toidentify any number of suffixes and prefixes of a given token. It can generally be applied todifferent language families. The algorithm achieves about 97 % segmentation accuracy on adevelopment test corpus containing 28,449 word tokens.

Bisazza and Federico (2010) proposed a chunk-based reordering technique to automat-ically identify and move clause-initial verbs in the Arabic side of a word-aligned parallelcorpus. The method is applied to reprocess the training data, and to collect statistics aboutverb movements. From this analysis, verb reordering patterns are identified, built on the testsentences, before decoding them. This technique handled the most important cases of reor-dering verbs in Arabic-English, focusing only on the problem of VSO sentences. Carpuatet al. (2010) presented a method for improving overall SMT quality using a syntactic parserto reorder VS constructions into SV for Arabic-to-English word alignment. The author did

123

Arabic machine translation

not totally solve the problem, because many verb re-orderings were missed, even though theresulting system surpassed a strong baseline in terms of BLEU systems, and produced moreglobally readable translations.

Nguyen and Vogel (2008) presented a context-dependent morphology pre-processing tech-nique for Arabic-English translation. The authors used Arabic morphology-English align-ment to teach a model removing nonaligned Arabic morphemes. They discussed the relationbetween the size of the reordering window and morphology processing. The model was onlyapplied on a travel-domain system and a news domain system. Abraham and Salim (2005)used algorithms to analyse Arabic to English, based on supervised alignment data, and theperformance of their algorithm was contrasted with human annotation performance. Shirkoet al. (2010) developed a machine translation system called Npae-Rbmt that translated Ara-bic noun phrases into English using a transfer-based approach. The method was tested on 88thesis titles and journals from the computer science domain. The accuracy of this system’sresults was 94.6 %.

Salem et al. (2008) presented an Arabic to English machine translation system calledUniArab, which was based on the Role and Reference Grammar model, and detailed the sys-tem’s design and how it accommodated the particulars of Arabic to generate English. Givena limited lexicon, which was used to implement this system, the system failed to translatemany words, because their structure did not exist in the system. Also, the system did not dealwith ambiguity.

Chafia and Ali (1995) presented an attempt to perform MT from Arabic to English andFrench. They proved that analysing and reordering Arabic must be done to achieve goodresults according to Arabic rules. Yassine et al. (2010) investigated the possibility of buildinga high performance Arabic NER (automated Named Entity Recognition) system, by usinglexical, syntactic, and morphological features, and increasing the model with deeper lexi-cal features and more syntagmatic features. These features were extracted from noisy dataobtained via projection from an Arabic-English parallel corpus. The results showed that thesystem achieved a significantly high performance for almost all data-sets that were obtainedfrom broadcast news only. Larkey et al. (2002) used the spelling feature to measure the stringkernel distance between Romanised Arabic and English words.

According to the above, the machine translation from Arabic to English is a difficult taskand efficient techniques working with special rules should be available to generate a usefulsystem. A rule-based approach is the most popular method applied to generate a machinetranslation from Arabic to English. This is due to the complexity of Arabic in morphologicaland syntactic structures, which require many processes, such as word segmentation (segmentsa word into a prefix and suffix stem sequence), word analysis (analysing and reordering ofArabic must be done to achieve good results according to Arabic rules), etc.

3.3 English-to-Arabic MT

There are many orthographical differences between Arabic and English, which have to betaken into consideration by MT developers. Italics are used in English to indicate emphasisto a word, but in Arabic, they show a change in word order or the introduction of an emphaticword.

Badr et al. (2009) applied syntactic phrase reordering in English-to-Arabic statisticalmachine translation, and introduced reordering rules that were motivated linguistically. Thework also studied the effect of combining reordering with Arabic morphological segmenta-tion; a pre-processing technique to improve Arabic to English and English to Arabic trans-lation. Although this phrase-based statistical machine translation proved to be a robust and

123

A. Alqudsi et al.

effective approach to machine translation, it had a limited capacity to deal with long distancephenomena, because they relied on local alignments.

Elming and Habash (2009) studied syntactic reordering and the effect of the alignmentmethod on learning reordering rules within English to Arabic translation tasks. They achievedsignificant improvements in translation quality. Toutanova et al. (2008) improved the qualityof Statistical Machine Translation (SMT) by applying models that predicted word forms fromtheir stems using extensive morphological and syntactic information from both the sourceand the target languages. They applied the inflection generation models in translating Englishinto two morphologically complex languages, namely Russian and Arabic, and showed thatan independent model of morphology generation can be successfully integrated with an SMTsystem, making improvements in both phrasal and syntax-based MT. Their model achievedan accuracy of over 91 %, which suggests that the model was effective when its input wasclear in its stem choice and order.

Attia (2003) discussed the translation of English into Arabic using a transfer approach.Their study focused on the analysis of English as a source language, problems related to thetransfer of English into Arabic, and the generation of Arabic as a target language, which dealtwith agreement as one of the characteristics that greatly affects the output of MT. The studywas limited to electronic texts i.e., texts which are written in a machine readable format. AbuShugier (2009) presented a rule-based approach in English to Arabic MT, and emphasis wasgiven to the handling of word agreement and ordering. A major design goal of this systemwas that it would be used as a tool and integrated with a general machine translation system.The total score for this system was 96.1 %.

Modern Arabic has agreement asymmetries that are sensitive to word order effects. Aounet al. (1994) proposed an analysis of first conjunct agreement in verb sentences in LebaneseArabic and Moroccan Arabic. They argued that agreement of number-sensitive items causedby clausal coordination. Guessoum and Zantout (2005) presented a methodology for evalu-ating Arabic Machine Translation (MT) systems. Their evaluation methodology was appliedto four English-Arabic commercial MT systems and the results of the evaluation of thesesystems were presented for the domains of the Internet and Arabization.

Hatem and Nassar (2008) introduced a modified Dijkstra’s shortest path algorithm, usedto identify the target language phrases by listing the indexes of the source sentence’s words,which were found in the target language corpus and constructed a directed graph to identifythe phrases that form a shortest path walk in the graph. The method was used in a hybridEnglish to Arabic MT system. The system merges between rule-based and example-basedmachine translation techniques. Shaalan et al. (2004) implemented a transfer-based MT sys-tem to translate a fairly complex English NP into Arabic. The system was applied to 66real thesis and journal titles from the computer science domain. The accuracy of the resultswas 94.6 %. This significant improvement was attributed to the use of specific rules of nounphrases.

Finally, machine translation from English to Arabic uses methods slightly different tothose discussed from Arabic to English. Most experiments dealt with agreement as one ofthe characteristics that greatly affects the output of machine translation from English toArabic, as well as making reordering improvements.

3.4 From Arabic to other languages and vice versa

Several papers discussed the major approaches to machine translation from and to Arabic.For example; machine translation approaches from French to Arabic are uncommon andvery rare. According to Alsharaf et al. (2004), French cannot be translated into Arabic using

123

Arabic machine translation

existing approaches (i.e., direct, transfer, pivot, and statistical). This may be because the twolanguages are linguistically distant and this requires that certain linguistic phenomena thatare specific to the pair must be analysed. Such phenomena do not necessarily occur in otherlanguage pairs.

Alsharaf et al. (2004) presented an approach to the machine translation of French to Ara-bic. They incorporated certain aspects of existing approaches (i.e., direct, transfer, pivot, andstatistical) in some of their system’s steps. They used new functions, which were used to treatthe type of language pair that was characterized by having linguistically distant languages.In French, the morphology is very different, because there is no method to construct nouns,adjectives, adverbs, and actors, according to the verb rhyme, as in Arabic.

Hasan et al. (2006) first presented a statistically driven machine translation system forArabic to French and applied the system to the medical domain. They also described thenecessary steps needed to create a system for corpus acquisition, pre-processing (such asArabic tokenizer), training the models, and generating translations. Debili (1992) handledthe problem related to the automatic alignment of sentences belonging to bilingual text pairs.Their experiments were applied to French-English and French-Arabic text pairs. Meanwhile,Mostefa et al. (2009) presented a semi-automatic annotation with Named Entities of a mul-tilingual corpus for Arabic, English, and French. The text corpus was made of comparablenewswires from the Agency France Presse, covering the period 2004 to 2006. The method,which they used for producing the corpus, was iterative and the annotations were checkedmanually and corrected if necessary. Statistics of the corpus were presented and comparedwith the annotation results for the three languages.

Besançon et al. (2009) presented the InFile evaluation paradigm (INformation, FILtering,and Evaluation) in general and focused on a study of the Arabic part of the corpus in par-ticular. Coverage mismatch was between profiles and Arabic documents. He also discussedthe problems that may arise when trying to transfer information from English and French toArabic. Guidere (2002) applied a corpus-based machine translation form that depended on abilingual corpus of French and Arabic texts and translation part alignment. The author usedalignment for combining linguistic and statistical information. He also proposed proceduresto construct a machine translation system based on parallel translated corpora. Moghrabi(1998) described a machine translation system between French and Arabic in the sub-worldof cooking recipes. He described the design of the generation component and how this designallows a variety of outputs; all expressing the same conceptual meaning. This system was ofthe knowledge-based family of Interlingua translation systems. It focused on the importanceof the meaning of the text being processed and articulated all of its available knowledge-basesin order to achieve a flexible meaningful wording.

Another language that attracted the attention of researchers was Chinese. In common use,Chinese uses a complex orthography that contains about 10,000 characters, which expresssemantics rather than phonological information. Chinese is written either from left-to-rightor top-down and words are written without spaces. The major challenge for Chinese pro-cessing is word segmentation. There is no morphology in Chinese, but it does have limitednominal and verbal aspects. The first work in Arabic-Chinese MT was by Habash and JunHu (2009). They presented a comparison of two approaches for Arabic-Chinese machinetranslation using English as a pivot language. Their system handled many complex Arabic-Chinese syntactic variations. The results showed that using English language as a pivot wasbetter than a direct translation from Arabic to Chinese.

Japanese is the closest in form to Chinese, but less common in machine translation. Theword order of Japanese is Subject Object Verb (SOV). Japanese nouns have no grammaticalnumber, gender, or features, and some words are usually translated as pronouns. Bouillon

123

A. Alqudsi et al.

et al. (2008) described an interlingua-based medical speech translation system between Jap-anese and Arabic and vice versa. They also described a simple generic tool for debuggingInterlingua translation rules, and a method for improving speech understanding performanceby re-scoring N-best speech hypothesis lists. They used statistical tuning methods to increaseefficiency.

Spanish is also attracting scientist’s attention. Spanish is a language with a two-gendersystem. As for syntax, the sentence word order is Subject Verb Object, though variations arepossible. Doaa and Ana (2008) focused on the discourse markers as key elements in guidingthe inferences of statements and to help in natural language processing. This is achieved by arule based approach for the resource that addresses aspects concerning discourse structure andcoherence through automatic identification, classification, and annotation, of the discoursemarkers in a multilingual parallel corpus (i.e., Arabic-Spanish-English). The research pro-vided an important resource for the community as it presented a multilingual computationalprocessing of different kinds of discourse markers. Furthermore, the research addressed Ara-bic from a computational pragmatic perspective, where the classification, identification, andannotation processes, were implemented using the information provided from the tagging ofSpanish discourse markers and alignments.

Doaa et al. (2006) developed a multilingual parallel corpus (Arabic-Spanish-English)aligned on the sentence level and tagged on the POS level, and is a valuable resource for thistranslation. A multilingual parallel corpus includes Arabic to Spanish, English to Arabic, andEnglish to Spanish. The results of this method were over 90%, even though the percentageswere different from one language pair to another, they were evaluated against a gold standardsystem.

Another language gaining interest among researchers is Hindi. The syntax for Hindi isSubject-Object-Verb, and nouns are either masculine or feminine. The verb must agree withits subject in both number and gender and the adjectives must agree with the nouns. Somewords in Hindi can be translated into different forms, but the meaning is approximately sameand their translation depends upon grammatical context.

Mark et al. (2004) presented a comparative analysis of relative clauses in machine transla-tion of Hindi and Arabic, in the tradition of the Paninian Grammar Framework, which leadsto deriving a common logical form for equivalent sentences. Parallels are drawn betweenthe Hindi co-relative construction and resumptive pronouns in Arabic. More details aboutrelative clauses between Hindi and Arabic can be found by Mark et al. (2004).

In conclusion, machine translation from Arabic to other languages (and vice versa), notesthat there is a big difference in the approach styles, due to the different features of eachlanguage, such as morpho-syntactic features, agreement features, as well as, each languagehas its own challenges and ambiguities. Figure 7 summarises the available methods from theliterature review that were employed to develop machine translations from/to Arabic.

4 Discussion

Concluding this review, it is clear that machine translation can be - and in fact has been—addressed using a variety of different approaches. In this review, we have focused our dis-cussions on the application of these approaches to Arabic machine translation problems,and taken special note which of these approaches might be suitable for dealing with Arabicfeatures. There are many common elements in the systems, although there is also a growingdiversity. The transfer approach has made swift progress and there is great optimism for itsfuture success. It is more suited to certain types of Arabic challenges, due to the advantages

123

Arabic machine translation

General Machine Translation

Arabic to English English to Arabic

Arabic to other and Versa

OtherStatisticalRule based

OtherStatisticalRule based

Lodhi et al., (2000)Chafia & Ali (1995)Nguyen & Vogel (2008)Salem et al., (2008)Shirko et al., (2010)

Young et al., (2003)Abraham & Salim (2005)Yassine et al., (2010)Bisazza et al., (2010)

Diab et al., (2007)Marine et al., (2010)

Attia (2003) Shaalan et al., (2004)Toutanova et al., (2008)Elming & Habash (2009)AbuShugier & Sembok (2009)

Aoun et al., (1994)Hatem & Nassar (2008)

Doaa & Ana (2008)Besançon et al., (2009)

Sarikaya & Deng (2007)Badr et al., (2008)Badr et al., (2009)

Arabic Machine Translation

Example-based

OtherStatisticalRule based

Guidere (2002) Hasan et al., (2006) Doaa et al., (2006)Mostefa et al., (2009) Habash & JunHu (2009)

Debili & Sammouda (1992) Moghrabi (1998) Alsharaf et al., (2004)Mark et al., (2004)Bouillon et al., (2008)

StatisticalApproach

Hybrid Method

Knowledge-based

Rule-based

Direct Interlingua Transfer

Joshan & Lehal (2007) Hutchins (1986) Hutchins & Harold (1992)Eric & Teruko (1992)Bonnie & Eduard (2004)

Mitamura et al., (1991)Hutchins & Harold (1992)Arnold et al., (1994)Trujillo (1999) Tahir et al., (2010)

Toma (1977) Lavie et al., (2004)

Langlais & Simard (2002)Groves (2005)Paul (2005a)Paul (2005b)Groves (2006)

Furuse et al., (1992)Furuse & Iida (1992)Nirenburg et al., (1994)Brown (1996)Nagao (1997)Richardson et al., (2001)

Zavrel et al., (1997)Och & Ney (2000)Marcu (2001)Lopez (2008)

Machine Translation

Fig. 7 Summary of machine translation approaches

that the transfer is easy to reach the level of abstractness required, and the level of analysisin the transfer approach is attainable and easy to implement.

Developing a transfer based MT system requires less time and effort than Interlingua. Thisis why most commercial systems apply the transfer approach. One noticeable trait from ourreview is the reality that most approaches that have been proposed for machine translation

123

A. Alqudsi et al.

systems may only have been tested on limited domains. This situation is understandable froma practical perspective. However, from a research point-of-view, this characteristic will oftenmake it very difficult to assess the system capacity of performing in comparison to othersystems. There seems to be confusion in the field about how different machine translationsystems should be formally tested and evaluated against one another. As we have seen inthis survey, different approaches and different challenges seek to achieve different sets ofobjectives, making it difficult to perform comparisons in many cases. However, difficulties incomparisons are also real in many of the studies. Different performance measures will oftenbe looked at and different numbers of trials will be performed in the analyses and testing ofthe systems.

Most of the community evaluations focused on the translation of news and governmenttexts. There is very little work on other domain translations, particularly those that describemuch of the information found on the internet, where translation is in demand. Usually, theauthors tested their machine translation systems using the BLEU evaluator; which is an eval-uation system in the development and research cycle of machine translation technology. TheBLEU system uses the n-gram similarity of a candidate to a set of references. It has a widerapplicability than just MT. It could also be extended to evaluate the generation of naturallanguage and the summarization of systems.

5 Conclusion and future work

Arabic has a different word order sequence that makes it a significant challenge to MT, due tothe possibility of expressing a sentence in Arabic in various subject-verb-object combinationswith the same meaning. In Arabic, three elements make up a sentence, namely subject, verband object. Using all of these elements, Arabic can be classified into four types of sentences,according to different word orders i.e., SVO, VSO, VOS, and SOV. Thus, it is a difficulttask to find a machine translation that meets human requirements. It is not yet clear whethermachine translation can satisfy peoples’ requirements in terms of translation quality andretrieval time. We assume that many kinds of phenomena exist, some of which are suitablefor MT.

This paper has given an account of and reasons for, the widespread use of machine trans-lation. The features of Arabic and major MT methods were also discussed. Some of thesemethods are commonly used. Most machine translation systems focus on the translation ofnews and official texts, whilst not many focus open domain translation. Translations focusmainly on informal genres, which take much of their information from the internet (for whichtranslation is in great demand). Statistical machine translation has grown quickly and trans-fer machine translation will surely follow. Nonetheless, these MT systems still do not meethuman requirements. This paper investigates current MT techniques, employed to translatefrom\to Arabic. In the future, we plan to develop a new Arabic to English machine translationsystem, taking the reordering of language challenges into consideration. We hope to extendour system to cover Arabic and other languages with even more features.

References

Abbès R, Dichy J, Hassoun M (2004) The architecture of a standard Arabic lexical database: some figures,ratios and categories from the DIINAR.1 source program. In: The workshop on computational approachesto Arabic script-based languages, COLING 2004. Geneva, Switzerland, pp 15–22

123

Arabic machine translation

Abraham I, Salim R (2005) A maximum entropy word aligner for Arabic-English machine translation. In:Proceedings of human language technology conference and conference on empirical methods in naturallanguage processing (HLT/EMNLP). pp 89–96 (Vancouver)

Abu Shugier M, Sembok T (2007) Handling agreement in machine translation from English to Arabic. In:1st International conference on digital communications and computer applications (DCCA2007). JUST.pp 385–379

Abu Shugier M (2009) Word agreement and ordering in English-Arabic machine translation: a rule-basedapproach. PhD thesis, FTSM, University Kebangsaan Malaysia, p 175

Afify M, Sarikaya R, HKJ Kuo LB, Gao Y (2006) On the use of morphological analysis for dialectal Arabicspeech recognition. In: 9th International conference on spoken language processing (Interspeech—ICS-LP), Pittsburgh. pp 277–280

Alansary S, Nagi M, Adly N (2009) Towards analysing the international corpus of Arabic (ICA). In: Interna-tional conference on language engineering. Progress of Morphological Stage, Egypt, pp 241–245

Albared M, Nazlia O, Mohd J, Ab Aziz (2009) Classifiers combination to Arabic morphoSyntactic disam-biguation. In: International conference on electrical engineering and informatics, Malaysia. 978-1-4244-4913-2/09 (IEEE)

Almas Y, Ahmed K (2007) A note on extracting “sentiments” in financial news in English, Arabic, and Urdu.In: Proceedings of the 2nd workshop on computational approaches to Arabic script-based languages(CAASL’07). pp 1–12

Alsalman S (2004) The effectiveness of machine translation. Int J Arab Engl Stud 5:145–160Alsharaf H, Sylviane C, Peter G (2004) French to Arabic machine translation. In: The specificity of lan-

guage couples 9th EAMT workshop, “Broadening horizons of machine translation and its applications”,pp 26–27 April 2004, Malta, pp 11–17

Al-Sughaiyer I, Al-Kharashi IA (2004) Arabic morphological analysis techniques: a comprehensive survey.JASIST 55(3):189–213

Aoun J, Elabbas B, Dominique S (1994) Agreement, word order, and conjunction in some varieties of Arabic.Linguist Inq 25:195–220

Arnold D, Balkan L, Lee H, Meijer S, Sadler L (1994) Machine translation: an introductory guide. Blackwell,Manchester

Attia M (2006) An ambiguity-controlled morphological analyser for modern standard Arabic modelling finitestate networks. In: Challenge of Arabic for NLP/MT conference. The British Computer Society, London,pp 48–67

Attia M (2007) Arabic tokenization system. In: ACL-Workshop on computational approaches to semitic lan-guages, Prague

Attia M (2005) Developing a robust Arabic morphological transducer using finite state technology. In: The8th annual CLUK research colloquium. Manchester

Attia M (2008) Handling Arabic morphological and syntactic ambiguity within the LFG framework with aview to machine translation. Ph.D. Thesis. The University of Manchester, Manchester, p 61

Attia M (2003) Implications of the agreement features in machine translation. M.A. Thesis. University ofManchester

Azmi M (1988) Arabic morphology: a study in the system of conjugation. Hasan Publishers, HyderabadBadr I, Zbib R, Glass J (2009) Syntactic phrase reordering for English-to-Arabic statistical machine transla-

tion. In: The 12th conference of the European chapter of the association for computational linguistics.Athens, pp 86–93

Beesley K (1996) Arabic finite-state morphological analysis and generation. In: Proceedings of the 16thconference on association for computational linguistics. pp 89–94

Beesley KR (1998) Arabic morphology using only finite-state operations. In: Computational approaches tosemitic languages: proceedings of the workshop. Montreal, pp 50–57

Beesley KR, Karttunen L (2003) Finite state morphology. CSLI Publications, Palo Alto, CABesançon R, Mostefa D, Timimi I, Chaudiron S, Laïb M (2009) Arabic, English and French: three languages

in a filtering systems evaluation project. In: MEDAR 2009: 2nd international conference on Arabiclanguage resources & Tools, 22–23 April 2009, Cairo, pp 163–167

Bisazza A, Federico M (2010) Chunk-based verb reordering in VSO sentences for Arabic-English statisticalmachine translation. In: ACL 2010: joint fifth workshop on statistical machine translation and Metric-sMATR. Proceedings of the workshop, 15–16 July 2010, Uppsala University, Uppsala, pp 235–243

Bonnie J, Dorr E, Hovy H, Lori S (2004) Machine translation: interlingual methods. In: Brown K (ed) Ency-clopaedia of language and linguistics, 2nd edn, ms. 939

Bouillon P, Sonia H, Yukie N, Kyoko K, Hitoshi I, Nikos T, Marianne S, Beth AH, Manny R (2008) Devel-oping non-European translation pairs in a medium-vocabulary medical speech translation system. In:

123

A. Alqudsi et al.

LREC 2008: 6th Language resources and evaluation conference, Marrakech, Morocco, 26–30 May,pp 1741–1748

Brill E, Resnik P (1994) A rule-based approach to prepositional phrase attachment. In: Proceedings of the15th conference on 1994, acl.ldc.upenn.edu

Brown D, Ralf B (1996) Example-based machine translation in the Pangloss system. In: Proceedings of theCOLING-96, vol 1, pp 169–174 (Copenhagen)

Carpuat M, Yuval M, Nizar H (2010) Improving Arabic-to-English statistical machine translation by reor-dering post-verbal subjects for alignment. In: ACL 2010: the 48th annual meeting of the associationfor computational linguistics, Uppsala, July 11–16, 2010: Proceedings of the Conference Short Papers,pp 178–183

Chafia M, Ali Mili (1995) Machine translation from Arabic to English and French information sciences3(2):91–109

Chalabi A (2004) Elliptic personal pronoun and MT in Arabic. In: JEP-2004-TALN 2004 special sessionon Arabic language processing-text and speech. http://www.lpl.univ-aix.fr/jep-taln04/proceed/actes/arabe2004/TAAC17.pdf

Chalabi A (2000) MT-based transparent Arabization of the internet TARJIM.COM. In: White JS (ed) AMTA2000, LNAI 1934. Springer, Berlin pp 189–191

Chalabi A (2001) Sakhr web-based Arabic/English MT engine. Downloaded from www.elsnet.org/arabic2001/chalabi.pdf on 25 Aug

Charoenpornsawat P, Sornlertlamvanich V, Charoenporn, T (2002) Improving translation quality of rule-basedmachine translation. In: Proceedings of COLING-02 on machine translation in Asia. Morristown, pp 1–6

Daimi K (2001) Identifying syntactic ambiguities in single-parse Arabic sentence. Comput Hum 35:333–349Darwish K (2002) Building a shallow Arabic morphological analyser in one day. In: Proceedings of the

ACL workshop on natural language processing in the biomedical domain, PA, USA. Association forComputational Linguistics

Debili F (1992) Aligning sentences in bilingual texts French–English and French–Arabic. In: COLING,pp 517–525 (Nantes)

Ditters E (2001) A formal grammar for the description of sentence structure in modern standard Arabic. In:Workshop on Arabic processing: status and prospects at ACL/EACL, Toulouse

Doaa S, Ana GL (2008) Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic-Spanish-English). In: LREC 2008: 6th language resources and evaluation conference, Marrakech, 26–30May 2008

Doaa S, Antonio M, Sandoval J, Guirao M, Enrique A (2006) Building a parallel multilingual corpus (Arabic-Spanish-English). In: LREC-2006: fifth international conference on language resources and evaluation.Proceedings, Genoa, Italy, 22–28 May 2006, pp 2176–2181 (increase)

Dorr BJ, Jordan PW, Benoit JW (1999) A survey of current paradigms in machine translation. In: ZelkowitzM (ed) Advances in computers, vol 49. Academic Press, London pp 1–68

Elming J, Habash N (2009) Syntactic reordering for English-Arabic phrase-based machine translation. In:Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, Athens,pp 69–77

Eric HN, Teruko M (1992) The KANT system: fast, accurate, high-quality translation in practical domains.In: International conference on computational linguistics proceedings of the 14th conference on compu-tational linguistics, vol 3. pp 1069–1073

Farghaly A, Shaalan K (2009) Arabic natural language processing: challenges and solutions. ACM TransAsian Lang Inform Process Assoc Comput Mach 8:1–22. doi:10.1145/1644879.1644881

Farghaly A, Senellart J (2003) Intuitive coding of the Arabic lexicon. In: Proceedings of the MT Summit IX,the association for machine translation in the Americas (AMTA’03)

Fehri AF (1993) Issues in structure of Arabic clauses and works. Kulwer, DordrechtFuruse O, Iida H (1992) An example-based method for transfer-driven machine translation. In: The third

international conference on theoretical and methodological issues, Empiristic vs. Rationalist methods inMT. Montréal, pp 139–150

Groves D, Way A (2006) Hybrid data-driven models of machine translation. Springer Science & BusinessMedia B.V., Berlin 301–323

Groves D, Way A (2005) Hybrid example-based SMT: the best of both worlds? In: Proceedings of the ACL2005 workshop on building and using parallel texts: data-driven machine translation and beyond, AnnArbor, pp 183–190

Guessoum A, Zantout R (2005) A methodology for evaluating Arabic machine translation systems. MachTrans 18:299–335 doi:10.1007/s10590-005-2412-3 (Springer)

Guidere M (2002) Toward Corpus-Based Machine Translation for Standard Arabic Translation Journal 6.1.http://accurapid.com/journal/19mt.htm, visited September

123

Arabic machine translation

Habash N (2010) Introduction to Arabic natural language processing. In: Graeme H (ed) Synthesis lectureson human language technologies. Morgan & Claypool Publishers, San Rafael p 187

Habash N, Jun Hu (2009) Improving Arabic-Chinese statistical machine translation using English as pivot lan-guage. In: Proceedings of the fourth workshop on statistical machine translation, Athens, 30 March–31March, pp 173–181

Habash N, Sadat F (2006) Arabic pre-processing schemes for statistical machine translation. In: Proceedingsof the 7th meeting of the North American chapter of the association for computational linguistics/humanlanguage technologies conference (HLT-NAACL06). New York, pp 49–52

Hasan S, Isbihani A El I, Hermann N (2006) Creating a large-scale Arabic to French statistical machinetranslation system. In: LREC-2006: fifth international conference on language resources and evaluation.Proceedings, Genoa, Italy, 22–28 May

Hatem A, Nassar A (2008) Modified Dijstra-like search algorithm for English to Arabic machine translationsystem. In: Hutchins J, Hahn Walther v (eds) Proceedings EAMT 2008: 12th annual conference of theEuropean association for machine translation, September 22–23, 2008. Hamburg, pp 66–71

Hatem A, Omar N (2010) Syntactic reordering for Arabic-English phrase-based machine translation. In:Database theory and application, bio-science and bio-technology. Springer Lecture Notes in ComputerScience, vol 118. Verlag, Berlin, pp 198–206

Hutchins J (2007) Machine translation: a concise history. In: Wai CS (ed) Computer aided translation: theoryand practice. Chinese University of Hong Kong, Hong Kong

Hutchins WJ, Harold LS (1992) An introduction to machine translation. Academic Press, LondonHutchins WJ (1986) Machine translation: past, present, future. Ellis Horwood Limited, West SussexIbrahim K (2002) Al-Murshid fi Qawa’id Al-Nahw wa Al-Sarf [The Guide in Syntax and Morphology Rules].

Amman, Jordan, Al-Ahliyyah for Publishing and DistributionJosef FO, Ney H (2000) Improved statistical alignment models. In: ACL00: Proceedings of the 38th annual

meeting of the association for computational linguistics., Hongkong, pp 440–447Joshan GS, Lehal GS (2007) Evaluation of direct machine translation system from Punjabi to Hindi. Int

J Systemics Cybern Inform, 76–83Kamir D, Soreq N, Neeman Y (2002) A comprehensive NLP system for modern standard Arabic and modern

hebrew. In: Proceedings of the workshop on computational approaches to semitic languages in the 40thannual meeting of the association for computational linguistics (ACL-02). Philadelphia

Köpr S, Miller J (2009) A unification based approach to the morphological analysis and generation of Arabic.In: CAASL-3—third workshop on computational approaches to Arabic script-based languages [at] MTSummit XII, August 26 2009

Langlais P, Simard M (2002) Merging example-based and statistical machine translation. In: Richardson SD(ed) Machine translation: from research to real users, 5th conference of the association for machinetranslation in the Americas (AMTA-2002), Tiburon, October 2002. proceedings, Springer, Berlin,pp 104–113

Larkey L, Ballesteros L, Connell M (2002) Improving stemming for Arabic information retrieval: light stem-ming and co-occurrence analysis. pp 275–282

Lavie A, Probst K, Peterson E, Vogel S, Levin L, Font-Llitjos A, Carbonell J (2004) A Trainable transfer-basedmachine translation approach for languages with limited resources. In: Proceedings of workshop of theEuropean association for machine translation (EAMT-2004), Valletta, Malta, pp 116–123

Lee Y (2004) Morphological analysis for satistical machine translation. In: Proceedings of the joint confer-ence on human language technologies and the annual meeting of the North American chapter of theassociation of computational linguistics (HLT-NAACL)

Lee Y, Suk L, Kishore P, Salim R (2003) Language model based Arabic word segmentation. In: 41st annualmeeting of the association for computational linguistics. Sapporo, pp 399–406

Lopez A (2008) Statistical machine translation. ACM Comput Surv 40(3):1–49Marcu D (2001) Towards a unified approach to memory- and statistical-based machine translation. In: Asso-

ciation for computational linguistics: 39th annual meeting and 10th conference of the European chapter,Toulouse, pp 378–385

Mark P, Domenyk E, Samir K, Lakshmi P (2004) Relative clauses in Hindi and Arabic: a Paninian depen-dency grammar analysis. In: Coling’04 workshop: proceedings recent advances in dependency grammar,August 28, pp 9–16

McCarthy J (1979) Formal problems in semitic phonology and morphology. Ph.D. dissertation, MIT, Cam-bridge

Mitamura T, Nyberg E, Carbonell J (1991) An efficient interlingua translation system for multi-lingual docu-ment production. In: Proceedings of machine translation Summit III, Washington, DC, July 2–4

123

A. Alqudsi et al.

Moghrabi C (1998) On parametering the choice of words in text generation and its usefulness in machine trans-lation. In: International conference “Machine translation: ten years on” proceedings held at CranfieldUniversity, England, 12–14 November (Cranfield University Press, pp 1–9

Mostefa D, Laïb M, Chaudiron S, Choukri K, Chalendar G (2009) A multilingual named entity corpus for Ara-bic, English and French. In: MEDAR 2009: 2nd international conference on Arabic language resources& tools, April 2009, Cairo

Nagao M (1997) Machine translation through language understanding. In: Proceedings of MT Summit VI,San Diego, pp 41–49

Nguyen T, Vogel S (2008) Context-based Arabic morphological analysis for machine translation In: Proceed-ings of the 12th conference on computational natural language learning, Manchester, pp 135–142

Nirenburg S, Beale S, Domashnev C (1994) A full text experiment in example based machine translation.In: Proceedings of the international conference on new methods in language processing, Manchester,pp 78–87

Othman E, Shaalan K, Rafea A (2003) A chart parser for analysing modern standard Arabic sentence. In: TheMT Summit IX workshop on machine translation for semitic languages: Issues and Approaches, NewOrleans

Paul M, Doi T, Hwang Y, Imamura K, Sumita E (2005a) Nobody is perfect: ATR’s hybrid approach to spo-ken language translation. In: Proceedings of the international workshop on spoken language translation(IWSLT 2005), Pittsburgh, pp 55–62

Paul M, Sumita E, Yamamoto S (2005b) A machine learning approach to hypothesis selection of greedydecoding for SMT. In: MT Summit X workshop: second workshop on example-based machine transla-tion, Phuket, pp 117–124

Ratcliffe R (1998) The broken plural problem in Arabic and comparative semitic: allomorphy and analogy innon-concatenative morphology. J. Benjamins, Amsterdam

Richardson S, Dolan W, Menezes A, Pinkham J (2001) Achieving commercial-quality translation with exam-ple-based methods. In: Proceedings of MT summit VIII, Santiago De Compostela, Spain

Salem Y, Arnold H, Brian N (2008) Implementing Arabic to English machine translation using the role andreference grammar linguistic model. In: Proceedings of the eighth annual international conference oninformation technology and telecommunication (ITT 2008), Galway, Ireland, October 2008 (Runner-upfor Best Paper Award)

Shaalan K, Rafea A, Abdel Monem A, Baraka H (2004) Machine translation of English noun phrases intoArabic. Int J Comput Process Orient Lang. World Scientific Publishing Company 17(2):121–134

Shaalan K, Raza H (2009) NERA: named entity recognition for Arabic. J Am Soc Inf Sci Technol. John Wiley& Sons, Inc., NJ, 60(8):1652–1663

Shirko O, Omar N, Arshad H, Albared M (2010) Machine translation of noun phrases from Arabic to Englishusing transfer-based approach. J Comput Sci 6(3):350–356 (ISSN 1549-3636)

Soudi A, Bosch A, Neumann G (2007) Arabic computational morphology: knowledge-based and empiricalmethods. Springer, Berlin

Spence G, Christopher D (2010) Better Arabic parsing: baselines, evaluations, and analysis. In: Coling 2010:23rd international conference on computational linguistics. Proceedings of the conference, 23–27 August2010, Beijing International Convention Centre, Beijing

Tahir GR, Asghar S, Masood N (2010) Knowledge based machine translation. In: Proceedings of internationalconference on information and emerging technologies (ICIET). Karachi, Pakistan pp 1–5

Toma P (1977) SYSTRAN as a multilingual machine translation systemmt-archive. In: The third Europeancongress on information systems, pp 569–581

Toutanova K, Suzuki H, Ruopp A (2008) Applying morphology generation models to machine translation.In: ACL-08: HLT. 46th annual meeting of the association for computational linguistics: human languagetechnologies. Proceedings of the conference, June 15–20, 2008, The Ohio State University, Columbus„pp 514–522

Trujillo A (1999) Translation engines: techniques for machine translation. Springer, LondonYassine B, Imed Z, Mona D, Paolo R (2010) Arabic named entity recognition: using features extracted from

noisy data. In: Proceedings of the ACL 2010 conference short papers, pp 281–285, Uppsala, 11–16 July2010. c 2010 Association for Computational Linguistics

Žabokrtský Z, Smrž O Arabic syntactic trees: from constituency to dependency. In: The 10th conference ofthe European chapter of the association for computational linguistics, Budapest, pp 183–186

Zavrel J, Daelemans W, Veenstra J (1997) Resolving PP. attachment Ambiguities with memory-based learning.In: The workshop on computational natural language learning (CoNLL’97). Madrid, pp 136–144

123