Answer Selection and Validation for Arabic Questions

Arab Academy for Science, Technology and Maritime Transport

College of Computing and Information Technology

Department of Computer Science

Answer Selection and ValidationAnswer Selection and Validation

for Arabic Questionsfor Arabic QuestionsBy

Ahmed Magdy EzzeldinEgypt

Submitted to Arab Academy for Science, Technology and Maritime

Transport in partial fulfillment of the requirements for the degree of

Master of Computer Science

Supervised by

Prof. Dr. Yasser El-Sonbaty

College of Computing & Information Technology

Arab Academy for Science, Technology and Maritime

Transport

Dr. Mohamed Hamed Kholief

College of Computing & Information Technology

Arab Academy for Science, Technology and Maritime

Transport

DECLARATION

I certify that all the material in this thesis that is not my own work has been identified, and that no material

is included for which a degree has previously been conferred on me. The contents of this thesis reflect my own

personal views, and are not necessarily endorsed by the University.

Signature:

Date: 23/9/2014

Abstract

Arabic is the 6th most wide-spread natural language in the world with more than 350 million native speakers.

Arabic question answering systems are gaining great significance due to the increasing amounts of Arabic un-

structured content on the Internet and the increasing demand for information that regular information retrieval

techniques do not satisfy. Question answering systems generally, and Arabic systems are no exception, hit an up-

per bound of performance due to the propagation of error in their pipeline. This increases the significance of an-

swer selection and validation systems as they enhance the certainty and accuracy of question answering systems.

Very few works tackled the Arabic answer selection and validation problem, and they used the same question an-

swering pipeline without any changes to satisfy the requirements of answer selection and validation. That is why

they did not perform adequately well in this task. In this dissertation, a new approach to Arabic answer selection

and validation is presented through “ALQASIM”, which is a QA4MRE (Question Answering for Machine Read-

ing Evaluation) system. ALQASIM analyzes the reading test documents instead of the questions, utilizes sen-

tence splitting, root expansion, and semantic expansion using an ontology built from the CLEF 2012 background

collections. Our experiments have been conducted on the test-set provided by CLEF 2012 through the task of

QA4MRE. This approach led to a promising performance of 0.36 Accuracy and 0.42 C@1, which is double the

performance of the best performing Arabic QA4MRE system.

Acknowledgements

In the name of Allah, Who helped and guided me through the journey of my studies and my life ( لله .(الحمد

Special thanks to my supervisors, Prof. Dr. Yasser El-sonbaty and Dr. Mohamed Kholief, who guided me with

their experience and sense of perfection throughout the thesis, which improved the results of our research and the

papers dramatically.

I also thank all the doctors who taught me and helped me through out my studies in AASTMT. With their

hard work, I learned a lot that helped me to choose my field of research and develop a scientific way of thinking.

And a special thank you for Dr. Mohamed Shaheen who taught me a lot about academic writing and guided me

through writing my first paper. I would also like to thank Prof. Dr. Mohamed Ismail who accepted to award me

this privilege by examining and accepting my thesis.

I would have never accomplished this work without the support of my family. My mother and father who

taught me everything they could in my early years and who gave me the sense of perfection that helped me

through out the years of my study and life. Thank you my dear beloved wife Shimaa, who were there for me in

every moment, provided me with all the support to help me rise against all the hardships of this journey. May Al-

lah bless you Shimaa and bless our beloved child Mariam.

Table of Contents

Abstract................................................................................................................................................................ 4

List of Figures.......................................................................................................................................................8

List of Abbreviations............................................................................................................................................ 9

Chapter One Introduction…................................................................................................................................10

1.1. Background and Context............................................................................................................................11

1.1.1. Arabic Specific Difficulties.................................................................................................................11

1.1.2. Question Answering and its significance............................................................................................13

1.1.3. Pipeline of QA.....................................................................................................................................14

1.1.4. Question Answering for Machine Reading Evaluation (QA4MRE)...................................................14

1.2. Scope and Objectives.................................................................................................................................15

1.3. Achievements.............................................................................................................................................15

1.4. Overview of Dissertation...........................................................................................................................16

Chapter Two Related Works................................................................................................................................17

2.1. Arabic QA systems....................................................................................................................................18

2.1.1. Question Analysis...............................................................................................................................22

2.1.2. Passage Retrieval................................................................................................................................23

2.1.3. Answer Extraction and Validation.....................................................................................................25

2.2. The Best Performing English QA System................................................................................................27

2.3. Answer Selection and QA4MRE systems................................................................................................27

Chapter Three Tools, Test-set and Evaluation Metrics…...................................................................................31

3.1. Tools and Resources..................................................................................................................................32

3.1.1. MADA+TOKAN PoS tagger..............................................................................................................32

3.1.2. Root Stemmers...................................................................................................................................34

3.1.2.1. Khoja Root Stemmer...................................................................................................................34

3.1.2.2. ISRI Root Stemmer.....................................................................................................................34

3.1.2.3. Tashaphyne Root Stemmer..........................................................................................................34

3.1.3. Arabic WordNet.................................................................................................................................35

3.2. Test-set......................................................................................................................................................35

3.3. Evaluation metrics....................................................................................................................................36

3.3.1. Accuracy.............................................................................................................................................36

3.3.2. C@1...................................................................................................................................................37

Chapter Four System Architecture......................................................................................................................38

4.1. ALQASIM 1.0 Initial Architecture............................................................................................................39

4.1.1. Document Analysis.............................................................................................................................40

4.1.2. Locating Questions & Answers..........................................................................................................41

4.1.3. Answer Selection................................................................................................................................41

4.1.4. Evaluation of ALQASIM 1.0..............................................................................................................41

4.2. ALQASIM 2.0 Architecture......................................................................................................................43

4.2.1. Document Analysis.............................................................................................................................43

4.2.1.1. Inverted Index...............................................................................................................................44

4.2.1.2. Morphological analysis module...................................................................................................45

4.2.1.3. Sentence splitting module............................................................................................................45

4.2.1.4. Root Expansion module...............................................................................................................46

4.2.1.5. Numeric Expansion module........................................................................................................46

4.2.1.6. Ontology-based Semantic Expansion module.............................................................................47

4.2.2. Question Analysis..............................................................................................................................48

4.2.3. Answer Selection...............................................................................................................................49

Chapter Five Results and Analysis.....................................................................................................................50

5.1. Results.......................................................................................................................................................51

5.2. Analysis....................................................................................................................................................55

Chapter Six Conclusion and Future Work..........................................................................................................59

6.1. Summary...................................................................................................................................................60

6.2. Evaluation.................................................................................................................................................61

6.3. Future Work..............................................................................................................................................61

6.3.1. Rule-based techniques........................................................................................................................62

6.3.2. Anaphora Resolution.........................................................................................................................62

6.3.3. Semantic Parsing and Semantic Role Labeling.................................................................................62

6.3.4. More Automatically Generated Ontologies.......................................................................................63

References............................................................................................................................................................64

Appendix A..........................................................................................................................................................68

List of Figures

Figure 1. Example Arabic Derivation...................................................................................................................... 12

Figure 2. Example Arabic Inflection....................................................................................................................... 13

Figure 3. Arabic Question Answering Subtasks...................................................................................................... 14

Figure 4. The architecture of ArabiQA (Benajiba & Rosso, 2007)......................................................................... 19

Figure 5. QASAL architectural components Brini et al., 2009............................................................................... 20

Figure 6. JIRS Passage Retrieval System Architecture (Benajiba et al., 2007)....................................................... 24

Figure 7. Example of the Answer Extraction module's performance steps Benajiba et al., 2007...........................26

Figure 8. Performance of QA4MRE systems @ CLEF 2012.................................................................................. 29

Figure 9. Overview of ALQASIM architecture ..................................................................................................... 39

Figure 10. Detailed Architecture of ALQASIM...................................................................................................... 40

Figure 11. The performance of ALQASIM 1.0 versus the other two Arabic QA4MRE systems...........................42

Figure 12. Example for failure of ALQASIM 1.0 while trying to locate the question snippet...............................42

Figure 13. ALQASIM 2.0 Architecture.................................................................................................................. 43

Figure 14. Document Analysis module architecture............................................................................................... 44

Figure 15. The intuition behind extracting the background collection ontology..................................................... 47

Figure 16. Question Analysis module architecture.................................................................................................. 48

Figure 17. Answer Selection module architecture................................................................................................... 49

Figure 18. Comparison between QA4MRE systems............................................................................................... 53

Figure 19. Performance in terms of answered questions counts.............................................................................. 53

Figure 20. Ratio between correctly answered, wrongly answered and unanswered questions in run 4..................54

Figure 21. An example showing the effect of sentence splitting in ALQASIM 2.0................................................ 55

Figure 22. An example of the effect of ontology expansion in ALQASIM 2.0...................................................... 56

Figure 23. An example showing the effect of root expansion in ALQASIM 2.0.................................................... 57

ALQASIM platform main screen showing results.................................................................................................. 68

Screen showing a correctly answered question with the question answer colored.................................................68

Screen showing a wrongly answered question........................................................................................................ 69

Schema of the input questions after applying morphological analysis................................................................... 69

List of Abbreviations

• AVE: Answer Validation Exercise

• AWN: Arabic WordNet

• CLEF: Conference and Labs of the Evaluation Forum

• IR: Information Retrieval

• JIRS: Java Information Retrieval System

• ML: Machine Learning

• MRE: Machine Reading Evaluation

• MRR: Mean Reciprocal Rank

• MSA: Modern Standard Arabic

• MT: Machine Translation

• NE: Named Entity

• NER: Named Entity Recognition

• NLP: Natural Language Processing

• PoS: Part-of-Speech

• QA: Question Answering

• QA4MRE: Question Answering for Machine Reading Evaluation

• TREC: Text REtrieval Conference

Chapter OneChapter One

IntroductionIntroduction

Chapter 1Chapter 1

IntroductionIntroduction

Arabic is the 6th most wide-spread natural language in the world with more than 350 million native speakers.

Arabic is a highly inflectional, derivational, and morphologically rich language that requires special handling on

the different Natural Language Processing tools and tasks. Arabic question answering systems are gaining great

importance due to the increase of Arabic content on the Internet and the increasing need for information that the

traditional information retrieval systems cannot satisfy.

In this chapter, “Question Answering” (QA) as a task is defined and its significance is highlighted. Arabic

language specific difficulties that emerge from its rich morphology are also demonstrated. Then the QA subtasks

and pipeline is explained. The scope of the dissertation is then explained.

1.1. Background and Context

In this section, Arabic specific difficulties are explained. The definition of QA and its significance are

demonstrated, and the significance of this dissertation to Arabic QA is then highlighted.

1.1.1. Arabic Specific Difficulties

Arabic is a very rich language; however, this richness needs special handling, which makes regular Natural

Language Processing (NLP) systems, designed for other languages, unable to handle it. In the field of QA, Eng-

lish and other Latin-based languages benefited a lot from the advancement in NLP. However, Arabic Question

Answering systems are lagging behind when compared to their English and Latin-based counterparts due to the

Arabic specific difficulties.

One of the Arabic specific difficulties is the lack of diacritics in Modern Standard Arabic (MSA), which adds

to the ambiguity of the question and the searched documents. For example the word “علم” can have different

meaning according to the application of different diacritics as shown in Table 1. However, much interest has been

given to diacritizing MSA to resolve this ambiguity.

Table 1: Different meanings for the word “علم” due to different diacritics

Undiacritized word Diacritized word English Pronunciation Meaning

علم ْم َل َع Alam Flag

علم ْم ْل ِع Elm Science

علم َم َّل َع Allama Taught

علم َم ِل َع Alema Knew

Arabic is a highly derivational language as the vocabulary of Arabic words are essentially built from about

10,000 three or four letters roots, and derivations of these roots are created by adding affixes (prefix, infix, or

suffix) to each root according to about 120 patterns. As illustrated in Figure 1, derivations in Arabic are almost

always like this: Lemma = Root + Pattern (Abdelbaki & Shaheen, 2011). This derivational nature increases the

size of the Arabic vocabulary dramatically and makes building a high coverage semantic Language Resource

(LR) very challenging.

Arabic language morphology is challenging when compared to English and other Latin-based languages. This

is because Arabic is a highly inflectional language where a word token can consist of multiple morphemes. As il-

lustrated in Figure 2, an Arabic word may take this form "Word = Lemma + affixes (prefix, infix, and suffix)".

The prefixes can be articles, prepositions or conjunctions, which causes a lot of sparseness in index documents

and makes query expansion harder. This inflectional nature needs special handling for different Arabic NLP

tasks like stemming, lemmatization, morphological analysis, PoS tagging and even tokenization. Various tools

were developed to address this need as described shortly.

Unlike English and most Latin-based languages, Arabic does not have capital letters which makes Named En-

tity Recognition (NER) harder. In the next section, we will review the different approaches to the NER task.

Figure 1. Example Arabic Derivation

Figure 2. Example Arabic Inflection

1.1.2. Question Answering and its significance

In Information Retrieval (IR) and Natural Language Processing (NLP), Question Answering (QA) is the task

of automatically providing an answer for a question posed by a human in natural language. QA as a task can be

divided into three main distinct subtasks, which are: Question Analysis, Passage Retrieval and Answer Extrac-

tion. Most QA systems follow these three subtasks; however they may differ in how they implement every sub-

QA as a task deals with many types of questions. Factoid questions are one type that is concerned with ques-

tions that ask mainly about Named Entities (NEs) like questions using the words: When, Where, How

much/many, Who, and What, which ask about a date/time, a place, a person, and an organization respectively.

QA systems are more capable of handling natural language queries than regular IR systems. On the other

hand, regular IR systems like search engines yield better results when the query is in Boolean formula (Laurent

et al., 2006). QA systems are also easier to use and have higher recall than ordinary IR systems, which means

that QA systems return an answer if it exists, unlike regular IR systems that sometimes return irrelevant docu-

ments which may not contain the answer (Smuckeret al., 2008).

1.1.3. Pipeline of QA

Arabic Question Answering as a task is made up of three distinct subtasks, which are: Question Analysis,

Document / Passage Retrieval, and Answer Extraction. In this section, we will show how each subtask is imple-

mented and the variations among different systems and implementations of Arabic QA in implementing these

subtasks.

Figure 3. Arabic Question Answering Subtasks

As illustrated in Figure 3, QA consists of three main phases: question analysis, passage retrieval and answer

extraction. In the question analysis phase, the expected answer type is determined according to the question

words, and the question is formulated into a query to be ready for passage retrieval. In the passage retrieval

phase, documents are separated into passages, and query formulated in the question analysis phase is used to

search for the most relevant passages and rank them according to relevance. In the answer extraction phase, the

retrieved passages are reranked according to the expected answer type detected in the question analysis phase,

and the system responds with the first ranked answer. In chapter two, the different Arabic QA systems are

demonstrated highlighting the achievements of each system in these three subtasks.

1.1.4. Question Answering for Machine Reading Evaluation (QA4MRE)

Question Answering for Machine Reading Evaluation (QA4MRE) is another type of QA that evaluates how

the computer understands a comprehension passage by posing a list of multiple choice questions that can be an-

swered by understanding this comprehension passage. It was introduced in an initiative to give more attention to

answer selection and validation over the IR based tasks of passage retrieval.

1.2. Scope and Objectives

In 2005, it was noticed in CLEF (Conference and Labs of the Evaluation Forum) that the greatest accuracy

reached by the different Question Answering (QA) systems was about 60%, where on the other hand 80% of the

questions were answered by at least one participating system. This is due to the error that propagates through the

QA pipeline layers: (i) Question Analysis, (ii) Passage Retrieval, (iii) and Answer Extraction. This led to the in-

troduction of the Answer Validation Exercise (AVE) pilot task, in which systems were required to focus on an-

swer selection and validation, leaving answer generation aside. However, all the QA systems from CLEF 2006 to

2010 used the traditional IR based techniques and hit the same accuracy upper bound of 60%. By 2011, it was a

must to devise a new approach to QA evaluation that forces the participating systems to focus on answer selec-

tion and validation, instead of focusing on passage retrieval. This was achieved by answering questions from one

document only. This new approach, which was named Question Answering for Machine Reading Evaluation

(QA4MRE), skips the answer generation tasks of QA, and focuses only on the answer selection and validation

subtasks (Penas et al., 2011). The questions used in this evaluation are provided with five answer choices each,

and the role of the participating systems is to select the most appropriate answer choice. The QA4MRE systems

may leave some questions unanswered, in case they estimate that the answer is not certain. Arabic QA4MRE was

then introduced for the first time in CLEF 2012 by two Arabic systems. A new test-set and a metric named

“C@1” were introduced for that reason. The evaluation test-set and metrics are explained thoroughly in chapter

three.

We introduce ALQASIM which is an Answer Selection and Validation system built on the QA4MRE test-set.

The main purpose of our research is to provide a reliable answer selection and validation module which will im-

prove the performance and reliability of any Arabic QA system. Through out this dissertation, the term

“QA4MRE” will be used interchangeably with “Answer Selection and Validation” as they serve the same pur-

1.3. Achievements

We provided two versions of our answer selection and validation system. The first version depends mainly on

answer keywords proximity to question keywords in the test document. The second version uses sentence split-

ting as a natural boundary to search for answers in the test document and introduces root expansion and semantic

expansion using an automatically generated ontology created from the background collection documents pro-

vided with the test-set.

The first version achieved an accuracy of 0.31 and a c@1 of 0.36. The second version achieved an accuracy of

0.36 and a c@1 of 0.42, which is double the performance of the best performing Arabic QA4MRE system.

1.4. Overview of Dissertation

In the next chapter, the Arabic QA related works are reviewed. The best performing English QA system is

also demonstrated. Then the best performing Arabic QA4MRE systems are explained.

In chapter three, the tools and language resources that were used in ALQASIM are demonstrated. The ex-

plained tools and language resources are the morphological analysis tool kit, stemmers and the Arabic WordNet

(AWN). The test-set and evaluation metrics that were used to evaluate ALQASIM are also demonstrated in the

same chapter.

In chapter four, the architecture of the two versions of our answer selection and validation system

(ALQASIM) are explained in detail. Then the results are demonstrated and discussed in chapter five. In chapter

six, the research is concluded and future work is highlighted.

Chapter TwoChapter Two

Related WorksRelated Works

Chapter 2Chapter 2

Related WorksRelated Works

2.1. Arabic QA systems

In this section, the Arabic QA systems are reviewed and compared. The first known Arabic QA system is

AQAS, which is created by Mohammed et al., 1993 to answer questions in the radiation domain. It handles ques-

tions and declarative sentences posed by humans in Arabic natural language and uses a knowledge base in the

form of frames to answer these questions. It analyzes user questions and formulates a query from them and

searches a structured set of data. However, the performance of this system was not reported and it is criticized

for being just a natural language interface for an ordinary database.

The second Arabic system is created by Hammo et al., 2002 and 2004 under the name of QARAB. It searches

in the corpus of news articles extracted from Al-Raya newspaper, using a Passage Retrieval (PR) module based

on Salton’s vector space model, and treats the document as a "bag of words". The system removes stop words,

uses a lexicon-based stemmer to stem questions and documents words, and implements PoS tagging and NER us-

ing the system created by Abuleil & Evens, 1998. It also determines the expected answer type by question words

and extracts Named Entities (NEs) as the answer. QARAB reports a precision of 97.3%; however, this perfor-

mance is skeptically high as it is much higher than the best performing English QA system (Lymba’s PowerAn-

swer 4), created in 2007 which performed at a precision of 70.6% (Moldovan et al., 2007). It is also worth men-

tioning that the test-set was composed of only 113 question which are only factoid questions and that these ques-

tions were posed by the system creators themselves, and that the corpus and questions that were used are not pub-

licly available.

Rosso et al., 2005 experimented with cross-language IR to answer Arabic questions from English documents.

They translated the questions then made five different formulations to them by verb and noun movement. Their

cross-language QA system has a precision of 10.7% and an MRR (Mean Reciprocal Rank) of 0.08. This poor

performance is due to the ambiguity incurred by translation. They found out that the best results came out from

verb reformulation in the translated question.

Awadallah & Rauber, 2006 created a QA system that ranks passages according to Answer and Question words

Count (AQC) and Answer and Question words Association (AQA). AQC is the number of found question key-

words in the passage and AQA is the co-occurrence of question and answer choice keywords within the same re-

sult snippet’s context. They held their experiments on the question of the Arabic TV show “Who's gonna be a

millionaire?” and TREC-2002 QA track questions. Their experiments revealed an average accuracy of 55% to

62%. The AQA strategy had better performance on the Arabic language questions while AQC was better for

English language tasks, which may be due to the morphological complexity of Arabic that resulted in retrieving

only precise phrases if they exist, rather than retrieving split segments (Awadallah & Rauber, 2006).

Rosso et al., 2006 created an Arabic QA system under the name of ArabiQA in 2006, which was completed

by Benajiba et al., 2007, Benajiba & Rosso, 2007 and Benajiba et al., 2007. They created their own corpus and

questions, following CLEF guidelines. They removed stop words from the question and documents and extracted

the named entities. They also classified the questions into Name, Date, Quantity, and Definition questions ac-

cording to the question words. ArabiQA use the Distance Density N-gram model, which assigns a higher rank for

passages that have a smaller distance between its keywords. Their answer extraction module tagged NEs in re-

trieved passages, selected the answers with the expected type of NEs and applied pattern matching to select the

final list of answers. The architecture of ArabiQA is illustrated in figure 4.

Figure 4. The architecture of ArabiQA (Benajiba & Rosso, 2007)

Brini et al., 2009 created QASAL Arabic QA system that answers the definition questions. QASAL uses

NooJ local grammars to formulate the user question into a query. It extracts the expected answer type, question

focus and important question keywords and uses Google as a PR system. It achieved a recall of 100% and a preci-

sion of 94%. However, the size of the test-set is too small as it is only 100 Factoid questions and 43 Definition

questions. The architecture of QASAL is illustrated in Figure 5.

Figure 5. QASAL architectural components Brini et al., 2009

Another QA system is the system created by Kanaan et al., 2009. They analyzed the question by tokenizing

the question text and the question focus, and extracting the root of each non-stop word in the question. They also

determined the question type using the question words as shown in Table 2. They also used a passage retieval

module based on Salton’s vector space model and achieved a precision of 43%. However, the test-set that they

used is very small as it contains only 25 documents gathered from the Internet, and 12 questions.

Table 2: Determining the question type according to question words as

suggested by Kanaan et al., 2009

Question words in Arabic Question Words in English Question Type

من Who, Whose Person, Group

متى When Date, Time

ماذا What Organization, Product, Event

أين Where Location

كم How much, How many Number, Quantity

DefArabicQA is another Arabic QA system created by Trigui et al., 2010 that tackles the definition type of

questions. They identified the candidate definitions using manual lexical patterns of sequence of words, letters

and punctuation symbols. They also used heuristic rules that they deduced from observing the form of some cor-

rect and incorrect definitions to enhance their algorithm, and ranked the candidate definitions according to the

weight of the definition pattern, snippet position and the sum of word frequencies in the candidate definition.

DefArabicQA has an MRR of 0.7, and 54% of the questions were answered by the first candidate answer re-

turned. Yet its main criticism is that the corpus size is too small as it contains only 50 organization definition

questions and the answers were assessed by only one Arabic native speaker.

Abouenour et al., 2009, 2010, 2011 created an Arabic QA system that answers the translated CLEF and TREC

questions. Their system uses Yahoo search engine and JIRS (Java Information Retrieval System) for passage re-

trieval. The system applies morphological and semantic query expansion using the Arabic WordNet. Abouenour

et al. enriched AWN to help the query expansion of their system. They ranked the passages based on distance

density n-gram model, and used Amine Platform to score and rank the retrieved passages semantically using con-

cept graphs. Their QA system achieved a poor accuracy of 20.20%, which may have occured because the number

of passages in JIRS was less than 1000 which did not enable structure based techniques to have great effect on

the results.

AQuASys is an Arabic QA system created by Bekhti et al., 2011. It segmented the question into interrogative

noun, question’s verb and keywords, and it did not use a Named Entity Recognition (NER) system. It performed

at a recall of 97.5% and at a precision of 66.25%. Yet, it is fatal drawback is that it cannot be used as it is on an

untagged raw text corpus as it uses the 316 documents provided by ANERcorp (Arabic NER corpus) with its

150,000 tagged words and posed 80 questions on that corpus.

Abdelbaki & Shaheen, 2011 also analyzed the question in their Arabic QA system by applying tokenization,

normalization and NER. They also determined the expected answer type and question focus and applied keyword

extraction and expansion. They applied stemming using Khoja’s Stemmer, then used semantic similarity between

the question’s focus and the candidate answer and made matching using N-grams. Finally, they validated the an-

swers using accuracy scoring and ranking. The test-set that they used to test their system is the 316 document

provided by ANERCorp and they posed 240 questions on this test-set. The accuracy of their system is 86.25%.

and the system also has an MRR of 0.87. However, the main criticism against this system is that it used ANER-

Corp for both training the NER module and as a document corpus which makes the results biased due to over-fit-

The best performing, published system in English QA is Lymba’s Power Answer 4 created by Moldovan et al.,

2007. They used TREC 2007 questions, 175 GB collection of blog entries and 2.5 GB newswire articles. Power

Answer 4 integrated semantic relations, advanced inferencing abilities, syntactically constrained lexical chains,

and temporal contexts. It used strategies to answer each class of questions, where each strategy has the three

components of (i) Question Processing, (ii) Passage Retrieval and (iii) Answer Processing. It also resolved fuzzy

temporal expressions and integrated with a syntactic parser, an NER, a semantic parser, ontologies, and a logic

prover for textual inference in answer selection. It could also detect event-event relations as it used a concept tag-

ger. Lymba’s Power Answer 4 performed at an accuracy of 70.6% in factoid questions and at an accuracy of

47.9% in list questions.

2.1.1. Question Analysis

In an attempt to perform a better Question Analysis, Hammo et al., 2004 parsed the question to extract its cat-

egory and the type of answer required whether it is a name, a place, a quantity or a date, which makes it easier

later in the Answer Extraction phase to select the right answer.

Rosso et al. 2005 experimented with cross-language IR to answer Arabic questions from English documents.

To analyze the question, they translated it then made 5 different formulations to the question by verb and noun

movement. They found out that the best results came out from verb reformulation in the translated question.

However, the results were not promising as the precision decreased by about 20% due to the ambiguity that trans-

lation adds to the question.

Rosso et al. 2006 analyzed Arabic questions by eliminating stop words, extracting named entities and classi-

fied the questions into Name, Date, Quantity, and Definition questions according to the question word used.

Brini et al., 2009 made some query formulation and extracted the expected answer type, question focus and

important question keywords. The question focus is the main noun phrase of the question that the user wants to

ask about. For example, if the user's query is "What is the capital of Tunis?" then the question focus is "Tunis"

and the keywords are "capital" and the expected answer type is a named entity for a location. Unfortunately, this

work had only 100 questions which made it biased and unable to generalize (Brini et al., 2009).

Kanaan et al., 2009 made four steps to analyze the question. They tokenized the question, then determined its

type, then determined its focus which is the proper noun phrase and extracted the root of each non-stop word in

the question.

Abdelbaki and Shaheen 2011 analyzed the question by:

a. Tokenization & Normalization

• Replacing initial إ, آ, أ by اand the letter ئ by the sequence ىء

• Replacing final ى by ي and replace final ة by ه

b. Determining answer type by question words (who, when...)

c. Named Entity Recognition (gazetteer, maxent model)

d. Focus determination by extracting the main NE

e. Keywords Extraction by removing stop words using the Khoja stop list, which has 168 words and the 1,131

words translated from English.

f. Keywords Expansion using the Arabic dictionary of synonyms. NEs are not expanded to avoid ambiguity.

g. Stemming by Khoja’s Stemmer and NEs are not stemmed

h. Query generation of keywords into a Boolean formula (Abdelbaki & Shaheen, 2011).

Bekhti et al., 2011 segmented the question into interrogative noun, question’s verb and question’s keywords.

2.1.2. Passage Retrieval

Awadallah and Rauber 2006 experimented with Arabic and English QA and introduced two techniques to

rank retrieved passages to select the best answer. The first technique is Answer and Question words Count (AQC)

which is based on the number of questions and/or answer choice keywords occurring in result snippets. The sec-

ond technique is Answer and Question words Association (AQA) which is the co-occurrence of question and an-

swer choice keywords within the same result snippet’s context. In other words, if there is a question with 5 candi-

date answers, then each candidate answer is joined with the question and passed to the passage retrieval module.

A retrieved passage is then assigned a higher ranking if it contains more question and candidate answer keywords

(AQC). If the candidate answer and question keywords appear nearer to each other in the retrieved passage it is

also assigned a higher ranking (AQA). They held their experiments on the question of the famous Arabic TV

show “Who's gonna be a millionaire?” and TREC-2002 QA track questions. Their experiments revealed an aver-

age performance of 55% to 62%. The AQA strategy had better performance on the Arabic language questions

while AQC was better for English language tasks. This may be due to the morphological complexity of Arabic

that resulted in retrieving only precise phrases if they exist, rather than retrieving split segments (Awadallah &

Rauber, 2006).

Benajiba et al., 2007 ranked the retrieved passages according to the relevant question terms appearing in the

passage, and assigned a higher rank for the passages that have a smaller distance between keywords which is

called the Distance Density model as shown in figure 6.

Figure 6. JIRS Passage Retrieval System Architecture (Benajiba et al., 2007)

Kanaan et al., 2009 used a passage retrieval system following Salton’s vector space model using query words

weight, and cosine similarity between documents words and question words. Their system tokenized every docu-

ment, removed the stop words, and carried out root extraction and term weighting. However, their test-set was

only 25 documents gathered from the Internet, 12 queries (questions) and some relevant documents provided by

themselves (Kanaan et al., 2009).

Abouenour explained an enhanced passage retrieval built on the JIRS passage retrieval system. He followed a

three level approach in his passage retrieval system: (Abouenour et al., 2009, 2010, 2011)

a- Keyword-based level: morphological and semantic query expansion using the Arabic WordNet including

the concept hypernyms, hyponyms, synonyms and definition.

b- Structure-based level: ranking the passages based on Distance Density n-gram Model giving higher rank to

passages that have the question words appear nearer one another.

c- Semantic Reasoning level: where he used Amine Platform to score and rerank the retrieved passages se-

mantically using concept graphs to find the most relevant answer passage.

However, the number of processed passages in JIRS was less than 1000 which did not enable structure based

techniques to have great effect.

The CLEF 2012 campaign had 2 Arabic QA attempts. The first attempt, is IDRAAQ by Abouenour et al. Its

NER is achieved by mapping the YAGO1 ontology and Arabic WordNet. The Passage Retrieval module of

IDRAAQ is based on 2 levels:

1. Keyword-based level: based on Query Expansion process relying on Arabic WordNet semantic relations

2. Structure-based level: based on a Distance Density N-gram Model passage retrieval system which is JIRS

2.1.3. Answer Extraction and Validation

Trigui et al., 2010 tackled the definition type of questions. They first identified the candidate definitions using

manual lexical patterns of sequence of words, letters and punctuation symbols. Then they used some heuristic

rules that they deduced from observing the form of some correct and incorrect definitions. After they extracted

the candidate definitions, they ranked them according to three criteria which are (i) Pattern weight of the pattern

that matched the candidate definition, (ii) Snippet position of the snippet that contains the candidate definition in

the snippets collection and (iii) the sum of word frequencies in the candidate definition. However, their evalua-

tion was not good enough as they tested on 50 organization definition questions only and the answers were as-

sessed by only one Arabic native speaker.

Abdelbaki & Shaheen, 2011 used semantic similarity between the question’s focus and the candidate answer

and made matching using n-grams. After that they validated the answers using accuracy scoring and ranking.

The results they achieved where 86.25% accuracy and an Mean Reciprocal Rank (MRR) of 0.87. They also pro-

vided the average response time which was 2262 ms on a machine with low specs (CPU: Intel® 1.60 GHz,

RAM: 512 MB) (Abdelbaki & Shaheen, 2011). However this work used the ANERCorp 316 articles as a QA cor-

pus and posed 240 questions on this small corpus which makes the redundancy passages not enough to test the

1 Yet Another Great Ontology: http://www.mpi-inf.mpg.de/YAGO-naga/YAGO/downloads.html

passage retrieval and the answer extraction modules. It is also noticed that by using ANERCorp corpus for train-

ing the Arabic Named Entity Recognition (NER) classifier then using it as the QA corpus will make the system

over-fitted for this corpus and may not reach the same results on other unseen texts.

Figure 7. Example of the Answer Extraction module's performance steps Benajiba et al., 2007

Benajiba et al., 2007, in their system named ArabiQA, approached the Answer Extraction task in three steps

as shown in Figure 7:

a- Using an NER system to tag all NEs in the retrieved passages.

b- Selecting candidate answers NEs that has the same expected answer type only.

c- Applying a set of patterns to select the final list of answers.

Moreover, they created a test-set solely to evaluate their Answer Extraction module in separation from the rest

of the system. This test-set was made up of four lists:

a- List of the questions

b- List containing the type of each question

c- List of manually selected passages that contain the right answers for the questions

d- List of correct answers

Their AE (Answer Extraction) module performed at a precision of 83.3% where the precision here is calculated

by dividing the number of correct answers over the number of questions (Benajiba et al., 2007).

2.2. The Best Performing English QA System

On the other hand, the best performing English QA system Lymba’s Power Answer 4, created by Moldovan et

al., 2007 performed at an accuracy of 70.6% in factoid questions and 47.9% in list questions. Lymba’s Power An-

swer 4 used the test-set of TREC 2007 and integrated semantic relations, advanced inference abilities, syntacti-

cally constrained lexical chains, and temporal contexts. It used strategies to answer each class of questions: each

strategy has the 3 components of (i) Question Processing (ii) Passage Retrieval (iii) Answer Processing. It also

resolved fuzzy temporal expressions, and it was integrated with a syntactic parser, an NER, a semantic parser,

ontologies, and a logic prover for textual inference in answer selection, and used Concept Tagger to detect event-

event relations.

2.3. Answer Selection and QA4MRE systems

Answer selection is concerned with selecting the best answer choice from multiple choices suggested by the

answer generation tasks which are question analysis and passage retrieval. A few works tackled the answer selec-

tion task in QA and most of these works used redundancy IR based approaches. Most of them also depended on

external sources like Wikipedia or different public Internet search engines. For example, Ko et al., 2007 used

Google search engine and Wikipedia to score answer choices according to redundancy, using a corpus of 1760

factoid questions from Text Retrieval Conference (TREC). Another similar attempt was carried out by Mendes &

Coheur, 2011 who explored the effects of different semantic relations between answer choices and the perfor-

mance of a redundancy-based QA system, also using the corpus of factoid questions provided by TREC. How-

ever, these approaches proved to be more effective in factoid questions. This is because factoid questions ask

about date/time, number or named entities, which could be searched easily on the Internet and aligned with the

answer choices. These kinds of answers are also repeated multiple times in almost the same format in different

documents. On the other hand, QA4MRE supports the answer choices of each question with a document and de-

pends on language understanding to select the correct answer choice, which can generalize for other kinds of

questions like causal, method, list and purpose questions.

In CLEF 2012, Arabic QA4MRE was introduced for the first time. Two Arabic systems participated in this

campaign. The first system is IDRAAQ which is created by Abouenour et al., 2012, achieved a 0.13 accuracy and

a 0.21 c@1. It used JIRS for passage retrieval, which uses the Distance Density N-gram Model, and semantic ex-

pansion using Arabic WordNet (AWN). IDRAAQ did not use the CLEF background collections. However, its de-

pendence on the traditional QA pipeline, to tackle QA4MRE, is the main reason behind its poor performance.

This is because the system depends mainly on the passage retrieval module, when the focus should have been on

the answer selection module as the right passages were already provided by the test-set. The use of AWN may

have also reduced performance due to its general purpose nature, which adds ambiguity by blurring the differ-

ences between the answer choices.

The second system by Trigui et al., 2012 achieved the accuracy and c@1 of 0.19 with their system. Thus, their

system has not marked any questions as unanswered, and attempted to answer all the test-set questions. Their sys-

tem uses an IR based approach. It collects the passages that have the question keywords and aligns them with the

answer choices, then searches for the best answer choice in the retrieved passages. It then employs semantic ex-

pansion, using some inference rules on the background collection, to expand the answer choices that could not be

found in the retrieved passages. The main reason behind the poor performance of that approach is that it depends

on the background collection as it offers enough redundancy for the passage retrieval module. This makes the

system very similar to traditional question answering systems, and does not attempt to analyze the reading test

document.

However, these two Arabic systems do not compare to the best performing system in the same campaign,

which was created by Bhaskar et al., 2012, which used the English test-set and performed at an accuracy of 0.53

and c@1 of 0.65. This system combined each answer choice with the question in a hypothesis and searched for

the hypothesis keywords in the document, then ranked the retrieved passages according textual entailment, which

proves that analyzing the reading test document, yields a much better performance. A comparison of the two

Arabic QA4ME systems and the best performing English system is illustrated in Table 3. Then the implemented

techniques of the best performing Arabic and English QA4MRE systems are compared in Table 4. The results of

the three QA4MRE systems are compared in the chart in figure 8.

Figure 8. Performance of QA4MRE systems @ CLEF 2012

Table 3: Comparison between Arabic QA4MRE @ CLEF 2012 and the best performing English System

QA4MRE System Deployed Components Performance

IDRAAQ Abouenour et

al., 2012

- NER by mapping the YAGO ontology and Arabic

WordNet.

- Did not use CLEF background collections

- PR based on Query Expansion using AWN semantic re-

lations, and Distance Density N-gram Model of JIRS

C@1 : 0.21

Accuracy : 0.13

Trigui et al., 2012 - Determine question focus

- PR retrieved passages are aligned with the multiple answer

choices of the question.

- Semantic expansion using inference rules on the back-

ground collection.

C@1 : 0.19

Accuracy : 0.19

English QA4MRE @

CLEF 2012

Bhaskar et al., 2012

- Combined each answer choice with the question in a hy-

pothesis

- PR searched for hypothesis keywords.

- Ranked passages according textual entailment. .

C@1 : 0.65

Accuracy : 0.53

Table 4: Techniques of the best performing QA4MRE Arabic and English Systems @ CLEF 2012

Criterion IDRAAQ: Abouenour

et al., 2012

English System

Bhaskar et al., 2012

Question Analysis and Linguistic Processing Methods

Automatically Acquired Patterns Yes Yes

PoS Tagging Yes Yes

n-grams Yes Yes

Chunking Yes Yes

Dependency Analysis Yes Yes

NER Yes Yes

Temporal Expressions - Yes

Numerical Expressions - Yes

Syntactic Transformations - Yes

Grammatical Functions (subject, Object, etc..) Yes Yes

Semantic Parsing Yes Yes

Semantic Role Labeling Yes Yes

Predefined sets of relations Yes -

Frames Yes -

Conceptual Graphs Yes -

Similarity Scoring Yes -

Answer Validation Techniques

Redundancies in Collection Yes -

Lexical Similarity (Term Overlapping) Yes Yes

Syntactic Similarity Yes Yes

Semantic Similarity Yes Yes

Chapter Three Chapter Three

Tools, Test-set and Evaluation MetricsTools, Test-set and Evaluation Metrics

Chapter 3Chapter 3

Tools, Test-set and Evaluation MetricsTools, Test-set and Evaluation Metrics

In this chapter, the tools that were used in ALQASIM 1.0 and 2.0 are demonstrated. These tools include a

morphological analysis toolkit, an Arabic ontology (Arabic WordNet) and three different root stemmers. The

test-set that is used in this research is also explained in detail, which is a list of documents, questions and answer

choices for these questions that cover various domains and types of questions. Last but not least, the evaluation

metrics used to evaluate ALQASIM are explained to set the grounds for the upcoming chapters that will explain

the architecture, results and discussion of ALQASIM 1.0 and 2.0.

3.1. Tools and Resources

Natural Language Processing tools and resources are very crucial for question answering systems. These tools

should be language specific due to the fact that every language has its own set of rules that do not generalize for

other languages. Among the important tools used by any question answering system are the morphological analy-

sis toolkits and stemmers. Other language specific resources like lexicons and ontologies are very important as

they provide a semantic dimension to the question answering process.

3.1.1. MADA+TOKAN PoS tagger

Habash et al., 2009 created MADA+TOKAN, which one of the best performing Arabic morphological analy-

sis toolkits. Its main advantages are that it is a freely available toolkit that performs at a high accuracy.

MADA+TOKAN offers various Arabic NLP services:

– Tokenization: splitting the stream of Arabic text into separate tokens. Separating tokens requires classi-

fication to separate different clitics in the same word token.

– Diacritization: adding diacritics to MSA words to disambiguate their meanings and way of pronuncia-

tion. Diacritics are very important in Arabic as they are very similar to vowels for English.

– Part-of-speech (PoS) tagging: finding the PoS of each word whether it is a verb, a noun, a proposition

– Light stemming: removing prefixes and suffixes from Arabic words and returning them to their stems.

This is different from root stemming because it changes only the inflectional form of the word and not

the derivation form of it.

– Word number identification: finding whether a word is a singular, a plural or a dual word.

– Word gender identification: finding the word gender whether it is a masculine or a feminine word.

MADA examines all possible analyses for each word, and then selects the analysis that matches the current

context using Support Vector Machine (SVM) models classification for 19 distinct, weighted morphological fea-

tures as shown in Table 5. TOKAN then takes the output of MADA and generates tokenized output in a custom-

izable format. MADA has over 86% accuracy in predicting full diacritization.

Table 5. The 19 features used by MADA

Feature AKA Description Predicted With

pos POS Part-of-Speech (e.g., noun, verb, preposition, etc.) SVM

conj CNJ Presence of a conjunction (w+ or f+) SVM

part PRT Presence of a particle clitic (b+, k+, l+) SVM

clitic PRO Presence of a pronominal clitic (object or possessive) SVM

art DET Presence of definite article (Al+) SVM

gen GEN Gender (feminine or masculine) SVM

num NUM Number (singular, dual, plural) SVM

per PER Person (first, second, or third person) SVM

voice VOX Voice (passive or active) SVM

aspect ASP Aspect (perfective, imperfective) SVM

mood MOD Mood (imperative, nominative, etc ...) SVM

def NUN Presence of nunation (Definite or Indefinite) SVM

idafa CON Construct state (Possessive or Non-possessive) SVM

case CAS Case (Nominative , Accusative, Genitive) SVM

unigramlex Lexeme predicted by a unigram model of lexemes N-gram

unigramdiacDiacritic form predicted by a unigram model of diacritic

formsN-gram

ngramlex Lexeme predicted by an N-gram model of lexemes N-gram

isdefaultBoolean: Whether the analysis is a default BAMA (Buck-

walter Arabic Morphological Analyzer) outputDeterministic

spellmatchBoolean: Whether the diacritic form is a valid spelling

matchDeterministic

3.1.2. Root Stemmers

Root stemmers are used to extract the three or four letters root of an Arabic word. Thus if the word assumes

any inflectional or derivational form other than its root, this form is changed back to its original root form. This

is done by removing any prefixes, suffixes or infixes and checking the derivational pattern of a word to return it

back to its root. Three root stemmers have been tested in ALQASIM 2.0, which are (I) Khoja, (ii) ISRI, and (iii)

Tashaphyne root stemmers.

3.1.2.1. Khoja Root Stemmer

The Khoja2 root stemmer is a Java Arabic root stemmer, created by Khoja & Garside, 1999. It removes the

longest suffix and prefix, and then matches the retrieved stem with the verb and noun patterns, to extract the root.

Khoja Arabic stemmer handles weak letters, which may change their form in an Arabic word root (i.e. alif, waw

or yah). It also identifies words that do not have roots like the Arabic words for “we”, “under”, “after”, and so on.

Khoja stemmer also produces the right root if a letter is deleted from the root during derivation due to duplicate

letters (i.e. the last two letters are the same) (Khoja & Garside, 1999). Taghva et al., 2005 reported that Khoja

stemmer has an Average Precision of 46.3%. Khoja Stemmer was also used by Larkey & Connell, 2006 in an

Arabic information retrieval system, which helped them to improve the Average Precision of their system by 49%

over the non-stemmed technique.

3.1.2.2. ISRI Root Stemmer

ISRI (Information Science Research Institute) root stemmer is a Python Arabic root stemmer, created by

Taghva et al., 2005. The ISRI stemmer has many features in common with the Khoja stemmer; however, it does

not use a root dictionary, which makes the ISRI stemmer more capable of stemming rare and new words. It nor-

malizes unstemmed words, has more stemming patterns than Khoja stemmer, and more than 60 stop words. ISRI

stemmer has an Average Precision of 48% (Taghva et al., 2005).

3.1.2.3. Tashaphyne Root Stemmer

The Tashaphyne Light Arabic Stemmer works by first normalizing words in preparation for the “search and

index” tasks required for stemming, including removing diacritics and elongation from input words. Next, seg-

mentation and stemming of the input is performed using a default Arabic affix lookup list, allowing for various

levels of stemming and rooting (Oraby et al., 2012).

2 Shereen Khoja Research: http://zeus.cs.pacificu.edu/shereen/research.htm

3.1.3. Arabic WordNet

Elkateb et al., 2006 introduced the Arabic WordNet and described the challenges they faced to create it. Ara-

bic WordNet is a lexical resource for MSA based on the widely used Princeton WordNet for English. Arabic

WordNet was also enriched by Abouenour et al., 2009, 2010, 2011 by adding new named entities, new verbs and

new nouns which enriched the hyponymy relation between concepts.

3.2. Test-set

Using a test-set that is well developed and used by several other systems lays out the rules for the developed

system and sets the grounds for an objective means of research. In this section, the test-set that was used to vali-

date ALQASIM 1.0 and 2.0 is explained in detail to highlight its form and coverage for different kinds of ques-

tions.

The test-set used to evaluate ALQASIM 1.0 and 2.0 is provided by the QA4MRE task at CLEF 2012.

QA4MRE at CLEF 2012 is the fourth campaign of its kind (Penas et al., 2012); however, the Arabic QA4MRE

test-set was introduced for the first time in 2012. It is composed of four topics: (i) "AIDS", (ii) "Climate change",

(iii) "Music and Society", and (iv) "Alzheimer". Each topic consists of four reading tests, and every reading test

has ten questions, where each questions has five answer options:

• 16 test documents (4 documents for each of the 4 topics)

• 160 questions (10 questions for each document)

• 800 answer choices/options (5 for each question)

QA4MRE questions are designed to test the comprehension of only one document. The questions test the rea-

soning capabilities of the participating QA systems which may include inference, relative clauses, elliptic expres-

sions, meronymy, metonymy, temporal and spatial reasoning, and reasoning on quantities. The questions may

also need some background knowledge that is not present in the test document. That is why “Background Collec-

tions” are provided by CLEF to fill this need. The questions types are:

(i) Factoid: (where, when, by-whom)

(ii) Causal: (what was the cause/result of event X?)

(iii) Method: (how did X do Y? or in what way did X come about?)

(iv) Purpose: (why was X brought about? or what was the reason for doing X?)

(v) Which is true: (what can a 14 year old girl do?)

Questions are also classified according to their information requirements as follows:

(i) 75 questions do not need extra knowledge (from background collections).

(ii) 46 questions need background knowledge.

(iii) 21 questions need inference.

(iv) 20 questions need information to be gathered from different sentences or paragraphs.

3.3. Evaluation metrics

Using the appropriate evaluation metrics makes it easy for other researchers to compare the developed system

with its alternatives and sets a solid ground for comparing different approaches to solving the question answer se-

lection and validation problem. In this section, the metrics used to evaluate ALQASIM 2.0 are explained. Accu-

racy and c@1 are the metrics used to evaluate the performance of QA4MRE systems.

3.3.1. Accuracy

Accuracy is used by many information retrieval systems to evaluate their performance in terms of the ability

of these systems to retrieve relevant data items and ignore irrelevant ones. It is calculated by dividing the number

of retrieved relevant items plus the number of irrelevant items that are not retrieved by the number of all items

(Manning et al., 2008). Table 6 introduces a contingency matrix that shows the relation between retrieval and rel-

evance of data items, introducing the notions of true positives (tp), true negatives (tn), false positives (fp) and

false negatives (fn), then equation 1 uses these notions to show how accuracy is calculated.

Table 6. Information Retrieval contingency table

Retrieved Not Retrieved

Relevant true positives (tp) false negative (fn)

Not Relevant false positives (fp) true negative (tn)

Accuracy=tp+tntp+fp+tn+fn

Where:

• tp: True Positives

• tn: True Negatives

• fp: False Positives

• fn: False Negatives

3.3.2. C@1

There was an urgent need for a metric that gives partial credit for systems that leave some questions unan-

swered in case of uncertainty to encourage researchers in the field of question answer selection and validation to

improve the quality of their systems. Due to the emergence of this need, C@1 was introduced in the QA4MRE

task at CLEF 2011 by Penas et al., 2011. C@1 is a metric that encourages systems to leave some questions unan-

swered to reduce the amount of incorrect answers, as it gives partial credit for systems that leave some questions

unanswered in cases of uncertainty, instead of attempting to answer them incorrectly (Penas et al., 2011), which

is shown in equation 2.

C@1=1n (nR +nU

n ) (2)

Where:

• nR: number of correctly answered questions

• nU: number of unanswered questions

• n: total number of questions

Chapter FourChapter Four

System ArchitectureSystem Architecture

Chapter 4Chapter 4

System ArchitectureSystem Architecture

In CLEF 2013, we introduced ALQASIM 1.0, which is a QA4MRE system that analyzes the reading test doc-

uments, where answer choices are scored according to the occurrence of their keywords, their keywords weights,

and the keywords distance from the question keywords. In the first section of this chapter, the architecture of

ALQASIM 1.0 (the first version) is explained. In the second section of this chapter, the architecture of the

ALQASIM 2.0 is explained in detail.

4.1. ALQASIM 1.0 initial architecture

Most Question Answering systems are composed of three main phases, which are: Question Analysis, Passage

Retrieval and Answer Extraction. However, these systems are mainly targeted at searching for answers in a large

collection of documents or on the Internet, which makes passage retrieval efficient (Ezzeldin & Shaheen, 2012).

QA4MRE is different in that aspect because the answer to a question is found in only one document, so there is

no enough information redundancy to help the IR statistical approaches of passage retrieval. Thus, we think that

the ordinary QA pipeline is not the best approach to QA4MRE; the best approach is the one used by human be-

ings in reading tests. A person would normally read and understand a document thoroughly, and then begins to

tackle the questions one by one. So, we divided the QA4MRE process into three phases: (i) Document Analysis,

(ii) Locating Questions & Answers, and (iii) Answer Selection. See Figure 9 and 10.

Figure 9. Overview of ALQASIM architecture

4.1.1. Document Analysis

In the Document Analysis phase, the reading tests documents are analyzed using MADA+TOKAN (Habash

et al., 2009) morphological analyzer to stem each word in the documents and get its Part-of-Speech (PoS). The

stop words are then removed, and an inverted index is created for the remaining words stems that contains the lo-

cations of each stem and its weight. Arabic WordNet (AWN) is also used to expand the words semantically by

adding the synonyms of each word to the inverted index of that document. The weight of each word in the in-

verted index is assigned according to its PoS and repetition. So that, nouns, verbs, adjectives, adverbs, proper

nouns and the other parts of speech are assigned different weights. Then the weight of a word is divided by its

count in the document, thus, the more a word is repeated the less its weight will be. These weights mark the im-

portance of keywords so that the higher the weight of a word the more important it is in the document. We have

carried out the morphological analysis phase in an off-line step to increase the speed of the system while testing

the questions.

Figure 10. Detailed Architecture of ALQASIM

4.1.2. Locating Questions & Answers

In the second phase, every question and answer choice is handled as follows. Keywords are identified by

stemming and removing stop words. The inverted index is then searched to find the best scoring three locations

for each question and answer choice keywords. This score is calculated according to: (i) the number of keywords

found within a distance threshold, (ii) the weights of all found keywords and (iii) the distance between these key-

words. The impact of keywords count and weights is positive while the impact of distance is negative which

means that locations scores are penalized for higher distance among its keywords.

4.1.3. Answer Selection

By now, the question and its five answer choices have three scored locations each. In this phase, answer

choices locations are scored with respect to the question locations. This score is generated by summing the

scores of one question location and one answer choice location and subtracting the distance between them. The

maximum of these scores is selected as the answer choice score. After that the maximum scoring answer choice

of the five choices is selected as the question answer. If there is more than one best scoring answer choice, then

the answer is not certain and the question is marked as unanswered.

4.1.4. Evaluation of ALQASIM 1.0

ALQASIM 1.0 searches for the answer choices within proximity to the question snippets. Keywords weights

are calculated according to their repetition in the document and their PoS tags, where the keywords weights de-

crease, the more they are repeated in the document. This approach leads to a promising performance of 0.31 ac-

curacy and 0.36 C@1, without using CLEF background collections (Ezzeldin et al., 2013). It is very promising

when compared to the other two Arabic QA4MRE system (Abouenour et al. 2012) and (Trigui et al. 2012) as

shown in figure 11, taking into consideration that it did not use CLEF background collections.

However, it does not take into consideration the natural boundary of sentences. Thus, if a weak answer choice

(in terms of keywords count and weight) is nearer to the question snippet, it may be selected as the correct an-

swer choice; even if there is a stronger answer choice that is a little bit further but still in the same sentence or the

sentence next to it. It also contains many manually adjusted weights that may make the system over-fit for the

test-set. Another disadvantage of ALQASIM 1.0 is that it finds the best three question snippets and three answer

snippets, then searches for the nearest and best scoring pair of question and answer choice snippets to choose the

correct answer choice. Thus, it may mistake an answer choice as the correct one if a sub-optimum question snip-

pet is chosen due to its high aggregate score with its associated answer choice snippet, which may happen with

answer choices that have many keywords. As an example, figure 12 shows how ALQASIM 1.0 could not mark

that all question keywords (marked in cyan) as the question snippet keywords because they are a little further

from each other, in fact they are only farther than the threshold that is set to identify related words. On the other

hand, if this threshold is increased, it affects the results negatively as it marks words from different snippets,

which produces more false positives.

Figure 11. The performance of ALQASIM 1.0 versus the other two Arabic QA4MRE systems

Figure 12. Example for failure of ALQASIM 1.0 while trying to locate the question snippet.

4.2. ALQASIM 2.0 Architecture

In ALQASIM 2.0, sentence splitting is utilized as a natural boundary to separate related units of meaning. So,

the question and answer choice snippets are considered related if they co-occur in the same sentence or share the

same sentence boundary. Thus, it searches for the answer choice in three sentences only: the sentence before the

question snippet, the sentence after it and the question snippet sentence itself.

In this section, ALQASIM 2.0 components and their integration will be explained in detail. As illustrated in

figure 13, the system consists of three main modules: (i) Document Analysis, (ii) Question Analysis, and (iii) An-

swer Selection.

Figure 13. ALQASIM 2.0 Architecture

4.2.1. Document Analysis

The document analysis module is crucial for the answer selection process. It gives context for morphological

analysis, and makes relating the question snippet to the answer choice snippet easier. In this module, the docu-

ment is split into sentences, and the sentences indexes are saved in an inverted index for future reference. The text

words are then tokenized, stemmed, and part-of-speech (PoS) tagged using MADA+TOKAN morphological

analysis toolkit. MADA+TOKAN performs light stemming and PoS tagging, and determines the number, and

gender of each word. ISRI Arabic root stemmer (Taghva et al., 2005) is then used to retrieve the root of each

word. Textual numeric expressions are rewritten in digits. The word, light stem, root, and numeric expansion are

all saved in an inverted index that marks the position of each word occurrence. The words are also expanded us-

ing an ontology of hypernyms and their hyponyms extracted from the background documents collection provided

by CLEF 2012. See figure 14.

Figure 14. Document Analysis module architecture

In the previous paragraph, we mentioned some modules and components that we have not explained yet:

i. Inverted Index

ii. Morphological analysis

iii. Sentence splitting

iv. Root expansion

v. Numeric expansion

vi. Ontology-based semantic expansion

These modules will be explained in detail in the following six subsections.

4.2.1.1. Inverted Index

The inverted index is an in-memory hash map data structure, created for each document. The key of this hash

map is a string token that could be a word stem, root, semantic expansion, or digit representation of a numeric

expression. The value of the hash map consists of a set of locations (a location is the index number of a word in

the text) where this token occurs. The inverted index also has another hash map that holds the weight of each to-

ken that is determined according to its part-of-speech (PoS) tag and whether it is a named entity (NE). See table

7. The intuition behind setting tokens weights is to give higher weights for tokens that are more informative.

Named Entities (NEs) are considered keywords that can mark a question or an answer snippet easily. That is why

NEs hold the highest weight. Nouns convey most of the meaning in Arabic sentences. This is because most of the

Arabic sentences are nominal ones. Adjectives and adverbs come in the third place in terms of weight after NEs

and nouns because they qualify nouns and verbs and help convey meaning. Verbs weight is less than adjectives

and adverbs as they are less informative. The least weight is then assigned to any other PoS tag. Prepositions,

conjunctions and interjections were already removed in the stop words removal process before saving the words

to the inverted index.

Table 7: Weights determined according to PoS tags and NEs

PoS Tag / Named Entity WeightNamed Entity (person, location, organization) 300Nouns 100Adjectives and Adverbs 60Verbs 40Other PoS tags 20Prepositions, conjunctions, interjections and stop words N/A (Not saved in the inverted index)

4.2.1.2. Morphological analysis module

The morphological analysis module is the most important module in the system as it provides the required in-

formation for the rest of the modules to work with. MADA+TOKAN has been used in this module to provide to-

kenization, part-of-speech (PoS) tagging, light stemming, and number determination of each word.

MADA+TOKAN is an open source morphological analysis toolkit, created by Habash et al., 2009. It provides

many Arabic NLP services like tokenization, diacritization, morphological disambiguation, gender and number

determination, part-of-speech (PoS) tagging, stemming and lemmatization. MADA also takes the current word

context into consideration, as it examines all possible analyses for each word, and selects the analysis that

matches the current context using Support Vector Machine (SVM) classification for 19 weighted morphological

features. TOKAN generates a customizable tokenized output from MADA output. MADA+TOKAN has an accu-

racy of 86% in predicting full diacritization (Habash et al., 2009). The morphological analysis module also saves

the stem of each word in the search inverted index, with a weight that is determined by the word PoS tag as men-

tioned in the previous subsection.

4.2.1.3. Sentence splitting module

The sentence splitting module marks the end of each sentence by the period punctuation mark that is recog-

nized by MADA PoS tagger. MADA helps the sentence splitting module by differentiating between the punctua-

tion period mark and other similar marks, like the decimal point. The sentence splitting process is very important

for locating question and answer snippets.

4.2.1.4. Root Expansion module

Arabic is a highly derivational language. This results in term sparseness that cannot be solved by only using a

light stemmer. Thus, using a root stemmer to expand words proved to be effective in many natural language pro-

cessing tasks on morphologically rich and complex languages like Arabic (Oraby et al., 2012).

The root expansion module uses an Arabic root stemmer to extract each word root and save it to the inverted

index as an expansion to the word itself. Three root stemmers were tested with ALQASIM 2.0: (I) Khoja, (ii)

ISRI and (iii) Tashaphyne root stemmers.

The Khoja root stemmer is a Java Arabic root stemmer, created by Khoja & Garside, 1999. It removes the