Arabic Information Retrieval Literature Review

20
Arabic Information Retrieval Literature Review Omar Kashef The University of North Carolina at Chapel Hill

Transcript of Arabic Information Retrieval Literature Review

Arabic Information Retrieval Literature Review

Omar Kashef

The University of North Carolina at Chapel Hill

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 2

Contents Introduction ................................................................................................................................................. 3

Guiding Articles ....................................................................................................................................... 4

Characteristics of Arabic ......................................................................................................................... 5

Diacritic Marks..................................................................................................................................... 5

Broken Plurals ..................................................................................................................................... 6

Dialects ................................................................................................................................................. 6

Transliteration and Arabizi ................................................................................................................ 7

Stemming ..................................................................................................................................................... 7

Morphological Approach ........................................................................................................................ 8

Light Stemmers ........................................................................................................................................ 9

N-Gram Stemmers ................................................................................................................................. 10

Hybrid Stemmer Models ....................................................................................................................... 11

Stemming based on Latent Semantic Analysis Model ........................................................................ 12

Stopwords .................................................................................................................................................. 13

Evaluation .................................................................................................................................................. 14

Conclusion .................................................................................................................................................. 16

Bibliography .............................................................................................................................................. 20

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 3

Introduction

Arabic is spoken by over 200 million people in over twenty countries spanning across North

Africa and Southwest Asia.1 This makes Arabic the fifth most common language in the world

indicating the importance of developing robust Arabic information retrieval (IR) systems. On

the internet, there are sixty-five million Arabic users, making Arabic the seventh most common

internet language.2 What makes Arabic IR particularly challenging are diacritical marks, or

vowels, broken or irregular plurals, and varying dialects. There are many dialects of this

Semitic language that have large implications in the field of Arabic IR. Additionally, when

writing/typing Arabic, vowels can be omitted; however, omitted diacritical marks, or phonetic

stresses, can change the meaning of words if the reader is not aware of the contextual use of

the word.3

Arabic words are typically derived through a robust system of Arabic roots. According to Sakhr

Software Company, there are over 10,000 potential roots, but far fewer that are used

regularly.4 This robust root system also has large implications in Arabic IR and may present

information retrieval challenges. This literature review will address two main issues that are

consistently raised in this field of Arabic IR: stemming and stopwords. There are still no

standardized methods of stemming or stopword elimination highlighting the infancy of the

field of Arabic information retrieval.

1 A Guide to Arabic – 10 facts about the Arabic language. BBC. 2 Top 10 Languages Used on the Internet. Accredited Language 3 A Guide to Arabic – The Arabic alphabet. BBC. 4 About Sakhr. Sakhr Software – Arabic language technology.

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 4

Both these issues raise further questions regarding varying Arabic dialects, proliferation of

transliteration, and diacritical marks. Prior to discussing both stemming and stopwords in

Arabic information retrieval, Arabic’s unique characteristics will be examined. After examining

those characteristics, stemming and stopwords will be discussed. Then I will explain evaluation

techniques and finish with needs for Arabic IR and projections for Arabic IR for future

information retrieval and computer scientists.

Guiding Articles

The articles I analyzed for this survey of Arabic information retrieval come from various peer

reviewed sources. The authors and papers selected are consistently referenced throughout the

selected papers adding to the credibility of the selected papers. The selected articles are

focused in the following areas:

surveys of Arabic IR (Abu El-Khair, 2007; Darwish & Magdy, 2014)

stemming techniques (Froud, Lachkar, & Ouatik, 2013; Kreaa, Ahmad, & Kabalan, 2014;

Larkey, Ballesteros, & Connell, 2004)

microblog retrieval (Darwish, Magdy, & Mourad, 2012)

diacritical marks in Arabic IR (Aloufi, 2011)

Arabic morphology analysis (Abu-Errub, Odeh, Shambour, & Hassan, 2014)

Together, these papers allow for a thorough examination of Arabic information retrieval. The

selected articles range from 2002-2014 thus the information surveyed is up to date. The only

articles written before 2010 are one survey article and a foundational paper outlining a new,

at-the-time, light stemming process. There are currently to my knowledge no peer reviewed

articles published in 2015 that specifically address stemming and stopword elimination

techniques for Arabic text.

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 5

Characteristics of Arabic

The Arabic language is distinct from English in many ways meaning that English IR practices

are not transferrable to Arabic IR. This section of this paper will introduce characteristics of

Arabic that present challenges in the stemming and stopword elimination processes. The first

major distinctions are that Arabic is a cursive script and is written from right-to-left. There are

twenty-eight letters, eight forms of the diacritic hamza, and another eight diacritic marks. The

twenty-eight letters all have different forms based on the letter’s position in a word. Hamza

functions like a letter and a diacritic mark; it is mainly used as a glottal stop in the

pronunciation of a long vowel. Nouns also have gender in Arabic.

A word in Arabic can be extended quickly with prefixes or suffixes that are commonplace in the

Arabic language. In addition, Arabic also features another distinction from English in the use of

kashidas, or physically elongated letters. Kashidas are often treated like white spaces in the

retrieval process (Darwish & Magdy, Arabic information retrieval, 2014). Despite these

complications, Arabic is a phonetic language and most words can be broken down into prefixes,

roots, and suffixes. Roots are between two and five characters and there are currently about

5,000 in modern use (Abu El-Khair, 2007). Not only are their Arabic roots, but also Arabic word

patterns, twelve in total, that change the meaning of Arabic roots. These patterns are consistent

for most Arabic words and are considered in the Arabic IR process.

Diacritic Marks

Diacritic marks are simply vowels in Arabic used to inflect letters in words. These marks can

change the pronunciation and even the meaning of the word. Thus, diacritic marks will play a

large role in the stemming process for Arabic IR. Information retrieval and computer scientists

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 6

must consider whether or not diacritic marks will be removed in the stemming process. What

complicate diacritic marks even further is that they are usually not used. Arabic used in news

articles, forums, and social media does not incorporate diacritical marks. These marks can

usually be found in Arabic laws, cases, academic papers and religious texts and commentary

(Darwish & Magdy, Arabic information retrieval, 2014).

Broken Plurals

Another key aspect of the Arabic language are irregular or broken plurals. While plurals in

English are relatively straightforward with just an additional suffix, Arabic plurals are more

likely to deviate from the typical Arabic plural word structure.

Dialects

Arabic has millions of speakers across North Africa and Southwest Asia, but there are at least

six distinct dialects that presents additional challenges for Arabic IR. The six dialects are

Egyptian, 85 million speakers; Maghrebi, 75 million; Levantine, 35 million; Iraqi, 25 million;

Gulf, 25 million; and Yemini, 20 million (Darwish & Magdy, Arabic information retrieval, 2014).

Dialectical words are morphologically different than words in modern standard Arabic (MSA)

meaning that stemming and stopword elimination techniques are hindered because dialectical

words do not conform to the morphological standards of MSA. Modern standard Arabic is

typically used for news and more formal purposes, unlike the various dialects which are

restricted to informal online content and speech.

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 7

Transliteration and Arabizi

Another key feature to the world of Arabic information retrieval is the proliferation of

transliterated web content and Arabizi. A lot of Arabic content on the internet is a

transliterated form of Arabic, commonly known as Arabizi. Arabizi is an encoded dialectical

and phonetic Arabic text found only online. Arabic users will use the English keyboard to write

in Arabic incorporating additional characters/numbers as letters for sounds unfamiliar to

English users. While retrieving Arabizi results is not particularly challenging, the problems are

with standardization and incorporation of additional languages. Arabizi is only found online

and with the varying dialects, there is a lack of standardization and an ever expanding

dictionary as Arabic users commonly incorporate English and French words (Darwish &

Magdy, Arabic information retrieval, 2014). This transliterated and slang-infused type of Arabic

presents additional challenges in the stemming and stopword elimination processes.

Stemming

Stemming is a text process by which different variations of a word are conflated. The stemming

process incorporates both inflections and derivations of words from a common stem. While

stemming has provided noticeable improvements in retrieval results, stemming is critical to

improving search results for Arabic content. There two main types of stemmers are algorithmic

and dictionary-based.5 Both of these techniques have been tested in Arabic text processing.

This section will differentiate between morphological stemmers, light stemmers, WordNet

5 Croft, W. B., Metzler, D., & Strohman, T. (2010). Search Engines Information Retrieval In Practice. Upper

Saddle River: Pearson Education, Inc.

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 8

processing, and hybrid stemmers. Most proposed or tested stemmers usually combine various

approaches.

Morphological Approach

Morphological stemmers are designed to analyze queries and texts using Arabic roots.

Specifically, morphology is a linguistic filed analyzing structure and formation of words. The

morpheme is the smallest and significant unit of language (Abu-Errub, Odeh, Shambour, &

Hassan, 2014). Morphological stemmers are dictionary-based in that they typically funnel the

query or text through a dictionary of roots and/or patterns to extract roots.

The Khoja stemmer, considered a morphological analysis because of it checks for patterns and

roots, is one of the first stemmers introduced in the field of Arabic IR (Larkey, Ballesteros, &

Connell, 2004). This stemmer and is primarily a dictionary-based stemmer. As mentioned

earlier, there are many roots in Arabic in modern usage, but roots and stems are not

synonymous in Arabic IR. The root pertains to Arabic root morphology while stem pertains to

simplified word structures that do not necessarily match Arabic root structures. The Khoja is

considered an aggressive stemmer. This stemmer removes diacritic marks, punctuation,

numbers, definite articles, suffixes, and prefixes. After this process, the results are matched

against a list of the twelve Arabic patterns. If there is a match, the pattern is removed to

hopefully extract a root. Once a root is extracted, it is matched against a root dictionary.

Altantawy introduced a morphological analyzer specifically for modern standard Arabic that

was successful in that it had high precision and recall scores. However, it was only applied to

the Qur’an which deviates less structurally, morphologically than other Arabic dialects without

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 9

diacritic marks. Abu Hawas introduced a morphological stemmer that functions similarly to

Altantawy’s but actually predicts roots without referencing a dictionary. (Abu-Errub, Odeh,

Shambour, & Hassan, 2014). This brings up the issue of testing and evaluation of Arabic

stemmers which will be explained further in the Evaluation section of this literature review.

Light Stemmers

In contrast to the aggressive Khoja stemmer and its substitutes, light stemmers were

introduced in the early part of the new millennium around when TREC introduced the Arabic

cross language track featuring Arabic (Darwish & Magdy, 2014). The biggest distinction

between light and aggressive stemmers is the type of error. Light stemmers sometimes do not

conflate enough forms of words that should be together while aggressive stemmers conflate

words that should not have been conflated (Larkey, Ballesteros, & Connell, 2004).

Larkey’s light-10 stemmer removes most frequent prefixes, suffixes, all definite articles, and

the letter “waw” which usually means “and”. This approach is significantly different than the

morphological approach because there is no consideration for Arabic roots. Algorithmically,

Larkey’s light stemmer is simpler because there are fewer computations used to process texts

or queries (Kreaa, Ahmad, & Kabalan, 2014). There have been a number of light stemmers

tested over the past decade, particularly Al-Stem, light-8 and light-10. The Darwish light

stemmer Al-Stem is more aggressive than Larkey’s light-10 stemmer which is also similar to

light-8. Darwish developed al-stem after light-10 was developed by a group of students at

University of Massachusetts at the TREC 2001-02 cross language tracks for Arabic (Abu El-

Khair, 2007).

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 10

Prior to light-8 and light-10 stemmers, researchers created even lighter stemmers. Light-1 and

light-2 did not even remove suffix strings, only prefixes and fewer prefixes than both light-8

and light-10. As expected, light-1, light-2, and light-3 stemmers all had lower precision and

recall scores than light-10 and light-8. The differences (average precision) between light-8 and

light-10 which had an additional co-occurrence step were not statistically different. However,

the difference (average precision) between both light-8 and light-10 with the previous light

stemmers 1, 2, and 3 are statistically significant (Larkey, Ballesteros, & Connell, 2004).

In comparison to the khoja stemmer, light10 also proved more robust in regards to precision

and recall measures. The distinction between light8 and light10 is an addition co-occurrence

measure. Co-occurrence re-clusters stems that are likely to occur together. Adding this co-

occurrence step did not make a significant difference in average precision between light-8 and

light-10, but average precision did increase (Larkey, Ballesteros, & Connell, 2004). Larkey’s

article was the only one that mentioned a co-occurrence technique to cluster stems that are

likely to be close to each other.

N-Gram Stemmers

In 1987, Ali Hegazi used n-grams to determine character sequences in Arabic text, one of the

first information retrieval scientists to do so. He estimated nth order entropy and redundancy

using letter frequency and rank frequency distribution, like TF*IDF scores. Hegazi found out

that Arabic entropy and redundancy is higher than English entropy and redundancy (Abu El-

Khair, 2007). Nth order entropy refers to uncertainty given a set of probabilities and

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 11

redundancy occurs when sets of characters occur together frequently.6 Arabic’s entropy and

redundancy is higher than English. So not only does Arabic present more uncertainty in terms

of characters following one another, but there are character sets occurring more frequently

than in English. This results in a high compression score for the Arabic language which makes

use of n-grams in the text preprocessing phase more challenging (Abu El-Khair, 2007).

Abu El-Khair cites a few more attempts at processing Arabic text for potential stemming

algorithms; however, most were ineffective regardless of use of bigrams, trigrams, or other

variations. The author argues that using n-grams does not address the ambiguity of Arabic

words without diacritic marks (Abu El-Khair, 2007). If n-grams are used, additional processing

should be incorporated to increase efficiency such as removing diacritic marks and kashidas.

However, this method requires additional processing that results in increased storage

requirements (Darwish & Magdy, Arabic information retrieval, 2014). There have been many

attempts to create stemming models focusing on roots, stems, or statistical methods, but some

of the more effective stemmers combine multiple approaches.

Hybrid Stemmer Models

Many stemmers developed take hybrid approaches with morphological processing,

suffix/prefix algorithms, and further reductions like kashidas and diacritics. Root based, stem

based, or statistical (n-gram) based stemmers all have their advantages and disadvantages;

thus, many stemming models incorporate multiple approaches. Even the khoja stemmer which

is primarily a root based stemmer incorporates stem based methods. In full, the khoja stemmer

6 Claude Shannon & Information Theory -- Harvard School of Engineering and Applied Sciences

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 12

removes diacritics, stopwords, punctuation, definite articles, suffixes, and prefixes. Following

all of that, what is left is matched against a list of the twelve morphological patterns. Once the

pattern is extracted, the root is left over and matched in a root in a dictionary. This process has

led to the classification of the khoja stemmer as aggressive. Light stemmers on the other hand

typically do not use morphological processing (Kreaa, Ahmad, & Kabalan, 2014).

One of the most cited hybrid stemming algorithm approach incorporates WordNet tables. The

first WordNet table was designed by a group at Princeton. The table they created is a lexical

database consisting of nouns, verbs, adjectives, and adverbs that are grouped in synsets, or

cognitive synonyms. The synsets are semantically-conceptually related.7 A group of researchers

Kreaa, Ahmad, and Kabalan built an Arabic WordNet using the same conceptual models by the

Princeton group. They built an Arabic WordNet to capitalize on the richness of the Arabic

language. After each algorithmic step, like removing diacritics, the word is filtered through a

table to determine if the algorithm should apply or not. In this study, the Wordnet stemmer

performed better in terms of average precision than both the light and khoja stemmers (Kreaa,

Ahmad, & Kabalan, 2014).

Stemming based on Latent Semantic Analysis Model

Froud, Lachkar, and Ouatik also attempted to create a model to tackle discrepancies between

terms. These researchers argue that document clustering is too challenging now because that

method requires large storage spaces. They created the latent semantic analysis (LSA) model to

characterize similarity between Arabic words, similar to cosine similarity and correlation

7 What is WordNet? – Princeton University

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 13

coefficient. This work builds off some of the co-occurrence work that Larkey had employed.

The LSA model is a vector space model that transforms corpus of text “into a vector space of

several hundred dimensions” (Froud, Lachkar, & Ouatik, 2013, p. 2). Vector models are built off

of vector representations of documents and queries. Documents, queries, and terms vectors are

measured using the cosine similarity measure that measures the angles between documents,

queries, and terms.8 The LSA model was measured via cosine similarity, Euclidean distance,

Jaccard coefficient, and the Pearson correlation coefficient. The LSA model was tested with and

without stemming. The model actually produced better similarity measure scores without

stemming. They contend that this occurs because the same root can be used to produce a

variety of meanings through the twelve distinct morphological patterns (Froud, Lachkar, &

Ouatik, 2013). This information retrieval system also incorporated stopword elimination which

is another issue in Arabic information retrieval that is lacking standardization as researchers

have yet to agree on how best to produce a stoplist.

Stopwords

Another key feature when processing text or queries to retrieve results is stopping. Stopwords

are function words that have little to no meaning when separated from other words.9 Throwing

out stopwords in the retrieval process eliminates those unnecessary search terms, reduces

processing power, and needed storage space. Creating lists of stopwords, or stoplists usually

happen in one of two ways, through human input or through statistical methods. The general

stoplist is human generated. Currently, the Lemur Toolkit for stopwords in Arabic is 168 words

8 Croft, W. B., Metzler, D., & Strohman, T. (2010). Search Engines Information Retrieval In Practice. Upper Saddle River: Pearson Education, Inc. 9 Ibid

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 14

(Abu El-Khair, 2007). Abu El-Khair stated the fact that the list is “only” 168 words as if it is a

short list.

Another way to generate a list of stopwords is through statistics. A list of stopwords can be

generated by determining frequency and cutting out words with the highest frequency. This is

an arbitrary method and may cut words that may be useful when entering a query. As of now,

there is no standard list of Arabic stopwords that people are systematically using. Thus, more

work is needed in this area. (Abu El-Khair, 2007). Since there is no standardized stoplist,

neither Google nor Bing actually use stopword lists for Arabic web searches (Darwish & Magdy,

Arabic information retrieval, 2014).

Evaluation

Outside of information retrieval and computer scientists evaluating their own information

retrieval systems or stemming techniques, many of their methods are tested at information

retrieval conferences. One of the most reputable and widely known is the Text Retrieval

Conference that are sponsored by the National Institute of Standards and Technology and the

US Department of Defense.10 There are three main components of a test collection at these

information retrieval conferences: queries and documents, topics, and relevance judgments.

There are usually around fifty topics and wide range in the number of documents. Earlier

conferences had far fewer total documents because computing power was not as powerful as

today and fewer sources of documents. Following the internet boom along with the exponential

growth of information, the number of documents in text collections rose dramatically.11

10 Slide 11 – March 17th INLS 509 PPT 11 Slide 10 – March 17th INLS 509 PPT and Class

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 15

Text collections are typically formed in four major steps. First, the topic creator makes

relevance assessments. Then, different sites are provided the corpus and topics to create lists

for each topic. Usually about fifty lists are made, one per topic, which consist of about 1,000

documents. Finally, lists are pooled and documents within lists are assessed for relevance.

While test collections can be useful because the collections are reusable and provide a standard

to compare information retrieval systems, they are hard to replicate and some feature binary

relevance scores that are assessed by only one person. Additionally, previous lists may have

been created with now outdated technology that did not process many relevant documents.12

Arabic became a popular topic in information retrieval following the inclusion of the cross

language track at TREC in 2001-02. The test collection from this conference is still widely used

for the news domain. Following TREC 2001-02, many other information retrieval conferences

started to pick up Arabic. Topic Detection and Tracking Evaluation (TDT3) created an Arabic

track constructed using radio, television, broadcast news, and news articles. This conference is

designed to focus around a specific event around those various mediums. TDT3 included an

Arabic track that featured almost 16,000 Arabic documents. In 2005, TRECvid included an

Arabic video track capturing 82.6 hours from Arabic news sources. At the Question Answering

for Machine Reading track in 2012-13 at Cross Language Evaluation Forum, there were Arabic

test collections that featured documents related to Alzheimer, AIDS, climate change, and music

and society. There have been additional conferences focusing on Arabic text, speech, and

information retrieval. There has clearly been significant growth in the world of Arabic

information retrieval (Darwish & Magdy, Arabic information retrieval, 2014).

12 Slide 2 – March 24th INLS 509 PPT

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 16

The main metrics used to evaluate stemming techniques in Arabic information retrieval are

average precision, recall, and precision. These metrics are used to compare stemming

techniques. Throughout the literature, the khoja stemmer and other morphological stemmers

did not perform as well as light stemmers. The light-10 stemmer scored highest at the TREC

2001-02 but Abu El-Khair argues that more work needs to be done (Abu El-Khair, 2007). He

wrote his article in 2007, and since then there have been many attempts to improve stemming

techniques. In the articles surveyed, there are many assumptions as well or issues with the

evaluation process. One author test their systems only on Qur’anic or religious texts which is

vastly different than the Egyptian dialect for example (Abu-Errub, Odeh, Shambour, & Hassan,

2014). Another author mentions researchers that only tested their analysis using modern

standard Arabic which again is not representing the vast diversity of Arabic dialects (Abu El-

Khair, 2007).

Conclusion

Arabic information retrieval has not been as heavily studied as English IR; thus, there are many

areas of IR that are still progressing. The key issues need more work on are standardized

stemming techniques and cross-language information retrieval efforts because much of Arabic

content on the internet is transliterated. Additionally, since much of the Arabic content on the

web is dialectical, there are many considerations needed in how best to create an information

retrieval system.

Although extremely anecdotal, having been to Egypt and seeing how my cousins would interact

with the internet was vastly different to my experiences in the US. For one, they were

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 17

incredibly invested in social media, especially Myspace which had already fallen out of favor

amongst many of my peers at the time. Reading through various articles, the fact that Arab

internet users more frequently are posting in forums is something I have seen and have not

understood. However, this makes sense if much of Arabic internet content is found in forums

which has not been my experience with searching in English.

Additional research in the field of Arabic IR will dramatically change the lives of Arabic internet

users. Arabic use on the internet grew thirty-fold from 2000 to 2012. Over forty percent of

internet users in Saudi Arabia are tweeting, highest proportion in the world. Yet only one

percent of web pages are in Arabic. Often, Arabic internet users find themselves on forums

after entering queries because much of Arabic content is found on forums. Only last year was

an Arabic address made available “dot shabaka” (transliterated Arabic) rather than using

.com.13 As Arabic users continue to populate and publish additional content on the internet at

an increasing rate of change, there will be a continued need to research and create better

stemming techniques to improve information retrieval systems.

Another area that Arabic information retrieval is focusing on is are microblogs, or items like

tweets. Darwish and Magdy team up again with Mourad to discuss microblog search. They

were the first researchers to contribute to the microblog field of Arabic IR writing their article

in 2012. From September 2011 to February 2012, there was a ten percent increase in Arab

Facebook users. Arabs make up almost fifteen percent of the Facebook population. Darwish,

Magdy, Mourad realize that following all these social media movements highlighting

revolutions from Tunisia, Egypt, to Syria, and even Qatar that there is a need to retrieve

13 Surfing the shabaka – The Economist

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 18

microblog content better than the information retrieval technology is capable of now (Darwish,

Magdy, & Mourad, Language processing for Arabic microblog retrieval, 2012).

There was actually a microblog track at 2011 TREC. Microblog retrieval presents the same

issues with text and query Arabic IR such as stemming and stopword elimination. Additionally,

dialects present a large challenge in microblog retrieval because different transliterated

dialects are not standardized and people spell pronunciations differently. Not only are dialects

used pervasively, but spelling errors are ubiquitous in the field of microblog retrieval especially

in Arabic with the numerous dialects employed. (Darwish, Magdy, & Mourad, Language

processing for Arabic microblog retrieval, 2012).

Moving forward, information retrieval and computer scientists in Arabic IR have a lot of work.

One thing that could ground future work is if dictionaries for transliterated dialects are created.

Although that would be incredibly challenging as new words are added and transliterated if

there is no Arabic word available, this standardization might need to occur to create

breakthroughs in information retrieval. This issue has hurt researchers’ attempts to design

effective stemming and stopword elimination algorithms.

With the recent growth in Arabic content on the web following the revolutions in North Africa

and Southwest Asia, there is a greater world focus in this area. More students are taking Arabic

courses and traveling to these areas to learn the dialects and cultures of various peoples across

the Middle East. With this interest in the Middle East, Arabic content on the internet will

continue to grow and increase pressure to standardize information retrieval techniques so that

Arabic content on the web is easily searchable. Arabic IR is still in its infancy and with time, I

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 19

believe that Arabic IR researchers will develop more robust stemming and stopword

elimination techniques.

ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 20

Bibliography Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and

Tehcnology, 505-533. Abu-Errub, A., Odeh, A., Shambour, Q., & Hassan, O. (2014). Arabic roots extraction using

morphological analysis. International Journal of Computer Science Issues, 11(2), 128-134. Aloufi, K. (2011). Diacritic oriented Arabic information retrieval system. International Journal

of Compute Science and Security, 143-155. Darwish, K., & Magdy, W. (2014). Arabic information retrieval. Foundations and Trends in

Information Retrieval, 239-342. Darwish, K., Magdy, W., & Mourad, A. (2012). Language processing for Arabic microblog

retrieval. Maui: Conference on Information and Knowledge Management. Froud, H., Lachkar, A., & Ouatik, S. A. (2013). Arabic text summarization based on latent

semantic analysis to enhance Arabic documents clustering. Internationa Journal of Data Mining & Knowledge Management Process, 3(1), 79-95.

Kreaa, A. H., Ahmad, A. S., & Kabalan, K. (2014). Arabic words stemming approach using Arabic WordNet. International Journal of Data Mining & Knowledge Management Procss, 4(6), 1-14.

Larkey, L. S., Ballesteros, L., & Connell, M. E. (2004). Light stemming for Arabic information retrieval. Amherst: University of Massachusets.