ARA281 Arabic Literature - National Open University of Nigeria
Arabic Information Retrieval Literature Review
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Arabic Information Retrieval Literature Review
Arabic Information Retrieval Literature Review
Omar Kashef
The University of North Carolina at Chapel Hill
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 2
Contents Introduction ................................................................................................................................................. 3
Guiding Articles ....................................................................................................................................... 4
Characteristics of Arabic ......................................................................................................................... 5
Diacritic Marks..................................................................................................................................... 5
Broken Plurals ..................................................................................................................................... 6
Dialects ................................................................................................................................................. 6
Transliteration and Arabizi ................................................................................................................ 7
Stemming ..................................................................................................................................................... 7
Morphological Approach ........................................................................................................................ 8
Light Stemmers ........................................................................................................................................ 9
N-Gram Stemmers ................................................................................................................................. 10
Hybrid Stemmer Models ....................................................................................................................... 11
Stemming based on Latent Semantic Analysis Model ........................................................................ 12
Stopwords .................................................................................................................................................. 13
Evaluation .................................................................................................................................................. 14
Conclusion .................................................................................................................................................. 16
Bibliography .............................................................................................................................................. 20
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 3
Introduction
Arabic is spoken by over 200 million people in over twenty countries spanning across North
Africa and Southwest Asia.1 This makes Arabic the fifth most common language in the world
indicating the importance of developing robust Arabic information retrieval (IR) systems. On
the internet, there are sixty-five million Arabic users, making Arabic the seventh most common
internet language.2 What makes Arabic IR particularly challenging are diacritical marks, or
vowels, broken or irregular plurals, and varying dialects. There are many dialects of this
Semitic language that have large implications in the field of Arabic IR. Additionally, when
writing/typing Arabic, vowels can be omitted; however, omitted diacritical marks, or phonetic
stresses, can change the meaning of words if the reader is not aware of the contextual use of
the word.3
Arabic words are typically derived through a robust system of Arabic roots. According to Sakhr
Software Company, there are over 10,000 potential roots, but far fewer that are used
regularly.4 This robust root system also has large implications in Arabic IR and may present
information retrieval challenges. This literature review will address two main issues that are
consistently raised in this field of Arabic IR: stemming and stopwords. There are still no
standardized methods of stemming or stopword elimination highlighting the infancy of the
field of Arabic information retrieval.
1 A Guide to Arabic – 10 facts about the Arabic language. BBC. 2 Top 10 Languages Used on the Internet. Accredited Language 3 A Guide to Arabic – The Arabic alphabet. BBC. 4 About Sakhr. Sakhr Software – Arabic language technology.
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 4
Both these issues raise further questions regarding varying Arabic dialects, proliferation of
transliteration, and diacritical marks. Prior to discussing both stemming and stopwords in
Arabic information retrieval, Arabic’s unique characteristics will be examined. After examining
those characteristics, stemming and stopwords will be discussed. Then I will explain evaluation
techniques and finish with needs for Arabic IR and projections for Arabic IR for future
information retrieval and computer scientists.
Guiding Articles
The articles I analyzed for this survey of Arabic information retrieval come from various peer
reviewed sources. The authors and papers selected are consistently referenced throughout the
selected papers adding to the credibility of the selected papers. The selected articles are
focused in the following areas:
surveys of Arabic IR (Abu El-Khair, 2007; Darwish & Magdy, 2014)
stemming techniques (Froud, Lachkar, & Ouatik, 2013; Kreaa, Ahmad, & Kabalan, 2014;
Larkey, Ballesteros, & Connell, 2004)
microblog retrieval (Darwish, Magdy, & Mourad, 2012)
diacritical marks in Arabic IR (Aloufi, 2011)
Arabic morphology analysis (Abu-Errub, Odeh, Shambour, & Hassan, 2014)
Together, these papers allow for a thorough examination of Arabic information retrieval. The
selected articles range from 2002-2014 thus the information surveyed is up to date. The only
articles written before 2010 are one survey article and a foundational paper outlining a new,
at-the-time, light stemming process. There are currently to my knowledge no peer reviewed
articles published in 2015 that specifically address stemming and stopword elimination
techniques for Arabic text.
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 5
Characteristics of Arabic
The Arabic language is distinct from English in many ways meaning that English IR practices
are not transferrable to Arabic IR. This section of this paper will introduce characteristics of
Arabic that present challenges in the stemming and stopword elimination processes. The first
major distinctions are that Arabic is a cursive script and is written from right-to-left. There are
twenty-eight letters, eight forms of the diacritic hamza, and another eight diacritic marks. The
twenty-eight letters all have different forms based on the letter’s position in a word. Hamza
functions like a letter and a diacritic mark; it is mainly used as a glottal stop in the
pronunciation of a long vowel. Nouns also have gender in Arabic.
A word in Arabic can be extended quickly with prefixes or suffixes that are commonplace in the
Arabic language. In addition, Arabic also features another distinction from English in the use of
kashidas, or physically elongated letters. Kashidas are often treated like white spaces in the
retrieval process (Darwish & Magdy, Arabic information retrieval, 2014). Despite these
complications, Arabic is a phonetic language and most words can be broken down into prefixes,
roots, and suffixes. Roots are between two and five characters and there are currently about
5,000 in modern use (Abu El-Khair, 2007). Not only are their Arabic roots, but also Arabic word
patterns, twelve in total, that change the meaning of Arabic roots. These patterns are consistent
for most Arabic words and are considered in the Arabic IR process.
Diacritic Marks
Diacritic marks are simply vowels in Arabic used to inflect letters in words. These marks can
change the pronunciation and even the meaning of the word. Thus, diacritic marks will play a
large role in the stemming process for Arabic IR. Information retrieval and computer scientists
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 6
must consider whether or not diacritic marks will be removed in the stemming process. What
complicate diacritic marks even further is that they are usually not used. Arabic used in news
articles, forums, and social media does not incorporate diacritical marks. These marks can
usually be found in Arabic laws, cases, academic papers and religious texts and commentary
(Darwish & Magdy, Arabic information retrieval, 2014).
Broken Plurals
Another key aspect of the Arabic language are irregular or broken plurals. While plurals in
English are relatively straightforward with just an additional suffix, Arabic plurals are more
likely to deviate from the typical Arabic plural word structure.
Dialects
Arabic has millions of speakers across North Africa and Southwest Asia, but there are at least
six distinct dialects that presents additional challenges for Arabic IR. The six dialects are
Egyptian, 85 million speakers; Maghrebi, 75 million; Levantine, 35 million; Iraqi, 25 million;
Gulf, 25 million; and Yemini, 20 million (Darwish & Magdy, Arabic information retrieval, 2014).
Dialectical words are morphologically different than words in modern standard Arabic (MSA)
meaning that stemming and stopword elimination techniques are hindered because dialectical
words do not conform to the morphological standards of MSA. Modern standard Arabic is
typically used for news and more formal purposes, unlike the various dialects which are
restricted to informal online content and speech.
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 7
Transliteration and Arabizi
Another key feature to the world of Arabic information retrieval is the proliferation of
transliterated web content and Arabizi. A lot of Arabic content on the internet is a
transliterated form of Arabic, commonly known as Arabizi. Arabizi is an encoded dialectical
and phonetic Arabic text found only online. Arabic users will use the English keyboard to write
in Arabic incorporating additional characters/numbers as letters for sounds unfamiliar to
English users. While retrieving Arabizi results is not particularly challenging, the problems are
with standardization and incorporation of additional languages. Arabizi is only found online
and with the varying dialects, there is a lack of standardization and an ever expanding
dictionary as Arabic users commonly incorporate English and French words (Darwish &
Magdy, Arabic information retrieval, 2014). This transliterated and slang-infused type of Arabic
presents additional challenges in the stemming and stopword elimination processes.
Stemming
Stemming is a text process by which different variations of a word are conflated. The stemming
process incorporates both inflections and derivations of words from a common stem. While
stemming has provided noticeable improvements in retrieval results, stemming is critical to
improving search results for Arabic content. There two main types of stemmers are algorithmic
and dictionary-based.5 Both of these techniques have been tested in Arabic text processing.
This section will differentiate between morphological stemmers, light stemmers, WordNet
5 Croft, W. B., Metzler, D., & Strohman, T. (2010). Search Engines Information Retrieval In Practice. Upper
Saddle River: Pearson Education, Inc.
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 8
processing, and hybrid stemmers. Most proposed or tested stemmers usually combine various
approaches.
Morphological Approach
Morphological stemmers are designed to analyze queries and texts using Arabic roots.
Specifically, morphology is a linguistic filed analyzing structure and formation of words. The
morpheme is the smallest and significant unit of language (Abu-Errub, Odeh, Shambour, &
Hassan, 2014). Morphological stemmers are dictionary-based in that they typically funnel the
query or text through a dictionary of roots and/or patterns to extract roots.
The Khoja stemmer, considered a morphological analysis because of it checks for patterns and
roots, is one of the first stemmers introduced in the field of Arabic IR (Larkey, Ballesteros, &
Connell, 2004). This stemmer and is primarily a dictionary-based stemmer. As mentioned
earlier, there are many roots in Arabic in modern usage, but roots and stems are not
synonymous in Arabic IR. The root pertains to Arabic root morphology while stem pertains to
simplified word structures that do not necessarily match Arabic root structures. The Khoja is
considered an aggressive stemmer. This stemmer removes diacritic marks, punctuation,
numbers, definite articles, suffixes, and prefixes. After this process, the results are matched
against a list of the twelve Arabic patterns. If there is a match, the pattern is removed to
hopefully extract a root. Once a root is extracted, it is matched against a root dictionary.
Altantawy introduced a morphological analyzer specifically for modern standard Arabic that
was successful in that it had high precision and recall scores. However, it was only applied to
the Qur’an which deviates less structurally, morphologically than other Arabic dialects without
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 9
diacritic marks. Abu Hawas introduced a morphological stemmer that functions similarly to
Altantawy’s but actually predicts roots without referencing a dictionary. (Abu-Errub, Odeh,
Shambour, & Hassan, 2014). This brings up the issue of testing and evaluation of Arabic
stemmers which will be explained further in the Evaluation section of this literature review.
Light Stemmers
In contrast to the aggressive Khoja stemmer and its substitutes, light stemmers were
introduced in the early part of the new millennium around when TREC introduced the Arabic
cross language track featuring Arabic (Darwish & Magdy, 2014). The biggest distinction
between light and aggressive stemmers is the type of error. Light stemmers sometimes do not
conflate enough forms of words that should be together while aggressive stemmers conflate
words that should not have been conflated (Larkey, Ballesteros, & Connell, 2004).
Larkey’s light-10 stemmer removes most frequent prefixes, suffixes, all definite articles, and
the letter “waw” which usually means “and”. This approach is significantly different than the
morphological approach because there is no consideration for Arabic roots. Algorithmically,
Larkey’s light stemmer is simpler because there are fewer computations used to process texts
or queries (Kreaa, Ahmad, & Kabalan, 2014). There have been a number of light stemmers
tested over the past decade, particularly Al-Stem, light-8 and light-10. The Darwish light
stemmer Al-Stem is more aggressive than Larkey’s light-10 stemmer which is also similar to
light-8. Darwish developed al-stem after light-10 was developed by a group of students at
University of Massachusetts at the TREC 2001-02 cross language tracks for Arabic (Abu El-
Khair, 2007).
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 10
Prior to light-8 and light-10 stemmers, researchers created even lighter stemmers. Light-1 and
light-2 did not even remove suffix strings, only prefixes and fewer prefixes than both light-8
and light-10. As expected, light-1, light-2, and light-3 stemmers all had lower precision and
recall scores than light-10 and light-8. The differences (average precision) between light-8 and
light-10 which had an additional co-occurrence step were not statistically different. However,
the difference (average precision) between both light-8 and light-10 with the previous light
stemmers 1, 2, and 3 are statistically significant (Larkey, Ballesteros, & Connell, 2004).
In comparison to the khoja stemmer, light10 also proved more robust in regards to precision
and recall measures. The distinction between light8 and light10 is an addition co-occurrence
measure. Co-occurrence re-clusters stems that are likely to occur together. Adding this co-
occurrence step did not make a significant difference in average precision between light-8 and
light-10, but average precision did increase (Larkey, Ballesteros, & Connell, 2004). Larkey’s
article was the only one that mentioned a co-occurrence technique to cluster stems that are
likely to be close to each other.
N-Gram Stemmers
In 1987, Ali Hegazi used n-grams to determine character sequences in Arabic text, one of the
first information retrieval scientists to do so. He estimated nth order entropy and redundancy
using letter frequency and rank frequency distribution, like TF*IDF scores. Hegazi found out
that Arabic entropy and redundancy is higher than English entropy and redundancy (Abu El-
Khair, 2007). Nth order entropy refers to uncertainty given a set of probabilities and
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 11
redundancy occurs when sets of characters occur together frequently.6 Arabic’s entropy and
redundancy is higher than English. So not only does Arabic present more uncertainty in terms
of characters following one another, but there are character sets occurring more frequently
than in English. This results in a high compression score for the Arabic language which makes
use of n-grams in the text preprocessing phase more challenging (Abu El-Khair, 2007).
Abu El-Khair cites a few more attempts at processing Arabic text for potential stemming
algorithms; however, most were ineffective regardless of use of bigrams, trigrams, or other
variations. The author argues that using n-grams does not address the ambiguity of Arabic
words without diacritic marks (Abu El-Khair, 2007). If n-grams are used, additional processing
should be incorporated to increase efficiency such as removing diacritic marks and kashidas.
However, this method requires additional processing that results in increased storage
requirements (Darwish & Magdy, Arabic information retrieval, 2014). There have been many
attempts to create stemming models focusing on roots, stems, or statistical methods, but some
of the more effective stemmers combine multiple approaches.
Hybrid Stemmer Models
Many stemmers developed take hybrid approaches with morphological processing,
suffix/prefix algorithms, and further reductions like kashidas and diacritics. Root based, stem
based, or statistical (n-gram) based stemmers all have their advantages and disadvantages;
thus, many stemming models incorporate multiple approaches. Even the khoja stemmer which
is primarily a root based stemmer incorporates stem based methods. In full, the khoja stemmer
6 Claude Shannon & Information Theory -- Harvard School of Engineering and Applied Sciences
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 12
removes diacritics, stopwords, punctuation, definite articles, suffixes, and prefixes. Following
all of that, what is left is matched against a list of the twelve morphological patterns. Once the
pattern is extracted, the root is left over and matched in a root in a dictionary. This process has
led to the classification of the khoja stemmer as aggressive. Light stemmers on the other hand
typically do not use morphological processing (Kreaa, Ahmad, & Kabalan, 2014).
One of the most cited hybrid stemming algorithm approach incorporates WordNet tables. The
first WordNet table was designed by a group at Princeton. The table they created is a lexical
database consisting of nouns, verbs, adjectives, and adverbs that are grouped in synsets, or
cognitive synonyms. The synsets are semantically-conceptually related.7 A group of researchers
Kreaa, Ahmad, and Kabalan built an Arabic WordNet using the same conceptual models by the
Princeton group. They built an Arabic WordNet to capitalize on the richness of the Arabic
language. After each algorithmic step, like removing diacritics, the word is filtered through a
table to determine if the algorithm should apply or not. In this study, the Wordnet stemmer
performed better in terms of average precision than both the light and khoja stemmers (Kreaa,
Ahmad, & Kabalan, 2014).
Stemming based on Latent Semantic Analysis Model
Froud, Lachkar, and Ouatik also attempted to create a model to tackle discrepancies between
terms. These researchers argue that document clustering is too challenging now because that
method requires large storage spaces. They created the latent semantic analysis (LSA) model to
characterize similarity between Arabic words, similar to cosine similarity and correlation
7 What is WordNet? – Princeton University
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 13
coefficient. This work builds off some of the co-occurrence work that Larkey had employed.
The LSA model is a vector space model that transforms corpus of text “into a vector space of
several hundred dimensions” (Froud, Lachkar, & Ouatik, 2013, p. 2). Vector models are built off
of vector representations of documents and queries. Documents, queries, and terms vectors are
measured using the cosine similarity measure that measures the angles between documents,
queries, and terms.8 The LSA model was measured via cosine similarity, Euclidean distance,
Jaccard coefficient, and the Pearson correlation coefficient. The LSA model was tested with and
without stemming. The model actually produced better similarity measure scores without
stemming. They contend that this occurs because the same root can be used to produce a
variety of meanings through the twelve distinct morphological patterns (Froud, Lachkar, &
Ouatik, 2013). This information retrieval system also incorporated stopword elimination which
is another issue in Arabic information retrieval that is lacking standardization as researchers
have yet to agree on how best to produce a stoplist.
Stopwords
Another key feature when processing text or queries to retrieve results is stopping. Stopwords
are function words that have little to no meaning when separated from other words.9 Throwing
out stopwords in the retrieval process eliminates those unnecessary search terms, reduces
processing power, and needed storage space. Creating lists of stopwords, or stoplists usually
happen in one of two ways, through human input or through statistical methods. The general
stoplist is human generated. Currently, the Lemur Toolkit for stopwords in Arabic is 168 words
8 Croft, W. B., Metzler, D., & Strohman, T. (2010). Search Engines Information Retrieval In Practice. Upper Saddle River: Pearson Education, Inc. 9 Ibid
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 14
(Abu El-Khair, 2007). Abu El-Khair stated the fact that the list is “only” 168 words as if it is a
short list.
Another way to generate a list of stopwords is through statistics. A list of stopwords can be
generated by determining frequency and cutting out words with the highest frequency. This is
an arbitrary method and may cut words that may be useful when entering a query. As of now,
there is no standard list of Arabic stopwords that people are systematically using. Thus, more
work is needed in this area. (Abu El-Khair, 2007). Since there is no standardized stoplist,
neither Google nor Bing actually use stopword lists for Arabic web searches (Darwish & Magdy,
Arabic information retrieval, 2014).
Evaluation
Outside of information retrieval and computer scientists evaluating their own information
retrieval systems or stemming techniques, many of their methods are tested at information
retrieval conferences. One of the most reputable and widely known is the Text Retrieval
Conference that are sponsored by the National Institute of Standards and Technology and the
US Department of Defense.10 There are three main components of a test collection at these
information retrieval conferences: queries and documents, topics, and relevance judgments.
There are usually around fifty topics and wide range in the number of documents. Earlier
conferences had far fewer total documents because computing power was not as powerful as
today and fewer sources of documents. Following the internet boom along with the exponential
growth of information, the number of documents in text collections rose dramatically.11
10 Slide 11 – March 17th INLS 509 PPT 11 Slide 10 – March 17th INLS 509 PPT and Class
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 15
Text collections are typically formed in four major steps. First, the topic creator makes
relevance assessments. Then, different sites are provided the corpus and topics to create lists
for each topic. Usually about fifty lists are made, one per topic, which consist of about 1,000
documents. Finally, lists are pooled and documents within lists are assessed for relevance.
While test collections can be useful because the collections are reusable and provide a standard
to compare information retrieval systems, they are hard to replicate and some feature binary
relevance scores that are assessed by only one person. Additionally, previous lists may have
been created with now outdated technology that did not process many relevant documents.12
Arabic became a popular topic in information retrieval following the inclusion of the cross
language track at TREC in 2001-02. The test collection from this conference is still widely used
for the news domain. Following TREC 2001-02, many other information retrieval conferences
started to pick up Arabic. Topic Detection and Tracking Evaluation (TDT3) created an Arabic
track constructed using radio, television, broadcast news, and news articles. This conference is
designed to focus around a specific event around those various mediums. TDT3 included an
Arabic track that featured almost 16,000 Arabic documents. In 2005, TRECvid included an
Arabic video track capturing 82.6 hours from Arabic news sources. At the Question Answering
for Machine Reading track in 2012-13 at Cross Language Evaluation Forum, there were Arabic
test collections that featured documents related to Alzheimer, AIDS, climate change, and music
and society. There have been additional conferences focusing on Arabic text, speech, and
information retrieval. There has clearly been significant growth in the world of Arabic
information retrieval (Darwish & Magdy, Arabic information retrieval, 2014).
12 Slide 2 – March 24th INLS 509 PPT
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 16
The main metrics used to evaluate stemming techniques in Arabic information retrieval are
average precision, recall, and precision. These metrics are used to compare stemming
techniques. Throughout the literature, the khoja stemmer and other morphological stemmers
did not perform as well as light stemmers. The light-10 stemmer scored highest at the TREC
2001-02 but Abu El-Khair argues that more work needs to be done (Abu El-Khair, 2007). He
wrote his article in 2007, and since then there have been many attempts to improve stemming
techniques. In the articles surveyed, there are many assumptions as well or issues with the
evaluation process. One author test their systems only on Qur’anic or religious texts which is
vastly different than the Egyptian dialect for example (Abu-Errub, Odeh, Shambour, & Hassan,
2014). Another author mentions researchers that only tested their analysis using modern
standard Arabic which again is not representing the vast diversity of Arabic dialects (Abu El-
Khair, 2007).
Conclusion
Arabic information retrieval has not been as heavily studied as English IR; thus, there are many
areas of IR that are still progressing. The key issues need more work on are standardized
stemming techniques and cross-language information retrieval efforts because much of Arabic
content on the internet is transliterated. Additionally, since much of the Arabic content on the
web is dialectical, there are many considerations needed in how best to create an information
retrieval system.
Although extremely anecdotal, having been to Egypt and seeing how my cousins would interact
with the internet was vastly different to my experiences in the US. For one, they were
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 17
incredibly invested in social media, especially Myspace which had already fallen out of favor
amongst many of my peers at the time. Reading through various articles, the fact that Arab
internet users more frequently are posting in forums is something I have seen and have not
understood. However, this makes sense if much of Arabic internet content is found in forums
which has not been my experience with searching in English.
Additional research in the field of Arabic IR will dramatically change the lives of Arabic internet
users. Arabic use on the internet grew thirty-fold from 2000 to 2012. Over forty percent of
internet users in Saudi Arabia are tweeting, highest proportion in the world. Yet only one
percent of web pages are in Arabic. Often, Arabic internet users find themselves on forums
after entering queries because much of Arabic content is found on forums. Only last year was
an Arabic address made available “dot shabaka” (transliterated Arabic) rather than using
.com.13 As Arabic users continue to populate and publish additional content on the internet at
an increasing rate of change, there will be a continued need to research and create better
stemming techniques to improve information retrieval systems.
Another area that Arabic information retrieval is focusing on is are microblogs, or items like
tweets. Darwish and Magdy team up again with Mourad to discuss microblog search. They
were the first researchers to contribute to the microblog field of Arabic IR writing their article
in 2012. From September 2011 to February 2012, there was a ten percent increase in Arab
Facebook users. Arabs make up almost fifteen percent of the Facebook population. Darwish,
Magdy, Mourad realize that following all these social media movements highlighting
revolutions from Tunisia, Egypt, to Syria, and even Qatar that there is a need to retrieve
13 Surfing the shabaka – The Economist
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 18
microblog content better than the information retrieval technology is capable of now (Darwish,
Magdy, & Mourad, Language processing for Arabic microblog retrieval, 2012).
There was actually a microblog track at 2011 TREC. Microblog retrieval presents the same
issues with text and query Arabic IR such as stemming and stopword elimination. Additionally,
dialects present a large challenge in microblog retrieval because different transliterated
dialects are not standardized and people spell pronunciations differently. Not only are dialects
used pervasively, but spelling errors are ubiquitous in the field of microblog retrieval especially
in Arabic with the numerous dialects employed. (Darwish, Magdy, & Mourad, Language
processing for Arabic microblog retrieval, 2012).
Moving forward, information retrieval and computer scientists in Arabic IR have a lot of work.
One thing that could ground future work is if dictionaries for transliterated dialects are created.
Although that would be incredibly challenging as new words are added and transliterated if
there is no Arabic word available, this standardization might need to occur to create
breakthroughs in information retrieval. This issue has hurt researchers’ attempts to design
effective stemming and stopword elimination algorithms.
With the recent growth in Arabic content on the web following the revolutions in North Africa
and Southwest Asia, there is a greater world focus in this area. More students are taking Arabic
courses and traveling to these areas to learn the dialects and cultures of various peoples across
the Middle East. With this interest in the Middle East, Arabic content on the internet will
continue to grow and increase pressure to standardize information retrieval techniques so that
Arabic content on the web is easily searchable. Arabic IR is still in its infancy and with time, I
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 19
believe that Arabic IR researchers will develop more robust stemming and stopword
elimination techniques.
ARABIC INFORMATION RETRIEVAL LITERATURE REVIEW 20
Bibliography Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and
Tehcnology, 505-533. Abu-Errub, A., Odeh, A., Shambour, Q., & Hassan, O. (2014). Arabic roots extraction using
morphological analysis. International Journal of Computer Science Issues, 11(2), 128-134. Aloufi, K. (2011). Diacritic oriented Arabic information retrieval system. International Journal
of Compute Science and Security, 143-155. Darwish, K., & Magdy, W. (2014). Arabic information retrieval. Foundations and Trends in
Information Retrieval, 239-342. Darwish, K., Magdy, W., & Mourad, A. (2012). Language processing for Arabic microblog
retrieval. Maui: Conference on Information and Knowledge Management. Froud, H., Lachkar, A., & Ouatik, S. A. (2013). Arabic text summarization based on latent
semantic analysis to enhance Arabic documents clustering. Internationa Journal of Data Mining & Knowledge Management Process, 3(1), 79-95.
Kreaa, A. H., Ahmad, A. S., & Kabalan, K. (2014). Arabic words stemming approach using Arabic WordNet. International Journal of Data Mining & Knowledge Management Procss, 4(6), 1-14.
Larkey, L. S., Ballesteros, L., & Connell, M. E. (2004). Light stemming for Arabic information retrieval. Amherst: University of Massachusets.