Enhancing Information Retrieval System bu using AGA

77

Transcript of Enhancing Information Retrieval System bu using AGA

Acknowledgments

First of all, I would like to thank Allah, Allah the Almighty, who gave me the

Strength to finish this thesis. My sincere thanks go to my supervisor Dr.

Mamoun Al-Rababaa for support and guidance during the research.

I would like also to express my appreciation and deep gratitude to Dr. Essam

Hanandeh and Dr. Mofleh Aldyabat

I would like to thank from the deepest of my heart my grandparents and my

parents who supported me morally and encouraged me in all my way in my

study. They were lighting my way by their words and their smiles.

accepting to come and discuss my thesis. Finally, I hope this work may

contribute in developing the knowledge in this field. And I pray to Allah to

dedicate me to serve my country under the governance of His Majesty King

Abdullah II Bin Al-Hussein

All praises to Allah

Laith aboaliqa

Dedication

To the Soul of my grandfather

To my beloved, father, mother, brothers and sisters

Laith aboaliqa

Table of Contents

List of Tables........................................................................................................................... G

.. I

List of ... . J

Abstract .........................................K

______________________________________________________________________________

Chapter 1: Introduction

1.1 Background ......................................................................................................................

1.2 Information Retrieval Systems (IRS)..............................................................................

3

1.4 The Statement of the Problem....................................................................................................4

1.5 The Objectives of the Study.................................................................................. 5

1.6 Organization of the Thesis.........................................................................................................5

______________________________________________________________________________

Chapter 2: Theoretical concept

2.1Information R ....................................................................................................6

2.2 Basic Processes of Information Retrieval..................... ............................................6

2.3 Information Retrieval (IR) Models .......................................................................................7

2.4 Document and Query Representation......................................................................................10

2.5 Example Cosine Similarity Measures......................................................................................12

2.6 Evaluation of Information Retrieval Systems..........................................................................15

2.7 Overview of Genetic Algorithm..............................................................................................16

2.7.1 Genetic Algorithm Approach................................................................................................17

2.7.2 Adaptive genetic algorithm...................................................................................................20

Chapter 3: Literature Review.................................................................................................24

______________________________________________________________________________

Chapter 4: Methodology

4.1 Introduction..................................................................................................................... .........28

4.2 Data Collection .........29

4.3 Preprocessing Steps.................................................................................................................30

4.4 Proposed Algorithm.................................................................................................................36

4.5 Pseudocode of Proposed Adaptive Genetic Algorithm..........................................................37

______________________________________________________________________________

Chapter 5: Experimental results

5.1 Results .....................................................................................................................................39

5.2 Average Precision using VSM with Cosine.............................................................................45

.........................................................................46

5.4 Average Precision using EBM with Cosine.............................................................................47

5.5 Average Precision using EBM with Cosine.............................................................................48

5.6 Result of Precision for (VSM).................................................................................................49

5.7 Result of Precision for (EBM).................................................................................................50

5.8 Comparison between Proportions Improvement.....................................................................51

______________________________________________________________________________

Chapter 6: Conclusion, recommendations and future work

6.1 Introduction 52

6.2 Conclusion ..................................52

6.3 Future work 53

______________________________________________________________________________

References.......................................................................................................................54

Abstract in Arabic languages ....................................................................................57

List of Tables

Table (2.1) Document and term query frequency..........................................................................12

Table (2.2) Max frequency and document ID................................................................................13

Table (2.3) Inverse document frequency.......................................................................................13

Table (2.4) Compute frequency and weights.................................................................................14

Table (2.5) Weights of term...........................................................................................................14

Table (4.1) representation chromosome........................................................................................32

Table (5.1) value of precision and Recall for query number1 by using (VSM) with (AGA)

using cosine fitness function..........................................................................................................39

Table (5.2) value of precision and Recall for query number1 by using (VSM) with (AGA)

using proposed cosine fitness function..........................................................................................40

Table (5.3) value of precision and Recall for query number1 by using (VSM) with (AGA)

using jaccard's fitness function......................................................................................................41

Table (5.4) value of precision and Recall for query number1 by using (VSM) with (AGA)

using proposed jaccard's fitness function.......................................................................................41

Table (5.5) value of precision and Recall for query number1 by using (EBM) with (AGA)

using cosine fitness function..........................................................................................................42

Table (5.6) value of precision and Recall for query number1 by using (EBM) with (AGA)

using proposed cosine fitness function..........................................................................................43

Table (5.7) value of precision and Recall for query number1 by using (EBM) with (AGA)

using jaccard's fitness function......................................................................................................43

Table (5.8) value of precision and Recall for query number1 by using (EBM) with (AGA)

using proposed jaccard's fitness function.......................................................................................44

Table (5.9) Average value of precision for all queries using VSM with cosine............................45

Table (5.10) Average value of precision for all queries .....................46

Table (5.11) Average value of precision for all queries using EBM with cosine..........................47

Table (5.12) Average value of precision for all queries using EBM with jaccard's......................48

Table (5.13) Using vector space model with fitness function option............................................49

Table (5.14) Using extended Boolean model with fitness function option...................................50

Table (5.15) Compare between improvements..............................................................................51

List of Figures

Figure (2.1) Data flow for the traditional genetic algorithm.........................................................17

Figure (4.1) Data flow for proposed algorithms............................................................................38

Figure (5.1): Average value of precision for all queries using VSM-cosine.................................45

Figure (5.2): Average value of precision for all queries using VSM-jaccard's.............................46

Figure (5.3): Average value of precision for all queries using EBM-cosine.................................47

Figure (5.4): Average value of precision for all queries using EBM- .............................48

Figure (5.5): Using VSM with fitness function option..................................................................49

Figure (5.6): Using EBM with fitness function.............................................................................50

List of abbreviations

IRS :--------------------------------- information retrieval system

AGA:-------------------------------- adaptive genetic algorithm

GA:---------------------------------- genetic algorithm

IR: -----------------------------------information retrieval

P: -------------------------------------precision

R: -------------------------------------recall

Doc Id:-------------------------------document identifier

F: -------------------------------------fitness value

Q: ------------------------------------query

TF:----------------------------------- term frequency

TF-IDF: -----------------------------term frequency- inverse document frequency

VSM: --------------------------------vector space model

EBM: --------------------------------extended Boolean model

Pm:----------------------------------- mutation probability

Pc: ------------------------------------crossover probability

Abstract:

Keywords: Information Retrieval, Adaptive genetic Algorithm, Extended Boolean Model, Vector space model.

Chapter one

Introduction

1.1 Background

After the increase number of information and documents on the Internet, provide a

huge number of information and documents, created by millions of authors and

organizations. As users, we can have this information through using the process of

information retrieval system. But still the users may encounter several problems in the

process of information retrieval system. Thus, there are several researches that are

tackling these problems that have an effect on the accuracy of the system in order to

find the best solutions. The researcher will attempt to increase the efficiency and

accuracy of the system [21].

The researcher reaches to some results from this study, which showed some

improvements in information retrieval system performance using adaptive genetic

algorithms, through implementing some queries using several methods in order to

obtain the relevant information through sorting and ranking based on the similarity

measure [21].

The objectives of the study were identified in order to solve the problems of

information retrieval system and to facilitate this process so that the user will obtain

the information he needs accurately through adaptive genetic algorithms and

implement it in an information retrieval system. The best solution was obtained with

the information retrieval system through achieving more accuracy after the

improvements from applying genetic algorithms [1] [21].

Genetic algorithms were used in many traditional areas such as information

retrieval system to meet the user need through entering the query, Neural networks to

get the best solution through the sorting of node, and artificial intelligence which has

become very common recently on the development of a genetic algorithm through

proposing adaptive fitness function and adaptive mutation operator [1] [10].

As is should be clear now that the study aims at investigating the information

retrieval models that are used to find the similarity between the query and documents.

In this study, the researcher used two models: vector space model and extended

Boolean model to compute the similarity between the query and documents [5].

The researcher also proposed two fitness functions: cosine and j

used variable ratio of mutation operators in order to have better results through

comparing first fitness function (Cosine) with variable probability mutations and

variable probability crossover and comparing

with variable probability mutations and variable probability crossover [5].

The corpus of the study consists of 1400 English documents in Mathematics and

255 queries to evaluate the effectiveness of the results according to the precision and

recall measures [4] [7].

1.2 Information Retrieval Systems (IRS)

In order to facilitate the task of the user in searching for information from a collection

of available documents through entering the query linked with user need, the

information retrieval systems work to rank documents in order, based on cosine

similarity, and identify documents that contain a relationship (relevant) with user

needs [1].

1.3 Genetic Algorithm

Genetic algorithm is represented by chromosome and the chromosome is represented

in several ways: figures of the percentage of decimal places, groups or binary system,

and conversion can be carried out on the basis of the fitness function. Through the

work of the algorithm in the process of retrieving the genetic information, each

chromosome is divided into several genes and each gene represents a term for the

query entered [7].

Genetic algorithms can work in several adaptations, such as adaptive fitness function,

adaptive crossover and mutation operators to improve the performance of the genetic

algorithm in information retrieval and having a better generation of the population

[17].

Genetic algorithm focuses on population to start the evaluation of process in genetic

algorithm and the population will be taken from the results of information retrieval

system when finishing the process [2].

The process of information retrieval systems is based on the degree of similarity

between the query and documents in the corpus and it is calculated according to the

degree of efficiency and accuracy using precision and recall measures [1][4].

Several models have been used in information retrieval to represent the document

and query to compute the similarity between them such as the vector space model,

Boolean model, language model and fuzzy model. But, in our study, the researcher

just used vector space model and extended Boolean model [10].

The researcher used adaptive genetic algorithm in order to have the best solution from

the results of information retrieval system [1].

Genetic algorithms used several principles in their operations such as a selection

method of the initial population. In our study, the researcher used the top ranked

documents retrieval in the selection process and used the documents retrieved from

the entry of a single query as initial population [2].

There are also several probability ratios of mutations and crossover in genetic

algorithms that can be taken from a range of mutation probability. Each ratio may flip

the chromosomes or may not, and has a ratio of crossover that may flip also the

chromosomes or may not for having best solution [5].

Mutation is an important strategy to study genetic algorithm for the optimization

process to get better results and to seek the best solution in the genetic algorithm

when retrieving information. There are several types of mutation operator and every

type uses certain probability with specific range to get a great number of Mutations

[3].

1.4 The Statement of the Problem

A large number of documents retrieved by the users

,

the

1.5 The Objectives of the Study

The aims to achieve the following goals:

1- Proposing adaptive fitness function through the use of two equations (Cosine, and

Jaccard) which have been used recently for having accurate results and improving the

GA performance.

2- Proposing adaptive crossover and mutation operators through the use of more

points within the domain of crossover equations that have been used recently for

having accurate results and improving the GA performance.

3- Finding the right number of generations for the user to save time in genetic

algorithms.

1.6 Organization of the Thesis

1. Chapter one : an introduction

2. Chapter two : theoretical concept

3. Chapter three : literature review

4. Chapter four : methodology and data collection

5. Chapter five : experimental results

6. Chapter six : conclusions, recommendations and future work

Chapter two

Theoretical concept

2.1 Information Retrieval

There are fundamental processes in information retrieval systems to get the

information that meets the needs of the user by finding the similarity between the

query and the existing documents. In this chapter, we shall show some of the basic

processes of matching the query with documents [9], based on Vector Space Model

(VSM) in which both documents and queries are represented as vectors, and used

extended Boolean model (EBM) [10].

2.2 Basic Processes of Information Retrieval

There are several processes of information retrieval system as following:

1- Removing the punctuation marks from each document.

2- Removing stop words from each document, such as prepositions, articles and

other words that appear frequently in the document without adding any

meaning to it [7] [8].

3- Stemming the words using the porter stemmer that is the most commonly used

[8].

4- Inverted index, give each document a unique serial number, known as the

document identifier and connecting which element in any document appeared

(doc ID) [7].

2.3 Information Retrieval (IR) Models

A range of different models has been proposed in the information retrieval literature,

based upon different notions of what it means for a document to be relevant to a

query. While some models, such as Boolean model, have been important historically,

the most common form of information retrieval today is ranked retrieval. A query is

treated as an unordered set of Keywords.

There are models of information retrieval system as follow:

1- The Extended Boolean model was described in a Communications of the

ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry

Wu. The goal of the Extended Boolean model is to overcome the drawbacks of

the Boolean model that has been used in information retrieval. The Boolean

model doesn't consider term weights in queries, and the result set of a Boolean

query is often either too small or too big.

The idea of the extended model is to make use of partial matching and term

weights as in the vector space model. It combines the characteristics of the

Vector Space Model with the properties of Boolean algebra and ranks the

similarity between queries and documents. This way a document may be

somewhat relevant if it matches some of the queried terms and will be

returned as a result, whereas in the Standard Boolean model [From Wikipedia].

Extended Boolean model (EBM): it is a simple model that works on the basis

of the binary system [20]. The logical model to calculate the similarity

between the query and the documents where the logic (or, and, not) are

calculated according to the equations given to retrieve all documents that

contain terms through using (or) in this model [10] [18]. We used equation (1)

to rank documents in (EBM).

2- Vector Space Model (VSM): This model is characterized by the use of very

little binary weight. Instead, it deals with a model that has partial matching.

When the weight is given to the index terms in queries, these weights are used

to calculate the degree of similarity of each file in the system and the user

queries through a descending arrangement after calculating the degree of

similarity. Then, they are arranged in a descending way [7] [10].

Vector Space Model: The researcher used cosine equations from some

measure to compute the similarity between query and document because the

cosine similarity in recent research given good results. The document and

query are represented to compare each term query with each term document

when matching term query and term document in order to compute the weight

to compute similarity as following [10].

Each dimension corresponds to a separate term. If a term occurs in the

document, its value in the vector is non-zero. Several different ways of

computing these values, also known as (term) weights, have been developed.

One of the best known schemes is TF-IDF weighting (see the example below).

Example documents and queries are represented as vectors

Q= (W1,q , W2,q ,... Where x num of index term ,Q=term query

num of doc, D=documents

The definition of term depends on the application. Typically terms are single

words, keywords, or longer phrases. If the words are chosen to be the terms,

the dimensionality of the vector is the number of words in the vocabulary (the

number of distinct words occurring in the corpus) [From Wikipedia].

Table (2.2) shows the identity (ID) documents and max frequencies for each

document.

Table (2.2) Max frequency and document ID

Max frequency Document ID

15 1

17 2

12 3

11 4

Table (2.3) shows the number of documents in which the index term appears in them

(ni), and compute the inverse document frequency (Idfi) using an equation number

(4).

Table (2.3) Inverse document frequency

Idfi = log(N/ni) ni Term ID Words

log10 4/4 = 0 4 1

log10 4/4 = 0 4 2

log10 4/2 = 0.301 2 3

log10 4/3 = 0.125 3 4

log10 4/2 = 0.301 2 5

2.7 Overview of Genetic Algorithm

Genetic algorithm is an important technique in random search in the best solution or

the best among a group of solutions in the available data [8].In the computer science

field of information retrieval system, genetic algorithm (GA) is a search heuristic that

mimics the process of natural selection. This heuristic (also sometimes called a met

heuristic) is routinely used to generate useful solutions to optimization and search.

Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which

generate solutions to optimization problems using techniques inspired by natural

evolution, such as mutation, selection, and crossover.

In a genetic algorithm, a population of candidate solutions to an optimization problem

is evolved toward better solutions. Each candidate solution has a set of properties (its

chromosomes) which can be mutated and altered; traditionally, solutions are

represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]

The evolution usually starts from a population of randomly generated individuals, and

is an iterative process, with the population in each iteration called a generation. In

each generation, the fitness of every individual in the population is evaluated; the

fitness is usually the value of the objective function in the optimization problem being

solved. The more fit individuals are stochastically selected from the current

population, and each individual's gene is modified (recombined and possibly

randomly mutated) to form a new generation. The new generation of candidate

solutions is then used in the next iteration of the algorithm. Commonly, the algorithm

terminates when either a maximum number of generations has been produced, or a

satisfactory fitness level has been reached for the population.

Genetic algorithm operates on the encoded representation of the solutions, equivalent

to those chromosomes of the individuals in nature. It is assumed that a potential

solution to a problem may be represented as encoded as a chromosome. The

researcher used chromosome depends on converting the documents chosen from IRs

to the chromosome in genetic algorithm represented by the binary system to

implement it in genetic algorithms, can see in the next page [7]. The researcher will

compare between the works of the system using the adaptive genetic algorithm with a

traditional genetic algorithm.

2.7.1 Genetic Algorithm ApproachData flow for traditional genetic algorithm as following:

Yes

No

Figure (2.1) Data flow for the traditional genetic algorithm.

Representation of chromosome

Initial population

Fitness evaluation

Got betterresult

Result

Selection method

Crossover

Mutation

The process of the genetic algorithms

Representation of the chromosomes: These chromosomes are represented using

the binary system to express them. They represent a group of genes depending on

the degree of similarity between the query and documents.

The researcher chooses 30 documents were selected from information retrieval

system as initial population, The researcher chooses this number of documents as

chromosome because this number is good for average research, and a document

was transformed to a binary system.

The researcher compares between the query term and term document. If the first

term query exists in term document, set the first bit 1, otherwise set 0. This can be

done with the rest of the bits of chromosomes. The length of the chromosome is

ten.

Fitness Function: an approach used to compute system performance. It evaluates

the solution whether it was effective or not. Adaptive Genetic algorithm in my

thesis used two proposed fitness function cosine, and jaccard's with some

modification (add addition equation at fitness function cosine and to

find the similarity between the chromosomes and identify the chromosomes that

have better value.

The researcher used two fitness functions to get better results during measuring.

Then, the researcher will work to change the equation of fitness and seeks to get

the best results. The researcher s as fitness function

because this has been used recently in many research and taking good results so

used this two measure in case and modification this measure in another case to

get the best case.

Selection Method: The researcher used in the genetic algorithm the selection

mechanism that is a simple random sampling for the application of operations.

This consists of the construction based on selecting randomly chromosome. The

researcher chosen were selected randomly model because takes all chromosomes

chance to apply operator of genetic algorithm to get the best result from all

chromosomes.

After choosing two chromosomes the researcher applies operators of genetic

algorithm to achieve get a new two chromosomes better the old two chromosomes

have been selected. When get two chromosomes better the researcher made

swapping between the original chromosomes, then put the better chromosome

instead two chromosomes is bad, evaluated based fitness function [7].

Operators :

In a genetic algorithm approaches, the researcher uses two GA operators to

produce offspring chromosomes, which are:

Crossover: the genetic operator that mixes two chromosomes together to form a

new offspring. On adaptive probability, the researcher apply two points crossover

through using many probability ratio crossover, to get better value, the researcher

chosen two point crossover mechanism because takes chromosomes more chance

for chromosome change [7] [10].

Mutation: the second operator used in our genetic algorithm systems, involves a

mutation to modify the values of the gene; under slight change, changing the

values of chromosomes that appear different breeds, the goal of the mutation is to

restore the lost and explore a variety of data. The researcher used adaptive many

ratio probabilities to get a better ratio mutation, based on applying one point

randomly [6] [10].

2.7.2 Adaptive genetic algorithm

Genetic algorithms with adaptive parameters (adaptive genetic algorithms, AGAs) is

another significant and promising variant of genetic algorithms. The probabilities of

crossover (pc) and mutation (pm) greatly determine the degree of solution accuracy

and the convergence speed that genetic algorithms can obtain. Instead of using fixed

values of pc and pm, AGAs utilize the population information in each generation and

adaptively adjust the pc and pm in order to maintain the population diversity as well

as to sustain the convergence capacity.

In adaptive genetic algorithm, the adjustment of pc and pm depends on the fitness

values of the solutions. In CAGA (clustering-based adaptive genetic algorithm),

through the use of clustering analysis to judge the optimization states of the

population, the adjustment of pc and pm depends on these optimization states. It can

be quite effective to combine GA with other optimization methods.GA tends to be

quite good at finding generally good global solutions, but quite inefficient at finding

the last few mutations to find the absolute optimum Genetic algorithms with adaptive

parameters (adaptive genetic algorithms, AGAs) is another significant and promising

variant of genetic algorithms.

The probabilities of crossover (pc) and mutation (pm) greatly determine the degree of

solution accuracy and the convergence speed that genetic algorithms can obtain.

Instead of using fixed values of pc and pm, AGAs utilize the population information

in each generation and adaptively adjust the pc and pm in order to maintain the

population diversity as well as to sustain the convergence capacity. In AGA (adaptive

genetic algorithm), the adjustment of pc and pm depends on the fitness values of the

solutions. GA tends to be quite good at finding generally good global solutions, but

quite inefficient at finding the last few mutations to find the absolute optimum.

Chapter three

Literature Review

The researcher have studied a range of scientific papers published in scientific

journals, to benefit from them and to know where the researchers are in this

field in order to enhance our work on genetic algorithms, and to find better

solutions through this study.

It should be noted that many scholars have studied and worked in this field.

The researcher will mention some of the studies on genetic algorithms in this

section.

(2013), Wafa Zaal: In this paper, worked on the adaptation in genetic

algorithm using some model as the vector space model, the logical model and

the language model. She used crossover and mutation probability variable

instead of using a fixed value in traditional genetic algorithms. And she is

expanding both crossover and Mutation to get best results. This thesis used

Arabic corpus. Her study has shown that using the model space model with

cosine similarity was the best solution and the improvement rate was 13.0%

[10].

(2013), Mohammad Thangamani: In this paper, the focus was on the

retrieval of the most relevant information. The author used the genetic

algorithm to improve the information retrieval systems. The author studied,

the fitness function based on the frequency of words in the document. He had

concluded that a genetic algorithm with information retrieval gives better

results than conventional systems in addition to efficiency of documents

recovered in this search [16].

(2013), Priya Borkar and Leena Patil: In this paper, they worked on a model

of hybrid genetic algorithm particles. They also worked on the expansion of

keywords to produce new keywords related to the needs of the user. In this

recent study, they worked on the development of the algorithm in order to help

improve the search results that are relevant to the user needs, and used the

fitness function, which is represented by the equation that gives more

sophisticated operating results to find similarities between the needs of the

user and feedback documents. They also gave a briefing on the working

principle with an adaptive genetic algorithm which can be put in the query

expansion [1].

(2013), Korejo and Khuhro: In this paper, the author studied adaptive

mutation and proposed four operators in the genetic algorithm to determine the

operator mutation despite the difficulty of the matter in the application

process. The author proposed solution to the problem by adapting the mutation

percentage of mutation. He chose each operator mutation based on the

behavior of the initial Population of each generation. Finally, this study has

shown that the work of adaptation mutation gave the best result at work [2].

(2012), Firas Alabsi and Reyadh Naoum: In this paper, multiple systems in

the process of detecting the sequence were used, and types of value were used

to assess the population size and encoding. The author mentioned several

types of crossover used to determine the best type of this species, and

mentioned types of surge and compare them to find the best kind of boom. He

also studied the behavior of a genetic algorithm with the different types as

mentioned in this research to compare different types of genetic processes to

determine the best one to use [17].

(2012), Ammar Sami: In this paper, he proposed a research method based that

is on genetic algorithm to improve information retrieval system from websites

online, and apply information retrieval using a genetic algorithm and divided

the work into two units, first unit document indexing, the second unit genetic

algorithm used crossover and mutation operator and specialize of fitness

function and finally was obtained an improvement in the results of up to 90%

[21].

(2011), Mohammad Nassar and Feras AL Mshagba and Eman AL

mshagba: In this paper, genetic algorithm was used to improve the retrieval of

data by finding better inquiry through optimization. Arabic texts were used in

this study. The query was improved by using genetic algorithm through the

use of different functions, fitness, and different mutation strategies to get the

best result through the use of a logical model when data collection is in

Arabic. Finally, the study has shown that the best strategy is to use a genetic

algorithm and logic models (M2, Precision) method [4].

(2009), Noha Marwan: In this paper, she worked on the use of four different

Islands in the retrieval of data, and used the fitness function through Jaccard's

and Ochiai's. She applied each measure on the islands independently. She used

expanded query and show the results using the expanded query and the results

without the use of expansion query, she compared the results of the four

islands with the use of expansion and without the use of expansion. She points

out that the use of expansion improved the search results in accordance with

the user needs and showed that Jaccard's is better than Ochiai's when using

random selection and Ochiai's is better than Jaccard's when using unbiased

model tournament selection [14].

(2008), Ahmed Radwan and Bahgat Abdel Latef and Abdel Ali and

Osman Sadek: In this paper, the authors studied the use of genetic algorithm

in information retrieval. They used three Corpuses. Documents retrieved were

amended for users in genetic modification. In this paper, new jobs for fitness

were provided for approximate information. The result was very fast recovery

and very flexible through the use of cosine similarity function fitness. Finally,

the results showed that the use of new job fitness gives a more sophisticated

results and he pointed out that it is more accurate than the other two

approaches [7].

(2007), Detelin Luchev: In this paper, the author studied the use of genetic

algorithms in information retrieval for having improvements on the user

queries in information retrieval systems. The author used genetic modification

to have accurate information retrieval through improving the efficiency of the

system. The aim of the study was to retrieve relevant documents with a

smaller number of documents that are relevant with respect to the query input

by the user in retrieval system information using genetic algorithm. The result

has shown that when using a genetic algorithm, the results are more

sophisticated and more accurate than IR System classic in groups selected,

since the use of genetic algorithm gives more accurate output in terms to adapt

the weights of query terms to get high precision results in the query [8].

(2004), Sanem Sariel: In this paper, the focus was on the retrieval of the most

relevant information. The author used genetic algorithms to improve the

information retrieval systems. The author studied how adaptive mutation

based on the feedback of the population through changing, job fitness through

the work of a genetic algorithm to obtain the best solutions [11].

Chapter Four

Methodology

4.1 Introduction

Because of the large amount of documents and information that are related to each

other in the current time, particularly on the Internet, the users may encounter

problems in the process of information retrieval systems.

Indeed, in the process of retrieval of information, there are several key stages in the

building of information retrieval systems such as tokens, stems and inverted index. In

our study, the researcher used the adaptive genetic algorithm in the research in order

to have the best results in the process of recovery through the use of information

retrieval system (IRs) by an adaptive genetic algorithm (AGA) to get the best solution

among all solutions.

The researcher seeks to improve and enhance the genetic algorithm by applying

information retrieval systems to get more accurate results compared with previous

work. We also try to work on the adaptation fitness function through modification on

the two equations of fitness function (Cosine, Jaccard's) used in this research to

modify the equations used to find out the best solutions.

The researcher seeks to obtain a good result by applying the proposed equations of

change on the fitness function to have accurate and results. The significance of our

work stems from being an original measure compared with adaptive fitness function

and conditioning through a functional modification.

As it obvious, in the first act of the equations, The researcher want to get the best

results in the retrieval of information using the (Cosine) as a function of fitness based

on previous work [14], while in the other part of work, the equations The researcher

wants to get the best results in the use of information retrieval in (Jaccard's) as a

fitness function based on compare with previous researched [10]. Then the researcher

will compare our equations that are used after adjustment with common equations to

access the best res ) that

the researcher used in our study.

The research focused on our work on the use of model space vector (VSM) because it

is considered the best model and Extended Boolean model (EBM) proved by previous

studies in [8] [10]. The researcher used two equations to calculate the fitness function,

and work on adapting the mutation to find a better priority because of our work of the

crossover adjustment [19]. Finally, the researcher work on finding the best number of

generations from applying genetic algorithms.

4.2 Data collection

To achieve the purpose of the study, The researcher chose Cranfield English Corpus

test collection that Contain information on mathematics, which was conducted by

Cyrial Cleverdon and used at the University of Cranfield in 1960 .It contains 1400

document collection, and 225 queries collection for simulation purposes. Used in

order to evaluate the 2009 Cambridge University Press, Goteborgs University by

Karin Friberg Heppin (2002), and used in wright state university (2014), through

using Precision and Recall measure to evaluate the system.

4.3 Preprocessing steps

Before implementing the IRs, there are many steps to prepare Corpus as following:

1- The corpus of Cranfield English Corpus that contain information on mathematical

conducted by Cyrial Cleverdon and his colleagues at the University of Cranfield.

2- Conduct some operations on the texts for the processing of data retrieval systems to

take the terms of the documents (phase one preparing documents) as follows:

Tokenizing: Finding the terms of the documents.

Elimination of all the stop-words [7].

Stemming the remaining words: extracting prefixes and suffixes of each word to

get the root to facilitate the process of comparing the words and get similar

results, depending on the root, using the porter stemmer [8].

3- Building information retrieval system as following:

Inverted index: referring to each term stated in the documents, in addition to the

number of repeated terms in each document such as a table (2-4) in chapter two.

Assigning weight for each term taken from documents using IR models [7] [9].

Calculate the degree of weight for each of the terms in the inquiry and documents

using information retrieval models (VSM, EBM) Then, calculating the degree of

similarity by equalizing cosine similarity based on the weight of each of the terms

to give a degree of similarity between the inquiry and documents [10].

Using vector space model (VSM): a strategy to build an inverted index

file. Each query is compared to the input documents in the system to

produce the degree of similarity between the query and documents, and

then put in order of ascending degrees the similarity to rank the closest

degree of similarity that is based on recall and precision [10][12].

Extended Boolean model (EBM): It is a simple model used in the past in

information retrieval systems to represent terms. It works on the basis of

the binary system.

Extended Boolean Model (EBM): working when made similarity between

documents and queries, take a term from a query by query if the term

taking from query existing in document get the weights and division based

number of terms existing in a document by document [20].

The logical model to calculate the similarity between the query and the

documents where the logic (or, and, not) is calculated according to

equations in chapter two given to retrieve all documents containing

anything from terms, The researcher used logical or to apply extended

Boolean model to get the similarity between term queries and documents,

using or in this strategy that meaning take weight for each term query

found in query or found in documents [10] [18].

4- Building Genetic Algorithm

Representation of the chromosomes: Initial population in genetic

algorithms is represented by the chromosome. Each chromosome consists

of several genes and each chromosome is calculated based on the number

of query terms that the researcher considers 10 bits which represents the

number of maximum terms queries. If the term query exists in the

document, it is represented as 1, and if not, represented as 0.

Example of representation chromosome as follow:

Table (4.1) representation chromosome

Chromosome number from

1 to 30

Representation chromosomes 10

bit

Document1 - Chromosome 1 0101101000

Document2 - Chromosome 2 1010111010

Document3 - Chromosome 3 1010011010

In the above example, document1=chromosome 1 =0101101000

D1= made, Answer, arrive, ask, become, break, believe, build, buy, call,

welcome, tomorrow, today, yesterday, man, accident, improve, car, bus.

Q1= dad, answer, special, call, ask, differ, call, mum.

To represent chromosome1 as binary system made mach between

document and query, first term query not found in document1, gene1 =0.

Second term query finds in document1, gene2 =1. Third term query not

found in document1, etc. When term queries less than 10, then genes must

complete zeros, to impose the complemented terms query not found in

document.

The system snapped the first 30 documents retrieved by the system and

take these documents to be in the initial population the adaptive genetic

algorithm.

Evaluation the fitness function: valuing each chromosome in the

population. The researcher used two adaptive fitness functions (Cosine and

Jaccard) with the following modifications in chapter two (7) (8).

Selection method: The selection process consists of selecting two parent

chromosomes on the basis of randomly fitness function to get a better

owner of Fitness and the greatest opportunity to choose.

5- Genetic Operation:

In our GA approaches, the researcher uses two GA operators to produce

offspring chromosomes, which are:

Crossover: The process of mixing the two chromosomes to get new two

chromosomes as offspring used in genetic algorithm. The researcher uses

two points of crossover to get better chromosome. When trying again,

accurate results are obtained through an adaptive genetic algorithm [19].

The proposed algorithm used the two points crossover operator in the

range of probability, and crossover usually used the value of probability

(0.5, 0.9) as initial value to the equation (9) in chapter two to take a

random number and compare it with the possibility of crossover to find out

the number of bits to be applied by the crossover.

Examples of crossover

In this example the researcher applies crossover after second gene

using two point crossover with (0.6) probability of crossover .

1010101011 parent1 (chromosome 1 chosen by selecting method )

1001100101 parent2 (chromosome 2 chosen by selecting method )

----------------------- The researcher swap the bits in bold color

1001100111 children1 (chromosome 1 Produced by crossover)

1010101001 children2 (chromosome 2 Produced by crossover )

Mutation: The process of changing one bit based on a random

selection of these bits to get better chromosome. Trying to gain access

to more accurate results in the work of adaptive genetic algorithm, if

bit 1 becomes 0, and bit 0 becomes 1 [19], the probability of mutation

that is usually used (0.2, 0.001) because The researcher made many

appearances to get the best ratio.

The researcher uses random one point mutation operator in range of

probability, but the value of the probability mutation (0.001, 0.2) are

usually used with adaptive based on equation (10) in chapter two. The

researcher takes a random number and compare it with the possibility

of mutation if the number is smaller or equal to the probability, The

researcher applies mutation operator if the random number applied is

greater than the probability of mutation, then when the random number

applied is smaller than the probability of mutation, The researcher

Example of mutation operator. If choose 3rd gene

In this example the researcher applies mutation operator in a third gene

using one point mutation with (0.01) probability of mutation.

1001100111 children1 (chromosome 1 Produced by crossover)

1010101001 children2 (chromosome 2 Produced by crossover)

------------------------- The researcher flip one bit in bold color

1011100111 new chromosom1 (chromosome 1 Produced by mutation )

1000101001 new chromosom2 (chromosome 2 Produced by mutation )

6- Evaluation fitness function: used method based fitness functions to end

genetic algorithm based on the 70 generations because the researcher made

many experiences to get this better number to get the optimal solution. We

try to find the best number of generations.

7- Evaluating the retrieved document using average (Recall and precision)

[20], the most commonly used measurements of retrieval performance are

precision and recall. Precision measures the ability of the system to

retrieve only the documents that are relevant to a query [7] [13].

8- Comparing between the use of traditional cosine and jaccard's efficiency

of the fitness function in genetic algorithm, and use of the proposed

adaptive fitness function.

9- Comparing the efficiency between using of crossover and mutation

probability in the adaptive genetic algorithm.

10- Comparing between the use of two models (VSP, EBM).

4.4 Proposed Algorithm

The overall steps of our proposed algorithm are as follows:-

Step 1: Tokenizing (to get the word form documents).

Step 2: Elimination all the stop-word.

Step 3: Stemming (to get the root of word based on porter stemmer).

Step 4: Inverted index (to represent frequencies of term and where it can be

found).

Step 5: Assigning weight for each term taken form document (to give each

term weight to compute a similarity measure using equation (8).

Step 6: using two models: vector space model (VSP), extended Boolean model

(EBM) (used to describe document contents in corpus and compute similarity

between query and documents.).

Step 7: compute similarity between documents and query (to get similarity

between documents and query using the cosine measure).

Step 8: The system snapped the first 30 documents retrieved by the system and

take these documents (to be initial population).

Step 9: Representation of the chromosome (to represent each individual

population in binary presentation).

Step 10: Evaluation the fitness (compute fitness function to get better

population using adaptive cosine fitness and adaptive jaccard's fitness).

Step 11: Selection method (method used to choose two chromosomes to apply

some operator based on randomly selection).

Step 12: Crossover (the method used to attend new better solution based on

two points after second gene ) .

Step 13: Mutation operator (oprator used to attend new better solution based

on one point randomly).

Step 14: Evaluation (evaluation two chromosomes Produced by mutation with

two chromosomes chosen by a select method to choose better two

chromosomes among them).

Can see data flow of the proposed algorithm in Figure (4.1) in next pages.

4.5 Pseudocode of Proposed Adaptive Genetic Algorithm

The pseudcode adaptive genetic algorithm process is expressed as follows:

1. [Start] Genetic initial population top 30 chromosome as sample

2. [Fitness] Evaluate the fitness (co of each individual in the

population

3. [New population] Create a new population by rebeating the following step

until the end condition

A. [Selection] Select two parents from a population according to their

fitness randomly

B. [Crossover rate] Calculate the genetic crossover to find the best rate

for individual according to the fitness using two point crossover

C. [Crossover] Crossing of two individual by using the parameter of the

individual with best fitness to from new offspring

D. [Mutation rate] Calculate the genetic mutation to find the best rate for

individual according to the fitness

E. [Mutation] Mutation of new offspring at each random position in

coromosome by using the best parameter

F. [Accepting] place new offspring in new population

G. [Parameter evaluation] Create a new set of parameter for each

individual in population

a. [Selection] Select two parameter settings from apopulation

according to their reinforcement position

b. [Crossover] Crossing of two parameter sets by using the

parameter of individual

c. [Mutation] Mutation of new offspring by using the parameter

of the individual

d. [Accepting ] Place new offspring in the new sets parameter

population

4. [Replace] Use newly generated population for a further run of the

algorithm

5. [Test] If the end condition is satisfied, and return the optimized query

6. [Loop] Go to step 2

Figure (4.1) Data flow for proposed algorithms.

Tokenized

Stemming

Inverted index

Calculate the terms weight for each document

Vector Space Model

Taking documents ranking from IRS

Extended Boolean Model

Used cosine and proposed cosine

Representation of chromosomeRepresentation of chromosome

Evaluation

Initial population

Evaluation the fitness

Stop

Selection method

Crossover

Mutation

Initial population

Evaluation the fitness

Selection method

Crossover

Mutation

Evaluation

Chapter five

Experimental results

5.1 Results

The researcher used in this study ten queries and every query has 8 states. In other

words, the researcher had 80 results. After obtaining the results, the researcher

analyzed them through using evaluation criteria. The researcher chose to show the

results of query number one.

Table (5.1) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using the vector space model (VSM) with the cosine fitness

function. It has shown relevant documents in order, but this result bad case when

used cosine because the next case in table (5.2) when used proposed cosine with

VSM better than this table, and can see when increasing recall the precision

decrease in all cases because increasing the number of samples.

Table (5.1) value of precision and Recall for query number1 by using (VSM)with (AGA) using the cosine fitness function.

Recall Precision (%)0.1 850.2 750.3 720.4 640.5 500.6 380.7 320.8 220.9 20

Table (5.2) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using the vector space model (VSM) and cosine fitness

function. It has shown relevant documents in order; the researcher got better

results in the evaluation and better degree of similarity and has average precision

greater than traditional cosine. But, this case has topped precision result, 91%

when recall 0.1 because the AGA give more weight to term query, and get better

crossover and mutation probability.

Table (5.2) value of precision and Recall for query number1 by using (VSM)with (AGA) using the proposed cosine fitness function.

Recall Precision (%)0.1 910.2 850.3 790.4 650.5 540.6 450.7 390.8 280.9 22

Table (5.3) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using the vector space model (VSM) with jaccard's fitness

function. The researcher gets better result in the next table (5.4) when used

And, in this table can see when increase recalls the

precision decrease in all cases. But in this case has bad result, 80% for all result

when recall 0.1 compares with the next case.

Table (5.3) value of precision and Recall for query number1 by using (VSM)with (AGA) using the jaccard's fitness function.

Recall Precision (%)0.1 800.2 700.3 600.4 580.5 430.6 330.7 290.8 210.9 19

Table (5.4) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using the vector space model (VSM) and proposed

jaccard's fitness function .Here when used the proposed jaccard's fitness function,

the researcher got better results in evaluation and better degree of similarity, when

the researcher got more average

, the best case among the four cases in

(VSM) is when using proposed cosine.

Table (5.4) value of precision and Recall for query number1 by using (VSM)with (AGA) using proposed the jaccard's fitness function.

Recall Precision (%)0.1 870.2 760.3 710.4 640.5 500.6 350.7 300.8 250.9 21

Table (5.5) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using extended Boolean model (EBM) with the cosine

fitness function. In this case the researcher gets bad result comparing with next

result in table (5.6) because when used proposed cosine with VSM has good result

because used adaptive genetic algorithm from there, the researcher got better

crossover and mutation probability and got better fitness function through giving

terms queries more weights.

Table (5.5) value of precision and Recall for query number1 by using (EBM)with (AGA) using the cosine fitness function.

Recall Precision (%)0.1 820.2 720.3 620.4 600.5 480.6 350.7 300.8 200.9 19

Table (5.6) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using Extended Boolean model (EBM) and proposed

cosine fitness function. In this case when used proposed cosine with VSM has

better results for all cases, used VSM that meaning gave terms queries more

weight in a fitness function query that effectively taking better results than before.

Table (5.6) value of precision and Recall for query number1 by using (EBM)with (AGA) using the proposed cosine fitness function.

Recall Precision (%)0.1 890.2 790.3 750.4 650.5 520.6 410.7 350.8 250.9 20

Table (5.7) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using extended Boolean model (EBM) with jaccard's

fitness function. In this case the result is worse than using proposed jaccard

the

next table (5.8), and can see when increase recall the precision decrease in all

cases, because when increase recall taking a larger sample of the documents.

Table (5.7) value of precision and Recall for query number1 by using (EBM)with (AGA) using the jaccard's fitness function.

Recall Precision (%)0.1 750.2 680.3 600.4 550.5 400.6 300.7 250.8 210.9 18

Table (5.8) shows the adaptive genetic algorithm (AGA) is used in information

retrieval system (IRs), using Extended Boolean model (EBM) and proposed

jaccard's fitness function, when you use the proposed jaccard's fitness function in

(EBM), the researcher do not get better results in evaluation and in degree of

similarity. But with the proposed cosine in (EBM) obtained the best result among

the four cases in (EBM) because the researcher modification of fitness function

and adaptive crossover and mutation give the researcher good result comparing

with previous research. Finally; it has been shown that the using (VSM) and

proposed cosine is the best case among the eight cases (VSM or EBM and fitness

function).

Table (5.8) value of precision and Recall for query number1 by using (EBM)with (AGA) using proposed the jaccard's fitness function.

Recall Precision (%)0.1 800.2 720.3 610.4 590.5 450.6 350.7 300.8 220.9 20

5.2 Average Precision using VSM with Cosine

In this case, the researcher compares between the usages of vector spacemodel with cosine and proposed cosine.

Table (5.9) Average value of precision for all queries using VSM withcosine

RecallAverage precision

Cosine (%)Average precisionproposed cosine

(%)

AGAImprovement

(%)0.1 85 92 70.2 76 85 90.3 72 80 80.4 65 68 30.5 51 55 40.6 39 45 60.7 33 38 50.8 23 28 50.9 20 23 3

average 51.5 57.1 5.6

Figure (5.1): Average value of precision for all queries using VSM-cosine

In this case, vector space model is used with cosine and proposed cosine. As can

be seen, the average of precision when using proposed cosine is greater than

cosine and when increasing the recall the precision is decreasing. Moreover, the

degree of improvement is good when the top recall because taking a small sample,

average precision when recall 0.1 is good for all queries, but recall 0.9 is bad. This

good result because the modification of fitness function more gave weight to each

term query and gets better ratio probability using adaptive crossover and mutation,

but shown in this figure (5.1) when using VSM with cosine the researcher have

5.3 Average Precision

In this case, the researcher compares the usages of vector space model withjaccard's and

Table (5.10) Average value of precision for all queries using VSM

Recall Average precision Jaccard's (%)

Average precisionproposed Jaccard's

(%)

AGAImprovement

(%)0.1 80 87 70.2 71 76 50.3 62 70 80.4 57 65 80.5 42 47 50.6 32 35 30.7 30 31 10.8 21 25 40.9 19 20 1

average 46 50.6 4.6

Figure (5.2): Average value of precision for all queries using VSM-jaccard's

In this case, vector space model with jacc

used. As can be seen the average

is greater than and when increasing the recall, the precision is

decreasing, and the degree of improvement is good when the top recall and

the average precision when recall 0.1 is good for all queries, whereas

recall 0.9 is bad. This result is good because of the modification on fitness

function and usage of adaptive crossover and mutation. But when using

VSM with cosine, the researcher has

5.4 Average Precision using EBM with Cosine

In this case, the researcher compares between the usages of extended Booleanmodel with cosine and proposed cosine.

Table (5.11) Average value of precision for all queries using EBMwith cosine

Recall Average precision Cosine (%)

Average precisionproposed cosine

(%)

AGAImprovement

(%)0.1 79 89 100.2 70 75 50.3 61 70 90.4 59 65 60.5 44 49 50.6 33 36 30.7 30 32 20.8 21 25 40.9 19 21 2

average 46.2 51.3 5.1

Figure (5.3): Average value of precision for all queries using EBM-cosine

In this case, extended Boolean model with cosine and proposed cosine are

used. As can be seen the average precision when using proposed cosine is

greater than cosine, The researcher can see when increasing recall, the

precision is decreasing and the degree of improvement is good when the

top recall. So the average precision when recall 0.1 is good for all queries,

but recall 0.9 is bad, this good result is because of the modification of

fitness function and the use of adaptive crossover and mutation, but the

degree of improvement EBM is worse than VSM.

5.5 Average Precision using EBM with Cosine

In this case, the researcher compares between usages the extended Booleanmodels with jaccard's and proposed jaccard's.

Table (5.12) Average value of precision for all queries using EBMwith jaccard's

Recall Average precision jaccard's (%)

Average precisionproposed jaccard's

(%)

AGAImprovement

(%)0.1 82 89 70.2 73 77 40.3 64 75 90.4 59 64 50.5 47 53 60.6 36 41 50.7 31 36 50.8 21 23 20.9 19 20 1

average 48 52.8 4.8

Figure (5.4): Average value of precision for all queries using VSM-

In this case, extended Boolean model with jacc

used. As can be seen the average precision when using

, the researcher can see when

increasing recall the precision is decreasing, and the degree of

improvement is good when the top recall. The average precision when

recall 0.1 is good for all query, but recall 0.9 is bad, this good result is

because of the modification of fitness function and the use of adaptive

crossover and mutation.

5.6 Result of precision for (VSM)

Calculating the average precision by dividing the total Precision inquiries on the 9

(value of Recall) and the calculation of the average precision using the Vector

Space Model.

Table (5.13) used vector space model with fitness function option.

option Cosine Proposed

cosine

Jaccard's Proposed

Jaccard's

Average

Precision(%)

51.5 57.1 46 50.6

Figure (5.5): Using VSM with function fitness option

Here, the researcher can see that proposed cosine has a degree in a flowchart; the

second score proposed jaccard's, third degree cosine and fourth score jaccard's.

This means the researcher got the best solution when modifying fitness functions

and using probability crossover and mutation operator.

5.7 Result of precision for (EBM).

Calculating the arithmetic mean by dividing the total Precision inquiries on the 9

(value of Recall) and the calculation of the arithmetic mean for them using

Extended Boolean Model.

Table (5.14) Using extended Boolean model with fitness function option.

options cosine Proposed

cosine

Jaccard's Proposed

Jaccard's

Average

Precision(%)

46.2 51.3 48 52.8

Figure (5.6): Using EBM with function fitness

Here, the researcher can see that proposed jaccard's has a high degree in the

flowchart, second degree proposed cosine, third degree jaccard's and fourth score

cosine. This means that proposed cosine is the best solution when comparing all

cases (VSM-proposed cosine); the researcher can have the best result.

This means the researcher got the best solution when modifying fitness function

and the usage of probability crossover and mutation operator when modifying

fitness function to compute similarity, the researcher have the best result by

using precision evaluation between these results, the researcher can have the best

solution.

5.8 Comparison between proportions improvement

In table (5.15) show the comparison between improvement strategies andshow a degree of improvements for each IR model with each one fitness

function.

Table (5.15) Compare between improvements.

Average ImprovementProposed cosine (%)

Average ImprovementProposed jaccard's (%)

VSM 5.6 4.6EBM 5.1 4.8

Degree of improvement in proposed cosine and cosine with VSM and EBM is

adaptive genetic algorithm given more accuracy in a working system.

After got the results. The best result in improvement when used VSM with proposed cosine because

used adaptive genetic algorithm. Can see the degree of improvement in all cases

through use information retrieval model (VSM, EBM) with all fitness functions as

following:

A. VSM with proposed Cosine (5.6%)B. EBM with proposed Cosine (5.1%)C. VSM with proposed Jaccard's (4.8%)D. EBM with proposed Jaccard's (4.6%)

Rank of the result based on average precision.The best result in ranking when used VSM with proposed cosine because used

adaptive genetic algorithm. Can see average precision for all cases through use

information retrieval model (VSM, EBM) with all fitness functions as following:

A. (VSM)- proposed Cosine (57.1%). B. (EBM)- proposed Jaccard's (52.8%). C. (VSM)- Cosine (51.5%). D. (EBM)- proposed Cosine (51.3%)E. (VSM)- proposed Jaccaed's (50.6%). F. (EBM)- Jaccard's (48%).

G. (EBM)- Cosine (46.2%). H. (VSM)- Jaccard's (46%).

Chapter six

Conclusion, recommendations and future work

6.1 Introduction

In this thesis, The researcher proposed adaptive genetic algorithm (AGA) to

enhance information retrieval systems (IRs) using several fitness functions

(cosine, proposed cosine, jaccard's, proposed jaccard's) on the vector space

model (VSM) and extended Boolean model (EBM).

6.3 Conclusion

In this thesis, the researcher concluded several conclusions the following: 1. When using adaptive crossover and mutation operator with VSM-proposedcosine to retrieve relevant document, we have the best result, when averageprecision (57.1%).2. Having a better generation of initial population is the best result to retrieverelevant documents. 3. When using VSM-proposed cosine, the researcher got high degree ofsimilarity for each relevant document. 4. Using adaptive crossover and mutation with EBM-proposed cosine toretrieval relevant document is the second best result when average precision(51.3%).5. The best result when using VSM is when using proposed cosine whenaverage precision (57.1).6. The best result when using EBM is when using proposed jaccard's whenaverage precision (52.8%).7. When comparing between the uses of cosine when average precision(51.5%) and the propose cosine when average precision (57.1%) with VSM, the best result is when using proposed cosine with improve 5.6%.8. When comparing between the uses of cosine when average precision (46%)and proposed cosine when average precision (51.3%) with EBM, withimprove 5.1%.9. When comparing between the use of jaccard's when average precision(46%) and jaccard's addition when average precision (50.6%) with VSM, thebest result is when using proposed jaccard's with improve 4.6%.10. When comparing between the uses of jaccard's when average precision

when average precision (52.8%) withEBM, the best result is when using proposed jaccard's with improve (4.8%). 11. The highest rate of improve is when using VSM with proposed cosine5.6%.

6.3 Future work

After finishing the study which mainly aims at improving information retrieval

using genetic algorithm and adaptive, The researcher have the following

suggestions:

1. In our study, the researcher used two models: Vector Space Model and

Extended Boolean Model. In the future work, Probability Model Language

Model and Fuzzy Model may be used.

2. In our study, the researcher used two fitness functions: cosine fitness function

and jaccard's fitness function. In the future work, horng & yeh formula fitness

function and ochiai's may be used.

3. In our study, the researcher has adaptive crossover and adaptive mutation. In

the future work, another manner can be used.

4. In this study, the researcher has an adaptive cosine fitness function and

adaptive jaccard's fitness function. In the future work, another manner in

adaptive can be used.

5. In our study, the researcher use random selection models, two points as a type

of crossover and randomly mutation. In future work, other types of selection as

Roulette Wheel Selection, Elitism Selection and rank selection can be used.

Furthermore, other types of crossover as a single point and uniform can be used.

Finally, other types of mutation as uniform can be used.

References

[1] Priya Borkar and Leena Patil, "Web Information Retrieval Using Geneticalgorithm Particle Swarm Optimization", international Journal of Future Computer

and Communication, Vol. 2, No. 6, pp 595-599, (2013).

[2] Korejo and Khuhro, "Genetic Algorithm Using an Adaptive MutationOperator for Numerical Optimization Functions", University of Sindh, Vol.45, pp

41- 48, (2013).

[3] Taisir Eldos, "Mutative Genetic Algorithms", Journal of Computations &Modelling, Vol.3, No. 2 , pp111-124, (2013).

[4] Mohammad Nassar , Feras AL Mshagba , Eman AL mshagba ,"Improving theUser Query for the Boolean Model Using Genetic Algorithms", IJCSIInternational Journal of Computer Science Issues, Vol. 8, Issue 5, No. 1 , pp66-70,

(2011).

[5] Imtiaz Korejo, Shengxiang Yang, and ChangheLi, "A Comparative Study ofAdaptive Mutation Operators for Genetic Algorithms", The VIII Metaheuristics

International Conference, (2009).

[6] Huifang Cheng , Handan, China ,"Improved Genetic Programming algorithm", International Asia Symposium on Intelligent Interaction and Affective Computing ,

ieee, pp168-177, (2009).

[7] Ahmed Radwan and Bahgat Abdel Latef and Abdel Ali , Osman Sadek, "UsingGenetic Algorithm to Improve Information Retrieval Systems", World Academy

of Science, Engineering and Technology, pp1021-1027, (2008).

[8] Detelin Luchev, "APPLYING GENETIC ALGORITHM IN QUERYIMPROVEMENT PROBLEM", International Journal "Information Technologies

and Knowledge", Vol.1,No. 1, pp309-216, (2007).

[9] NIR OREN, "Reexamining tf.idf based information retrieval with GeneticProgramming", University of the Witwatersrand, paper, pp1-10, (2002).

[10] Wafa. Maitah, Mamoun. Al-Rababaa and Ghasan. Kannan, "IMPROVINGTHE EFFECTIVENESS OF INFORMATION RETRIEVAL SYSTEM USINGADAPTIVE GENETIC ALGORITHM", International Journal of Computer

Science & Information Technology (IJCSIT), Vol 5, No. 5 , pp91-105, ( 2013).

[11] Sima Uyar, Sanem Sariel, and Gulsen Eryigit, "A Gene Based AdaptiveMutation Strategy for Genetic Algorithms", Istanbul Technical University, Electrical and Electronics Faculty, Department of Computer Engineering, LNCS

3103, pp 271 281, (2004).

[12] Sima Etaner and Gulsen Cebiroglu, "An Adaptive Mutation Scheme in GeneticAlgorithms for Fastening the Convergence to the Optimum", Istanbul Technical

University, Computer Engineering Department, (2005).

[19] BANGORN KLABBANKOH, "APPLIED GENETIC ALGORITHMS ININFORMATION RETRIEVAL", Faculty of Information Technology King

paper, (2009).

[13] Noha Mezyan, "web mining based on island genetic algorithm", al albaytuniversity, (2009).

[14]"Informationretrieval"http://www.dsoergel.com/NewPublications/HCIEncyclopediaIRShortEForDS, pdf, pp1-11.

[15] Kalayanasaravan and Thangamani, "Document Retrieval System using GeneticAlgorithm", Kongu Engineering College, Perundurai, Vol. 2, No.10, pp 943-946,

(2013).

[16] Firas Alabsi and Reyadh Naoum ," Comparison of Selection Methods andCrossover Operations using Steady State Genetic Based Intrusion DetectionSystem", Middle East University, Vol. 3, No.7, (2012).

[17] Desjardins, "Performance of Information Retrieval Models Using Term Co-occurrences University of Quebec in Montreal, research, Canada, (2012).

[18] Aijun Li," The Operator of Genetic Algorithms to Improve its Properties",Tianjin University of Technology and Education, Vol. 4, No. 3, pp 60-62, (2010).

[19] Pragati Bhatnagar, "A Combined Matching Function based EvolutionaryApproach for development of Adaptive Information Retrieval System", India, Vol. 2, No. 6 , pp249-256, (2012).

[20] ammar s al-dallal "ENHANCING RECALL AND PRECISION OF WEBSEARCH USING GENETIC ALGORITHM", paper, Brunel University, (2012).