Analysis Fraud

AAAAAAAAAAAA iiiiiiiiiiii gggggggggggg llllllllllll iiiiiiiiiiii RRRRRRRRRRRR iiiiiiiiiiii zzzzzzzzzzzz oooooooooooo uuuuuuuuuuuu

SSSSSSSSUUUUUUUUPPPPPPPPEEEEEEEERRRRRRRRVVVVVVVVIIIIIIIISSSSSSSSOOOOOOOORRRRRRRRSSSSSSSS

PPPPPPPPRRRRRRRROOOOOOOOFFFFFFFF........ JJJJJJJJ ........ SSSSSSSSOOOOOOOOLLLLLLLLDDDDDDDDAAAAAAAATTTTTTTTOOOOOOOOSSSSSSSS

PPPPPPPPRRRRRRRROOOOOOOOFFFFFFFF........ IIIIIIII ........ CCCCCCCCHHHHHHHHRRRRRRRRIIIIIIIISSSSSSSSTTTTTTTTOOOOOOOOUUUUUUUU

MMMMAS TE R O F A S T E R O F A S T E R O F A S T E R O F SSSS C I ENC E I N C I ENC E I N C I ENC E I N C I ENC E I N IIII N FORMAT I ON N FORMAT I ON N FORMAT I ON N FORMAT I ON

&&&& TTTT E L E COMMUN ICA T I ON S E L E COMMUN ICA T I ON S E L E COMMUN ICA T I ON S E L E COMMUN ICA T I ON S TTTT E CHNO LOG I E SE CHNO LOG I E SE CHNO LOG I E SE CHNO LOG I E S

ATHENSATHENSATHENSATHENS ,,,, OCTOBER OCTOBER OCTOBER OCTOBER 2010201020102010

Analysis of Fraud Detection

Aigli Rizou – MSITT 2010 2

ABSTRACTABSTRACTABSTRACTABSTRACT

Fraud plays a leading role in most aspects of social and economical life

worldwide. The expansion of modern technology resulted in the facilitation of

our daily activities, but rendered our lives more vulnerable to fraud attacks.

Intense financial pressure during the economic crisis has, also, led to a bulge

in fraud and increasing number of victims. Banking, telecommunications

insurance, internet and enterprises are the sectors which suffer significant

losses and they are the basic parts the current thesis disserts. The

economical impact of fraud made fraud prevention and detection a dire

necessity. Fraud detection theoretical background stems from scientific fields,

such as data mining, machine learning, artificial intelligence and statistics.

Supervised and unsupervised learning algorithms have contributed to the

evolvement of fraud detection. Moreover, several academic studies have

indulged in the research of various promising fraud techniques, with a view to

encounter the evolving nature of fraud. The presentation of a real-life fraud

detection system of a Greek Bank aims at giving a more practical view of the

problem. A short description of the system reveals the way the Bank deals

with fraud cases. Based on a real data set labeled by the system, another

machine learning tool is used in order to check the reliability of the running

supervised algorithms. Finally, the proposed fraud detection solution aims at

offering a robust fraud detection system with improved performance,

constituting a subject for future work.

Keywords: fraud detection, fraud losses, classification, clustering, confusion

matrix, false alarm rate, classifier ensembles, class label



ACKNOWLEDGMENTSACKNOWLEDGMENTSACKNOWLEDGMENTSACKNOWLEDGMENTS

I w o u l d l i ke t o e x p r e s s m y g r a t i t u d e a n d a p p r e c i a t i o n t o

m y s u p e r v i s o r s , P r o f . J . S o l d a t o s a n d P r o f . I . C h r i s t o u ,

w h o s e v a l u a b l e g u i d a n c e a n d h a r mo n i o u s c o l l a b o r a t i o n

e n c o u r a g e d m e t o d i s c o v e r u n kn o w n a s p e c t s o f t h e

s u b j e c t .



DECLARATIONDECLARATIONDECLARATIONDECLARATIONSSSS

I , A i g l i R i z o u , d e c l a r e t h a t t h e w o r k p r e s e n t e d i n t h i s

t h e s i s w a s c a r r i e d o u t i n a c c o r d a n c e w i t h t h e r e g u l a t i o n s

o f t h e A t h e n s I n f o r m a t i o n T e c h n o l o g y I n s t i t u t e . A n y

v i e w s e x p r e s s e d i n t h i s d i s s e r t a t i o n a r e t h o s e o f t h e

a u t h o r a n d i n n o w a y r e p r e s e n t t h o s e o f t h e A t h e n s

In f o r m a t i o n Te c h n o l o g y ( A I T ) I n s t i t u t e .

Aigli Rizou

22/10/2010

Th e w o r k c o n t a i n e d i n t h i s t h e s i s “ A n a l ys i s o f F r a u d

D e t e c t i o n ” b y A i g l i R i z o u h a s b e e n c a r r i e d o u t u n d e r m y

s u p e r v i s i o n .

J. Soldatos I. Christou

22/10/2010 22 /10/2010

Athens Information Technology Athens Information Technology



TABLE OF CONTENTSTABLE OF CONTENTSTABLE OF CONTENTSTABLE OF CONTENTS

1 . I N T R O D UC T I O N .............................................................................11

1 . 1 . H i s t o r y ............................................................................................11

1 . 2 . O b j e c t i v e .......................................................................................12

1 . 3 . S t r u c tu r e .......................................................................................13

2 FRAUD DETECTION (FD) OVERVIEW .................................................15

22 .. 11 .. FF rr aa uu dd DD ee ff ii nn ii tt ii oo nn .......................................................................15

2 . 2 . F r a u d D e t e c t i o n & P r e v e n t i o n .........................................15

2 . 3 . F r a u d T y p e s ................................................................................16

2 . 3 . 1 . B a n k i n g ...............................................................................16

2 . 3 . 2 . In s u r a n c e ...........................................................................18

2 . 3 . 3 . In t e r n e t ................................................................................18

2 . 3 . 4 . T e l e c o m m u n i c a t i o n s ...................................................19

2 . 3 . 5 . E n t e r p r i s e s .......................................................................19

2 . 3 . 6 . G e n e r a l ................................................................................19

2 . 4 . F r a u d T e c h n i q u e s ....................................................................20

2 . 4 . 1 . B a n k i n g ...............................................................................20

2 . 4 . 2 . In s u r a n c e ...........................................................................22

2 . 4 . 3 . In t e r n e t ................................................................................23


2 . 4 . 5 . E n t e r p r i s e s .......................................................................25

2 . 5 . F r a u d s t e r s T y p e ........................................................................26

2 . 6 . E c o n o m i c a l I m p a c t o f F r a u d .............................................27

2 . 6 . 1 . G e n e r a l ................................................................................28

2 . 6 . 2 . B a n k i n g ...............................................................................30

2 . 6 . 2 . 1 . C r e d i t / D e b i t c a r d s ...................................................30

2 . 6 . 2 . 2 . I d e n t i t y T h e f t ..............................................................34

2 . 6 . 3 . E n t e r p r i s e s .......................................................................34

2 . 6 . 4 . In s u r a n c e ...........................................................................36

2 . 6 . 5 . In t e r n e t ................................................................................36

2 . 6 . 5 . 1 . A d v a n c e F e e F r a u d – 4 1 9 S c a m ......................37




2 . 7 . D i f f i c u l t i e s i n F D .....................................................................39

2 . 8 . F D S y s t e m R e q u i r e m e n t s ....................................................41

2 . 8 . 1 . B u s i n e s s R e q u i r e m e n t s ............................................41

2 . 8 . 2 . T e c h n i c a l R e q u i r e m e n t s ...........................................41

2 . 8 . 3 . F u n c t i o n a l R e q u i r e m e n t s .........................................42

2.9. P e r f o r m a n c e M e t r i c s ..............................................................42

2 . 9 . 1 . R e c e i v e r O p e r a t i n g C h a r a c t e r i s t i c ( R O C ) .....44

3 THEORETICAL PERSPECTIVE ............................................................46

3.1. F D M e t h o d s ..................................................................................46

3 . 2 . S u p e r v i s e d & U n s u p e r v i s e d L e a r n i n g M e t h o d s ....46

3 . 2 . 1 . C l a s s i f i c a t i o n ..................................................................47

3 . 2 . 1 . 1 . D e c i s i o n T r e e ( D T ) ..................................................48

3 . 2 . 1 . 1 . 1 . C 4 . 5 ( J 4 8 ) ..............................................................49

3 . 2 . 1 . 2 . A r t i f i c i a l N e u r a l N e t w o r k s ( A N N ) ...................50

3 . 2 . 1 . 3 . F u z z y L o g i c ( F L ) .......................................................53

3 . 2 . 1 . 4 . F u z z y N e u r a l Ne t w o r k ( F N N) .............................54

3 . 2 . 1 . 5 . N a ï v e B a y e s ( N B ) .....................................................55

3 . 2 . 1 . 6 . S u p p o r t V e c t o r M a c h i n e s ( S V M ) .....................55

3 . 2 . 2 . L i n e a r a n d L o g i s t i c R e g r e s s i o n ..........................57

3 . 2 . 3 . C l u s t e r i n g ..........................................................................57

3 . 2 . 3 . 1 . O u t l i e r D e t e c t i o n ......................................................57

3 . 2 . 4 . Me t a - l e a r n i n g ..................................................................58

3 . 2 . 4 . 1 . B a g g i n g ( B o o t s t r a p A g g r e g a t i n g ) ..................59

3 . 2 . 4 . 2 . S t a c k i n g ( S t a c k e d G e n e r a l i z a t i o n ) ...............60

3 . 2 . 4 . 3 . B o o s t i n g .......................................................................60

4 ACADEMIC PERSPECTIVE ..................................................................61

4 . 1 . S c i e n t i f i c R e s e a r c h ................................................................61

4 . 1 . 1 . C a r d F D ...............................................................................61

4 . 1 . 1 . 1 . E x p e r i m e n t 1 - D e s c r i p t i o n ................................61

4 . 1 . 1 . 2 . E x p e r i m e n t 1 – R e s u l t s ........................................63

4 . 1 . 2 . In s u r a n c e F D ....................................................................66




4 . 1 . 2 . 2 . E x p e r i m e n t 1 - R e s u l t s .........................................68

4 . 1 . 3 . T e l e c o m m u n i c a t i o n s F D ............................................69


4 . 1 . 3 . 2 . E x p e r i m e n t 1 - R e s u l t s .........................................72

5 PRACTICAL PERSPECTIVE .................................................................76

5 . 1 . B a n k An t i - F r a u d S y s t e m ......................................................76

5 . 1 . 1 . T h e S o f t w a r e ....................................................................78

5 . 1 . 1 . 1 . S e r v i c e C o n s u m e r T i e r .........................................78

5 . 1 . 1 . 2 . S e r v i c e P r o v i d e r T i e r ............................................79

5 . 1 . 1 . 3 . C l i e n t T i e r .....................................................................79

5 . 1 . 1 . 4 . C o m m u n i c a t i o n ..........................................................79

5 . 1 . 2 . F L S o f t w a r e ......................................................................80

5 . 1 . 2 . 1 . R i s k S h i e l d P r o j e c t ..................................................82

5 . 1 . 2 . 1 . 1 . I n p u t V a r i a b l e s ...................................................82

5 . 1 . 2 . 1 . 2 . F i n g e r p r i n t s .........................................................83

5 . 1 . 2 . 1 . 3 . C a l c u l a t i o n U n i t s ..............................................85

5 . 1 . 2 . 1 . 4 . O u t p u t V a r i a b l e s ...............................................85

5 . 1 . 2 . 1 . 5 . D e c i s i o n V a r i a b l e .............................................85

5 . 1 . 2 . 1 . 6 . C a s e Ma n a g e m e n t .............................................86

5 . 2 . W a i k a t o E n vi r o n m e n t f o r K n o w l e d g e An a l y s i s ( W E K A) ......................................................................................................86

5 . 2 . 1 . P r e p r o c e s s ........................................................................87

5 . 2 . 1 . 1 . D a t a s e t ..........................................................................88

5 . 2 . 2 . C l a s s i f i c a t i o n ..................................................................89

5 . 2 . 2 . 1 . P e r f o r m a n c e M e t r i c s ..............................................90

5 . 2 . 3 . E x p e r i m e n t s .....................................................................93

5 . 3 . R e s u l t s ...........................................................................................94

5 . 4 . C o n c l u s i o n s & F u t u r e W o r k ..............................................99

R E F E R E N C E S ......................................................................................103

A P P E N D I X A .........................................................................................109

A P P E N D I X B .........................................................................................124



LIST OF FIGURESLIST OF FIGURESLIST OF FIGURESLIST OF FIGURES

Figure 1: Bar chart of fraud types from 51 unique and published FD papers [76].................................................................................................................20

Figure 2: Hierarchy chart of white-collar crime perpetrators from both firm-level and community-level perspectives [76]..................................................26

Figure 3: Breakdown of fraud losses in UK for 2008 according to NFA [48] ..30

Figure 4: Breakdown of card losses in UK during 2008 according to FFA [48].......................................................................................................................31

Figure 5: UK Fraud loss ratios by card type according to Visa estimates for 2008 [16]. .......................................................................................................31

Figure 6: Comparative overview of European countries based on Visa estimates in 2008 [16]....................................................................................32

Figure 7: Fraud loss ratios of 2008 according to Visa [16]. ............................33

Figure 8: Card fraud losses in the US for 2006..............................................33

Figure 9: Distribution of occupational fraud losses worldwide according to ACFE [25]. .....................................................................................................34

Figure 10: Proportion analysis per fraud category according to ACFE [25]....35

Figure 11: Median occupational fraud losses by category for 106 nations according to ACFE [25]..................................................................................35

Figure 12: Annual dollar loss of referred complaints according to IC3 (in millions) [46]...................................................................................................36

Figure 13:Advance fee fraud losses for Greece (419 Unit of Ultrascan AGI) [54].................................................................................................................37

Figure 14: Advance fee fraud losses worldwide in 2009 in million $ (Ultrascan AGI) [54].........................................................................................................38

Figure 15: Receiver Operating Characteristic curve & the Area Under the Curve [84]. .....................................................................................................45

Figure 16: A Simple Linear Classification Boundary for the Loan Data Set, where the shaped region denotes class no loan [85]. ....................................48

Figure 17: An indicative DT example [61]. .....................................................49

Figure 18: A human neuron forming a chemical synapse [66]. .....................50

Figure 19: Artificial Neuron (Perceptron) [66].................................................51

Figure 20: Three-Layer feedforward ANN......................................................52

Figure 21: A multilayer perceptron .................................................................53



Figure 22: Membership function of the linguistic variable “amount” in FL ......54

Figure 23: Margins and Support Vectors in a two-dimensional example [65].56

Figure 24: Example of clustering [85].............................................................58

Figure 25: Graphical representation of bagging. ............................................59

Figure 26: System architecture of combined discriminant analysis and ANN approach [41]. ................................................................................................67

Figure 27: The vector of comparison [30].......................................................70

Figure 28: An example of similarity vectors of 3 user profile – test sets (Group 1) [30].............................................................................................................72

Figure 29: Plot of similarity probability between accounts against the data used in the test set [30]..................................................................................74

Figure 30: RiskShield architecture [79]. .........................................................78

Figure 31: A simplified FL environment – fuzzyTECH software .....................81

Figure 32: RiskShield-Client project...............................................................83

Figure 33: Fingerprint.....................................................................................84

Figure 34: Preprocess tab of WEKA ..............................................................87

Figure 35: arff file ...........................................................................................88

Figure 36: Classify tab of WEKA – The results of the application ZeroR classifier are shown on the right part .............................................................90

Figure 37: LOF proposed solution [84].........................................................100

Figure 38: The general framework for combining outlier detection techniques [84]...............................................................................................................101

LIS T O F TABLES

Table 1: Annual Fraud losses per fraud sector and country...........................29

Table 2: Cost model assuming a fixed overhead [17]. ...................................63

Table 3:Cost and savings in the credit card fraud domain using class-combiner (cost ± 95% confidence interval) [17]. ............................................64

Table 4: Results on knowledge sharing and pruning [17]...............................65

Table 5: Confusion matrix for binary problems...............................................91



1.1.1.1. INTRODUCTIONINTRODUCTIONINTRODUCTIONINTRODUCTION

1 .1.1.1.1.1.1.1. H is toryHis toryHis toryHis tory

The phenomenon of fraud dates back centuries ago and it’s believed that its

presence coincides with the dawn of commerce. A true history of fraud has

been recorded in 300 B.C. in Greece, when a merchant named Hegestratus

took out a large insurance policy known as bottomry. In essence, the

merchant borrowed money and agreed to pay back with interest as soon as

the cargo (i.e. corn) was delivered. If the loan was not paid back, then the

lender had the right to acquire the boat and the cargo as well. Hegestratus

decided to sink his empty boat, keep the loan and sell the corn. However, he

didn’t manage to deceive the lender, as he drowned when trying to escape his

crew passengers when they caught him in the act. This is regarded as the first

recorded incident of insurance fraud worldwide [9].

Fraud, undoubtedly, keeps up with the social, economical and technological

evolution and thus it appears with different intensity and form, depending on

the epoch. The expansion of modern technology and the global

superhighways of communications in combination with the fraudsters’

“professionalism” have led to a dramatic increase on fraud incidents and fraud

losses.

The modern trend is the appearance of new fraud forms on the scene, as an

attempt to establish the financial crime as a part of the organized crime.

Financial crime is a global phenomenon and poses a threat not only to

organizations and businesses, but also to individuals, through international

organized bands of criminals, who take advantage of the sophisticated means

of technology and of course worldwide web. As a result, fraudsters are no

INTRODUCTION



more naïve and entrepreneurs but more cautious and intelligent, developing

new adaptive ways to deceive the potential victims.

The fraud loss of billion of dollars worldwide each year has sparked a search

for effective countermeasures to those who exploit security vulnerabilities to

commit any kind of fraud. Under this scope, fraud prevention and detection

technologies have become an imperative need. However, the effectiveness of

these techniques is based on their flexibility towards fraudsters’ evolving

behavior.

1.2.1.2.1.2.1.2. Object iveObject iveObject iveObject ive

The present study addresses the fraud detection issue, focusing in particular

fraud sectors, such as banking, insurance, internet and telecommunications.

The citation of some statistical figures worldwide indicates the recent aspect

of the problem. Irrespective of the fraud type and the obstacles during fraud

detection, the requirements of each fraud detection system are common.

Various scientific areas offer the means for developing a number of fraud

detection methods, which are applicable in real life scenarios. In addition,

some indicative scientific experiments are presented in order to provide the

results of the utilization of the previous methods and to give motivations for

further research.

The final part of the thesis includes the analysis of a real life fraud detection

scenario of application fraud and the description of the implemented software

in order to explain the operation and the requirements of a real fraud detection

system. In the context of this thesis, a number of algorithms have been

applied as part of another kind of open source machine learning tool.

Providing that the real fraud detection system exhibits a high degree of

accuracy, a comparative evaluation of the algorithms’ performance is

accomplished. The ultimate goal is to propose an optimized solution, which

will yield improved results in real life fraud detection systems.



1.3 .1.3 .1.3 .1.3 . S t ruc tureSt ruc tureSt ruc tureSt ruc ture

The present thesis has been organized in the following way.

Chapter 2 introduces the fraud concept through various definitions. Banking,

telecommunications, insurance, internet and enterprises are the sectors under

consideration. Depending on fraud type, different techniques have been

developed so far by the perpetrators. This chapter, also, describes the typical

characteristics of fraudsters’ profile. In order to highlight the magnitude of the

problem, figures concerning fraud losses of Europe and the U.S. for the last

years are quoted. In addition, chapter 2 refers to the obstacles during FD so

that it brings out the complexity of the problem. The last part includes the

requirements of a reliable FD system as well as the metrics used for the

evaluation of its performance.

Chapter 3 provides a theoretical background of FD, introducing fundamental

concepts, necessary for the next chapters. The existing FD methods, used in

modern FD tools, proceed from various scientific fields and they are divided

into two categories, supervised and unsupervised methods. A special

reference to metalearning algorithms is given, since they have proven to be

very effective means of FD.

Chapter 4 contains a number of experiments carried out in the scientific field,

divided per fraud sector as in chapter 2. The experiments exploit the

algorithms of chapter 3, providing useful results for future considerations.

Chapter 5 constitutes the practical part of the thesis and describes a real FD

system, implemented by a Greek bank, which detects fraud behaviors among

loan applications. In the first place, the real application fraud data, provided by

the bank, are loaded in the system and the results are recorded. Next, the

same data set is loaded in an open source machine learning tool in order to

record the results of the running algorithms. The comparison of both result

sets helps to draw conclusions for the effectiveness of the algorithms as a



standalone tool. Finally, a potential ideal FD solution is proposed as an object

for future work.



2222 FRAUD DFRAUD DFRAUD DFRAUD DETECTION OVERVIEWETECTION OVERVIEWETECTION OVERVIEWETECTION OVERVIEW

22222222 ........ 11111111 ........ FFFFFFFF rrrrrrrr aaaaaaaa uuuuuuuu dddddddd DDDDDDDD eeeeeeee ffffffff iiiiiiii nnnnnnnn iiiiiiii tttttttt iiiiiiii oooooooo nnnnnnnn

Fraud is the crime of obtaining money by deceiving people (Cambridge

Advanced Learner’s Dictionary).

Fraud is a criminal deception; the use of false representations to gain an

unjust advantage (Concise Oxford Dictionary).

Fraud is an intentional deception made for personal gain or to damage

another individual and undoubtedly it’s considered to be a crime and a civil

law violation [2].

Fraud is an intentional act meant to induce another person to part with

something of value, or to surrender a legal right. It is a deliberate

misrepresentation or concealment of information in order to deceive or

mislead [31].

Fraud occurs in most of the areas of human endeavour, causing significant

financial losses not only to the individuals but also to various enterprises. No

matter in which domain fraudsters commit fraud, their primary motivation is

money and secondarily power, peer regard, appreciation and greed.

2.2.2.2.2.2.2.2. Fraud Detect ionFraud Detect ionFraud Detect ionFraud Detect ion & Prevent ion& Prevent ion& Prevent ion& Prevent ion

As fraud increases dramatically with the expansion of modern technologies,

there is an urgent need that sophisticated technologies and fraud experts’

knowledge should be combined in order to ensure against fraud attacks.

FRAUD

DETECTION (FD)

OVERVIEW



Nowadays, individuals, organizations or companies apply various fraud

prevention and detection methods, aiming at minimizing their losses as soon

as possible.

In particular, fraud prevention involves measures to inhibit fraud at an early

stage, such as personal identification number for bank cards, chip-based EMV

payment cards, Internet security systems for credit card transactions,

Subscriber Identity Module (SIM) cards for mobile phones, laminated metal

strips and holographs on banknotes etc. However, none of these measures

acts as panacea in practice. What is more, there should be a trade-off

between expense and inconvenience (e.g. to a customer) on the one hand

and effectiveness on the other.

Unlike prevention, fraud detection implies identifying fraud as soon as

possible once it has been perpetrated. FD comes into effect, after fraud

prevention has failed. Hence, FD must be applied constantly, since failure of

fraud prevention is not always verified. For example, although individuals

guard their cards against fraudsters very meticulously, card’s data can be

stolen and then it’s crucial to be able to detect as fast as possible that fraud is

being committed.

2.3.2.3.2.3.2.3. Fraud TypesFraud TypesFraud TypesFraud Types

There are at least as many types of fraud as there are types of people who

commit it, but in every case the deception is their common denominator. The

common fraud types per sector are the following.

2 .3 .1 .2 .3 .1 .2 .3 .1 .2 .3 .1 . Bank ing Bank ing Bank ing Bank ing

Bank fraud is an attempt to deceptively earn money, assets or property owned

or held by a financial institution [2]. In this case, not only banks but also

millions of people fall victim to monetary damages caused by bank fraud.

There are countless ways that bank fraud can occur, but only two main



categories can be distinguished, insider and outsider bank fraud.

Insider Bank Fraud is perpetrated by people who work inside or have access

to restricted areas of information inside of the financial institution [4]. This type

of fraud is difficult for banks to combat, since the number of people who hold

positions with responsibilities for handling large amount of money is

significantly high. Hence, there is an urgent need for banks to constantly

update security measures. Here are some of the common forms of insider

fraud.

Illegal Insider Trading: When someone has the authority to make

investments on behalf of the bank without the bank being aware of it. This

type of fraud may lead to an irreparable damage for the bank.

Identity theft: It is often the case that a bank employee uses customers’

personal information with a view to selling this information or making

fraudulent purchases.

Fraudulent loans: When a loan officer within a bank forges documents,

creates false entities or lies about the ability of the applicant to repay in order

to “borrow” a sum of money from the bank that they never intend to repay.

Wire Fraud: There are cases where insiders attempt to use fraudulent or

forged documents which claim to request a bank depositor’s money be wired

to another bank often an offshore account in some distant foreign country. It

may take a bank months or even longer to notice the missing funds.

Outsider bank fraud or fraud perpetrated by outside parties is not limited to

persons working inside the financial institutions. Some of the common ways to

accomplish this form of fraud are the following [4].

Debit/Credit card fraud: It’s described as the unauthorized use of a

debit/credit card to obtain goods of value. It includes counterfeiting cards,

using lost or stolen cards and fraudulently acquiring credit cards through the

mail. In all of these cases, the fraudster uses a physical card, but physical



possession is not essential to perpetrate credit card fraud. Typical case is the

“cardholder-not- present” fraud, where only the card’s details are given (e.g.

over the phone) [44]. Apart from internet, card fraud occurs in ATM and POS

transactions.

Identity theft: It’s considered to be one of the most popular schemes today

and occurs when someone steals the identity of another person to perform an

illegal action. Fraudsters obtain useful information from a variety of sources,

such as the victim’s wallet, trash, fake websites of internet, fake documents

etc [6]. Identity theft is strongly connected with other types of fraud, such as

application fraud, which is described subsequently.

Application fraud: It refers to the theft of an individual’s personal data such

as name, address, telephone and mobile numbers, id number, passports,

social security number and their use in financial credit products applications in

someone’s name. Credit cards, bank accounts, loans are examples of such

applications which are recorded fraudulently in victim’s name, leaving her/him

liable for any resulting charges and fees [5].

Money Laundering: It involves the investment or transfer of money from

racketeering, drug transactions or other embezzlement schemes so that the

original source either is concealed or it appears to be legitimate [6].

Purchasing and selling securities, using the funds as collateral on the loans,

and even writing off the money as business expenses are all common forms

of money laundering.

2 .3 .2 .2 .3 .2 .2 .3 .2 .2 .3 .2 . I ns ur anceIns ur anceIns ur anceIns ur ance

It’s any act perpetrated with the fraudulent intent to obtain payment from an

insurer agent. Although insurance fraud is not a highly visible crime, it costs

insurance companies great deal of money annually [2, 3].

2 .3 .3 .2 .3 .3 .2 .3 .3 .2 .3 .3 . I n t er ne tIn t er ne tIn t er ne tIn t er ne t

This type of fraud varies and it’s intended to intercept, view or redirect



confidential information about the client and the client’s financial information in

order to compromise accounts and commit fraud. A common practice is to

create fake websites which deceive clients, extorting them great amounts of

money [6].

2 .3 .4 .2 .3 .4 .2 .3 .4 .2 .3 .4 . Te l ecommun ica t i onsTe l ecommun ica t i onsTe l ecommun ica t i onsTe l ecommun ica t i ons

Fraudsters steal or use telecommunication service (telephones, cell, phones,

computers etc.) to commit other types of fraud, deceiving consumers,

businesses and communication service providers. This type of fraud can only

be detected once it has occurred [7, 30].

2 .3 .5 .2 .3 .5 .2 .3 .5 .2 .3 .5 . Ente rpr i sesEn te r pr i sesEn te r pr i sesEn te r pr i ses

The occupational fraud is described as the abuse of one’s occupation for

personal enrichment through the deliberate misuse of the employing

enterprises’s resources or assets [24].

2 .3 .6 .2 .3 .6 .2 .3 .6 .2 .3 .6 . GeneralGener alGener alGener al

Referring to the above fraud types, Figure 1 [76] displays the most popular

subgroups of occupational, insurance, credit card and telecommunications

fraud, studied in published FD papers. Occupational FD is concerned with

determining fraudulent financial reporting by management and abnormal retail

transactions by employees. Referring to insurance fraud, four groups exist: a)

home insurance, b) crop insurance, c) automobile insurance and d) medical

insurance. Credit FD involves screening credit applications and/or logged

credit card transactions. In telecommunications fraud, subscription data and/or

wire-line and wireless phone calls are monitored [76].



Figure 1: Bar chart of fraud types from 51 unique and published FD papers [76].

2.4.2.4.2.4.2.4. Fraud TechnFraud TechnFraud TechnFraud Techn iquesiquesiquesiques

Depending on the sector and the type, fraud is committed in various ways

which are described in the following paragraphs. Note that there are also

combinations of these types.

2 .4 .1 .2 .4 .1 .2 .4 .1 .2 .4 .1 . Bank ingBank ingBank ingBank ing

There are numerous ways to impose credit or debit card fraud is committed

and the following are the most typical.

- Phishing: Phishing attacks are considered to be one of the fastest growing

fraud trends and potential victims are customers of both large and small

financial institutions. It is a criminal scam whereby Internet perpetrators try to

steal cardholder’s pertinent and sensitive data through e-mail. This will result

in committing identity theft fraud and possible account hijacking. The e-mails

appear to come from a well-known organization –with which victim does not

even have an account- and ask for victims’ personal information, such as card

number, social security number, account number or password. The fraudster

leads cardholders to a website so that he/she will be able to “phish” their

personal information. Phishing e-mails almost always urge victims to click a

link, which results in a site for entering their personal information. However,

legitimate organizations would never request personal information of



cardholders via email.

- Skimming: Card skimming is the most traditional method for defrauding

cardholders, which takes place in public areas with internet access, such as

airports, gas stations, supermarkets, gas stations and Internet cafes. It, also,

takes place in ATMs and POS and involves illegal copying of information from

the magnetic strip of a credit or debit card. It is a more direct version of a

phishing scam. Scammers use a “wedge”, that is a device that captures and

stores the full magnetic stripe tract data, to steal the account number

information. Some wedges can store large volumes of track information

versus some that are wireless and send data to the scammer in the parking lot

or outside the merchant establishment. Once criminals have skimmed the

card, they are able to create a fake or ‘cloned’ card with victim’s details on it.

Then, they run up charges on victim’s account.

Card skimming is an alternative way for fraudsters to steal cardholder’s

identity and use it to commit identity fraud, i.e. to borrow money or take out

loans in victim’s name [19, 21].

Money Laundering: Main precondition is the physical disposal of cash. The

next step is known as layering and involves carrying out complex layers of

financial transactions to separate the illicit proceeds from their source and

disguise the audit trail. Finally, the perpetrator makes the wealth derived from

the illicit proceeds appear legitimate.

Identity theft is strongly connected with other types of fraud, such as credit

card or application fraud, and the common techniques used are the following

[2, 37].

- Shoulder surfing: Perpetrators observe directly victims from a nearby

location, such as looking over someone’s shoulder to extract valuable

information. It is especially effective in crowded places and it’s relatively easy

for fraudsters to observe victims as they fill out a form, enter their PIN at an

ATM or a POS terminal, enter passwords at an internet café, public and



university libraries or airport kiosks or use a calling card a public pay phone.

Shoulder surfing is accomplished at a distance using binoculars or other

vision-enhancing devices. Inexpensive, miniature closed-circuit television

cameras can be concealed in ceilings, walls or fixtures to observe data entry.

- Dumpster diving: When criminals go through victims’ garbage cans or a

communal dumpster or trash bin to obtain copies of their checks, credit card

or bank statements or other records that typically bear their name, address,

and even their telephone number. These types of records make it easier for

criminals to get control over accounts in victim's name and assume his/her

identity.

The way to commit application fraud is divided into 2 categories: a) when the

fraudster assumes another person’s identity, solely for the purpose of

receiving another individual’ s credit cards or loans and b) when the fraudster

applies for a loan or a credit card, but gives false personal details on purpose.


Fraudsters use four kinds of techniques in order to perpetrate insurance fraud

[22]:

- Exploited accidents: They refer to actual accidents which did occur and

they’re exploited in order to get reimbursed for pre-existing damage or the

damage increased on purpose at fraudster’s interest.

- Fabricated accidents: In this case, an accident either did not take place or

at least not as stated and fraudster merely pretends it did occur in order to

proceed to a legitimate claim.

- Provoked accidents: One driver intentionally involves another innocent

driver in an accident, which is crafted cleverly to make the latter appear as the

one at fault. A typical case is when the fraudster accelerates before a yellow

traffic light and brakes hard or perhaps reverse in front of a red light. Potential

locations for these accidents are blind corners, where accomplices are always



on hand to coordinate the accident and act as witnesses.

- Staged accidents: An accident did occur but, if one strictly applies the laws

of coincidence, an accident did not really take place. It’s common practice that

rental vehicles are involved and sometimes more than once. The damage

incurred is either not repaired or only to the extent absolutely necessary.

2 .4 .3 .2 .4 .3 .2 .4 .3 .2 .4 .3 . I n ter ne t I n ter ne t I n ter ne t I n ter ne t

Apart from the Phishing, Skimming, techniques already mentioned, internet

fraud can be committed through the following ways as well [26]:

- Trojan horse: It appears to be a useful, legitimate software program or file,

but once installed, it causes havoc with a computer by damaging or deleting

files. Such scam may claim to have pornographic element or to have a

program which removes computer viruses. When the unsuspecting user

opens the file or downloads the software, then the damage has happened.

Unlike viruses or worms, Trojan horse is not designed to replicate itself. Some

Trojan horse programs open a backdoor into the computer, allowing

unscrupulous users to steal sensitive financial and identity information [19,

20].

- Advance Fee: An incident involving communications that would have people

believe that to receive something, they must first pay money to cover some

sort of incidental cost or expense. Among the variations on this type of scam

is the Nigerian letter or 419 scam [46].

- Nigerian Letter or 419 scam: The fraudster sends spam e-mails to

numerous recipients and narrates a fake story about a money transfer which

is not able to make. Usually, these e-mails contain the famous subject line

“Your assistance is needed”. The potential victim answers this e-mail and the

perpetrator either steals money from his/her bank account or steals sensitive

card data [28]. The majority of Nigerian Advance fee fraud is still organized by

Nigerians, but no longer initially from Nigeria [46].



- Lottery scam: Scammers send e-mails/letters/faxes which claim that the

potential victim has already won a great deal of money in an international

lottery, even though he/she has never taken part in. They also claim that

victim’s address has been randomly chosen out of a large pool of addresses

as a ‘winning entry’. In some cases, the emails claim to be endorsed by well-

known companies such as Microsoft or include links to legitimate lottery

organization websites. Any relationships implied by these endorsements and

links will be completely bogus.

2 .4 .4 .2 .4 .4 .2 .4 .4 .2 .4 .4 . Te l ecommun ica t i onsTe l ecommun ica t i onsTe l ecommun ica t i onsTe l ecommun ica t i ons

Methods of telecommunications fraud are grouped into four categories [44]:

- Contractual fraud: In this case, perpetrators generate revenue through the

normal use of a service, whilst having no intention of paying for use.

Subscription and Premium Rate fraud are some of the examples of

contractual fraud. In Subscription fraud, the fraudster subscribes to the mobile

network using a false identity and then sells the use of his phone to

unscrupulous customers (typically for international calls to distant foreign

countries) at a rate lower than the regular tariff. A large number of expensive

calls is accumulated, but the fraudster disappears before the bill can be

collected [40].

- Technical fraud: It is connected with attacks against weaknesses in the

technology of the mobile system. The perpetrator should have technical skills

and abilities, but once a weakness is discovered then this information is often

quickly distributed in a form that non-technical people can use.

- Hacking fraud: The fraudster generates revenue by breaking into insecure

systems and exploiting or selling on any available functionality.

- Procedural fraud: It involves attacks against the procedures followed to

minimize the exposure to fraud. The perpetrator often attacks the weaknesses

in the business procedures used to grant access to the system.



Apart from the above, there are combinations of these techniques. For

example, there are cases where fraudsters obtain the ability to place

international and mobile calls, by gaining a legitimate PIN to use with the

private PABX1 of an organization as employees of the organization, but have

no intention of paying for these services (contractual fraud). Additionally,

fraudsters give the PIN to others (hacking fraud) who also used the service,

without paying. There is often the case where an employee of the

organization with special technical knowledge manages to deceive the system

and obtain a PIN that belongs to another person. The fraudster then starts

using this PIN, pretending to be the legitimate user and burdens the legitimate

user’s account [30].


According to the Association of Certified Fraud Examiners (ACFE),

occupational fraud is committed mainly through the following ways [23]:

- Asset misappropriation: In this case, the perpetrator steals or misuses an

organization’s resources, like false invoicing, payroll fraud and skimming.

- Corruption: Fraudsters use the influence in business transactions so that

their duty to their employer is violated aiming at obtaining for themselves.

Employees might receive or offer bribes, extort funds from third parties or

engage in conflicts of interest.

- Financial statement fraud: It involves the intentional misstatement or

omission of material information from the organization’s financial reports.

These are the cases of “cooking the books” that often make front page

headlines. Financial statement fraud cases often involve the reporting of

fictitious revenues or the concealment of expenses or liabilities in order to

make an organization appear more profitable than it really is.

1 Private Automated Branch Exchange (PABX), this telephone network is commonly used

by call centres and other organizations. PABX allows a single access number to offer multiple lines to outside callers while providing a range of external lines to internal callers or staff [75]



2.5.2.5.2.5.2.5. FFFFrauds tersrauds tersrauds tersrauds ters T T T Typeypeypeype

Figure 2 illustrates the types of profit-motivated fraudsters and affected

industries [76]. It stands to reason that, each business is susceptible to

internal fraud or corruption not only from the high level employees

(managers), but also the low level employees.

Fraudsters can be an external party (or parties) or can perpetrate fraud in the

form of prospective/existing customer or supplier. The external fraudster has

three basic profiles: the average offender, the criminal offender and the

organized crime offender.

Figure 2: Hierarchy chart of white-collar crime perpetrators from both firm-level and

community-level perspectives [76]

Average offenders display random and/or occasional dishonest behaviour

when there is opportunity, sudden temptation, or when suffering from financial

problems. In contrast, the more risky external fraudsters are individual criminal

offenders and organised/group crime offenders (professional/career

fraudsters) because they repeatedly disguise their true identities and/or evolve

their modus operandi over time to approximate legal forms and to counter FD

systems. Hence, it’s very important that business should take effective

countermeasure concerning their FD systems and algorithms according to

professional fraudsters’ modus operandi. Occupational and insurance fraud is

mainly committed by average offenders, while credit and telecommunications



fraud is more vulnerable to professional fraudsters.

2.6.2.6.2.6.2.6. Economical Impac t o f F raudEconomical Impac t o f F raudEconomical Impac t o f F raudEconomical Impac t o f F raud

Fraud is a considerable and increasing financial risk which threatens the

profitability and status of enterprises and causes great inconvenience to

individuals and merchants worldwide. The financial and economic result of

fraud is obviously the worst aspect of the problem.

In contrast with fraud costs, business costs, such as utility, accommodation,

salaries or procurement costs are usually known and predictable. The attitude

of denying the existence of fraud or reacting after the losses have been

occurred, surely, doesn’t help to mitigate the problem. It’s often the case that

the necessary protection measures against fraud are taken after the fraud

losses have occurred or after the resources have been diverted from where

they were intended and of course after the economic damage has happened.

Furthermore, fraud losses affect individuals not only in a direct but also in an

indirect way. For instance, when banks lose money because of credit card

fraud, cardholders pay for all of that loss through higher interest rates, higher

fees, and reduced benefits. In case of insurance companies, policyholders

pay fraud losses through high premiums.

The key to successful loss reduction is measurement methodologies, which

have been developed and implemented over the last decade by various

associations and organizations. Measuring fraud costs contributes to draw

useful conclusions about the investment to be made in moderating them and

the financial benefits from their reduction [89].

Of course, fraud scientific observation or measurement is not an easy task

because of its complicated nature. However, in the cost of fraud the following

parameters should be taken into account [11]:

− immediate direct loss due to fraud



− cost of fraud prevention and detection

− cost of lost business (when replacing card)

− opportunity cost of fraud prevention/detection

− deterrent effect on spread of e-commerce

Table 1 contains the size of annual fraud losses of various countries and time

periods based on fraud sector, interpreted in actual figures. Some comments

on the losses are given in the following paragraphs.

It’s important to mention that if detected fraud losses increase, this doesn’t

necessarily mean that there is more fraud or the FD systems improved;

similarly, if detected fraud losses drop, this doesn’t mean that there is less

fraud or worse detection [89].

2 .6 .1 .2 .6 .1 .2 .6 .1 .2 .6 .1 . Genera lGenera lGenera lGenera l

Using fraud figures that are currently available, the National Fraud Authority

(NFA) estimates that fraud cost the UK economy £30.5 billion during 2008

[48]. However, these estimations suggest that public sector losses accounted

for 58% of all fraud loss, with estimated fraud losses of £17.6 billion for the

public sector alone (Figure 3). Next, the private sector losses, which

accounted for 30% of, total loss or £9.3 billion. The individual and charity

sector represents the rest 12% of the total loss, which is translated to £3.5

billion and £32 million respectively.



F raud T y peF raud T y peF raud T y peF raud T y pe C oun t r yCoun t r yCoun t r yCoun t r y Y e a rYea rYea rYea r A nnua l l o s sAnnua l l o s sAnnua l l o s sAnnua l l o s s

General Europe 2008 €700 million

General UK 2008 £30 billion

Occupational worldwide 2009 $2.9 trillion

Occupational US 2008 $994 billion

Insurance US 2008 $80 billion

Insurance UK 2008 £2.08

Internet Greece 2008 €9 million

Internet US 2009 $559.7 million

Internet worldwide 2009 €2 billion

Advance fee US 2009 $2110million

Advance fee UK 2009 $1230 million

Advance fee Greece 2009 $108 million

Telecom worldwide 2008 $72-$80 billion

Telecom UK 2008 £948 million

Mobile worldwide 2008 $25 billion

Identity theft US 2003 $50 billion

Cards worldwide 2007 $5.5 billion

Debit/Credit Card US 2006 $3.718 billion

Cards Europe 2008 €700million

Cards (ATM) Europe 2008 €312 million

Credit Card France 2008 €249.2 million

Credit Card Greece 2006 €4 million

Cards UK 2008 £610 million

Credit Card Netherlands 2008 €68.4 million

Table 1: Annual Fraud losses per fraud sector and country



Figure 3: Breakdown of fraud losses in UK for 2008 according to NFA [48]

2 .6 .2 .2 .6 .2 .2 .6 .2 .2 .6 .2 . Bank ingBank ingBank ingBank ing

2 .6 .2 .1 .2 .6 .2 .1 .2 .6 .2 .1 .2 .6 .2 .1 . Cred i t /Deb i t ca rdsC red i t /Deb i t ca rdsC red i t /Deb i t ca rdsC red i t /Deb i t ca rds

In 2007, card fraud globally took in an estimated $5.5 billion, based on a

worldwide survey conducted by Kroll Consulting Services in collaboration with

the Economist Intelligence Unit [51].

Concerning Europe figures, the European ATM Security Team (EAST)

estimates that losses of card fraud referring to ATM transactions fell from

€485 million to €312 million during 2008, despite a rise in attacks [15]. Based

on EAST’s estimates international losses due to skimming attacks fell by 43%

from €393 million to €226 million, continuing a downward trend from 2007

[15]. Furthermore, according to Visa, 0.055% of the cards transactions is

considered to be fraudulent and the card fraud turnover is estimated at

€700million [53].



Figure 4: Breakdown of card losses in UK during 2008 according to FFA [48]

Figure 5: UK Fraud loss ratios by card type according to Visa estimates for 2008 [16].

Financial Fraud Action (FFA) has reported that over 10.5 billion UK card

transactions have taken place in 2008, with spending amount £397 billion and

card fraud loss up to £610 million, up 14% from 2007 [48]. The majority of

card losses resulted from card-not-present scheme and accounted for over a



half of all card losses. However, this should be considered along with changes

in card usage, i.e. many more transactions are made online, by phone or

through mail order than 5 years ago. Figure 4 illustrates that card-not-present

fraud appears the highest losses (£328.4 million) whilst the application fraud

appears the lowest losses (£11 million) [48]. Additionally, Figure 5 shows the

UK fraud loss ratios by type of card, where credit cards stand out significantly

[16].

Based on Visa Europe's figures, card fraud losses in UK are comparatively

higher than that of France or Netherlands (Figure 6) [16].

Figure 6: Comparative overview of European countries based on Visa estimates in 2008 [16].

Figure 7 represents the fraud loss ratio for 2008 based on losses on both

purchases and cash withdrawals, and on both domestic and international

transactions, where UK losses still outnumber the rest of the regions. Fraud

loss ratio is a comprehensive overall measure of fraud losses and it expresses

fraud losses as a proportion of total payment card turnover. Yet, the size of

losses and their trends needs to be seen in the context of the importance of

cards as a means of payment in each country.

In 2006, total card fraud losses in the US are estimated at approximately

$3.718 billion, from which $1.24 billion are the credit card losses, $762 million

is the debit and ATM card losses, $0.829 billion are the POS merchant losses

and $0.9 billion are the Internet, mail order and telephone merchants’ losses

(Figure 8) [52].



Figure 7: Fraud loss ratios of 2008 according to Visa [16].

Greek banks and financial institutions as well as Visa and MasterCard

estimate that card fraud loss in Greece is half than the corresponding

European losses. For every, €1000 transaction, €0.35 is the result of fraud,

while in Europe the corresponding amount is €0.75 and the total turnover is

estimated to €1,2-€1,5 billion. This implies that debit and credit card fraud

turnover is calculated around €4 million and concerns 2.500 cardholders, with

average loss €110 each. The most frequent fraud types in Greece are: 23%

counterfeiting, 24% stolen cards, 26% lost cards and 27% other types. The

corresponding percentages in Europe are 35%, 19%, 17% and 29% [28].

Card fraud losses in billions $ (US, 2006)

0,7620,829

0,9

1,24

0

0,2

0,4

0,6

0,8

1

1,2

1,4

credit card debit and ATM POS merchant losses Internet, mail order and telephone merchants’

Figure 8: Card fraud losses in the US for 2006



2 .6 .2 .2 .2 .6 .2 .2 .2 .6 .2 .2 .2 .6 .2 .2 . I d en t i t y TI den t i t y TI den t i t y TI den t i t y T he f the f th e f th e f t

During 2007, nearly 10 million victims of identity theft fraud in the US have

been recorded and the total loss to individuals and businesses has risen to

$50 billion, according to Federal Trade Commission survey. In addition, the

average loss to a business is $4.800. Total business losses from identity theft

exceeded $47 billion during 2008 [5, 11].


According to survey of the Association of Certified Fraud Experts (ACFE),

including more than 106 nations – with more than 40% of the cases occurring

in countries outside the US – total global occupation fraud loss is estimated

more than $2.9 trillion between January 2008 and December 2009 [25].

Furthermore, the US companies lose 7% of their annual revenues due to

occupational fraud, which is translated to $994 billion.

Figure 9 illustrates the distribution of occupational fraud losses, as CFEs

recorded for 1822 fraud cases. The median loss for these cases was

$160.000. Nearly one-third of the fraud schemes caused a loss to the victim

organization of more than $500.000 and almost one-quarter of all reported

cases topped the $1 million threshold [25].

Figure 9: Distribution of occupational fraud losses worldwide according to ACFE [25].



In addition, Figure 10 shows the proportion of the total losses based on fraud

category. Referring to cases which cost more than $18 billion, 21% were

caused by asset misappropriation, 11% by corruption and 68% by fraudulent

financial statements.

Figure 10: Proportion analysis per fraud category according to ACFE

[25].

Analyzing the median losses of occupational fraud per scheme worldwide

(Figure 11), the asset misappropriation appears to be the least costly, despite

its high frequency. In contrast, financial statement fraud caused a loss of more

than $4 million and corruption schemes fell in the middle category, creating a

median loss of $250.000.

Figure 11: Median occupational fraud losses by category for 106 nations according to ACFE [25].




The annual losses of fraudulent insurance claims are calculated nearly $80

billion in the US, according to the estimates of the Coalition Against Insurance

Fraud. This figure includes all lines of insurance. It’s also a conservative figure

because much insurance fraud goes undetected and unreported. As it’s

mentioned in §2.6, fraud contributes to higher insurance premiums because

insurance companies generally must pass the costs of bogus claims — and of

fighting fraud — onto policyholders [48].

According to Association of British Insurers (ABI), losses occurred by both

detected and undetected insurance fraud in the UK during 2008 reach £2.08

billion. It’s worth mentioning that the UK insurance industry is the largest in

Europe and the third largest in the world accounting for 11% of total worldwide

premium income [47].

2 .6 .5 .2 .6 .5 .2 .6 .5 .2 .6 .5 . I n t er ne tIn t er ne tIn t er ne tIn t er ne t

Internet fraud losses in the USA referred to law enforcement amounts to

$559.7 million in 2009 according to the Internet Crime Complaint Center (IC3)

[46]. Figure 12 shows the increasing internet fraud losses of referred

complaints from 2001 until 2009 for the US.

Figure 12: Annual dollar loss of referred complaints according to IC3 (in millions) [46].

On the other hand, the total turnover per year in Greece reached from €3,2

million in 2007 to €9 million in 2008. At the same time, the global turnover is



estimated more than €2 billion.

2 .6 .5 .1 .2 .6 .5 .1 .2 .6 .5 .1 .2 .6 .5 .1 . Advance Advance Advance Advance Fe e F raud Fe e F raud Fe e F raud Fe e F raud –––– 41 9 S 41 9 S 41 9 S 41 9 S camcamcamcam

According to Ultrascan Advanced Global Investigations (AGI), currently there

is no country even actively encourages the reporting of these criminal

attempts to defraud to the authorities [54]. Reporting is only limited only to

cases in which there has been a financial loss. There are, of course, only a

very small percentage of the total criminal attempts to defraud by the 419ers

(though the number of loss cases is still huge both in numbers of victims and

amounts lost). When only these loss cases are considered in statistics on 419,

the true massive magnitude of 419 Advance fee fraud criminal activities is

obscured- as only the tip of the iceberg of the actual numbers of 419 crimes is

being included in the statistics [54].

Figure 13 and Figure 14 illustrate the size of Advance fee 419 fraud losses

during 2009 for the top 25 countries. The US comes first with $2110 million

losses; it follows the UK with $1230 million. Since 2005, Greece's fraud losses

exhibit an extreme increase, which resulted in $108 million losses for 2009.

Figure 13:Advance fee fraud losses for Greece (419 Unit of Ultrascan AGI) [54]



Figure 14: Advance fee fraud losses worldwide in 2009 in million $ (Ultrascan AGI) [54]

2 .6 .6 .2 .6 .6 .2 .6 .6 .2 .6 .6 . Te l ecommun ica t i onTe l ecommun ica t i onTe l ecommun ica t i onTe l ecommun ica t i on ssss

The global telecom fraud loss has increased from $54-$60 in 2005 to $72-$80

billion in 2008, which corresponds approximately to 4.3% of telecom revenues

according to Communications Fraud Control Association (CFCA) [44]. At the

same time, worldwide mobile fraud costs $25 billion per annum.

The Telecommunications UK Fraud Forum (TUFF) estimates that, on

average, telecommunications companies lose 2.4% of their annual turnover to

fraud. Applying this average to industry turnover of £39.5 billion, it is estimated

that £948 million was lost during 2008 to telecommunications fraud [48].



2.7.2.7.2.7.2.7. D i f f i cu l t ies in FDD i f f i cu l t ies in FDD i f f i cu l t ies in FDD i f f i cu l t ies in FD

Fraud is a constantly evolving discipline and a hard task to deal with, so it’s

not surprising that many FD systems exhibit serious limitations. Depending on

the fraud type, different systems with different parameters, database

interfaces, procedures and case management tools should be developed.

Hence, nowadays FD is considered to be a great challenge for numerous

reasons.

Whenever it becomes known that FD method is in place, criminals adapt their

strategies rapidly. To avoid information leaks to fraudsters, FD methods must

be kept secret. New criminals that will enter the field may not be aware of

these FD methods and adopt strategies which lead to identifiable frauds [32].

It’s often the case that there is a subtle distinction between a fraudulent and a

legitimate behaviour, since legitimate account users may gradually change

their behaviour over a long period of time and it’s important to avoid spurious

alarms [32].

Another fundamental problem of FD is the unwillingness of financial

institutions, organizations or companies to admit being defrauded in order to

preserve a good reputation in the market. Due to the severely limited

exchange of ideas in FD, data sets do not become available and the results

are often censored, encumbering the measurement of fraud losses [32].

Beyond these limitations, FD requires the analysis of massive amounts of

transactions data. For example, the credit card company Barclaycard carries

approximately 350 million transactions a year in the United Kingdom alone

(Hand, Blunt, Kelly and Adams, 2000), the Royal Bank of Scotland - which

has the largest credit card merchant acquiring business in Europe - carries

over a billion transactions a year and AT&T carries around 275 million calls

each weekday (Cortes and Pregibon, 1998) [32]. As a consequence,

assuming that the fraudulent transaction represent the 0,1% out of 100 million

transactions and for each fraud case the company loses €10, this implies that



the fraud cost or alternatively the potential value of FD amounts to €1 million.

Processing huge data sets in a search for fraudulent transactions in a timely

manner is an important problem [32]. Experienced and well-trained employees

are capable of effective manual classification of transactions, comparing with

historical data. Yet, time and cost requirements render this aspect prohibitive

[18].

High dimensionality of the input, i.e. the number of attributes, is another point

to be considered. This implies that the search space also increases in an

exponential manner and thus the processing time is affected [57].

There is no doubt that the correct choice of data attributes is often a tricky

task. The existence of both irrelative variables and mixed attribute data-sets

(i.e. data-sets containing both nominal and continuous attributes) or even

complex data types such as text, signals, images is a crucial factor during FD

[11].

Moreover, the FD task exhibits technical problems because the available

training data are highly skewed, i.e. legitimate transactions outnumber

fraudulent ones [11, 17]. It's estimated that 1 out of 1000 transactions is

fraudulent. This percentage is lower in case of debit card fraud and even

lower in case of web-based banking transactions and money laundering [18].

An additional difficulty in FD procedure lies in the typical validity of

transactions for classification. In particular, almost all transactions concerning

electronic payments are typical valid, since fraudsters do not commit fraud

with an expired card [18].

Because of this the typical validity, it is possible that some transaction records

contain original and fake subsets at the same time (class overlapping).

Consequently, the finding of suitable business rules for the discrimination

between original and fraud cases becomes a hard task [18].

Finally, it's noteworthy that the variable misclassification cost per error type



burdens significantly the FD process. For example, credit card transactions

may be labelled incorrectly: a fraudulent transaction may remain unobserved

and thus be labelled legitimate (and the extent of this may remain unknown)

or a legitimate transaction may be misreported as fraudulent [32].

2.8.2.8.2.8.2.8. FD SFD SFD SFD Sys temys temys temys tem Requi rements Requi rements Requi rements Requi rements

All the aforementioned difficulties generate the need of a number of business,

technical and functional requirements for the development of a robust FD

system.

2 .8 .1 .2 .8 .1 .2 .8 .1 .2 .8 .1 . Bus iness Bus iness Bus iness Bus iness RRRRequ i r emen tsequ i r emen tsequ i r emen tsequ i r emen ts

As it’s already mentioned, fraud losses may imperil the good name and the

profitability of the businesses. In this case, there is a dual impact, which

involves not only the lost amounts but also the internal cost, generated due to

the settlement of the fraud case. So, the reference point for money saving is

the following: spare as less money as possible for fraud cases and their

settlement [18].

Obviously, every time a fraud case appears, the relationship between

customer and the particular organization is put on a risk. So, the point is that a

reliable FD system should produce a minimum number of false alarms for

preserving the customer’s satisfaction [18].

Another key issue in FD is the interception of authorization request in real

time, since fraudsters constitute a serious threat as long as they act

inconspicuously [18], especially in cases such as card fraud.

2 .8 .2 .2 .8 .2 .2 .8 .2 .2 .8 .2 . Techn ica l Techn ica l Techn ica l Techn ica l RRRRequi r emen tsequi r emen tsequi r emen tsequi r emen ts

The connection and the integration of new FD solutions in an existent

business environment cause many problems due to the high cost. Hence, the

FD solution should be flexible and available for the majority of technological



platforms and should allow the easy integration and interconnection. Thus, the

implementation and maintenance cost remains low [18].

2 .8 .3 .2 .8 .3 .2 .8 .3 .2 .8 .3 . Funct ional Requi rements

The percentage of false alarms is slightly connected with the percentage of

response. This means that when there is high percentage of responses, then

several false alarms are produced, which leads to customers’ inconvenience.

Consequently, the number of the accepted false alarms helps to define how

many cases will be investigated. This suggests that there should be a

balance between the number of false alarms and responses [18].

Usually, each service provider knows better the fraud issues that encounters.

For this reason, the internal design of an FD system is a secure approach. In

addition, fraud experts should be very precise and studious during the system

design and the goal is to create an FD system totally transparent to them [18].

Some fraud types are global and other appear in specific areas or in specific

service providers. These are the rarest ones, but they can cause great losses.

Given the rampant change of fraud types, it’s very important to use FD safety

measures as fast as possible. Hence, the system processor should be

capable of preserving the decision logic in an independent way [18].

Last but not least, fraud systems should not be awkward to use. The goal is to

facilitate fraud experts during FD, so as to avoid wasting time on simple tasks,

such as retrieving the necessary analytical data of the transaction from

several disparate databases.

2.9.2.9.2.9.2.9. PerformanPerformanPerformanPerformance Mce Mce Mce Metr icsetr icsetr icsetr ics

The performance of a FD system is a subtle matter with many pitfalls and

ambiguous opinions. The performance is usually defined by each service

provider’s needs and requirements and it’s strongly connected with the losses

a service provider is able to prevent. Because measuring averted losses is not



a feasible task, service providers use metrics as detection rate and false

alarm rate [39]. Additional information for classifiers’ performance metrics is

given in §5.2.2.1.

An ideal FD system would have 0% false alarms and 100% hits with

instantaneous detection. Though, the successful detection of all fraud cases

as soon as fraud starts implies that many legitimate cases will be mislabelled

as fraudulent at least once. In fact, in a real FD system there is a trade-off of

the above performance criteria.

False alarm rate refers to the percentage of legitimate instances mislabelled

as fraud. In case of 1000000 legitimate instances in the total population out of

which 100 cases are mislabelled as fraud, this gives a false alarm rate of

0.01%. This measure is considered to be important especially in the flagging

phase, where fraud experts aim at reducing the number of cases that have to

be investigated for fraud to just those that involve actual fraud [39].

As it is mentioned in §5.1, when there is no clear evidence of fraud, there

should be a further analysis by the fraud analysts, before the interception of a

transaction, the restriction of an account or the denial of an insurance claim. In

this case, the flagged instances with the highest priority in the queue are

investigated first, whenever a fraud analyst is available. A queue may

prioritize instances by the number of fraudulent minutes accumulated to date

or by the time of the most recent high scoring call, for example. Performance

can then be evaluated after flagging or after prioritization. For example, the

flagging detection rate is the fraction of compromised accounts in the

population that are flagged. In addition, the system detection rate is the

fraction of compromised accounts in the population that are investigated by a

fraud analyst. The system and flagging detection rates are equal only when

fraud analysts or investigators investigate all flagged instances. Otherwise,

the system detection rate is smaller than the flagging detection rate because

both detection rates are computed relative to the number of accounts with

fraud in the population [39].



Another key issue is that investigators should focus on fraud cases and not

spend time on investigating legitimate instances. This implies that there

should be a precise definition of the percentage of investigated cases that

involve potential fraud. The flagging hit rate is the fraction of flagged instances

that have fraud, and the system hit rate is the fraction of investigated cases

that have fraud. The (1- system hit rate) is often a good measure of the

service provider's perception of the “real false alarm rate”, especially since

this is the only error rate that the service provider can evaluate easily from

experience. That is, only the cases that are investigated may be of service

provider’s concern and not the legitimate cases in the population that were

never classified as suspicious. If 20 cases of fraud are investigated and only 8

turn out to be fraud, then a service provider may feel that the “real false alarm

rate" is 60%, even if only .01% of the legitimate accounts in the population are

flagged as fraud [39].

Moreover, the difference between the fraction of fraud in the population and

the fraction of fraud in flagged instances is used as a measure of the

efficiency of the FD algorithm. Similarly, the system hit rate should be larger

than the flagging hit rate, or else the analyst can find as much fraud by

randomly selecting one of the flagged accounts to investigate [39].

Despite the aforementioned metrics, a key point during FD performance

measurement is the uncertain misclassification costs which are equal to false

positive and false negative error costs. These costs differ from example to

example and may change over time. A common practice is that a false

negative error costs more than a false positive error.

The following paragraph includes an additional performance measure of FD,

which is a result of the aforementioned detection and false alarm rate.

2 .9 .1 .2 .9 .1 .2 .9 .1 .2 .9 .1 . Rece iv er Oper a t i ng Char ac ter i s t i c Rece iv er Oper a t i ng Char ac ter i s t i c Rece iv er Oper a t i ng Char ac ter i s t i c Rece iv er Oper a t i ng Char ac ter i s t i c

(ROC)(ROC)(ROC)(ROC)

The striking feature of FD is finding the right balance between detection of



actual fraudulent users and the production of false alarms. For instance,

telecommunications service providers are very cautious about unnecessary

bothering of good customers. This implies that false alarm rate are not

common for all FD applications, since the number of users and thus the size

or processed record varies significantly [40].

The Receiver Operating Characteristic plots the percentage of correct

detection fraud cases versus the percentage of false alarms for non-fraudulent

users for varying values of the threshold. In other words, ROC curve is a

graphical plot of the sensitivity (true positive rate) versus the (1-specificity) or

false positive rate for a binary classification problem as the discrimination

threshold is varied [2]. The ROC curve is typically shown on a 2-D graph,

where false alarm rate and detection rate are plotted on x-axis and y-axis

respectively (Figure 15).

The ideal ROC curve has 0% false alarm rate and 100% detection rate, but

this is not a real case scenario. Hence, researchers compute detection rate for

different false alarm rates and present the results on ROC curves [84].

Figure 15: Receiver Operating Characteristic curve & the Area Under the Curve [84].

Furthermore, the Area Under the Curve (AUC) is often used to gauge the

classification performance of a FD system. The AUC is defined as the surface

area under the ROC curve (Figure 15 – shaded area) and it’s set to be 1 for

the case of ideal scenario. In practice, the AUC is the index of performance

needs to be maximized.



3333 THEORETICAL PERSPECTIVETHEORETICAL PERSPECTIVETHEORETICAL PERSPECTIVETHEORETICAL PERSPECTIVE

3.1.3.1.3.1.3.1. FD MethodsFD MethodsFD MethodsFD Methods

Fraud is an adaptive crime, so it requires special methods of intelligent data

analysis to detect and prevent it. FD techniques aim at the automation of the

procedure of pattern recognition and come from the fields of Knowledge

Discovery in Databases (KDD), Data Mining, Machine Learning and Statistics,

which offer applicable and successful solutions in different areas of fraud

crimes [2].

KDD is viewed as the overall process of discovering useful knowledge from

data, while data mining is the application of specific algorithms for extracting

patterns (models) from data. Sometimes, KDD and data mining are used

interchangeably [56].

Machine learning is a scientific discipline refers to the design and

development of algorithms that allow computers to evolve behaviours based

on empirical data, such as from sensor data or databases. A major focus of

machine learning research is to automatically learn to recognize complex

patterns and make intelligent decisions based on data.

3.2.3.2.3.2.3.2. Supervi sed Supervi sed Supervi sed Supervi sed &&&& Unsuperv ised Unsuperv ised Unsuperv ised Unsuperv ised LLLLearningearningearningearning

MMMMethodsethodsethodsethods

Supervised learning methods attempt to discover the relationships between

input attributes (independent variables) and a target attribute (dependent

variable) [57]. The relationship discovered is represented in a structure

THEORETICAL

PERSPECTIVE



referred to as a Model, used to describe and explain phenomena, which are

hidden in the dataset. It can be, also, used for predicting the value of the

target attribute knowing the values of the input attributes.

In supervised methods, samples of both fraudulent and non-fraudulent

records are necessary to construct models which allow one to assign new

observations into one of the two classes. Of course, this requires one to be

confident about the true classes of the original data used to build the models.

It, also, requires that one has examples of both classes. Furthermore, it can

only be used to detect frauds of a type which have previously occurred [32].

The supervised models are distinguished into two main categories: a)

Classification models known as classifiers (§3.2.1) and b) Regression models

(§3.2.2)

In contrast, the unsupervised learning refers to modelling the distribution of

instances in a typical, high-dimensional input space. According to [Kohavi and

Provost (1998)] the term "Unsupervised learning" refers to “learning

techniques that group instances without a pre-specified dependent attribute"

[57].

Unsupervised methods simply seek those accounts, customers, transactions

and so forth which are most dissimilar from the norm. Typical characteristic of

unsupervised methods is the fact that there are no prior set of legitimate and

fraudulent observations. Fraud experts model a baseline distribution that

represents normal behaviour and then attempt to detect observations that

show the greatest departure from this norm [32].

3 .2 .1 .3 .2 .1 .3 .2 .1 .3 .2 .1 . Cl ass i f i ca t i onC l ass i f i ca t i onC l ass i f i ca t i onC l ass i f i ca t i on

Classification is learning a function that maps (classifies) a data item into one

of several predefined classes (Weiss and Kulikowski 1991; Hand 1981).

Figure 16 shows a simple partitioning of loan data into two class regions. The

bank may use the classification regions to automatically decide whether future



loan applicants will be given a loan or not [85]. The most widely known

classifiers are analyzed in the next paragraphs.

Figure 16: A Simple Linear Classification Boundary for the Loan Data Set, where the shaped region denotes class no loan [85].

3 .2 .1 .1 .3 .2 .1 .1 .3 .2 .1 .1 .3 .2 .1 .1 . Dec is i on T re eDec is i on T re eDec is i on T re eDec is i on T re e (DT ) (DT ) (DT ) (DT )

DTs are decision making prediction models with a simple representational

form. They generate rules to classify a data set, where each tree represents a

set of decisions. DT is based on “divide and conquer” technique, which means

that the problem is broken down into two or more sub-problems of the same

(or related) type, until these become simple enough to be solved directly. The

solutions to the sub-problems are then combined to give a solution to the

original problem.

To be more specific, a DT is a classifier expressed as a recursive partition of

the instance space. It consists of nodes that form a rooted tree, meaning that

it is a directed tree with a node called root that has no incoming edges. All

other nodes have exactly one incoming edge. A node with outgoing edges is

called internal node or test nodes and all other nodes are called leaves [57].

In a DT, each internal node splits the instance space into two or more

subspaces according to a certain discrete function of the input attributes

values. In the simplest and most frequent case, each test considers a single

attribute, such that the instance space is partitioned according to the

attribute's value. In the case of numeric attributes -with continuous values- the



condition refers to a range and it refers to regression (§3.2.2) [57].

Figure 17 represents a DT example, where given this classifier all transactions

with a probability of >=0.70 (defined threshold) will be indicated and alerted as

fraud [61].

Figure 17: An indicative DT example [61].

3 .2 .1 .1 . 1 .3 .2 .1 .1 . 1 .3 .2 .1 .1 . 1 .3 .2 .1 .1 . 1 . C4. 5 ( J48 )C4. 5 ( J48 )C4. 5 ( J48 )C4. 5 ( J48 )

The DTs generated by C4.5 algorithm as following: Initially, the most

characteristic attribute is selected in order to become a tree root. This

appropriate selection constitutes the key to successful DT, due to the effective

division of problem space. For each different value, a root descendant is

generated and all the training instances which bear this value are mapped

with the descendant. The whole process is repeated retrospectively for each

descendant of the DT root, limiting the examined trained subset to the

instances that have been mapped to this node. The termination of this

process happens when one of the following conditions are satisfied: a) all the

instances of the current node belong to the same class and b) all the

attributes have been used.

One of the most popular mechanisms for instance space partitioning is that of



Information Entropy, which selects that independent variable that leads to the

most compact tree. Let S be the training set at the point of partitioning (node),

the entropy measures the existent incongruity in S, referring to the dependent

variable under consideration. In practice, the Information Gain represents the

reduction in entropy of a training set S, as a result of the usage of a specific

attribute, let A. In other words, it is a measure for the attribute evaluation.

3 .2 .1 .2 .3 .2 .1 .2 .3 .2 .1 .2 .3 .2 .1 .2 . AAAA r t i f i c i a l Neu ra l Ne two rks (ANN )r t i f i c i a l Neu ra l Ne two rks (ANN )r t i f i c i a l Neu ra l Ne two rks (ANN )r t i f i c i a l Neu ra l Ne two rks (ANN )

ANNs resemble the central processing unit (CPU) of a biological neural

network, the human brain, in the following two ways: a) an ANN acquires

knowledge through learning and b) an ANN's knowledge is stored within inter-

neuron connection strengths known as synaptic weights [66].

A human brain is highly complex, nonlinear, and parallel computer, made up

of about 100 billion tiny units called neurons (Figure 18) and has the capability

of organizing neurons so as to perform certain computations many times

faster than the fastest digital computer in existence today [19].

Each neuron is connected to thousands of other neurons and communicates

with them via electrochemical signals. Signals coming into the neuron are

received via junctions called synapses; these in turn are located at the end of

branches of the neuron cell called dendrites.

Figure 18: A human neuron forming a chemical synapse [66].

The neuron continuously receives signals from these inputs and sums up



them to itself in some way and then, if the end result is greater than some

threshold value, the neuron fires. It generates a voltage and outputs a signal

along something called an axon.

Each input into the neuron has its own weight, which is adjusted for the

network training (Figure 19). When each input enters the nucleus (blue circle),

it's multiplied by its weight. The nucleus, then, sums all these new input

values which gives us the activation. If the activation is greater than a

threshold value the neuron outputs a signal; otherwise the neuron outputs

zero.

Figure 19: Artificial Neuron (Perceptron) [66].

Given that a neuron has n number of inputs, x1 ,x2, x3… xn and their

corresponding weights (synaptic strength) are w1, w2, w3… wn, then the

weighted sum becomes:

a = x1w1+x2w2+x3w3... +xnwn or

To express a background activation level of the neuron, an offset (bias) Θ is

added to the weighted sum and this becomes the propagation function. The

bias is a constant term that doesn't depend on any input value (Figure 19).

The activation function computes the output signal Y of the neuron from the

activation level f and it’s of sigmoid type as plotted in the same figure.

There are many different ways of connecting artificial neurons together to

create a neural network but the most common is called a feedforward network

+ Θ



(Figure 19). It gets its name from the way the neurons in each layer feed their

output forward to the next layer until we get the final output from the neural

network. The simplest kind of a feedforward network is the perceptron (Figure

20), which is a single layer neural network whose weights and biases could be

trained to produce a correct target output when presented with the

corresponding input [67].

Once the ANN has been created, it needs to be trained. One way of doing this

is initialize the neural net with random weights and then feed it a training set.

There are many different ways of adjusting the weights of ANN, but the most

common is called backpropagation (BP). This method efficiently propagates

values of the evaluation function backward from the output of the network,

which then allows the network to be adapted so as to obtain a better

evaluation score. In other words, a BP network learns by example, which

implies that it needs as feedback some input examples and the known-correct

output for each case and this will result in network adaptation.

Figure 20: Three-Layer feedforward ANN

The most common feedforward ANN model is the multilayer perceptron

(MLP), which consists of multiple layers of nodes in a directed graph. The goal

of this type of network is to create a model that correctly maps the input to the

output using historical data so that the model can then be used to produce the

output when the desired output is unknown. MLP utilizes BP technique for

training the network. A graphical representation of an MLP is shown Figure

21.



Figure 21: A multilayer perceptron

3 .2 .1 .3 .3 .2 .1 .3 .3 .2 .1 .3 .3 .2 .1 .3 . Fu zzy Log i c (F L )Fu zzy Log i c (F L )Fu zzy Log i c (F L )Fu zzy Log i c (F L )

The theory of fuzzy sets, introduced by Zadeh (1965) captures the

uncertainties associated with human cognitive processes, such as thinking

and reasoning. FL introduces the concept of “vagueness” rather than the crisp

logic (YES-NO).

FL uses the concept of the degrees of truth, whose the extreme values {0,1}

represent absolute falsity and absolute truth respectively, while the values in

between represent intermediate truth degrees [73]. For example, if the truth

value that “a loan application is fraud” is 0, it means that the “application is

legitimate” or else that the “application has a zero possibility to be fraud”. In

Figure 22, the degrees of truth are plotted in y-axis.

The primary building block of any FL system is the so-called linguistic

variable, which translates real values into linguistic values. For example,

Figure 22 illustrates the three levels of variable “amount”, which show that

when the amount of an application is less than €300, between €500 and

€1000 and over €1100 is characterized as low, medium and high respectively.

These are called linguistic terms and represent the possible values of a

linguistic variable.

The degree to which the value of a technical figure satisfies the linguistic

concept of the term of a linguistic variable is called degree of membership. For



a continuous attribute, this degree is expressed by a function called

membership function. The membership functions map each value of the

technical figure to the membership degree to the linguistic terms. Figure 22

plots the membership functions of all terms of the linguistic variable “amount”

into the same graph.

Figure 22: Membership function of the linguistic variable “amount” in FL

FL models consist of a number of conditional "IF-THEN" rules [74]. Fuzzy

systems often glean their rules from experts. These rules are a transparent

way of imitating human decision processes and they are basically made up of

linguistic variables, associated linguistic terms, and connecting FL operators,

allowing more direct modeling [19].

3 .2 .1 .4 .3 .2 .1 .4 .3 .2 .1 .4 .3 .2 .1 .4 . Fu zzy Neu ra l Netwo rk ( FNN )Fu zzy Neu ra l Netwo rk ( FNN )Fu zzy Neu ra l Netwo rk ( FNN )Fu zzy Neu ra l Netwo rk ( FNN )

The fusion of ANN and FL in neuro-fuzzy models provides learning as well as

readability. FNN systems combine the human-like reasoning style of FL

systems with the learning and connectionist structure of ANN [2].

However, the simplicity of an FL system constitutes an important

disadvantage, since the “IF-THEN” rules must derive from the huge data sets,

which is not an easy task. At this point, ANNs play a significant role due to

their power of training a system with the available data sets. ANN can learn

from data sets while FL solutions are easy to verify and optimize [19].



3 .2 .1 .5 .3 .2 .1 .5 .3 .2 .1 .5 .3 .2 .1 .5 . Naï v e BayeNaï v e BayeNaï v e BayeNaï v e Baye s (NB )s (NB )s (NB )s (NB )

The NB classifier algorithm is based on the so-called Bayes theorem and is

particularly suited when the dimensionality of the inputs is high [55].

Suppose the given data consist of card transactions, described by their time

and amount. Bayesian classifiers operate by saying "If a transaction of €1000

that took place at night, in which of the two classes (legitimate or fraud) is it

likely to belong, based on the observed data sample? In future, classify these

transactions as that type."

A difficulty arises when there are more than a few variables and classes,

because it’s required that there should be an estimation of these probabilities

for an enormous number of observations (records).

NB classification gets around this problem by not requiring that there are lots

of observations for each possible combination of the variables. Rather, the

variables are assumed to be independent of one another. Therefore the

probability that a transaction of €1000, took place at night, with 123 as a

Processing code with cardholder’s name Smith etc. will be fraud can be

calculated from the independent probabilities that a transaction had the

following characteristics: amount= €1000, time=night, Processing code=123

and cardholder’s name= Smith, etc.

In other words, NB classifiers assume that the effect of a variable value on a

given class is independent of the values of other variable. This assumption is

called class conditional independence and is often not applicable. It is made

to simplify the computation and in this sense considered to be naïve.

However, it is the order of the probabilities, not their exact values that

determine the classifications [64].

3 .2 .1 .6 .3 .2 .1 .6 .3 .2 .1 .6 .3 .2 .1 .6 . Suppo r t Vec to r MacSuppo r t Vec to r MacSuppo r t Vec to r MacSuppo r t Vec to r Mach i nes ( SVM)h i nes ( SVM)h i nes ( SVM)h i nes ( SVM)

A SVM performs classification by constructing an N-dimensional hyperplane

that optimally separates the data into two categories. The goal of SVM



modelling is to find the optimal hyperplane that separates clusters of vector in

such a way that cases with one category of the target variable are on one side

of the plane and cases with the other category are on the other size of the

plane. The vectors near the hyperplane are the support vectors [65].

Figure 23 shows a simple two-dimensional example. In this case, the data has

a categorical target variable with two categories. One category of the target

variable is represented by rectangles while the other category is represented

by ovals and they are completely separated. The SVM analysis attempts to

find a one-dimensional hyperplane (i.e. a line) that separates the cases based

on their target categories. There are an infinite number of possible lines and

two candidate lines are shown in Figure 23. The question is which line is

better, and how do we define the optimal line.

The dashed lines drawn parallel to the separating line mark the distance

between the dividing line and the closest vectors to the line. The distance

between the dashed lines is called the margin. The vectors (points) that

constrain the width of the margin are the support vectors (Figure 23).

However, in real life scenarios SVM deals with: (a) more than two predictor

variables, (b) separating the points with non-linear curves, (c) handling the

cases where clusters cannot be completely separated, and (d) handling

classifications with more than two categories.

Figure 23: Margins and Support Vectors in a two-dimensional example [65]



3 .2 .2 .3 .2 .2 .3 .2 .2 .3 .2 .2 . L inear and Logi s t i c Regr ess i onL inear and Logi s t i c Regr ess i onL inear and Logi s t i c Regr ess i onL inear and Logi s t i c Regr ess i on

Regression is learning a function that maps a data item to a real-valued

prediction variable [85].

Linear regression models the relationship between a dependent variable y

and one or more independent variables X, using linear functions. It can be

used to fit a predictive model to an observed data set of y and X values. After

developing such model, if an additional value of X is then given without its

accompanying value of y, the fitted model can be used to make a prediction of

the value y [2].

On the contrary, logistic regression is a variation of ordinary regression which

is used when the dependent (output) variable is a dichotomous variable (i. e. it

takes only two values, which usually represent the occurrence or non-

occurrence of some outcome event, usually coded as 0 or 1) and the

independent (input) variables are continuous, categorical, or both [64]. Unlike

ordinary linear regression, logistic regression does not assume that the

relationship between the independent variables and the dependent variable is

a linear one.

3 .2 .3 .3 .2 .3 .3 .2 .3 .3 .2 .3 . Clus ter i ngC lus ter i ngC lus ter i ngC lus ter i ng

Clustering is a common descriptive task where one seeks to identify a finite

set of categories or clusters to describe the data (Jain and Dubes 1988;

Titterington, Smith, and Makov 1985). Figure 24 shows a possible clustering

of the loan data set into three clusters, where the clusters overlap, and

allowing data points to belong to more than one cluster. The original class

labels (denoted by x’s and o’s in the previous figures) have been replaced by

a '+' to indicate that the class membership is no longer assumed known.

3 .2 .3 .1 .3 .2 .3 .1 .3 .2 .3 .1 .3 .2 .3 .1 . Ou t l i e r Ou t l i e r Ou t l i e r Ou t l i e r DDDDe tec t i o ne tec t i o ne tec t i o ne tec t i o n

An outlier is an observation of the data that deviates from other observations

so much that it arouses suspicions that it was generated by a different



mechanism from the most part of data [68]. In FD, outlier detection helps to

recognize fraudulent behaviour through an exception in the amount of money

spent, type of items purchased, time and location.

Outliers may be erroneous or real in the following sense. Real outliers are

observations whose actual values are very different than those observed for

the rest of the data and violate plausible relationships among variables.

Erroneous outliers are observations that are distorted during data collection.

Many data-mining algorithms find outliers as a side-product of clustering

algorithms. However, these techniques define outliers as points, which do not

lie in clusters. Thus, the techniques implicitly define outliers as the

background noise in which the clusters are embedded. Another class of

techniques defines outliers as points, which are neither a part of a cluster nor

a part of the background noise; rather they are specifically points which

behave very differently from the norm.

Figure 24: Example of clustering [85].

3 .2 .4 .3 .2 .4 .3 .2 .4 .3 .2 .4 . MetaMe taMe taMe ta ---- l e arn inglearn inglearn inglearn ing

The efficiency of a model generated by machine learning algorithms is based

not only on the size and the quality of training set, but also on the

appropriateness of the selected algorithm. There is often the case to utilize

the experience of more than one model-expert, whose combination leads to

the final output for a specific data set. At this point, the various classifiers stem

from the training of a single algorithm applied on different subsets of the



available data set.

Meta-learning methods take advantage of the lack of stability of some learning

algorithms or else the oversensitivity towards the small changes of input data.

The aim is the successive creation of models, capable of complementing each

other. This implies that one model will outperform others in a specific subset

of the training set, where the others will have disadvantages. As it is expected,

meta-learning methods show better results for unstable algorithms, i.e.

algorithms which generate classifiers quite different for only a small change of

the training set.

3 .2 .4 .1 .3 .2 .4 .1 .3 .2 .4 .1 .3 .2 .4 .1 . Bagg i ng (Boo ts t rap Agg rega t i ng )Bagg i ng (Boo ts t rap Agg rega t i ng )Bagg i ng (Boo ts t rap Agg rega t i ng )Bagg i ng (Boo ts t rap Agg rega t i ng )

Bagging involves having each model in the ensemble2 vote with equal weight.

It’s the simplest meta-learning method and it is based on the production of a

number of models (sub-classifiers), which come from a common learning

algorithm. The point is that in each case the sampling of the training set

differentiates. Decision is taken following the voting method, which means that

the final decision of the system coincides with the decision of the majority. In

the case of cross-validated committees the subsets of the training data are

defined through the cross-validation (§5.2.3) method [2].

Figure 25: Graphical representation of bagging.

In Figure 25, D represents the training set, D1,..., Dt represent the various data 2 Ensemble methods use multiple models to obtain better predictive performance that could

be obtained from any of the constituent models [2]. These methods centre around producing classifiers that disagree as much as possible on their predictions [70].



sets (1st step), C1,...,Ct represent the various classifiers per data set (2nd step)

and C represent the final classifier as a result of the combination of C1,...,Ct

(3rd step).

3 .2 .4 .2 .3 .2 .4 .2 .3 .2 .4 .2 .3 .2 .4 .2 . S tack ing ( S t acked Gen era l i z a tS t ack ing ( S t acked Gen era l i z a tS t ack ing ( S t acked Gen era l i z a tS t ack ing ( S t acked Gen era l i z a t io n )io n )io n )io n )

It is a type of ensemble learning, where the set of the models used come from

different learning algorithms, in contrast with most of the approaches.

Additionally, the final decision-making doesn't presuppose the majority voting

or the weighted estimation of individual decisions. Conversely, stacking uses

a model-leader that judges which of the set of the competitors learning

algorithms are the best [2]. The choice among a set of models is done with

cross-validation (§5.2.3) as well.

3 .2 .4 .3 .3 .2 .4 .3 .3 .2 .4 .3 .3 .2 .4 .3 . Boos t i ngBoos t i ngBoos t i ngBoos t i ng

Boosting involves incrementally building an ensemble by training each new

model instance to emphasize the training instances that previous models

misclassified. It is a general method which attempts to boost the accuracy of

any given learning algorithm. The focus of boosting methods is to produce

iteratively learning weak classifiers3 with respect to a distribution and then

adding them to produce a more powerful combination or else a strong

classifier. Thus, boosting attempts to produce new classifiers that are better

able to predict examples for which the current ensemble's performance is

poor [2, 88].

When the weak classifiers are added, they are typically weighted in a way that

is usually related to the weak learner's accuracy. Next, the data is reweighed,

which implies that misclassified examples gain weight and examples correctly

classified lose weight. Unlike bagging (§3.2.4.1), future weak classifiers focus

mostly on the examples that previous weak learners misclassified.

3 A weak classifier is a learner that has misclassification error rate only slightly better than

random guessing.



4444 ACADEMIC PERSPECTIVEACADEMIC PERSPECTIVEACADEMIC PERSPECTIVEACADEMIC PERSPECTIVE 4 .1.4.1.4.1.4.1. Sc ien t i f ic Sc ien t i f ic Sc ien t i f ic Sc ien t i f ic RRRResearchesearchesearchesearch

FD has been widely studied in terms of scientific research for the last 10

years. The following paragraphs describe experiments carried out in the

academic field. These experiments utilize the aforementioned data mining and

machine learning methods (§3) and draw conclusions for future consideration.

The following material is based on scientific publications and it’s grouped

according to fraud types. Within the present thesis research, it has been

noticed that credit card FD has received the most attention from academic

point of view.

4 .1 .1 .4 .1 .1 .4 .1 .1 .4 .1 .1 . Card Card Card Card FDFDFDFD

The following experiment addresses the desirable data distribution, the

pruning4 effects, the use of cost models and the meta-learning advantages.

The research was conducted by Philip K. Chan from Florida Institute of

Technology and Wei Fan, Andreas L. Prodromidis and Salvatore J. Stolfo

from Columbia University [17].

4 .1 .1 .1 .4 .1 .1 .1 .4 .1 .1 .1 .4 .1 .1 .1 . Expe r iment 1Expe r imen t 1Expe r imen t 1Expe r imen t 1 ---- Desc r i p t i on Desc r i p t i on Desc r i p t i on Desc r i p t i on

The proposed methods of combining various learned fraud detectors under a

cost model using distributed data mining appear to be useful for reducing card

fraud losses. For the experiment’s purposes, two American banks, the Chase

Manhattan Bank and the First Union Bank, provided 500.000 real and labeled

4 Pruning a DT leads to the reduction of classification errors, caused by specialization in the

training set. The tree becomes more general, by removing those sections that provide limited power to classify instances. The result is low complexity and better predictive accuracy.

ACADEMIC

PERSPECTIVE



credit card data. The initial data distribution was 20:80 and 15:85 respectively.

In order to test the effectiveness of the developing techniques under extreme

conditions, more skewed distributions were used. The provided data included

approximately 30 attributes, both numerical and nominal.

Instead of training or generalization error, a different metric was used for

performance estimation. This metric is based on a cost model, which relies on

the sum and average of loss caused by fraud. Thus, the following quantities

have been defined:

or

,where Cost(i) is the cost associated with the transactions i and n is the total

number of transactions. The cost model has been designed with the

contribution of bank employees and constitutes a real life scenario. The model

introduces the concept of overhead, which involves the cost of investigation,

operational resources etc. Under this assumption, if the overhead is greater

than the amount of the transaction, then investigating this case is not of

bank’s interest in any case. Table 2 shows the cost model, where the tranamt

is the amount of the credit card transaction. For privacy reasons, the overhead

threshold has not been disclosed by the bank. The evaluation of the present

studies is based on this cost model.

During the experiment, the desired distribution of 50:50 was used in the

training set. It has been proved through previous experiments that this

distribution leads to a performance improvement. For this purpose, the

majority instances of the data were divided into four partitions and four data

subsets were formed through merging the minority instances with each of the

four partitions containing majority instances. This means that the minority

instances were copied four times to create the desired distribution.



Table 2: Cost model assuming a fixed overhead [17].

The training, validation and test data set contained transactions from first eight

months (10/95–5/96), the ninth month (6/96) and the twelfth month (9/96)

respectively. To each subset of the training set, a learning algorithm(s) was

applied (C4.5, CART, Ripper and Bayes) and thus 128 classifiers were

produced. This process was made in parallel processors in order to save time.

After the creation of the classifiers, they are combined by meta-learning

through stacking (§3.2.4.2). Results are shown in the following paragraph

(Table 3).

Another important issue is that of bridging different database schemata. In

particular, banks often need to exchange their classifiers with specific data

attributes. Though, there is often the case of attributes incompatibility, which

renders the exchanged classifiers useless. Hence, two methods have been

proposed, referring to features with different semantics as well as missing

values.

Furthermore, pretraining and posttraining pruning has been applied in order to

improve system’s accuracy and efficiency. Pretraining pruning filters the

classifiers before they are combined. Finally, based on predefined metrics, the

most competent classifiers are selected for the final meta-classifier. Through,

posttraining pruning, the evaluation and pruning of the initial base classifiers is

performed, after the construction of a complete meta-classifier.

4 .1 .1 .2 .4 .1 .1 .2 .4 .1 .1 .2 .4 .1 .1 .2 . Expe r iment 1 E xpe r imen t 1 E xpe r imen t 1 E xpe r imen t 1 –––– Resu l ts Resu l ts Resu l ts Resu l ts

Table 3 contains the cost and savings from stacking algorithm using 50:50



distribution, the average of individual CART classifiers generated using the

desired distribution (10 classifiers), class combiner using the given distribution

(32 base classifiers—8 months x 4 learning algorithms), and the average of

individual classifiers using the given distribution (the average of 32 classifiers).

COTS5 refers to the current FD bank system.

As it’s concluded, class combiner with a desirable 50:50 distribution

succeeded an important increase in savings. It’s noticeable that when the

overhead is $50, more than half losses were prevented. In addition, when the

overhead is $50, a classifier (Single CART) trained from one month’s data

with the desired distribution achieved significantly more savings than

combining classifiers trained from all eight months’ data with the given

distribution. This reaffirms the importance of employing the appropriate

training class distribution in this domain. Class-combiner also contributed to

the performance improvement.

Table 3:Cost and savings in the credit card fraud domain using class-combiner (cost ± 95%

confidence interval) [17].

Making a comparison between COTS and the aforementioned method might

not be accurate, since the bank adopts a different cost model and maintains

much more training data. Though, COTS provides some information about

how the existing FD system operates in real world. It also appears that with

10:90 distributions, the proposed method reduced the cost significantly more

than COTS, whilst with 1:99 distributions, the method did not outperform

COTS. Both methods did not achieve any savings with 1:999 distributions.

5 Commercial Off-The-Shelf system is a ready-made and available for sale to the general

public system, without the need of customization.



Moreover, the ratio of the overhead amount to the average cost,

R=Overhead/Average cost, indicates whether the above techniques are

effective. As a consequence, the described approach yields better results in

comparison with the COTS for R<6, while both techniques are not effective for

R>24. This implies that under a reasonable cost model with a fixed overhead

cost in challenging transactions as potentially fraudulent, when the number of

fraudulent transactions is a very small percentage of the total, it is financially

undesirable to detect fraud. The loss due to this fraud is yet another cost of

conducting business. In addition, filtering out low-risk transactions –using

fraud detectors based on available customer profile- may reduce a high

overhead-to-loss ratio.

Table 4: Results on knowledge sharing and pruning [17].

Table 4 contains the results on knowledge bridging and pruning for the two

American banks. The size column represents the number of base classifiers

used in the ensemble classification system. The first row indicates the best

possible performance of Chase’ s COTS FD system on this data set. The next

two rows present the performance of the best base classifiers over the entire

set and over a single month’s data, while the last four rows detail the

performance of the unpruned (size of 50 and 110) and pruned metaclassifiers

(size of 32 and 63). The first two of these metaclassifiers combine only

internal (from Chase) base classifiers, while the last two combine both internal

and external (from Chase and First Union) base classifiers. The First Union’s

COTS FD system performance is not available for the experiment.

Table 4 indicates that meta-learning outperforms not only the single-model



approaches, but also the traditional FD systems, at least for the given data

sets. One step further, it appears that the database bridging led to the

improvement of metalearning system performance. What is more, pruning

achieved satisfactory results in computing meta-classifiers.

As a conclusion, this survey indicates that distributed data mining techniques

that combine multiple models result in effective FD. An additional advantage

of the present approach is that these multi-classifiers allow adaptation over

time and out-of-date knowledge is removed. Nevertheless, an important

disadvantage of this experiment is the definition of the desired distribution

according to the cost model, which presupposes the running of laborious

preliminary experiments.

4 .1 .2 .4 .1 .2 .4 .1 .2 .4 .1 .2 . I ns ur ance FDIns ur ance FDIns ur ance FDIns ur ance FD

The particular experiment in insurance domain concerns a Complex Event

Processing (CEP) engine (Figure 26), which applies a combination of ANN

(§3.2.1.2) and discriminant analysis6 techniques. CEP is an advanced

technology for detecting already seen patterns of events and aggregating

them as complex events at a higher level of analysis in real time. Taking into

account that the common practice of insurance fraud experts is the manual

FD, it’s clear that the automation of FD techniques will contribute significantly

to the saving costs.

The following description is based on two scientific papers published by A.

Widder, R. v. Ammon, G. Hagemann, D. Scoenfeld, P. Scaeffer and C. Wolff

[22, 41].

4 .1 .2 .1 .4 .1 .2 .1 .4 .1 .2 .1 .4 .1 .2 .1 . Expe r iment 1 E xpe r imen t 1 E xpe r imen t 1 E xpe r imen t 1 ---- Desc Desc Desc Desc r i p t i onr i p t i onr i p t i onr i p t i on

In insurance companies, fraud experts investigate the various insurance

6 It is a technique for classifying a set of observations (training set) into predefined classes, based on a set of predictors (input variables). It constructs a set of linear functions of the predictors, known as discriminant functions, such that L = b1x1 + b2x2 + .. + bnxn + c , where the b's are discriminant coefficients, the x's are the input variables or predictors & c is a constant. These discriminant functions are used to predict the class of a new observation with unknown class [64].



claims in order to catch fraud cases. From these claim-events, the selection of

the relevant attributes is made for the experiment’s purposes. Examples of

attributes are: the estimated total loss, incident time and loss location, the

personal data of the causer of the loss, the personal data of other ones

involved like claimant and witnesses, the description of the succession of the

incident, the policy period, and the total of previous claim losses due to the

insurant, weather reports at incident time.

As it is mentioned, this FD approach is based on discriminant analysis and

ANN. For the CEP engine (Figure 26), the concept of event represents an

input value of the ANN. This engine creates clusters of events based on

already known historical legitimate and fraud events for specific training

customers.

Figure 26: System architecture of combined discriminant analysis and ANN approach [41].

An event may be fraud or non-fraud according to the relevant attributes, which

help to define to which cluster the event belongs. The values of these

attributes are necessary for the computation of discriminant coefficient and

discriminant function, which is used for allocating a new occurring event into a

specific group of events. Every time a new event occurs, its attribute values

are inserted in the discriminant function. The value of the discriminant function

is compared with a critical value, defined by the historic event clusters. Note



that, the discriminant function is updated every time new discriminant groups

are created and, thus, it has a dynamic form.

At the beginning of the experiment, the CEP engine scans the global event of

clouds of an organization. After the insertion of attribute values into the

discriminant function and the comparison with the critical discriminant value,

the events are allocated into a specific cluster. For every discriminant cluster,

an ANN is produced, where the weights are defined through training with

discriminant values of known legitimate and fraud event patterns. Finally, the

ANN classifies the events to fraud and non-fraud.

The ANN runs for an occurring combination of event discriminant values and

the output is evaluated for distinguishing whether the input events are fraud or

not. During network training with BP algorithm (§3.2.1.2), the known fraud

and non-fraud combinations have the output values 1 and 0 respectively. For

the unknown combinations, a threshold is determined through the training

results. This means that when the output value of an input combination is

greater than the threshold, the system classifies it as fraud and reacts with a

predefined action, e.g. sends an alert. Then, the values of the detected fraud

pattern are used to retrain the ANN, after the expiration of predefined time

interval, depending on the system performance.

4 .1 .2 .2 .4 .1 .2 .2 .4 .1 .2 .2 .4 .1 .2 .2 . Expe r iment 1 E xpe r imen t 1 E xpe r imen t 1 E xpe r imen t 1 ---- R esu l Resu l Resu l Resu l t st st st s

The combination of discriminant analysis and ANN appears to run

successfully for a small set of events with two relevant attributes. In this

experiment, the ANN consists of two input nodes, two hidden nodes and one

output node, which is a rather simple structure. Future experiments should be

carried out involving a more complicated environment in order to confirm the

proper operation of this system based on real time requirements. So, the next

step is the use of a higher number of test and training data sets and historic

events as well as the use of a more advanced ANN environment.



4 .1 .3 .4 .1 .3 .4 .1 .3 .4 .1 .3 . Te l ecommun ica t i onTe l ecommun ica t i onTe l ecommun ica t i onTe l ecommun ica t i on ssss FD FD FD FD

The current experiment addresses fraud cases, which are a combination of

contractual, hacking, technical and procedural fraud (§2.4.4). The applied

technique is the user profiling, where the past behavior of a user is

accumulated in order to create a profile. If future behavior deviates from this

profile, this may imply the existence of fraud and may trigger an alarm to the

Network Operations Centre (NOC).

More than five thousand users for one year are collected through the Call

Detail Record (CDR) of a University PBX. In addition, only outgoing calls and

periods users are not on a leave are taken into account during the experiment.

The caller id, the date and the time of the call, the chargeable duration of the

call, the called party id are some indicative attributes of the experiment. The

accumulated user behavior per day is used as a differentiator measure,

because it doesn’t disclose any private information of the user, such as the

caller id.

The scientific research was performed by Constantinos S. Hilas from

Technological Educational Institute of Serres and John N. Sahalos from

Aristotle University of Thessaloniki [30].

4 .1 .3 .1 .4 .1 .3 .1 .4 .1 .3 .1 .4 .1 .3 .1 . Expe r iment 1 E xpe r imen t 1 E xpe r imen t 1 E xpe r imen t 1 ---- Desc r i p t i on Desc r i p t i on Desc r i p t i on Desc r i p t i on

The precondition of the experiment is to construct a reliable user profile that

will represent the normal behavior. The assumption is that any behavior that

doesn’t exist in historical data is at least suspicious or another user. In order

to compare the user profile with a different or fraudulent user, experts use a

similarity measure. They create an eight-element vector , which contains the

following data: number of calls made to local destinations (loc), the duration of

local calls (locd), the number of calls to mobile destinations (mob), the

duration of mobile calls (mobd), the number of calls to national (nat) and

international (int) destination and their corresponding durations (natd, intd)

(Figure 27). Sequence of more than one vector can be used, but the length of



the sequence must be the same for a single run.

For the similarity measure, the rule of “r-contiguous bits” is used. For example,

if a seq1 has k equal points with two other sequences seq2 and seq3, but the

common points with seq2 are in neighbouring positions then the similarity

(seq1, seq2) is greater than the similarity (seq1, seq3).

Figure 27: The vector of comparison [30].

There are two levels of comparison within the experiment: a) the equality of

the number of the calls of the same category and b) the total call duration per

category.

The first level of similarity can give a score from 0 to m*4, where m is the

number of vector sequences. The second level of similarity exists only if the

corresponding number of calls yields equality and, thus, has possible values

from 0 to m*8. The disadvantage of this measure is that there is often the

case where users make no international calls (nat=0 and natd=0 of the

vector), which increases unduly the similarity measure. For this reason, the

zero number calls were excluded from the computation of similarity.

Furthermore, it’s very rare that the call duration will be equal during

comparison. Hence, according to “equality interval”, two call durations are

considered the same only if the first differs by a ±X percentage with the

second. This practice indicates the fuzziness of the system. Here is the

algorithm followed:

1. Start with k profiles and k test sets,

2. Select the length m of the sequence (seq)

(seq = m*unit vector)

3. For each profile - test pair



Select the first sequence from the test set

Set similarity=0

Compare this sequence with the profile set FOR each position, i, in the sequence length

IF position i holds Number-Of-Calls info THEN

IF seqtest(i)=seqprof(i) AND seqtest(i) ≠0

THEN similarity=similarity+1

record the position, i, of equality

ELSE IF position i holds Duration info THEN

IF current position is next to the previous position

of equality THEN

IF seq(i) ≠0 AND seqtest(i)<=(1+X)*seqprof(i) AND

seqtest(i)>=(1-X)*seqprof(i)

THEN similarity = similarity +1

4. After all positions have been examined return the measure values and

store the maximum value as the highest similarity measure between the first

test sequence and the profile under comparison.

5. Store the vector containing the maximum values resulting after the

repeated comparisons between all sequences from a test set with a profile

test.

6. Repeat for all profile – test set combinations (k2 vectors). In this sense the

similarity of a single sequence, i, drawn from the test set, seqi test, with all the

sequences in the profile set K, seqjЄK, is defined as:

and is the similarity of that sequence with the most similar sequence in the

profile set. Once all the k2 similarity vectors are computed, one can compare

them to make decisions about the similarities between users’ behavior.



4 .1 .3 .2 .4 .1 .3 .2 .4 .1 .3 .2 .4 .1 .3 .2 . Expe r iment 1 E xpe r imen t 1 E xpe r imen t 1 E xpe r imen t 1 ---- R esu l t s Resu l t s Resu l t s Resu l t s

During the experiment, twelve telephone terminals were used and four groups

were formed with three terminals each. For each terminal, a similarity vector

for the last year was created. The terminal’s activity was divided into a training

set (user profile) and a test set at split 2/3 and 1/3 respectively. Finally, each

group of terminals had their similarity vectors computed (Figure 28).

It’s observed that when a user profile is compared with the same user then it

appears to be the same, but when it is compared with different users, it differs

significantly. For the comparison of means of each vector, the Analysis of

Variance (ANOVA)7 test was used. The mean of each vector and the

probability that the corresponding mean value is equal to the mean similarity

of the same user is given in the tables III, IV, V, VI.

Figure 28: An example of similarity vectors of 3 user profile – test sets (Group 1) [30].

7 In statistics, ANOVA is a collection of statistical models and their associated procedures, in

which the observed variance in a particular variable is partitioned into components due to different sources of variation. ANOVAs are useful in comparing three or more means [2].



It is apparent, that the diagonal values of the tables III, IV are the maximum

values of the rows (red rectangles). For the table III, this means that e.g.

comparing user profile 1 with test set 1 gives that both sets refer to the same

user. In contrast, the comparison between tests set 1 and user profile 2

results in lower values of mean similarity vectors, indicating a case of different

users.

As a consequence, this is reflected to the respective probability of equality

between means in table IV, where the diagonal values are 1. This implies that



the higher the probability value, the most probable is that the user profile set

will be the same as the test set, i.e. both sets refer to the same user.

Concerning the Profile 6 of tables V, VI, it appears that all test sets fit equally

well with this user profile (column 3). After further examination, it’s concluded

that this was a public phone being used by many different people and for this

reason the terminal’s behavior fit with anybody’s behavior.

Figure 29: Plot of similarity probability between accounts against the data used in the test set [30]

Moreover, six pairs of completely different accounts were formed in order to

define the minimum size of test data that are necessary to distinguish two

different accounts (Figure 29). At each step of comparison, the size of test

data was incremented by one sequence. In Figure 29, the mean probability for

each step (pdif) is also plotted. The probability of similarity of one user with his

own profile (psim), as the number of sequences is incremented is shown as

well.

Additionally, Figure 29 illustrates that high differentiation between accounts

(<20%, lower dotted line) is achieved after of 8 sequences or else 8*3=24



days. One step further, it appears that 13 sequences (13*3=39 days) are

necessary for identifying on one user, where similarity probability is greater

than 80% (upper dotted line).

Among the advantages of the aforementioned approach is the simplicity, the

protection of private data, the application in different and various fields

(cellular phones, web usage, and intrusion detection) and the transferability to

an ANN (§3.2.1.2) of fuzzy network (§3.2.1.3) implementation. Also, the

differentiation measure between accounts gives a motivation for further

research. However, the approach can be used only for midterm decisions and

not for an online account comparison.



5555 PRACTICAL PERSPECTIVEPRACTICAL PERSPECTIVEPRACTICAL PERSPECTIVEPRACTICAL PERSPECTIVE

The aim of the final chapter is to examine a real FD scenario, using a real

anti-fraud system and an open source machine learning tool as well. After the

analysis of the results, a potential FD solution is suggested as a final

conclusion.

The first section of the present chapter describes the anti-fraud system

operating in a Greek Bank. An available labeled data set based on real data is

given by the Bank in order to perform a number of tests for the thesis

purposes. The second section of this chapter describes a machine learning

tool which applies a number of supervised algorithms to the same labeled

data set. The results of the machine learning tool are compared with the

results of the existing anti-fraud system, given that the latter constitutes a fool-

proof mechanism, which provides a reliable classification. In the end, the aim

is to propose a viable FD, combining the existing FD system of the Bank and

an effective method proceeded from the scientific field.

The type of fraud under consideration is the application fraud (§2.3.1, §2.4.1).

The application fraud refers to loan applications, which contain fake data,

such as identity card number, tax identification number, employer name,

telephone numbers etc. Fraudsters use these data in order to achieve a loan

approval quickly, without having the intention of paying back the installments.

5.1.5.1.5.1.5.1. BankBankBankBank Ant i Ant i Ant i Ant i ---- Fraud SystemFraud SystemFraud SystemFraud System

The Bank has implemented an integrated solution, implemented by a Greek

Company. This anti-fraud solution comprises a German software (§5.1.1),

PRACTICAL

PERSPECTIVE



which is parameterized to each customer, and a web-based case investigation

tool suitable for monitoring and managing the produced alarms in real time,

developed by the Greek company.

Before the description of the system operation, it’s important to analyze the

two basic roles of the Bank, i.e. fraud analysts and investigators. Fraud

analysts belong to the Fraud Department of Bank and their experience in

fraud qualifies them to undertake the decision making issues. They are

responsible for rule analysis, rule design and system maintenance. After the

system configuration and fine-tuning by the Company, analysts perform all the

necessary tasks, such as data analysis and decision model optimization, in

order to adapt to new emerging fraud patterns and to improve system

performance. Fraud analysts are authorized to make final decisions about

whether an application will be rejected or not. Conversely, fraud investigators

monitor and investigate the incoming loan applications, and they are not

allowed to decide about the loan disbursement.

However, during investigation there are two possible scenarios: a) the loan

application contains wrong data due to users' errors and b) the loan

application contains actual fraud data. In the first case, fraud investigators

notify the corresponding user, who corrects the application data and proceeds

with the typical procedures. Finally, the loan application is filed in the bank

system as legitimate. In the second case, after the completion of initial

investigation, fraud investigators ask for analysts' contribution, only if they

have concluded high probability of fraud. Fraud analysts investigate further

the application and they decide about the final outcome.

The anti-fraud solution has been incorporated in the Bank’s system and gives

a real-time indication of fraud. According to the Bank’s demand, it acts at an

early stage of the workflow, even if the application contains missing data. Yet,

authorized users (fraud analysts and investigators) may call the system at any

time they need an updated fraud indication throughout the workflow.

The route followed by an incoming loan application in the Bank system is



described hereupon. A loan application is inserted into the Bank’s workflow

and then the anti-fraud system evaluates the data and sends a real-time

indication to the investigators about whether the application involves fraud or

not. If the evaluation indicates fraud, then the investigator investigates the

case and if necessary, he/she contacts fraud analysts, who take the final

decision. Otherwise, the application is forwarded for approval automatically by

the Bank's workflow. However, in both cases the Bank’s system is updated

with the final result.

The following paragraphs contain an overview of the fraud prevention and

detection logic, implemented by the Greek Company, using the German

software.

5 .1 .1 .5 .1 .1 .5 .1 .1 .5 .1 .1 . The The The The SSSSo f twareo f twareo f twareo f tware

RiskShield is a product family for automated decision-making in financial

engineering and it follows client-server architecture.

Figure 30: RiskShield architecture [79].

The RiskShield Server is usually used as the middle ("service provider") tier of

a multi-tiered architecture [79]. In many RiskShield applications, the following

three tiers are used (Figure 30).

5 .1 .1 .1 .5 .1 .1 .1 .5 .1 .1 .1 .5 .1 .1 .1 . Serv i ce Consumer T i e rSe rv i ce Consumer T i e rSe rv i ce Consumer T i e rSe rv i ce Consumer T i e r

This is where the "consumers" of the decision "service" resides. For payment

systems, the service consumers are the authorisation and card management



systems. For insurance claim processing systems, this is the claims

processing system itself. There can be multiple service consumers in a

RiskShield installation. Communication between the service provider and the

service consumer is provided in XML or CSV format using IP or other

transport layers.

5 .1 .1 .2 .5 .1 .1 .2 .5 .1 .1 .2 .5 .1 .1 .2 . Serv i ce Se rv i ce Se rv i ce Se rv i ce P ro v i de rP rov i de rP rov i de rP rov i de r T i e r T i e r T i e r T i e r

The RiskShield Server has two interfaces. The one to the left is the real time

interface for the actual decisions. The one to the right is a maintenance

interface (non real time) for the RiskShield Client. The maintenance interface

is used only during the maintenance. Connection between the RiskShield

Server and the RiskShield Client is not necessary during operation.

The RiskShield-Server is coded in Java. This allows operating it on virtually

any commercial computer platform, such as IBM-AIX, SUN-Solaris, Linux and

Microsoft Windows Server based environments.

5 .1 .1 .3 .5 .1 .1 .3 .5 .1 .1 .3 .5 .1 .1 .3 . Cl i en tC l i en tC l i en tC l i en t T ie r T ie r T ie r T ie r

The RiskShield Client is a Microsoft Foundation Class based software product

that runs on a MS Windows PC computer. It is used by decision project

designers and analysts to create and adapt decision projects, and to verify the

performance of an operational decision making system. The RiskShield Client

can be used both as a stand-alone software tool and as client tier within a

multi-tiered architecture. The RiskShield Client connects to the RiskShield

Server via IP over the Bank’s Intranet.

The RiskShield Client employs FL (§3.2.1.3, §5.1.2) that allows for the

transparent implementation of complex decision patterns. Furthermore, FL

facilitates a rapid reaction on new fraud schemes and allows the exact

addressing of specific fraud patterns as no complete retraining is necessary.

5 .1 .1 .4 .5 .1 .1 .4 .5 .1 .1 .4 .5 .1 .1 .4 . Commun i ca t io nCommun i ca t io nCommun i ca t io nCommun i ca t io n



The RiskShield Client produces a decision project (§5.1.2.1), which is

uploaded to the Server in order to become operational. The RiskShield Server

then loads the decision project and provides it to service consumers. To

obtain analysis data, the RiskShield-Server can be configured to capture

production data within csv8 files and/or within its embedded database. The

analysis data is downloaded by the RiskShield Client where the actual

analyses are performed offline [80].

During operational usage, the RiskShield-Server is independent of the

RiskShield Client. When started on the Server computer, the RiskShield

Server behaves very much like typical server programs, e.g. an http daemon.

It reads its initialisation file with its configuration, then it loads the decision

projects to be served, initializes the IP ports it shall provide its services at, and

writes its actions into a log file. Once initialized, other software systems, either

located on the same computer or on other computers, can access the

decision computation services via XML messages.

5 .1 .2 .5 .1 .2 .5 .1 .2 .5 .1 .2 . FLFLFLFL So f tware So f tware So f tware So f tware

FL technology (§3.2.1.4) has proven to be a very effective means of modelling

human expertise and is thus an important part of the RiskShield Client. The

actual development of a FL system itself involves a number of design and

verification steps. Hence, this development is not performed within the

RiskShield Client, but uses the separate fuzzyTECH software package, which

is a product of the same German company as well [80].

fuzzyTECH software allows the design of rules providing a user-friendly

interface. It applies all the fundamental concepts of FL, such as linguistics

variables, fuzzification and defuzzification, etc. A simplified FL environment is

given in Figure 31, where an indicative rule block and a variable’s

membership function is given.

8 csv stands for character separated values



The development of a completed fuzzyTECH project presupposes the

creation of the input and output linguistic variables (red rectangles), including

their membership functions, and the creation of rule blocks (blue rectangle),

which contain the “IF-THEN” rules and their weights.

Figure 31: A simplified FL environment – fuzzyTECH software

The Bank’s fuzzyTECH project consists of 327 numerical and categorical

(input and output) variables, which are used in order to build 540 rules totally.

The output variables include, apart from the final decision (§5.1.2.1.5), a

number of intermediate scores for the application FD purposes.

Each fuzzyTECH project is incorporated in the RiskShield Client project

(§5.1.2.1) and the cooperation of both softwares results in an alarm for each

Input Linguistic Variables

Output Linguistic Variables



incoming loan application. This alarm is the product of the computations

among input and output variables taking place in both fuzzyTECH and

RiskShield Client.

5 .1 .2 .1 .5 .1 .2 .1 .5 .1 .2 .1 .5 .1 .2 .1 . Ris kSh i e ld P ro j ec tR is kSh i e ld P ro j ec tR is kSh i e ld P ro j ec tR is kSh i e ld P ro j ec t

A typical RiskShield project carries the entire fraud prevention and detection

logic, implemented by the Company and the fraud analysts of the Bank

(Figure 32).

RiskShield Client environment is divided into three sections. The first section

(red rectangle) contains all the input and output variables-attributes of the

decision logic (e.g. customer name, identification number, age, tax

identification number, final decision etc.). The columns of the second section

(blue rectangle) are computational modules which constitute the decision

logic. These are the so-called calculation units (§5.1.2.1.3). Because most

advanced decision logic applications require the clever combination of

different decision modeling techniques, each such technique is encapsulated

in a different type of calculation unit [80]. One of these columns comprises

the set of the FL rules (§5.1.2). The third section (green rectangle) consists of

the transaction data, i.e. the data of each customer’s incoming loan

application.

The most fundamental concepts of the development of the RiskShield Client

project follow in the next paragraphs.

5 . 1 . 2 . 1 . 1 .5 . 1 . 2 . 1 . 1 .5 . 1 . 2 . 1 . 1 .5 . 1 . 2 . 1 . 1 . I n pu t Va r i ab l esIn pu t Va r i ab l esIn pu t Va r i ab l esIn pu t Va r i ab l es

At the first stage of the project development, the Bank selected all the

necessary data attributes-variables contained in its database. In particular,

fraud analysts suggested to the Company a set of variables, which were used

for the creation of calculation units and thus for the design of the fuzzy rules.

The final RiskShield project contains four input variable types: a) continuous,

such as loan amount, b) text, such as customer name, c) categorical, such as



old or current types of identification card numbers and d) fingerprints.

Fingerprint plays an important role in the decision logic and thus is described

in more detail.

5 . 1 . 2 . 1 . 2 .5 . 1 . 2 . 1 . 2 .5 . 1 . 2 . 1 . 2 .5 . 1 . 2 . 1 . 2 . F i ng e rp r i n tsF i ng e rp r i n tsF i ng e rp r i n tsF i ng e rp r i n ts

Fingerprint is a special variable type that the RiskShield Server stores in-

between requests representing for instance sequences of previous

transactions of the same entity, events or profiles (Figure 33). Each fingerprint

has a single (or a combination of) key variable(s), which declare the specific

entity. RiskShield Server stores these fingerprints in its embedded database

[79].

Figure 32: RiskShield-Client project

The concept of fingerprint becomes more clear with the following example:

Assuming that FingerprintTAXid has the tax identification number as a key

variable and that on day 1, customer A with tax identification number

123123123 makes a loan application in the Bank. This application is stored in

Variables Calculation Units Loan Application Data



the FingerprintTAXid. Afterwards, on day 2, customer B makes another

application and uses the same tax identification number (123123132).

Similarly, the second application will be stored in the FingerprintTAXid, which

already contains the first application of customer A. This will lead to a conflict

between data and will probably indicate fraud.

Of course, there are as many fingerprints as the numbers of the specific

entities, e.g. Tax identification numbers. It's important to mention, that the

choice of the fingerprint and thus its keys is not made at random. Instead,

fraud experts choose attributes which characterize uniquely each customer.

As a consequence, a fingerprint holds the historical data of a transaction and

creates a more complete behavioral profile of each customer as time goes by.

In a simplified point of view, the comparison of the aforementioned

accumulated history with the current transaction data leads to the final

decision for the incoming loan application.

Figure 33 shows a typical fingerprint which contains two past loan applications

(with application code 44091 and 29916), bearing the same tax identification

number.

Figure 33: Fingerprint



5 . 1 . 2 . 1 . 3 .5 . 1 . 2 . 1 . 3 .5 . 1 . 2 . 1 . 3 .5 . 1 . 2 . 1 . 3 . Cal cu l a t i o n Un i tsCa l cu l a t i o n Un i tsCa l cu l a t i o n Un i tsCa l cu l a t i o n Un i ts

Consequently, RiskShield Client renders decision logics as sequences of so-

called calculation units. Within this sequence, each calculation unit constitutes

a “universe of its own”. Each calculation unit contains its own definition of

variables and variable types, and its own configuration of settings [80].

In other words, calculation units are independent “plug-ins” that compute new

data (output variables) from the existing data (input variables). Most of the

RiskShield Client projects use multiple and different types of calculation units

[80].

In the RiskShield project spreadsheet, each calculation unit is visualized as a

column, which shows the input and output variables of the calculation unit,

and how they are connected to the RiskShield Client variables and the other

calculation unit’s variables. fuzzyTECH project constitutes one of these

columns.

The processing order is strict from left to right in the project spreadsheet. This

means that a pre-processing calculation unit must be arranged left to a post-

processing calculation unit [80].

5 . 1 . 2 . 1 . 4 .5 . 1 . 2 . 1 . 4 .5 . 1 . 2 . 1 . 4 .5 . 1 . 2 . 1 . 4 . Ou tpu t V a r i ab l esOu tpu t V a r i ab l esOu tpu t V a r i ab l esOu tpu t V a r i ab l es

The output variables constitute the result of each calculation unit. Apart from

the Decision variable (§5.1.2.1.5), there is a number of continuous, text,

categorical or fingerprint as in the case of output variables. A special case of

output variables is the variables which are used for the rule design in the

fuzzyTECH, i.e. rule variables.

5 . 1 . 2 . 1 . 5 .5 . 1 . 2 . 1 . 5 .5 . 1 . 2 . 1 . 5 .5 . 1 . 2 . 1 . 5 . Dec is i on Va r i ab l eDec is i on Va r i ab l eDec is i on Va r i ab l eDec is i on Va r i ab l e

This is the final output variable of the RiskShield project and it’s the outcome

of the combination of both RiskShield project and fuzzyTECH project (Figure

32 – yellow rectangle). According to the Bank's indications, the Decision of the



project is not a crispy variable (fraud-legitimate), but instead it is separated in

different levels of alerts, such as: accepted, suspicious, low risky and

extremely risky.

The Decision is a numeric variable and its possible values are 0, 1, 2, 3 for

‘accepted’, ‘low_risky’, ‘suspicious’, and ‘extr_risky’ respectively. Note that the

Decision values are discrete and characterize each loan application

individually.

5 . 1 . 2 . 1 . 6 .5 . 1 . 2 . 1 . 6 .5 . 1 . 2 . 1 . 6 .5 . 1 . 2 . 1 . 6 . Case Managemen tCase Managemen tCase Managemen tCase Managemen t

Depending on the different levels of alerts, the Bank can manage each

application in multiple ways. As it’s mentioned in the beginning of this chapter,

the case management is performed through a case investigation tool,

developed by the Company. In specific, loan applications with the ‘accepted’

flag are forwarded for approval by the Bank’s workflow. Applications with the

'suspicious' or ‘low-risky’ flag are examined by the investigators and if there is

a need, they are forwarded for further examination to the Fraud Department.

Applications with the ‘extr_risky’ flag are rejected by the fraud analysts after

thorough investigation. Next, the approval authority evaluates the application

and the usual approval process can be resumed right after fraud investigator's

declassification.

The next paragraph introduces a machine learning tool used for the thesis

experiments.

5.2.5.2.5.2.5.2. WWWWaikato a ikato a ikato a ikato EEEEnv ironment for nv ironment for nv ironment for nv ironment for KKKKnowledge nowledge nowledge nowledge

AAAAnalysis (WEKA)nalysis (WEKA)nalysis (WEKA)nalysis (WEKA)

WEKA is a popular suite of machine learning free software written in Java,

developed at University of Waikato in New Zealand. Its advantages include

the great variety of machine learning algorithms and the user-friendly

environment.



In the following paragraphs, only the tools of WEKA used for the thesis

experiments are described. These tools are comprised in the “Explorer”

selection of WEKA GUI Chooser. The experiments were carried out with the

version 3.6.2 of WEKA.

5 .2 .1 .5 .2 .1 .5 .2 .1 .5 .2 .1 . PreprocessPreprocessPreprocessPreprocess

This is the first tab of the “Explorer” environment. The Preprocess tool is

necessary for loading the available data sets in order to run the appropriate

algorithms. There are, also, filters which are algorithms to transform the

datasets, by removing or adding attributes, resampling the dataset, removing

examples and so on [81].

Figure 34: Preprocess tab of WEKA

As it’s shown from Figure 34, the “Current relation” box is the currently loaded

data, which can be interpreted as a single relational table in database



terminology. The name of the loaded file, the total number of records and the

total number of attributes (variables or features) in the data is given by the

Relation, Instances and Attributes entries respectively. The “Attributes” box

lists all the attributes of the loaded data sample. There is also the choice of

removing them through the Remove button. Finally, the “Selected attribute”

box contains the aggregate results for every selected data attribute of

“Attributes” section. In addition, the user must choose the variable which will

be the class label for the supervised algorithms, i.e. Decision, which histogram

is given in the following figure. With Visualize all button, the histograms of all

attributes appear.

5 .2 .1 .1 .5 .2 .1 .1 .5 .2 .1 .1 .5 .2 .1 .1 . Da ta s e tDa ta s e tDa ta s e tDa ta s e t

The data set is a very basic concept of the current machine learning tool. A

dataset is roughly equivalent to a two-dimensional spreadsheet or database

table, which consists of a number of numerical (a real or integer number),

nominal (one of a predefined list of values) or string (an arbitrary long list of

characters) attributes. Date/time attributes types are also supported.

Figure 35: arff file

WEKA data sets have a special structure, which is shown in Figure 35. This is

a typical Attribute-Relation File Format (arff file). Apart from arff files, WEKA

accepts alternative formats such as csv, c4.5, binary or data from databases



through JDBC.

The arff files consist of two parts: the first part is the header and the second

part is the actual data. The header part contains information about the name

of the dataset (@relation application_fraud), as well as a list of data attributes

and their data types (e.g. @attribute tax_id numeric). The data part has the

form {<class label>,<value1>,…<value n>} and {<value1>,…,<value n>} for

supervised and unsupervised machine learning respectively. So, the

sequence {suspicious, LPR013470, DU0093840, 7327Α∆ΠΑΡΑ, 1257320, 26,

60172089} indicates a data record or a loan application which is labeled as

‘suspicious’ and has ‘LPR013470’ as application code, ‘DU0093840’ as

customer number, ‘7327Α∆ΠΑΡΑ’ as identification card number, ‘1257320’ as

tax_id, ‘26’ as customer age and ‘60172089’ as customer phone number.

5 .2 .2 .5 .2 .2 .5 .2 .2 .5 .2 .2 . C l ass i f i ca t io nC l ass i f i ca t io nC l ass i f i ca t io nC l ass i f i ca t io n

The “Classify” tab of the “Explorer” selection is used for the training of a

machine learning algorithm (classifier), based on the available data sets, so

that it can be used for the classification of additional data samples.

As it’s shown from Figure 36, the “Classifier” box contains the name of the

currently selected classifier and its options. The result of running the selected

classifier will be tested according to the options that are set in the “Test

Options” box. Furthermore, the classifiers in WEKA are designed to be trained

to predict a single ‘class’ attribute, which is the target for prediction. Some

classifiers can only learn nominal classes; others can only learn numeric

classes (regression problems); still others can learn both. The “Classifier

output” box comprises all the aggregate results of a specific classifier

application. The “Result list” box contains several entries, after the training of

various classifiers.

Some of the WEKA classifiers are the: DT (§3.2.1.1), NB networks (§3.2.1.5),

logistic regression (§3.2.2), C4.5 (§3.2.1.1.1), SVM (§3.2.1.6) etc. Apart from

these, WEKA implements some classifiers ensembles such as Bagging



(§3.2.4.1), Stacking (§3.2.4.2), Boosting (§3.2.4.3) etc.

Figure 36: Classify tab of WEKA – The results of the application ZeroR classifier are shown on the

right part

5 .2 .2 .1 .5 .2 .2 .1 .5 .2 .2 .1 .5 .2 .2 .1 . Per f o rmance Pe r f o rmance Pe r f o rmance Pe r f o rmance MMMMet r i c set r i c set r i c set r i c s

Similar to §2.9, the performance metrics of WEKA supervised algorithms are

described hereupon. These metrics are calculated for each applied classifier

and they constitute a measure of comparison.

Each classification algorithm of WEKA results in a confusion matrix

(contingency table). In a typical binary classification problem (fraud-

legitimate), the confusion matrix is 2x2 and has the following form of Table 5.

This table indicates how many instances have been assigned to each class.

Elements show the number of test examples whose actual class is the row

and whose predicted class is the column [81].

The TP rate is the proportion of examples which were classified as class x,

among all examples which truly have class x, i.e. how much part of the class

was captured [81]. Otherwise, these examples are FN.



Fraud Legitimate

# class fraud True Positive (TP) False Negative (FN)

# class legitimate False Positive (FP) True Negative (TN)

Table 5: Confusion matrix for binary problems

The FP rate is the proportion of examples which were classified as class x,

but belong to a different class, among all examples which are not of class x. In

the matrix, this is the column sum of class x minus the diagonal element,

divided by the rows sums of all other classes [81]. Otherwise, these examples

are TN.

Nevertheless, samples A, B, C, D of the experiments (§5.2.3) contain a four-

level class label, which implies that the produced matrices will be 4x4 for each

of the algorithm.

Based on the aforementioned concepts, after the implementation of the

classifiers for the thesis purposes, the following metrics are recorded

(Appendix A):

� Precision=TP/(TP+FP): It is the proportion of the examples which truly have

class x among all those which were classified as class x.

� Recall=TP/(TP+FN): In the confusion matrix, this is the diagonal element

divided by the sum over the relevant row.

� Accuracy=(TP+TN)/(TP+FN+TN+FP): It represents the percentage of

correctly classified instances.

� Error=100%-Accuracy

� F-Measure=2*Precision*Recall/(Precision+Recall): It is a combined

measure for precision and recall.

Prediction values

Actual values



� ROC area: This measure (§2.9.1) can be interpreted as the probability

that when we randomly pick one positive and one negative example, the

classifier will assign a higher score to the positive example than to the

negative [2].

� Correlation coefficient: It is a measure of the interdependence of two

random variables that ranges in value from −1 to +1, indicating perfect

negative correlation at −1, absence of correlation at zero, and perfect

positive correlation at +1.

� Mean absolute error: It is a quantity used to measure how close

predictions are to the eventual outcomes. As the name suggests, the

mean absolute error is an average of the absolute errors ei = fi − yi,

where fi is the prediction and yi the true value [2].

� Root mean squared error: It is a good measure of precision and reflects

the differences between values predicted by a model and the values

actually observed from the thing being modelled or estimated. The root

mean squared error Ei of an individual program i is evaluated by the

equation where P(ij) is the value predicted by

the individual program i for sample case j (out of n sample cases); and Tj

is the target value for sample case j. For a perfect fit, P(ij) = Tj and Ei = 0.

So, the Ei index ranges from 0 to infinity, with 0 corresponding to the ideal

[82].

� Root relative squared error: It is relative to what it would have been if a

simple predictor had been used. This simple predictor is just the average

of the actual values. Thus, the relative squared error takes the total

squared error and normalizes it by dividing by the total squared error of

the simple predictor. By taking the square root of the relative squared

error one reduces the error to the same dimensions as the quantity being

predicted [82].

� Relative absoluter error: The relative absolute error Ei of an individual



program i is evaluated by the following equation:

where P(ij) is the value predicted by the individual

program i for sample case j (out of n sample cases), Tj is the target value

for sample case j and is given by the formula:

For a perfect fit, the numerator is equal to 0 and Ei = 0. So, the Ei index

ranges from 0 to infinity, with 0 corresponding to the ideal [92]. It gives an

idea of scale of error compared to how variable the actual values are.

5 .2 .3 .5 .2 .3 .5 .2 .3 .5 .2 .3 . Exper imen tsExper imen tsExper imen tsExper imen ts

Bank provided 1264 data records (loan applications) processed from the test

and production phase of the anti-fraud system. These records were loaded to

the corresponding RiskShield project and, as a result of the calculation units

plus the rules operation, a decision label (alarm) was generated for each

single record. Based on the alert levels, the data distribution was the

following: 92% accepted, 1% low_risky, 6,4% suspicious and 0.8% extr_risky.

Although the number of data instances is not adequate, the problem of

skewed distribution becomes apparent.

Next, these labeled records were exported from RiskShield in order to be

used in WEKA tool. The selected variables, during export, concerned input

and output variables (continuous, date or categorical data type), which affect

the rules and contribute to an effective classification. The exported records

were loaded to the WEKA tool for running a number of classification

algorithms. After the loading of data samples and the selection of Decision

attribute as a class label in the “Preprocess” tab, the histogram shows the

same distribution as in RiskShield Client case.

Filters were used in order to convert the numerical values of Decision attribute

to nominal values, as well as to remove some data records, in order to

improve the distribution.

Finally, four types of samples were formed with the exported records for the



experiments’ purposes:

Sample A contains 1264 data records with 145 attributes (input variables of

RiskShield project) with nominal class label, i.e. Decision values are

‘accepted’, ‘low_risky’, ‘suspicious’, ‘extr_risky’.

Sample B contains 1264 data records with 145 attributes (input variables of

RiskShield project) with numeric class label, i.e. Decision values are 0, 1, 2, 3.

Sample C contains 1264 data records with 439 attributes (input variables plus

the rule variables of RiskShield project) with nominal class label.

Sample D contains 1264 data records with 439 attributes (input variables plus

the rule variables of RiskShield project) with numeric class label.

For each of the above sample, a stratified cross-validation of 10 folds is

selected for the training, in the “Classify” tool. This means, that data are

randomly broken into 10 record sets of size 1264/10, then the training is

performed on 9 sets and the test on 1 set and this procedure is repeated 10

times. At the end, an average performance from the individual experiments

was calculated. Thus, every record took part once in the test set and nine

times in the training set. The reason for choosing 10 partitions is that this

yields the same error rate as if the entire data set had been used for training.

5.3.5.3.5.3.5.3. Resul tsResul tsResul tsResul ts

Upon the completion of the previous settings, a set of supervised algorithms

was applied to each of the samples A, B, C and D. The aforementioned

metrics (§5.2.2.1) of indicative algorithms are given in detail in Appendix A.

Moreover, the tables of Appendix B contain the aggregate results for all

samples. They show the running time, accuracy and relative absolute error for

samples A, B, C and D for 10-fold cross validation for all classification

algorithms, which run during the thesis experiments.

Additionally, the following comparative diagrams illustrate the accuracy and



relative absolute error of some indicative algorithms, such as J48, SMO,

AdaBoostM1, Naïve Bayes, Decision Tables, EnsembleSelection and Bagging

algorithm for samples A,B,C and D. In the end of this paragraph, the overall

comparative diagram includes the accuracies of the previous algorithms. As

it’s shown, only the EnsembleSelection, Bagging and DecisionTable run for all

samples, since they accept both nominal and numerical class label.

According to the results of Appendix A, it’s apparent that the increase in

feature space from sample A (or B) to sample C (or D) affects significantly the

time taken to build the model for the same number of instances. The

comparative diagram indicates that the LogitBoost and DecisionTable, which

exhibit a better accuracy in comparison with the rest of the algorithms, show

also a normal running time.

SMO proved to be very time-consuming algorithm and for this reason the data

records reduced to 764 for both samples. Yet, the time taken to build the

model was still long and the accuracy low in comparison with the rest of the

algorithms for both samples. Conversely, NaiveBayes run in a very short time,

but it resulted in very low accuracy for both samples as well.

In case of EnsembleSelection, Bagging and J48, it’s concluded that there is

no difference in accuracy between samples A and C for the same number of

instances.

Furthermore, taking the confusion matrices of the aforementioned algorithms

(Appendix A) into consideration, it seems that all algorithms have the

tendency to classify almost all instances as ‘accepted’, which is not a safe

practice in real-life anti-fraud systems. Referring to the Bank’s strategy and

the confusion matrix of J48 algorithm, all 1264 loan applications would be

approved, which would pose a serious threat for the institution. However, the

classifiers’ behaviour was presumable, given the limited number of available

instances (only 1264) and the skewed distribution of all data samples.

The following paragraph describes an alternative solution which would lead to



a more reliable FD, as a final conclusion for the present paper.

91,693% 91,693%

98,6723% 98,6723%

80%

83%

86%

89%

92%

95%

98%

Accu

racy/E

rro

r (%

)

Accuracy (%) Relative Absolute Error (%)

J48J48J48J48

sample A

sample C

86,6317% 87,156%

213,8027%

214,5457%

80%

83%

86%

89%

92%

95%

98%

Ac

cu

rac

y/E

rro

r (%

)


SMOSMOSMOSMO

sample A

sample C

91,693%92,959%

97,5671%

104%

80%

83%

86%

89%

92%

95%

98%

Accu

racy/E

rro

r (%

)

Accuracy (%) Relative Absolute Error

(%)

AdaBoostM1AdaBoostM1AdaBoostM1AdaBoostM1

sample A

sample C



91,5348%

98,0222%

89,3212%

25,685%

80%

83%

86%

89%

92%

95%

98%

Accu

racy/E

rro

r (%

)


LogitBoostLogitBoostLogitBoostLogitBoost

sample A

sample C

84,8892%

87,1835%

97,5098%

82,6355%

80%

83%

86%

89%

92%

95%

98%

Accura

cy/E

rror (%

)


NaiveBayesNaiveBayesNaiveBayesNaiveBayes

sample A

sample C

91,6139%

97,3892%

112,4373%

32,7%

93,5832%

9,7337%

80%

83%

86%

89%

92%

95%

98%

Accura

cy/E

rror

(%)


DecisionTableDecisionTableDecisionTableDecisionTable

sample A

sample C

sample B

sample D



91,93%91,93%

95,6%95,6%

93,5%

93,5%

80%

83%

86%

89%

92%

95%

98%A

ccura

cy/E

rror

(%)


EnsembleSelectionEnsembleSelectionEnsembleSelectionEnsembleSelection

sample A

sample C

sample B

sample D

92,17%92,17%92,7%92,7%

92,464%

92,464%

80%

83%

86%

89%

92%

95%

98%

Accura

cy/E

rror

(%)


BaggingBaggingBaggingBagging

sample A

sample C

sample B

sample D

91,69%

SMO LogitBoost NaiveBayes Bagging

80%

83%

86%

89%

92%

95%

98%

Ac

cu

rac

y(%

)

J48 AdaBoostM1 DecisionTable EnsembleSelection

Comparative DiagramComparative DiagramComparative DiagramComparative Diagram

sample A sample C

91,69%

86,63% 87,16%

91,69%92,96%

91,69% 91,535%

98,02%97,39%

84,89%87,18%

91,93% 91,93%92,17% 92,17%



5.4.5.4.5.4.5.4. ConclusionsConc lusionsConc lusionsConc lusions & Future Work & Future Work & Future Work & Future Work

As it is mentioned, the above results indicate that there is no classification

algorithm with the sufficient accuracy, which can be used independently in the

case of FD of the Bank. Moreover, WEKA treats all types of classification

errors equally, which is not a desirable approach for the real application fraud

scenario. Classifying a loan application incorrectly as ‘accepted’ may damage

the Bank’s credibility; especially when this application involves a high amount

of money.

Hence, using WEKA as a stand-alone tool is not an effective solution for the

for the loan application scenario of the Bank. Instead, cost-sensitive classifiers

in combination with the existing anti-fraud system should be employed in

order to detect fraud effectively.

At this point, a novel and promising technique for combining outlier detection

algorithms is presented. This feature bagging (§3.2.4.1) approach is

developed by Aleksandar Lazarevic and Vipin Kumar from University of

Minnesota [84].

The outlier detection algorithms used for these experiments are based on

computing the full dimensional distances of the points from one another as

well as on computing the densities of local neighborhoods. The Density Based

Local Outlier Factor Detection has been finally used, due to its satisfactory

prediction performance.

In this method, each data example acquires a degree of being outlier

(§3.2.3.1), which is called Local Outlier Factor (LOF). Thus, data with high

LOF are more possible to be outliers. Referring to Figure 37, despite the

different densities of clusters C1 and C2, LOF approach recognizes both p2

and p3 as outliers, because it considers the density around the points.



Figure 37: LOF proposed solution [84].

The procedure for combining different outlier detection algorithm takes place

in a series of T rounds. The outlier detection algorithm runs in every round t

with different set of features Ft, used for distance computation. The number of

selected features (Nt) is randomly chosen and ranges from d/2 to d-1, where d

is the number of features in original data set. When the number of features Nt

in Ft is selected, Nt features are randomly selected without replacement from

the original feature set.

Thus, various outlier score vectors ASt are produced by each outlier detection

algorithm. These vectors indicate the probability of each data record from the

data set S being an outlier. Finally, the outcome of T rounds is the generation

of T outlier score vectors for each outlier detection algorithm. Using the

COMBINE function, the various outlier score vectors are combined into a

uniquely anomaly score vector ASfinal. This vector assigns a final probability of

being an outlier to every data records of the original data set. Figure 38

displays this general framework of combining the outlier detection algorithms.

The experiments have been carried out on synthetic and real life data sets

with different percentage of outliers, different sizes and different number of

features for providing a diverse test bed. Throughout the experiments, the

single LOF compared with the method of combining outlier detection

algorithms.



Figure 38: The general framework for combining outlier detection techniques [84].

Feature bagging methods in real life data sets outperformed single LOF outlier

detection. In case of synthetic data, these combining methods are able to

alleviate the effect of noisy features and outperform the single LOF as well,

but only till a certain level. Generally, these methods decrease the influence of

irrelevant features in the data sets and thus improve the detection

performance. This decrease is rather small if the number of irrelevant features

is much greater than the number of relevant features in the data set. When all

features are relevant, the detection performance of combining methods

deteriorates.

The main advantage of the proposed feature bagging methods is that they

exploit benefits from combining multiple outputs of separate individual

predictions through focusing on smaller feature projections. In addition, the

proposed framework allows various combinations of any outlier detection

algorithms and this indicates its usefulness in real-life scenarios. However,

future work should be done in order to experiment with high dimensional

databases, new combining algorithms and not only distance-based outlier



detection approaches.

However, in real life scenarios there is no ideal FD system, which produces

zero false alarm rates. Problems such as skewed distributions, noisy data,

inadequate training, non-uniform cost per error, evolving fraud patterns and

unknown misclassification costs are encountered on a daily basis.

The current paper can be further extended to propose a novel FD system, as

a product of fusion of the Bank anti-fraud software (§5.1.1) or similar

commercial product and the aforementioned technique for combining outlier

detection algorithms, which would operate in parallel for improved

performance. As no modern FD method is panacea, the proposed system

aims at combining the strengths and alleviating the weaknesses of each

individual method, resulting in a viable FD solution and reinforcing the prestige

of the particular institution.



REFERENCES

1] National Check Fraud Center

2] Wikipedia, the Free Encyclopedia

3] Spamlaws community

4] BustThief.com, “Protecting you and your finances”

5] Identity theft detection

6]J.P. Morgan's Treasury Services

7]http://www.wantagh.li/spin/telecommunications_fraud.pdf

8] Veridian Credit Union

9] Investopedia, “The Pioneers of financial fraud”

10] Parilman & Associates, A national law firm, resource4quitam.com

11] David J. Hand, “Statistical techniques for fraud detection, prevention and evaluation”, Imperial College London, September 2007

12]Bank Systems and Technology

13]SC magazine for IT security professionals

14] Subex global provider of Operations Support Systems,”GlobalFraud Loss Survey”, December 2009

15] Finextra, “European ATM fraud losses tumble – East“, April 2010,

16]Payments Cards and Mobiles magazine, “Inside Fraud-Sponsored by VISA”

17]Philip K. Chan, Florida Institute of Technology, Wei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, Columbia University, “Distributed Data Mining in Credit Card Fraud Detection”, November/December 1999

18]Panos Sarafidis, DIENEKIS S.A., “Εισαγωγή στο ΙRIS - Πρόληψη της απάτης στα συστήµατα επεξεργασίας ηλεκτρονικών πληρωµών INFORM GZS “, 2004



19] Panos Sarafidis, DIENEKIS SA, “Exploring the NEURO-FUZZY from Theory to Practice”, January 2009

20]Online Cyber Safety website

21]Australian Competition & Consumer Commission, Scam Watch

22] Alexander Widder, Rainer v. Ammon, Gerit Hagemann, Dirk Schönfeld, “An Approach for Automatic Fraud Detection in the Insurance Domain”, Association for the Advancement of Artificial Intelligence, 2009

23] Association of Certified Fraud Examiners (ACFE), “2008 Report to the Nation on Occupational Fraud & Abuse”, 2008

24]Association of Certified Fraud Examiners, “2006 Report to the Nation on Occupational Fraud & Abuse“, 2006

25] Association of Certified Fraud Examiners, “Report to the Nation on Occupational Fraud & Abuse”, 2010 Global Fraud Study

26]Hoax Slayer website

27] Oliver Sylvester, University of Exeter, “Transactional Credit Card Fraud”

28]“Kathimerini” newspaper, “Η τεχνη της απάτης και ο τζίρος της”, 2006 http://portal.kathimerini.gr

29]CNN Money.com, “Health care: A 'goldmine' for fraudsters”, January 2010

30]Constantinos S. Hilas, John N. Sahalos, “User Profiling for Fraud Detection in Telecommunication Networks”, 5th International Conference of Technology and Autmation (ICTA) 2005

31] Thomas J. Winn Jr., State Auditor’s Office, Austin, Texas, “Fraud Detection – A Primer for SAS Programmers”, http://www.sas.com/

32]Richard J. Bolton, David J. Hand, “Statistical Fraud Detection: A Review”, Statistical Science 2002, Vol. 17, No. 3, 235–255

33] Philip K. Chan, Salvatore J. Stolfo, “Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection”, March 1998

34]Parilman & Assiciates website, Phillips National Injury Group

35]Fraud Aid, Fraud Victim Advocacy website

36]Jörn Dinkla, Dipl.-Inform., “Artificial Intelligence and Fraud Detection/ Fraud Management”



37]The United States Department of Justice website

38]V.Dheepa1, Dr. R.Dhanapal, “Analysis of Credit Card Fraud Detection Methods”, International Journal of Recent Trends in Engineering, Vol 2, No. 3, November 2009

39] Michael H. Cahill, Diane Lambert, Jose C. Pinheiro, Don X. Sun, “Detecting Fraud in the Real World”, Computational Cybersecurity in Compromised Environments C3E

40] Herman Verrelst, Ellen Lerouge, Yves Moreau, Joos Vandewalle, Christof Störmann, Peter Burge, “A rule based and neural network system for fraud detection in mobile communications”, Advanced Security for Personal Communications Technologies (ASPeCT)

41]Alexander Widder, Rainer v. Ammon, Philippe Schaeffer, Christian Wolff, “Combining Discriminant Analysis and Neural Networks for Fraud Detection on the Base of Complex Event Processing”, The second International Conference on Distributed Event-Based Systems, Rome-Italy, July 1st-4th 2008

42]Noara Foiatto, Christine Tessele Nodari, João Miguel Lac Roehe, Marcus Vinicius Viegas Pinto, “AUTOMATIZATION OF TAMPERING IDENTIFICATION IN INDUCTION ELECTRICAL POWER METERS”, XIX IMEKO World Congress Fundamental and Applied Metrology, Lisbon, Portugal, September 6−11, 2009

43] C. Muniz, M. Vellasco, R. Tanscheit, K. Figueiredo, “A Neuro-fuzzy System for Fraud Detection in Electricity Distribution”, Computational Intelligence Lab., Department of Electrical Engineering, Pontifical Catholic University of Rio de Janeiro Rio de Janeiro, Brazil, IFSA-EUSFLAT, 2009

44] Phil Gosset, Mark Hyland, “Classification, Detection and Prosecution of Fraud on Mobile Networks”, Katholikie Universiteit Leuven

45] Nathan Kurtz, “Securing A Mobile Telecommunications Network From Internal Fraud”, SANS Institute InfoSec Reading Room, 2002

46] Internet Crime Complaint Center (IC3), “2009 Internet Crime Report”,

47] Coalition Against Insurance Fraud website

48] National Fraud Authority (NFA), “Annual fraud indicator”, January 2010

49] Identity Theft Protection, “Identity Theft Statistics”, 2009

50] “Simerini” newspaper, “Έξαρση στις απάτες στα ΑΤΜ και στο διαδίκτυο”, 2009

51] Kroll consulting company, “Global Fraud Report”, Annual Edition



2009/2010,

52] Richard J. Sullivan, “THE CHANGING NATURE OF U.S. CARD PAYMENT FRAUD: ISSUES FOR INDUSTRY AND PUBLIC POLICY”, Presentation for Workshop on the Economics of Information Security Harvard University. May 21, 2010

53] “The New INKA” website

54] 419 Unit of Ultrascan Advanced Global Investigations (AGI), “419 Advance Fee Fraud The World’s Most Successful Scam”, January 2007

55]Statsoft, Electronic Statistics Textbook, website

56] Usama Fayyad, Gregory Piatetsky – Shapiro, Padhraic Smyth “The KDD Process for Extracting Useful Knowledge from Volumes of Data”, COMMUNICATIONS OF THE ACM November 1996, Vol. 39, No. 11 27

57] World Scientific Books, “Knowledge Discovery and Data Mining: Concepts and Fundamental Aspects”,Book: “Decomposition Methodology for Knowledge Discovery and Data Mining: -Chapter 1

58] www.kddnuggets.com

59] Nikos Pelekis, Yannis Theothoridis, “Data Warehousing & Data Mining”, University of Peraeus, Information Systems Lab

60] Frank Keller, “Evaluation Connectionist and Statistical Language Processing”, Computerlinguistik Universitat des Saarlandes

61] SmartSoft, Banking Risk Solutions, website

62] University of Regina, Department of Computer Science website

63] Thales Sehn Korting, “C4.5 algorithm and Multivariate Decision Trees”, Image Processing Division, National Institute for Space Research – INPE S˜ao Jos´e dos Campos – SP, Brazil

64] Resampling Stats website

65]DTREG Software For Predictive Modeling and Forecasting

66] NeuroDimension company website

67] Computer Science Ben Gurion University of Negev website

68] Svetlana Cherednichenko, “Outlier Detection in Clustering”, University of Joensuu Department of Computer Science Master’s Thesis

69] Christophe Giraud-Carrier, “Metalearning - A Tutorial”, The Seventh



International Conference on Machine Learning and Applications (ICMLA'08), December 2008

70] School of Computer Science Carnegie Mellon website, “Classifier Ensembles”

71] Online Guards, Identity Protection Company website

72] Universite Libre de Bruxelles, Department d' Informatique, Presentation: “Boosting Methods”

73]Toby Ord, “Degrees of Truth, Degrees of Falsity”, British Academy Postdoctoral Fellow, Department of Philosophy, University of Oxford

74] “FUZZY LOGIC and ITS USES, ARTICLE 2 Fuzzy Logic Introduction”, Imperial College of London

75] The free online dictionary for words, Webopedia

76] Clifton Phua, Vincent Lee, Kate Smith, Ross Gayler, “A Comprehensive Survey of Data Mining-based Fraud Detection Research”

77] Artificial Intelligence Junkie website

78] Risk Shield website

79] RiskShield-Server Manual, “RiskShield-Turn Risk into Profit”, Server Version 3.73b, Manual Revision of 2010-06-15

80] RiskShield-Client Manual, “RiskShield-Turn Risk into Profit”, RiskShield-Client Software Release 1.44, Manual Release of 2009-11-30

81]Remco R. Bouckaert, Eibe Frank, Mark Hall, Richard Kirkby, Peter Reutemann, Alex Seewald, David Scuse, “WEKA Manual for Version 3-6-2“, January 11, 2010

82]GeneXproTools company website

83] WEKA documents website

84] Aleksandar Lazarevic, Vipin Kumar, “Feature Bagging for Outlier Detection”, Research Track Paper

85] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From Data Mining to Knowledge Discovery in Databases”, American Association for Artificial Intelligence, FALL 1996

86] Pete McCollum, “An Introduction to Back-Propagation Neural Networks”, Encoder, The Newsletter of the Seattle Robotics Society



87] Robert Fuller, “Neural Fuzzy Systems”, Donner Visiting professor Abo Akademi University, 1995

88] Pasquale Malacaria & Fabrizio Smeraldi, “A Simplification of Adaboost and its Relation to Betting Strategies”, Queen Mary University of London, January 2007

89] Report of Maclntyre Hudson LLP & Center for Counter Fraud Studies, University of Portsmouth, “Counter Fraud - The financial cost of Healthcare fraud”, 2006



APPENDIX A

The following tables refer to §5 and include the results of the application of

some indicative running supervised machine learning algorithms, grouped by

data samples. The uploaded real data sample contains 1264 records.

SAMPLE A

CONFUSION MATRIX – Bayes - BayesNet

accepted low_risky suspicious extr_risky

accepted 929 7 217 6

low_risky 12 0 2 0

suspicious 55 1 25 0

extr_risky 6 2 2 0

Time taken to build the model: 0.23 seconds

Correctly Classified Instances: 75.4747 % (954)

TP Rate

FP Rate

Precision Recall F-measure ROC Area

accepted 0.802 0.695 0.927 0.802 0.86 0.627

low_risky 0 0.008 0 0 0 0.56

suspicious 0.309 0.187 0.102 0.309 0.153 0.638

extr_risky 0 0.005 0 0 0 0.721

CONFUSION MATRIX – Bayes - NaiveBayes


accepted 1058 3 78 20

low_risky 12 0 1 1


extr_risky 6 0 1 3



Predicted values Target

values

Predicted values

Target values

Measures

Class



TP Rate FP Rate Precision Recall F-measure ROC Area

accepted 0.913 0.81 0.926 0.913 0.919 0.571 low_risky 0 0.003 0 0 0 0.499

suspicious 0.148 0.068 0.13 0.148 0.139 0.595 extr_risky 0.3 0.018 0.12 0.3 0.171 0.626

CONFUSION MATRIX – Functions - RBFNetwork


accepted 1159 0 0 0

low_risky 14 0 0 0

suspicious 81 0 0 0

extr_risky 10 0 0 0


Correctly Classified Instances: 91.693% (1159)


accepted 1 1 0.917 1 0.957 0.525 low_risky 0 0 0 0 0 0.5

suspicious 0 0 0 0 0 0.535 extr_risky 0 0 0 0 0 0.458

CONFUSION MATRIX – Lazy IB1


accepted 1071 25 48 15

low_risky 9 4 1 0


extr_risky 8 2 0 0




accepted 0.924 0.743 0.932 0.924 0.928 0.591 low_risky 0.286 0.022 0.125 0.286 0.174 0.632

suspicious 0.235 0.041 0.279 0.235 0.255 0.597 extr_risky 0 0.012 0 0 0 0.494

CONFUSION MATRIX – Lazy - IBk


accepted 1071 25 48 15

low_risky 9 4 1 0


extr_risky 8 2 0 0

Measures

Class


values

Measures Class


values

Measures Class


values



Time taken to build the model: 0 seconds





CONFUSION MATRIX – Meta – AdaBoostM1


accepted 1159 0 0 0

low_risky 14 0 0 0

suspicious 81 0 0 0

extr_risky 10 0 0 0






CONFUSION MATRIX – Meta - LogitBoost


accepted 1155 1 3 0

low_risky 14 0 0 0

suspicious 79 0 2 0

extr_risky 10 0 0 0




accepted 0.997 0.981 0.918 0.997 0.956 0.663 low_risky 0 0.001 0 0 0 0.772

suspicious 0.025 0.003 0.4 0.025 0.047 0.65 extr_risky 0 0 0 0 0 0.81

CONFUSION MATRIX – Meta - Bagging


accepted 1155 0 4 0

low_risky 10 4 0 0

Measures Class


values

Measures Class


values


values

Measures Class



suspicious 75 0 6 0

extr_risky 10 0 0 0




accepted 0.997 0.905 0.924 0.997 0.959 0.637 low_risky 0.286 0 1 0.286 0.444 0.704


CONFUSION MATRIX – Meta - EnsembleSelection


accepted 1155 0 4 0

low_risky 12 2 0 0

suspicious 76 0 5 0

extr_risky 10 0 0 0






CONFUSION MATRIX – Meta - Stacking


accepted 1159 0 0 0

low_risky 14 0 0 0

suspicious 81 0 0 0

extr_risky 10 0 0 0






CONFUSION MATRIX – Rules - DecisionTable


accepted 1158 0 0 1

Measures Class


values

Measures Class


values

Measures Class


values



low_risky 14 0 0 0

suspicious 81 0 0 0

extr_risky 10 0 0 0




accepted 0.999 1 0.917 0.999 0.956 0.494 low_risky 0 0 0 0 0 0.494


CONFUSION MATRIX – Rules - JRip


accepted 1159 0 0 0

low_risky 14 0 0 0

suspicious 81 0 0 0

extr_risky 10 0 0 0






CONFUSION MATRIX – Trees - DecisionStump


accepted 1159 0 0 0

low_risky 14 0 0 0

suspicious 81 0 0 0

extr_risky 10 0 0 0






CONFUSION MATRIX – Trees – J48


accepted 1159 0 0 0

low_risky 14 0 0 0

suspicious 81 0 0 0

Measures Class


values

Measures

Class


values

Measures

Class


values



extr_risky 10 0 0 0






CONFUSION MATRIX – Trees - RandomForest


accepted 1156 0 3 0

low_risky 10 4 0 0


extr_risky 10 0 0 0






CONFUSION MATRIX – Functions - SMO


accepted 646 1 10 1

low_risky 10 4 0 0


extr_risky 10 0 0 0

Time taken to build the model: 263.89seconds

Correctly Classified Instances: 86.6317 % (661 out of 764)




Measures Class


values

Measures

Class


values

Measures

Class



SAMPLE B

Functions – RBFNetwork - Time taken to build the model: 1.33 seconds

Correlation coefficient -0.0468

Mean absolute error 0.202

Root mean squared error 0.4108

Relative absolute error 100.1464%

Root relative squared error 100.3364 %

Lazy – IBk - Time taken to build the model: 0 seconds

Correlation coefficient 0.1084



Relative absolute error 100.2708 %


Lazy – KStar - Time taken to build the model: 0 seconds






Lazy – LWL - Time taken to build the model: 0 seconds






Meta – Bagging - Time taken to build the model: 22.66 seconds






Meta – Stacking - Time taken to build the model: 0.02 seconds




Relative absolute error 100 %

Root relative squared error 100%



Meta – EnsmbleSelection -Time taken to build the model: 43.95 seconds






Rules – ConjunctiveRule - Time taken to build the model: 0.28 seconds






Rules – DecisionTable - Time taken to build the model: 25.2 seconds






Trees – DecisionStump - Time taken to build the model: 0.39 seconds








SAMPLE C

CONFUSION MATRIX – Bayes - BayesNet



low_risky 10 3 1 0


extr_risky 5 2 3 0






CONFUSION MATRIX – Bayes - NaiveBayes


accepted 1060 4 73 22

low_risky 1 3 9 1


extr_risky 5 7 1 2






CONFUSION MATRIX – Functions - RBFNetwork


accepted 1159 0 0 0

low_risky 14 0 0 1

suspicious 81 0 0 0

extr_risky 10 0 0 0






values

Measures Class


values

Measures Class


values

Measures

Class




CONFUSION MATRIX – Lazy – IB1



low_risky 10 4 0 0


extr_risky 6 2 0 2






CONFUSION MATRIX – Meta – AdaBoostM1


accepted 1159 0 0 0

low_risky 10 0 1 3


extr_risky 6 2 0 2




accepted 1 0.79 0.933 1 0.965 0.792 low_risky 0 0.002 0 0 0 0.833


CONFUSION MATRIX – Meta - Bagging


accepted 1155 0 4 0

low_risky 10 4 0 0

suspicious 75 0 6 0

extr_risky 10 0 0 0






values

Measures Class


values

Measures

Class


values

Measures

Class



suspicious 0.074 0.003 0.6 0.074 0.132 0.595 extr_risky 0 0 0.4 0 0 0.674

CONFUSION MATRIX – Meta - LogitBoost


accepted 1159 0 0 0

low_risky 5 8 1 0


extr_risky 0 1 0 9




accepted 1 0.219 0.981 1 0.99 0.945 low_risky 0.571 0.001 0.889 0.571 0.696 0.912

suspicious 0.778 0.001 0.984 0.778 0.869 0.949 extr_risky 0.9 0 1 0.9 0.947 0.998

CONFUSION MATRIX – Rules - DecisionTable


accepted 1159 0 0 0

low_risky 6 6 2 0


extr_risky 5 0 0 5




accepted 1 0.295 0.974 1 0.987 0.926 low_risky 0.429 0 1 0.429 0.6 0.988

suspicious 0.753 0.002 0.968 0.753 0.847 0.903 extr_risky 0.5 0 1 0.5 0.667 0.898

CONFUSION MATRIX – Rules - ConjunctiveRule


accepted 1159 0 0 0

low_risky 6 6 2 0


extr_risky 5 0 0 5




accepted 1 0.876 0.926 1 0.962 0.552 low_risky 0 0 0 0 0 0.38


Predicted values

Target values

Measures Class


values

Measures Class


values

Measures

Class



CONFUSION MATRIX – Trees – J48


accepted 1159 0 0 0

low_risky 14 0 0 0

suspicious 81 0 0 0

extr_risky 10 0 0 0






CONFUSION MATRIX – Trees - RandomForest


accepted 1159 0 3 0

low_risky 10 4 0 0

suspicious 72 0 9 0

extr_risky 10 0 0 0






CONFUSION MATRIX – Trees - RandomTree


accepted 1150 0 9 0

low_risky 10 3 0 0


extr_risky 10 0 0 0






Predicted values

Target values

Measures Class


values

Measures Class


values

Measures Class



CONFUSION MATRIX – Functions - SMO


accepted 4 0 10 0

low_risky 0 10 70 0


extr_risky 0 0 10 0

Time taken to build the model: 278.38seconds

Correctly Classified Instances: 87.156% (665 out of 764)


accepted 0.286 0 1 0.286 0.444 0.666 low_risky 0.125 0.01 0.588 0.125 0.206 0.557



values

measures Class



SAMPLE D

Functions – RBFNetwork - Time taken to build the model: 4.09 seconds






Lazy – IBk - Time taken to build the model: 0 seconds






Meta – Bagging - Time taken to build the model: 70.28 seconds






Meta – EnsembleSelection - Time taken to build the model: 87.27seconds






Meta – Stacking- Time taken to build the model: 0.02 seconds




Relative absolute error 100 %

Root relative squared error 100 %

Rule – DecisionTable - Time taken to build the model: 117.3seconds








Rule – ConjunctiveRule - Time taken to build the model: 0.56 seconds






Trees – DecisionStump - Time taken to build the model: 0.36 seconds








APPENDIX B

The following tables include the WEKA results in an aggregated form after the

application of all supervised algorithms in samples A and B and samples C

and D.

SAMPLES A & B

Algorithm Type

Algorithm Running

Time (sec) Accuracy

(%)

Relative Absolute Error (%)

Class Label

#Records

ConjunctiveRule 0,31 92 99 Nominal 1264

ConjunctiveRule 0,28 98 Numerical 1264

DecisionTable 13,77 91,6139 112,4373 Nominal 1264

DecisionTable 25,2 93,5832 Numerical 1264

Jrip 7 91,693 96,5272 Nominal 1264

OneR 0,08 92 Nominal 1264

PART 0,49 91,693 98,7697 Nominal 1264

Ridor 1,55 89,1614 69,0206 Nominal 1264

ZeroR 0,02 92 100 Nominal 1264

RULES

ZeroR 0 100 Numerical 1264

DecisionStump 0,11 92 98 Nominal 1264

DecisionStump 0,39 99 Numerical 1264

J48 0,38 91,693 98,6723 Nominal 1264

J48graft 1,16 91,693 98,6723 Nominal 1264

REPTree 1,98 92,0886 93,3101 Nominal 1264

REPTree 2,02 90,2275 Numerical 1264

M5P 35,23 119,7694 Numerical 764

RandomForest 0,56 92,5633 89,4978 Nominal 1264

RandomTree 0,06 91,7722 89,0786 Nominal 1264

REES

UserClassifier 58,08 100 Numerical 1264

HyperPipes 0,02 92,4051 465 Nominal 1264 MISC

VFI 0,08 85,9177 88,5161 Nominal 1264

AdaBoostM1 0,3 91,693 98 Nominal 1264

AttributeSelectedClassifier 9,66 91,693 98,6723 Nominal 1264

AdditiveRegression 1,38 86,5033 Numerical 1264 ClassificationViaClustering 1,56 64,6361 225,1986 Nominal 1264

CVParameterSelection 0,02 91,693 100 Nominal 1264

CVParameterSelection 0 100 Numerical 1264

Bagging 41,02 92,1677 92,6654 Nominal 1264

Bagging 22,66 92,464 Numerical 1264

Decorate 15,11 91,693 108,9727 Nominal 1264

END 5,19 91,693 98,6833 Nominal 1264

FilteredClassifier 0,22 91,693 986.723 Nominal 1264

META

Grading 0,03 91,693 52,899 Nominal 1264



LogitBoost 4,3 91,5348 89,3212 Nominal 1264


MultiBoostAB 0,23 91,693 97,5671 Nominal 1264

MultiScheme 0 91,693 100 Nominal 1264

MultiScheme 0,02 100 Numerical 1264

ClassBalancedND 0,63 91,693 98,7113 Nominal 1264

ND 0,78 91,693 98,7305 Nominal 1264 DataNearBalancedND 0,42 91,693 98,7562 Nominal 1264 OrdinalClassClassifier 1 91,693 100,6065 Nominal 1264

RegressionByDiscretization 0,47 100 Numerical 1264

RacedIncremental LogitBoost 0,03 91,693 100,6065 Nominal 1264

RandomCommittee 0,52 93 85 Nominal 1264 EnsembleSelection 49,75 92 96 Nominal 1264

EnsembleSelection 43,95 94 Numerical 1264

RandomSubSpace 26,2 92,1677 93,5194 Nominal 1264

Stacking 0 91,693 100 Nominal 1264

Stacking 0,02 100 Numerical 1264

StackingC 0,09 91,693 99,9962 Nominal 1264

Vote 0 91,693 100 Nominal 1264

Vote 0 100 Numerical 1264

IB1 0,02 86,5506 85,646 Nominal 1264

IBk 0 86,5506 86,8425 Nominal 1264

IBk 0 100,2708 Numerical 1264

KStar 0,31 91,693 98,5211 Nominal 1264

KStar 0 52,9587 Numerical 1264

LWL 0 91,8513 93,9239 Nominal 1264

LAZY

LWL 0 95,3114 Numerical 1264

RBFNetwork 2,16 92 99 Nominal 1264

RBFNetwork 1,33 100 Numerical 1264

SMO 263,89 87 214 Nominal 764 FUNCTIONS

SMOreg 186,45 120 Numerical 764

BayesNet 0,23 75,4747 156,4513 Nominal 1264

NaiveBayes 0,13 84,8892 97,5098 Nominal 1264 BAYES

NaiveBayesUpdateable 0,09 84,8892 97,5098 Nominal 1264 SAMPLES C & D

Algorithm Type

Algorithm Running

Time (sec) Accuracy

(%)

Relative Absolute Error (%)

Class Label

#Records

ConjunctiveRule 0,95 93 86 Nominal 1264

ConjunctiveRule 0,56 81 Numerical 1264

DecisionTable 62,78 97,3892 32,7153 Nominal 1264

DecisionTable 117,3 9,7337 Numerical 1264

JRip 19,67 98,7342 12,3703 Nominal 1264

OneR 0,5 92 48 Nominal 1264

PART 1,63 91,693 98,3046 Nominal 1264

Ridor 18,75 91,5348 54 Nominal 1264

ZeroR 0 92 100 Nominal 1264

RULES

ZeroR 0 100 Numerical 1264

DecisionStump 0,52 93 79 Nominal 1264 TREES

DecisionStump 0,36 73 Numerical 1264



J48 1,73 91,693 98,6723 Nominal 1264

J48 1 86 99 Nominal 764

J48graft 3,73 91,693 98,6723 Nominal 1264

REPTree 8,5 92,0886 93,3101 Nominal 1264

REPTree 7,75 90 Numerical 1264

M5P 38,39 158,0417 Numerical 764

RandomForest 1,17 92,4842 90,1185 Nominal 1264

RandomTree 0,03 92,1677 84,5656 Nominal 1264

UserClassifier 17,42 100 Numerical 1264

HyperPipes 0,05 92,7215 473,1011 Nominal 1264 MISC

VFI 0,14 86,6297 82,1092 Nominal 1264

AdaBoostM1 2,05 93 104 Nominal 1264 AttributeSelectedCla

ssifier 7,69 92 99 Nominal 1264

AdditiveRegression 4,31 24,6164 Numerical 1264 ClassificationViaCluster

ing 7,05 59,4146 258,4494 Nominal 1264

CVParameterSelection 0 92 100 Nominal 1264

CVParameterSelection 0 100 Numerical 1264

Bagging 99,72 92,1677 92,6654 Nominal 1264

Bagging 70,28 92,464 Numerical 1264

Decorate 77,63 91,693 107,7443 Nominal 1264

END 22,41 91,693 98,5855 Nominal 1264

FilteredClassifier 0,7 92,0095 94,147 Nominal 1264

Grading 0,03 92 53 Nominal 1264



MultiBoostAB 3,53 92,8797 45,3391 Nominal 1264

MultiScheme 0,02 91,693 100 Nominal 1264

MultiScheme 0,03 100 Numerical 1264

ClassBalancedND 4,92 91,693 98,573 Nominal 1264 ND 1,8 91,693 98,5283 Nominal 1264

DataNearBalancedND 4,88 91,693 98,4153 Nominal 1264

OrdinalClassClassifier 4,36 91,693 98,6723 Nominal 1264 RegressionByDiscret

ization 1,74 99,6557 Numerical 1264

RacedIncrementalLogitBoost

0,02 91,693 100,6065 Nominal 1264

RandomCommittee 0,75 93 88 Nominal 1264

EnsembleSelection 86,39 92 96 Nominal 1264

EnsembleSelection 87,27 94 Numerical 1264

RandomSubSpace 35,06 91,9304 93,4288 Nominal 1264

RandomSubSpace 32,84 92,3796 Numerical 1264

Stacking 0,02 91,693 100 Nominal 1264

Stacking 0,02 100 Numerical 1264

StackingC 0,17 91,693 99,9962 Nominal 1264

Vote 0,02 92 100 Nominal 1264

META

Vote 0 100 Numerical 1264

IB1 0,03 87,3418 80,608 Nominal 1264

Ibk 0 87,3418 81,9896 Nominal 1264

IBk 0 76,4778 Numerical 1264

LAZY

KStar 0 91,693 477,6024 Nominal 1264



KStar 0 54,5149 Numerical 1264

LWL 0 94,3829 69,2469 Nominal 1264

LWL 0 70,1878 Numerical 1264

RBFNetwork 6,77 92 99 Nominal 1264

RBFNetwork 4,09 100 Numerical 1264

SMO 278,38 87 215 Nominal 764 FUNCTION

S

SMOreg 195,31 97 Numerical 764

BayesNet 0,67 75,9494 152,8978 Nominal 1264

NaiveBayes 0,31 87,1835 82,6355 Nominal 1264 BAYES

NaiveBayesUpdateable

0,31 87,1835 82,6355 Nominal 1264

Analysis Fraud

Documents

Transcript of Analysis Fraud