CHAPTER 1

21
Chapter one Introduction CHAPTER 1. INTRODUCTION 1.1 Introduction The problem of the existence of outliers in data is an important problem that has been investigated within diverse knowledge disciplines and application domains. The existence of the problem of doubtful or anomalous values has been recognized for a very long time, certainly since the middle of the eighteenth century. See, for example Bernoulli (1777) writing about the combination of astronomical observations, (Barnett and Lewis, 1994). Since this period until the middle of the nineteenth century, the main point of discussion in the literature with regard to outlying values is whether their rejection can be justified. The first published objective test for anomalous observations was due to the American astronomer Peirce (1852). As Barnett and Lewis (1994) mentioned , outlying observations do not inevitably ‘perplex’ - 1 -

Transcript of CHAPTER 1

Chapter one Introduction

CHAPTER 1. INTRODUCTION

1.1 Introduction

The problem of the existence of outliers in

data is an important problem that has been

investigated within diverse knowledge disciplines

and application domains. The existence of the

problem of doubtful or anomalous values has been

recognized for a very long time, certainly since

the middle of the eighteenth century. See, for

example Bernoulli (1777) writing about the

combination of astronomical observations,

(Barnett and Lewis, 1994).

Since this period until the middle of the

nineteenth century, the main point of discussion

in the literature with regard to outlying values

is whether their rejection can be justified. The

first published objective test for anomalous

observations was due to the American astronomer

Peirce (1852).

As Barnett and Lewis (1994) mentioned ,

outlying observations do not inevitably ‘perplex’

-1-

Chapter one Introduction

or ‘mislead’, they are not necessarily ‘bad’ or

‘erroneous’ and the experimenter may be tempted

in some situations not to reject an outlier but

to welcome it as an indication of some

unexpectedly useful industrial treatment or

surprisingly successful agricultural variety .

The outlier challenge is one of the earliest

of statistical interests, and since nearly all

data sets contain outliers of varying

percentages, it continues to be one of the most

important. Sometimes outliers can grossly distort

the statistical analysis, at other times their

influence may not be as noticeable. Statisticians

have accordingly developed numerous algorithms

for the detection and treatment of outliers, but

most of these methods were developed for

univariate data sets. This thesis focuses instead

on multivariate outlier detection.

Especially when using some of the common

summary statistics such as the sample mean and

variance, outliers can cause the analyst to reach

a conclusion totally opposite to the case if

outliers were not present. For example, a

-2-

Chapter one Introduction

hypothesis might or might not be declared

significant due to a handful of extreme outliers.

In fitting a regression line via least squares,

outliers can sufficiently alter the slope so as

to induce a sign change.

Sometimes outliers can draw attention to

important facts, or lead to new discoveries. Our

goal in this thesis is to identify the outliers

in a data set as successfully as possible, after

which the analyst can decide what to do with

them.

Outliers have two opposing properties. They

can be noises that disturb regression and

classification task. On the other hand, they can

provide valuable information about rare

phenomena, which can lead to knowledge discovery.

At this stage, we must make clear what we

mean by an "outlier". In literature, different

definitions of outlier exist: the most commonly

referred are reported in the following:

-“An outlier is an observation which is suspected

of being partially or wholly irrelevant because

-3-

Chapter one Introduction

it is not generated by the stochastic model

assumed" (Anscombe and Guttman, 1960).

-“An outlier is a piece of data that is suspected

to be incorrect due to the remote probability

that it is in fact correct” (Ferguson, 1961) .

-“An outlying observation, or ‘outlier’, is one

that appears to deviate markedly from other

members of the sample in which it occurs”

(Grubbs, 1969).

- “An outlier is an observation that deviates so

much from other observations as to arouse

suspicions that it was generated by a different

mechanism” (Hawkins, 1980).

-“An outlier is defined as a case that does not

follow the same model as the rest of the data”

(Weisberg, 1985).

- “An outlier is an observation (or subset of

observations) which appear to be inconsistent

with the remainder of the data set” (Barnett and

Lewis, 1994).

-4-

Chapter one Introduction

- “An outlier is an observation that lies outside

the overall pattern of a distribution” (Moore and

McCabe, 1999).

- “An outlier in a set of data is an observation

or a point that is considerably dissimilar or

inconsistent with the remainder of the data”

(Ramasmawy et al., 2000).

-“An outlier is defined as a data point which is

very different from the rest of the data based on

some measure. Such a point often contains useful

information on abnormal behavior of the system

described by the data”(Aggarwal and Yu, 2001).

- “Outliers are those data records that do not

follow any pattern in an application” (Chen et

al., 2002).

- “Outlier is one that appears to deviate

markedly from other members of the sample in

which it occurs or as an observation (or subset

of observations) which appears to be inconsistent

with the remainder of that set of data” (Hodge

and Austin, 2004).

-5-

Chapter one Introduction

- “Outliers are the set of objects that are

considerably dissimilar from the remainder of the

data” (Han and Kamber, 2000).

- “Outliers defined as, those measurements that

significantly deviate from the normal pattern of

sensed data” (Chandola et al., 2007).

1.2 Types of outliers:

An important input to an outlier detection

technique is the definition of the desired

outlier, which needs to be detected by the

technique. Outliers can be classified into three

categories based on its composition and its

relation to rest of the data, (Chandola et al.,

2007).

(1) Type I Outliers.

In a given set of data instances, an

individual outlying instance is termed as a Type

I outlier. This is the simplest type of outliers

and is the focus of majority of existing outlier

detection schemes. A data instance is an outlier

due to its attribute values, which are

inconsistent with values taken by normal

-6-

Chapter one Introduction

instances. Techniques that detect Type I outliers

analyze the relation of an individual instance

with respect to rest of the data instances

(either in the training data or in the test

data).

(2) Type II Outliers.

These outliers are caused due to the

occurrence of an individual data instance in a

specific context in the given data. Like Type I

outliers, these outliers are also individual data

instances. The difference is that a Type II

outlier might not be an outlier in a different

context. Thus, Type II outliers are defined with

respect to a context. The notion of a context is

induced by the structure in the data set and has

to be specified as a part of the problem

formulation. A context defines the neighborhood

of a particular data instance.

Type II outliers satisfy two properties

The underlying data has a spatial/sequential

nature: each data instance is defined using

two sets of attributes, viz. contextual

attributes and behavioral attributes. The

contextual attributes define the position of

-7-

Chapter one Introduction

an instance and are used to determine the

context (or neighborhood) for that instance.

For example, in spatial data sets, the

longitude and latitude of a location are the

contextual attributes. Or in a time-series

data, time is a contextual attribute which

determines the position of an instance on

the entire sequence. The behavioral

attributes define the non-contextual

characteristics of an instance. For example,

in a spatial data set describing the average

rainfall of the entire world, the amount of

rainfall at any location is a behavioral

attribute.

The outlying behavior is determined using

the values for the behavioral attributes

within a specific context. A data instance

might be a Type II outlier in a given

context, but an identical data instance (in

terms of behavioral attributes) could be

considered normal in a different context.

Type II outliers have been most popularly

explored in time-series data.

(3) Type III Outliers.

-8-

Chapter one Introduction

These outliers occur because a subset of data

instances is outlying with respect to the entire

data set. The individual data instances in a Type

III outlier are not outliers by themselves, but

their occurrence together as a substructure is

anomalous. Type III outliers are meaningful only

when the data has spatial or sequential nature.

These outliers are either anomalous sub graphs or

subsequences occurring in the data.

Almost all the studies that consider outlier

identification as their primary objective are in

the field of statistics. A comprehensive

treatment of outliers appears in Barnett and

Lewis (1994). They provide a list of about 100

discordancy tests for detecting outliers in data

following well- known distributions. The choice

of an appropriate discordancy test depends on:

a) the distribution,

b) the knowledge of the distribution parameters,

c) the number of expected outliers, and

d) the type of expected outliers.

These methods have two main drawbacks. First,

almost all of them are for univariate data making

them unsuitable for multidimensional datasets.

-9-

Chapter one Introduction

Second, all of them are distribution-based, and

in most of the time, the data distribution is

unknown. Real-world data in most of the time is

multivariate with unknown distribution.

Detecting outliers, instances in a database

with unusual properties, is an important data-

mining task. People in the data mining community

got interested in outliers after Knorr and Ng

(1998) proposed a non-parametric approach to

outlier detection based on the distance of an

instance to its nearest neighbors.

Outlier detection in large data sets is an

active research field in data mining. It has many

applications in all those domains that can lead

to illegal or abnormal behavior, such as fraud

detection, network intrusion detection, insurance

fraud, medical diagnosis, marketing, or customer

segmentation, etc. Outlier detection has become

an important branch of data mining.

Frequently, outliers are removed to improve

accuracy of the estimators. However, this

practice is not recommendable because sometimes

outliers can have very useful information. The

presence of outliers can indicate individuals or

-10-

Chapter one Introduction

groups that have behavior very different of a

normal situation.

The importance of outlier detection is due to

the fact that outliers in data are translated to

significant (and often critical) information in a

wide variety of application domains. Therefore,

Outlier detection refers to the problem of

finding patterns in data that do not conform to

expected normal behavior.

The outlier detection technique finds

applications in credit card fraud, network

intrusion detection, financial applications, and

marketing. Outlier detection was studied in the

statistics community as early as the nineteenth

century, (Edgeworth, 1887).

Outlier detection has been found to be

directly applicable in a large number of domains.

These applications could be listed as follow:

Fraud detection: detecting fraudulent

applications for credit cards, state

benefits or detecting fraudulent usage of

credit cards or mobile phones.

-11-

Chapter one Introduction

Loan application processing: to detect

fraudulent applications or potentially

problematical customers.

Intrusion detection: detecting unauthorized

access in computer networks.

Activity monitoring: detecting mobile phone

fraud by monitoring phone activity or

suspicious trades in the equity markets.

Network performance: monitoring the

performance of computer networks, for

example to detect network bottlenecks.

Fault diagnosis: monitoring processes to

detect faults in motors, generators,

pipelines or space instruments on space

shuttles for example.

Structural defect detection: monitoring

manufacturing lines to detect faulty

production runs, for example cracked beams.

Satellite image analysis: identifying novel

features or misclassified features.

-12-

Chapter one Introduction

Motion segmentation: detecting image

features moving independently of the

background.

Time-series monitoring: monitoring safety

critical applications such as drilling or

high-speed milling.

Medical condition monitoring: such as

heart-rate monitors.

Pharmaceutical research: identifying novel

molecular structures.

Detecting novelty in text : to detect the

onset of news stories, for topic detection

and tracking or for traders to pinpoint

equity, commodities, FX trading stories,

outperforming or underperforming

commodities.

Detecting unexpected entries in databases:

for data mining to detect errors, frauds,

or valid but unexpected entries.

Detecting mislabeled data in a training

data set.

-13-

Chapter one Introduction

1.3 Problem definition

According to Hodge and Austin (2004) outliers

are defined as an outlying observations, or

outlier, is one that appears to deviate markedly

from other members of the sample in which is

occurs or as an observation (or subset of

observations) which appears to be inconsistent

with the remainder of that set of data. Many

terms have been used including novelty detection,

anomaly detection, noise detection, deviation

detection, or exception mining, but all involve a

similar process or goal. In addition to the many

different names, there are many more algorithms

for solving them.

Outlier detection can be treated as a part of

the data preprocess or as the object of data

mining. Many cases show the characters of outlier

sample, such as cheating in commercial data, web

intrusion, process faults, sudden variation of

-14-

Chapter one Introduction

operation condition, and high intensity voice in

industrial process and so on.

Numerical outlier characterizes abnormity on

matching relationship or location abnormity in

multi-dimension space. The commonly used methods

of outlier detection are:

I. Classical techniques include:

1.Statistical approach

2.Clustering approach

3.Density approach

4.Depth approach

II. Intelligent techniques include:

1.Classification approach

2.Distance approach

3.Information theory approach

4.Spectral decomposition approach

5.Visualisation approach

-15-

Chapter one Introduction

6.Wavelet approach

Most of these techniques deal with univariate

variable, but most of the problems statistician

faced that the data are multivariate or multi-

dimension. Generally speaking, none of the

previously reported works in literature can

guarantee enough confidence and reliability in

multi-dimension outlier detection .Thus, multi-

dimension outlier detection deserves further

study. So, we are proposing a new technique that

may guarantee more accuracy and high detection

rate. Therefore, the proposed method of Support

Vector Machine-K Nearest Neighborhood (SVM-KNN)

is trying to solve the problem and comparing

their results with two popular methods KNN, and

SVM.

1.4 Historical review

Outlier detection is a critical task in many

safety critical environments as the outlier

indicates abnormal running conditions from which

significant performance degradation may well

result, such as an aircraft engine rotation

-16-

Chapter one Introduction

defect or a flow problem in a pipeline. An

outlier can denote an anomalous object in an

image such as a land mine. An outlier may

pinpoint an intruder inside a system with

malicious intentions so rapid detection is

essential.

Outlier detection can detect a fault on a

factory production line by constantly monitoring

specific features of the products and comparing

the real-time data with either the features of

normal products or those for faults. It is

imperative in tasks such as credit card usage

monitoring or mobile phone monitoring to detect a

sudden change in the usage pattern, which may

indicate fraudulent usage such as stolen card or

stolen phone airtime. Outlier detection

accomplishes this by analyzing and comparing the

time series of usage statistics.

For application processing, such as loan

application processing or social security benefit

payments, an outlier detection system can detect

any anomalies in the application before approval

or payment. Outlier detection can additionally

-17-

Chapter one Introduction

monitor the circumstances of a benefit claimant

over time to ensure the payment has not slipped

into fraud. Equity or commodity traders can use

outlier detection methods to monitor individual

shares or markets and detect novel trends, which

may indicate buying or selling opportunities.

A news delivery system can detect changing

news stories and ensure the supplier is first

with the breaking news. In a database, outliers

may indicate fraudulent cases or they may just

denote an error by the entry clerk or a

misinterpretation of a missing value code, either

way detection of the anomaly is vital for data

base consistency and integrity.

Outlier detection has been a topic of a

number of surveys and review articles, as well as

books. Hodge and Austin (2004) provided an

extensive survey of outlier detection techniques

developed in machine learning and statistical

domains.

Petrovskiy (2003) presented a brief review

of outlier detection techniques using data mining

-18-

Chapter one Introduction

algorithms. An extensive review of novelty

detection techniques using neural networks and

statistical approaches has been presented in

Markou and Singh (2003a) and Markou and Singh

(2003b), respectively.

Patcha and Park (2007) have provided a broad

survey of outlier detection techniques used for

intrusion detection in computer networks.

Lazarevic et al. (2003) presented an evaluation

of selected outlier detection techniques, used

for network intrusion detection.

Outlier detection techniques developed

specifically for system call intrusion detection

has been reviewed by Forrest et al. (1999), and

later by Snyder (2001) and Dasgupta and Nino

(2000).

Hido et al. (2011) use the ratio of training and

test data densities as an outlier score for

outlier detection.

A substantial amount of research on outlier

detection has been done in statistics and has

been reviewed in several books, (Rousseeuw and

-19-

Chapter one Introduction

Leroy, 1987; Barnett and Lewis, 1994; Hawkins,

1980).

1.5 The aim of the work

The aim of this study is dealing with the

problem of detecting outlier in the case of

multivariate data. In this study, we propose a

technique that detects outliers in multivariate

data.

In order to satisfy this aim the organization

of this thesis will be as follows:-

Chapter 2: represents the techniques of

outlier detection. These techniques are classical

and intelligent techniques. Firstly, classical

techniques will be illustrated such as;

statistical, clustering, density and depth

techniques. Secondly, intelligent techniques will

be illustrated such as; classification approach,

distance approach, Information theory approach,

Spectral decomposition approach, Visualisation

approach, and Wavelet approach.

-20-

Chapter one Introduction

Chapter 3: concentrates on some intelligent techniques for outlier detection such as; KNN andSVM. In addition, we demonstrate the algorithms of KNN method and SVM method.

Chapter 4: illustrates the proposed methodSVM-KNN with its algorithm and how to use it in outlier detection. In addition, this chapter demonstrates the comparison between the most important methods, which are KNN, SVM and the proposed method SVM-KNN.

Finally, we introduce some suggestions for further studies in the same field.

-21-