CHAPTER 1
Transcript of CHAPTER 1
Chapter one Introduction
CHAPTER 1. INTRODUCTION
1.1 Introduction
The problem of the existence of outliers in
data is an important problem that has been
investigated within diverse knowledge disciplines
and application domains. The existence of the
problem of doubtful or anomalous values has been
recognized for a very long time, certainly since
the middle of the eighteenth century. See, for
example Bernoulli (1777) writing about the
combination of astronomical observations,
(Barnett and Lewis, 1994).
Since this period until the middle of the
nineteenth century, the main point of discussion
in the literature with regard to outlying values
is whether their rejection can be justified. The
first published objective test for anomalous
observations was due to the American astronomer
Peirce (1852).
As Barnett and Lewis (1994) mentioned ,
outlying observations do not inevitably ‘perplex’
-1-
Chapter one Introduction
or ‘mislead’, they are not necessarily ‘bad’ or
‘erroneous’ and the experimenter may be tempted
in some situations not to reject an outlier but
to welcome it as an indication of some
unexpectedly useful industrial treatment or
surprisingly successful agricultural variety .
The outlier challenge is one of the earliest
of statistical interests, and since nearly all
data sets contain outliers of varying
percentages, it continues to be one of the most
important. Sometimes outliers can grossly distort
the statistical analysis, at other times their
influence may not be as noticeable. Statisticians
have accordingly developed numerous algorithms
for the detection and treatment of outliers, but
most of these methods were developed for
univariate data sets. This thesis focuses instead
on multivariate outlier detection.
Especially when using some of the common
summary statistics such as the sample mean and
variance, outliers can cause the analyst to reach
a conclusion totally opposite to the case if
outliers were not present. For example, a
-2-
Chapter one Introduction
hypothesis might or might not be declared
significant due to a handful of extreme outliers.
In fitting a regression line via least squares,
outliers can sufficiently alter the slope so as
to induce a sign change.
Sometimes outliers can draw attention to
important facts, or lead to new discoveries. Our
goal in this thesis is to identify the outliers
in a data set as successfully as possible, after
which the analyst can decide what to do with
them.
Outliers have two opposing properties. They
can be noises that disturb regression and
classification task. On the other hand, they can
provide valuable information about rare
phenomena, which can lead to knowledge discovery.
At this stage, we must make clear what we
mean by an "outlier". In literature, different
definitions of outlier exist: the most commonly
referred are reported in the following:
-“An outlier is an observation which is suspected
of being partially or wholly irrelevant because
-3-
Chapter one Introduction
it is not generated by the stochastic model
assumed" (Anscombe and Guttman, 1960).
-“An outlier is a piece of data that is suspected
to be incorrect due to the remote probability
that it is in fact correct” (Ferguson, 1961) .
-“An outlying observation, or ‘outlier’, is one
that appears to deviate markedly from other
members of the sample in which it occurs”
(Grubbs, 1969).
- “An outlier is an observation that deviates so
much from other observations as to arouse
suspicions that it was generated by a different
mechanism” (Hawkins, 1980).
-“An outlier is defined as a case that does not
follow the same model as the rest of the data”
(Weisberg, 1985).
- “An outlier is an observation (or subset of
observations) which appear to be inconsistent
with the remainder of the data set” (Barnett and
Lewis, 1994).
-4-
Chapter one Introduction
- “An outlier is an observation that lies outside
the overall pattern of a distribution” (Moore and
McCabe, 1999).
- “An outlier in a set of data is an observation
or a point that is considerably dissimilar or
inconsistent with the remainder of the data”
(Ramasmawy et al., 2000).
-“An outlier is defined as a data point which is
very different from the rest of the data based on
some measure. Such a point often contains useful
information on abnormal behavior of the system
described by the data”(Aggarwal and Yu, 2001).
- “Outliers are those data records that do not
follow any pattern in an application” (Chen et
al., 2002).
- “Outlier is one that appears to deviate
markedly from other members of the sample in
which it occurs or as an observation (or subset
of observations) which appears to be inconsistent
with the remainder of that set of data” (Hodge
and Austin, 2004).
-5-
Chapter one Introduction
- “Outliers are the set of objects that are
considerably dissimilar from the remainder of the
data” (Han and Kamber, 2000).
- “Outliers defined as, those measurements that
significantly deviate from the normal pattern of
sensed data” (Chandola et al., 2007).
1.2 Types of outliers:
An important input to an outlier detection
technique is the definition of the desired
outlier, which needs to be detected by the
technique. Outliers can be classified into three
categories based on its composition and its
relation to rest of the data, (Chandola et al.,
2007).
(1) Type I Outliers.
In a given set of data instances, an
individual outlying instance is termed as a Type
I outlier. This is the simplest type of outliers
and is the focus of majority of existing outlier
detection schemes. A data instance is an outlier
due to its attribute values, which are
inconsistent with values taken by normal
-6-
Chapter one Introduction
instances. Techniques that detect Type I outliers
analyze the relation of an individual instance
with respect to rest of the data instances
(either in the training data or in the test
data).
(2) Type II Outliers.
These outliers are caused due to the
occurrence of an individual data instance in a
specific context in the given data. Like Type I
outliers, these outliers are also individual data
instances. The difference is that a Type II
outlier might not be an outlier in a different
context. Thus, Type II outliers are defined with
respect to a context. The notion of a context is
induced by the structure in the data set and has
to be specified as a part of the problem
formulation. A context defines the neighborhood
of a particular data instance.
Type II outliers satisfy two properties
The underlying data has a spatial/sequential
nature: each data instance is defined using
two sets of attributes, viz. contextual
attributes and behavioral attributes. The
contextual attributes define the position of
-7-
Chapter one Introduction
an instance and are used to determine the
context (or neighborhood) for that instance.
For example, in spatial data sets, the
longitude and latitude of a location are the
contextual attributes. Or in a time-series
data, time is a contextual attribute which
determines the position of an instance on
the entire sequence. The behavioral
attributes define the non-contextual
characteristics of an instance. For example,
in a spatial data set describing the average
rainfall of the entire world, the amount of
rainfall at any location is a behavioral
attribute.
The outlying behavior is determined using
the values for the behavioral attributes
within a specific context. A data instance
might be a Type II outlier in a given
context, but an identical data instance (in
terms of behavioral attributes) could be
considered normal in a different context.
Type II outliers have been most popularly
explored in time-series data.
(3) Type III Outliers.
-8-
Chapter one Introduction
These outliers occur because a subset of data
instances is outlying with respect to the entire
data set. The individual data instances in a Type
III outlier are not outliers by themselves, but
their occurrence together as a substructure is
anomalous. Type III outliers are meaningful only
when the data has spatial or sequential nature.
These outliers are either anomalous sub graphs or
subsequences occurring in the data.
Almost all the studies that consider outlier
identification as their primary objective are in
the field of statistics. A comprehensive
treatment of outliers appears in Barnett and
Lewis (1994). They provide a list of about 100
discordancy tests for detecting outliers in data
following well- known distributions. The choice
of an appropriate discordancy test depends on:
a) the distribution,
b) the knowledge of the distribution parameters,
c) the number of expected outliers, and
d) the type of expected outliers.
These methods have two main drawbacks. First,
almost all of them are for univariate data making
them unsuitable for multidimensional datasets.
-9-
Chapter one Introduction
Second, all of them are distribution-based, and
in most of the time, the data distribution is
unknown. Real-world data in most of the time is
multivariate with unknown distribution.
Detecting outliers, instances in a database
with unusual properties, is an important data-
mining task. People in the data mining community
got interested in outliers after Knorr and Ng
(1998) proposed a non-parametric approach to
outlier detection based on the distance of an
instance to its nearest neighbors.
Outlier detection in large data sets is an
active research field in data mining. It has many
applications in all those domains that can lead
to illegal or abnormal behavior, such as fraud
detection, network intrusion detection, insurance
fraud, medical diagnosis, marketing, or customer
segmentation, etc. Outlier detection has become
an important branch of data mining.
Frequently, outliers are removed to improve
accuracy of the estimators. However, this
practice is not recommendable because sometimes
outliers can have very useful information. The
presence of outliers can indicate individuals or
-10-
Chapter one Introduction
groups that have behavior very different of a
normal situation.
The importance of outlier detection is due to
the fact that outliers in data are translated to
significant (and often critical) information in a
wide variety of application domains. Therefore,
Outlier detection refers to the problem of
finding patterns in data that do not conform to
expected normal behavior.
The outlier detection technique finds
applications in credit card fraud, network
intrusion detection, financial applications, and
marketing. Outlier detection was studied in the
statistics community as early as the nineteenth
century, (Edgeworth, 1887).
Outlier detection has been found to be
directly applicable in a large number of domains.
These applications could be listed as follow:
Fraud detection: detecting fraudulent
applications for credit cards, state
benefits or detecting fraudulent usage of
credit cards or mobile phones.
-11-
Chapter one Introduction
Loan application processing: to detect
fraudulent applications or potentially
problematical customers.
Intrusion detection: detecting unauthorized
access in computer networks.
Activity monitoring: detecting mobile phone
fraud by monitoring phone activity or
suspicious trades in the equity markets.
Network performance: monitoring the
performance of computer networks, for
example to detect network bottlenecks.
Fault diagnosis: monitoring processes to
detect faults in motors, generators,
pipelines or space instruments on space
shuttles for example.
Structural defect detection: monitoring
manufacturing lines to detect faulty
production runs, for example cracked beams.
Satellite image analysis: identifying novel
features or misclassified features.
-12-
Chapter one Introduction
Motion segmentation: detecting image
features moving independently of the
background.
Time-series monitoring: monitoring safety
critical applications such as drilling or
high-speed milling.
Medical condition monitoring: such as
heart-rate monitors.
Pharmaceutical research: identifying novel
molecular structures.
Detecting novelty in text : to detect the
onset of news stories, for topic detection
and tracking or for traders to pinpoint
equity, commodities, FX trading stories,
outperforming or underperforming
commodities.
Detecting unexpected entries in databases:
for data mining to detect errors, frauds,
or valid but unexpected entries.
Detecting mislabeled data in a training
data set.
-13-
Chapter one Introduction
1.3 Problem definition
According to Hodge and Austin (2004) outliers
are defined as an outlying observations, or
outlier, is one that appears to deviate markedly
from other members of the sample in which is
occurs or as an observation (or subset of
observations) which appears to be inconsistent
with the remainder of that set of data. Many
terms have been used including novelty detection,
anomaly detection, noise detection, deviation
detection, or exception mining, but all involve a
similar process or goal. In addition to the many
different names, there are many more algorithms
for solving them.
Outlier detection can be treated as a part of
the data preprocess or as the object of data
mining. Many cases show the characters of outlier
sample, such as cheating in commercial data, web
intrusion, process faults, sudden variation of
-14-
Chapter one Introduction
operation condition, and high intensity voice in
industrial process and so on.
Numerical outlier characterizes abnormity on
matching relationship or location abnormity in
multi-dimension space. The commonly used methods
of outlier detection are:
I. Classical techniques include:
1.Statistical approach
2.Clustering approach
3.Density approach
4.Depth approach
II. Intelligent techniques include:
1.Classification approach
2.Distance approach
3.Information theory approach
4.Spectral decomposition approach
5.Visualisation approach
-15-
Chapter one Introduction
6.Wavelet approach
Most of these techniques deal with univariate
variable, but most of the problems statistician
faced that the data are multivariate or multi-
dimension. Generally speaking, none of the
previously reported works in literature can
guarantee enough confidence and reliability in
multi-dimension outlier detection .Thus, multi-
dimension outlier detection deserves further
study. So, we are proposing a new technique that
may guarantee more accuracy and high detection
rate. Therefore, the proposed method of Support
Vector Machine-K Nearest Neighborhood (SVM-KNN)
is trying to solve the problem and comparing
their results with two popular methods KNN, and
SVM.
1.4 Historical review
Outlier detection is a critical task in many
safety critical environments as the outlier
indicates abnormal running conditions from which
significant performance degradation may well
result, such as an aircraft engine rotation
-16-
Chapter one Introduction
defect or a flow problem in a pipeline. An
outlier can denote an anomalous object in an
image such as a land mine. An outlier may
pinpoint an intruder inside a system with
malicious intentions so rapid detection is
essential.
Outlier detection can detect a fault on a
factory production line by constantly monitoring
specific features of the products and comparing
the real-time data with either the features of
normal products or those for faults. It is
imperative in tasks such as credit card usage
monitoring or mobile phone monitoring to detect a
sudden change in the usage pattern, which may
indicate fraudulent usage such as stolen card or
stolen phone airtime. Outlier detection
accomplishes this by analyzing and comparing the
time series of usage statistics.
For application processing, such as loan
application processing or social security benefit
payments, an outlier detection system can detect
any anomalies in the application before approval
or payment. Outlier detection can additionally
-17-
Chapter one Introduction
monitor the circumstances of a benefit claimant
over time to ensure the payment has not slipped
into fraud. Equity or commodity traders can use
outlier detection methods to monitor individual
shares or markets and detect novel trends, which
may indicate buying or selling opportunities.
A news delivery system can detect changing
news stories and ensure the supplier is first
with the breaking news. In a database, outliers
may indicate fraudulent cases or they may just
denote an error by the entry clerk or a
misinterpretation of a missing value code, either
way detection of the anomaly is vital for data
base consistency and integrity.
Outlier detection has been a topic of a
number of surveys and review articles, as well as
books. Hodge and Austin (2004) provided an
extensive survey of outlier detection techniques
developed in machine learning and statistical
domains.
Petrovskiy (2003) presented a brief review
of outlier detection techniques using data mining
-18-
Chapter one Introduction
algorithms. An extensive review of novelty
detection techniques using neural networks and
statistical approaches has been presented in
Markou and Singh (2003a) and Markou and Singh
(2003b), respectively.
Patcha and Park (2007) have provided a broad
survey of outlier detection techniques used for
intrusion detection in computer networks.
Lazarevic et al. (2003) presented an evaluation
of selected outlier detection techniques, used
for network intrusion detection.
Outlier detection techniques developed
specifically for system call intrusion detection
has been reviewed by Forrest et al. (1999), and
later by Snyder (2001) and Dasgupta and Nino
(2000).
Hido et al. (2011) use the ratio of training and
test data densities as an outlier score for
outlier detection.
A substantial amount of research on outlier
detection has been done in statistics and has
been reviewed in several books, (Rousseeuw and
-19-
Chapter one Introduction
Leroy, 1987; Barnett and Lewis, 1994; Hawkins,
1980).
1.5 The aim of the work
The aim of this study is dealing with the
problem of detecting outlier in the case of
multivariate data. In this study, we propose a
technique that detects outliers in multivariate
data.
In order to satisfy this aim the organization
of this thesis will be as follows:-
Chapter 2: represents the techniques of
outlier detection. These techniques are classical
and intelligent techniques. Firstly, classical
techniques will be illustrated such as;
statistical, clustering, density and depth
techniques. Secondly, intelligent techniques will
be illustrated such as; classification approach,
distance approach, Information theory approach,
Spectral decomposition approach, Visualisation
approach, and Wavelet approach.
-20-
Chapter one Introduction
Chapter 3: concentrates on some intelligent techniques for outlier detection such as; KNN andSVM. In addition, we demonstrate the algorithms of KNN method and SVM method.
Chapter 4: illustrates the proposed methodSVM-KNN with its algorithm and how to use it in outlier detection. In addition, this chapter demonstrates the comparison between the most important methods, which are KNN, SVM and the proposed method SVM-KNN.
Finally, we introduce some suggestions for further studies in the same field.
-21-