Data Warehousing and Mining - PTU (Punjab Technical University)
-
Upload
khangminh22 -
Category
Documents
-
view
2 -
download
0
Transcript of Data Warehousing and Mining - PTU (Punjab Technical University)
Self Learning Material
Data Warehousing and Mining
(MSIT- 404)
Course: Masters of Science [IT]
Semester- IV
Distance Education Programme
I. K. Gujral Punjab Technical University
Jalandhar
Syllabus
I. K. G. Punjab Technical University
MSIT404 Data Warehousing and Data Mining
Section A: Review of Data Warehouse: Need for data warehouse, Big data, Data Pre-
Processing, Three tier architecture; MDDM and its schemas, Introduction to Spatial Data
warehouse, Architecture of Spatial Systems, Spatial: Objects, data types, reference
systems; Topological Relationships, Conceptual Models for Spatial Data, Implementation
Models for Spatial Data, Spatial Levels, Hierarchies and Measures Spatial Fact
Relationships.
Section B: Introduction to temporal Data warehouse: General Concepts, Temporality Data
Types, Synchronization and Relationships, Temporal Extension of the Multi Dimensional
Model, Temporal Support for Levels, Temporal Hierarchies, Fact Relationships, Measures,
Conceptual Models for Temporal Data W arehouses : Logical Representation and Temporal
Granularity
Section C: Introduction to Data Mining functionalities, Mining different kind of data,
Pattern/Context based Data Mining, Bayesian Classification: Bayes theorem, Bayesian
belief networks Naive Bayesian classification, Introduction to classification by Back
propagation and its algorithm, Other classification methods: k-Nearest Neighbor, case
based reasoning, Genetic algorithms, rough set approach, Fuzzy set approach
Section- D: Introduction to prediction: linear and multiple regression, Clustering: types of
data in cluster analysis: interval scaled variables, Binary variables, Nominal, ordinal, and
Ratio-scaled variables; Major Clustering Methods: Partitioning Methods: K-Mean and K-
Mediods, Hierarichal methods: Agglomerative, Density based methods: DBSCAN
References:
1. Data Mining: Concepts and Techniques By J.Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal
applications) by Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data W arehousing , Mining and Visualization By George M Marakas, Publisher
Pearson
Table of Contents
ChapterNo. Title Written By Page No.
1 Data Warehouse: An overview Ms. Rajinder Vir Kaur, DAVIET,
Jalandhar
2 Data warehouse: Three tier
architecture
Ms. Rajinder Vir Kaur, DAVIET,
Jalandhar
3 Multidimensional data models Ms. Rajinder Vir Kaur, DAVIET,
Jalandhar
4 Spatial Data Warehouse Ms. Rajinder Vir Kaur, DAVIET,
Jalandhar
5 Temporal Data Warehouses- 1 Ms. Seema Gupta, AP, Mayur
College, Kapurthala.
6 Temporal Data Warehouses- 2 Ms. Seema Gupta, AP, Mayur
College, Kapurthala.
7 Introduction to data mining Ms. Seema Gupta, AP, Mayur
College, Kapurthala.
8 Classification Techniques- 1 Ms. Seema Gupta, AP, Mayur
College, Kapurthala.
9 Classification Techniques- 2 Mr. Tarun Kumar, Lecturer, St. Joseph
School, Barnala
10 Prediction Mr. Tarun Kumar, Lecturer, St. Joseph
School, Barnala
11 Introduction to clustering Mr. Tarun Kumar, Lecturer, St. Joseph
School, Barnala
12 Clustering Methods Mr. Tarun Kumar, Lecturer, St. Joseph
School, Barnala
Reviewed By:
Mr. Gagan Kumar
DAVIET, Kabir Nagar, Jalandhar,
Punjab, 144001
©I K Gujral Punjab Technical University Jalandhar
All rights reserved with I K Gujral Punjab
Lesson- 1 Data Warehouse: An overview
Structure
1.0 Objective
1.1 Introduction
1.2 Data Warehouse
1.2.1 Need of data warehouse
1.2.2 Difference between operational and Informational data stores
1.3 Big data
1.4 Data preprocessing
1.4.1 Steps in Data Pre-processing
1.5 Summary
1.6 Glossary
1.7 Answers to check your progress/self assessment questions
1.8 References/ Suggested Readings
1.9 Model Questions
1.0 Objective
After Studying this lesson, students will be able to:
1. Define data warehouse.
2. Discuss the need of data warehouse.
3. Describe the notion of big data.
4. Explain the need of data preprocessing.
1.1 Introduction
Every enterprise is involved in managing large data for its applications. It is difficult for any business
to survive without the database management systems. Initially the enterprises were only interested to
manage the transactional data, i.e. to record all day to day transactions. But in this competitive age,
companies need quick access to strategic information for improved decision making. Transactional
data stores failed to provide this support. Extraction of interesting data from the transactional stores
that is according to end-user requirements and aggregation of the same are key to building strategic
information.
Data warehousing is a solution to this problem. Data warehousing has been around for more than 2
decades now. Data warehouse is an integrated central repository of data extracted from heterogeneous
data sources of an enterprise. It is important to get to the roots of as to why an enterprise really needs
data warehouse. It is critically important to understand the significance of data warehouse need. It is
lack of this factor that kills motivation and leads to the failure of so many data warehousing projects.
1.2 Data Warehouse
Data warehouse systems are probably the most popular among all the DSS’s. Data warehouse may be
defined as Collection of data that supports decision-making processes and it provides the following
features:
It is subject-oriented.
It is integrated and consistent.
It is time variant
It is non-volatile.
Data warehouses are subject-oriented as they pivot on enterprise-specific concepts. Transactional
databases on the other hand pivot around enterprise-specific applications like payroll, inventory, and
invoice.Data warehouses extract data from variety of sources. A data warehouse should provide an
integrated view of data. Data warehouse systems also add some degree of new information, but are
predominantly used for rearranging of existing information. Operational data covers transactions
involving the latest data, i.e. for a very short period of time. Data warehouse records all historical data
and lets you analyze the past data as well. Data is kept in the warehouse foreverand regular and
periodical updates are made to it from operational data stores.
Concept of the data warehousing is simple. The need for strategic information gave birth to it. The
new concept of data warehouse does not generate new data. The already existing data is transformed
into forms suitable for providingstrategic information. The data warehouse facilitates direct access to
data for business users, a single unified edition of the performance indicators, accurate historical
records, and ability to analyze the data from many different perspectives.
Figure 1.1:Mapping between operational and information data stores
1.2.1 Need of data warehouse
You need to a data warehouse to fulfill the following requirements:
1. Data Integration: Managers are always keen to find answers to Key Performance Indicators. And
the manager wishes to analyze it across all products by location, time and channel. Data in different
operational data stores is not integrated and hence cannot be used directly for analysis task.
2. Advanced Reporting & Analysis: Data warehouse:The data warehouse supports viewing of data
from multiple dimensions and support querying, reporting and analysis tasks. Multidimensional
models as data cubes are used to facilitate viewing of data from multiple dimensions.
3. Knowledge Discovery and Decision Support: Data in a warehouse is maintained at different
levels of abstractions using latest data structures, and hence it supports knowledge discovery and
helps in decision making.
4. Performance: Optimizing the query response time make the case for a data warehouse. The
transactional systems are meant to perform transactions efficiently and are designed to optimize
frequent database reads and writes operations. The data warehouseis designed to optimize frequent
complex querying and analysis. There is need to separate operational database from informational
database. Ad-hoc queries and interactive analysis take a heavy toll on transactional systems and drag
their performance down. Querying can be performed using data warehouse without interrupting the
transactional database. Data warehouse can also hold on to historical data generated by transactional
systems for longer period of time, hence letting the transactional database to do with the historical
data and focus on the current data.
Check your progress/ Self assessment questions- 1
Q1. List 4 features of data warehouse.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q2. Multidimensional models as _____________ are used to facilitate viewing of data from multiple
dimensions.
1.2.2 Difference between operational and Informational data stores
Operational data stores Informational data stores
Large number of users. Relatively small number of users.
Read, update and delete operations are
performed.
Only read operation is performed.
Objective is to Record and manage day to
day transactions.
It is objective is to provide decision-
making support.
Model used is application-based Model used is subject-based
Only current data is stored. It stores both current and historical data
The data is updated continuously. The data is updated periodically.
Database is highly Normalized. De-normalized, multidimensional data
structure for optimization of complex
queries.
Response time is minimal. Response time can range between
seconds, minutes to hours.
It involves predicted and repetitive queries. It involves Ad-hoc, random or heuristic
queries.
1.3 Big data
Data over the last decade has grown exponentially. It became impossible for the current database
management systems to handle this humongous data. What is big data or how much data is
considered to big data? No rule of thumb is defined for it. Big data may be defined as the data that
exceeds the processing capacity of conventional database systems. The data gets so large that it
becomes impossible to process it, migrate it, or even store it. It becomes necessary to deploy advanced
tools to process big data.
Big data may be characterized using volume, velocity and variability of massive data. Within this data
lie valuable patterns and information. Today’s commodity hardware, cloud architectures and open
source software bring big data processing into the reach of the less well-resourced. Big data enables
an enterprise to conduct effective data analysisor even develop new products. Ability to process each
and every item in big data over a reasonable time promotes an investigative approach to data. There is
no need to create sampling of the large data.
Figure 1.2 Big Data
Source: www.shineinfotect.com.au
Big data became the best option for new start ups, especially in the field of web services. Facebook is
a big example of big data. It has successfully highly personalized user experience and created a new
kind of advertising business. Some of the prominent users of big data are Google, Yahoo, Amazon
and Facebook. Big data can be very vague. Input to big data systems arebanking transactions, social
networks, web server logs, satellite imagery, the content of web pages, etc. How to characterize big
data or how to differentiate between big data and manageable data? Three Vs (volume, velocity and
variety) are used to characterize big data.
Volume: Ability to process large amounts of information led to the big data analytics. Ability to
forecast considering100 factors rather than 5 is surely going to result better prediction of demand.
Volume presents the most immediate challenge to conventional IT database systems. Large volume of
data needs highly scalable storage and a distributed approach to querying. Companies over the years
have stored historical data in form of archived data. The archived data is maintained in the form of
logs that cannot be processed. This choice is based on variety feature of big data.Data warehousing
approach involvesuse of predetermined schemas, whereas Apache does not placeany conditions on the
structure of the data it can process.
Hadoop is a platform for distributing computing problems across a number of servers. It implements
the MapReduce approach initiated by Google in compiling its search indexes. Hadoop’s MapReduce
involves distributing a dataset among multiple servers and operating on the data.
Velocity: It refers to the rate of increase in the data flows into an organization. An organization
currently may not have massive data volumes, but its velocity may be too high that is eventually
going to result in massive volumes of data. Finance industry with the help of big data tools is able to
take advantage of this fast moving data. Internet and mobile era has changed the way products and
services are delivered and consumed. Online retailers are able to compile large histories of customers
with every click and interaction. Even if no transaction happens, the data is generated on the basis of
browsing done by the customers. It helps the online retailers to recommend additional purchases or
make discounted offers. Fast-moving data can be streamed into bulk storage for later batch
processing, key lies in the speed of the feedback loop, i.e. taking data from input through to decision.
Velocity is also subject to system’s output. Tighter or shorter the feedback loop, better the competitive
advantage.
Variety: Input data is highly un-structured. It is not ready for processing in its current state.Source
data is diverse and obtained from a variety of sources such as operational databases, external sources,
etc. All input data is not in the form of relational structures. It may also be in the form of text from
various social networks or raw feed directly from a sensor source. The input data from the sources
lacks integration.
Transformation of un-structured data into structured data is one application of big data. Processing is
done on the unstructured data to extract ordered meaning, for consumption either by humans or as a
structured input to an application. Extraction involves picking up only the useful information and
throwing away additional information. Certain data types suit certain classes of database better.
Documents encoded as XML can be stored better using MarkLogic and Neo4J. These are best suited
to store the social network relations that represent graph databases.
Relational databases are not suitable with agileenvironment in which the computations evolve with
the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for
flexibility. It provides enough structure to organize data, but do not require the exact schema of the
data before storing it.
Check your progress/ Self assessment questions- 2
Q3. _______________ data stores involves Ad-hoc, random or heuristic queries.
Q4. Big data may be defined as the data that exceeds the ___________ capacity of conventional
database systems.
Q5. Big data may be characterized using _________, ________ and ___________ of massive data.
1.4 Data preprocessing
Rules and standards followed to maintain transactional databases vary from region to region. Also the
transactional databases suffer from various anomalies which make them susceptible for the analysis
task. It is important that all such problems like inconsistency and other issues are resolved before the
data is loaded into a data ware house. Analysis performed on data that in inconsistent and is not
integrated is going to result into in accurate analysis results, which in turn will result into bad
decisions. Following are the key factors that make data preprocessing must:
Incomplete Data: Often an analyst asks, as to why the information related to some dynamic was not
recorded. The most common answer to it is, it was not considered to be important. Vision of people
involved in designing relational model for transactional data stores is entirely different from people
involved in analysis. Transactional data store are optimized to record all live transactions even if it has
to compromise on some optional attributes.
Inconsistent Data: There are many possible reasons for it like, use of faulty data entry hardware
equipment, human errors during data entry, data transmission errors while backing up. Incorrect data
may also be the result of inconsistencies in naming conventions, or inconsistent format for attributes.
Descriptive data summarization helps us to know the general characteristics of the data. These
characteristics then help to identify the presence of noise or outliers in data. Presence of outliners and
noise in data should be handled at the very first stage.
1.4.1Steps in Data Pre-processing
You need to perform data preprocessing before the data from operational stores is archived in a data
warehouse.
1. Data Cleaning
Following are some of the data cleansing operations.
Missing Values: In case some values are missing, you can take following actions:
a. Ignore the tuple.
b. Manually fill the missing value
c. Global constant be used to fill the missing value.
d. Fill the missing value as attribute mean.
e. Predicting most probable value.
2. Handling Noisy Data
Noise is an error or variance in a measured variable, mostly worked upon in numeric variables. Some
of the techniques are:
a. Binning
b. Regression
c. Clustering
3. Data Integration
Merging of data from multiple heterogeneous data stores is referred to as data integration. The data is
then transformed into formats appropriate for end-user analysis. The data analysis task in Data
Warehouse involves data integration. It means keeping the data in consistent state in data warehouse.
Schema integration refers to entity identification problem. Best example is use of different attribute
names to save the same information on different platforms.
Redundancy is another important issue. If you can derive one attribute from some other attribute, it is
considered to be redundant attribute.
4. Data Transformation
The data needs to be transformed or consolidated into formats appropriatefor end-user analysis. Data
transformation involves:
Smoothing: Techniques such as binning, regression, and clustering are used to remove noise from the
data.
Aggregation: Aggregation operations are applied to data. For example, performing the aggregate
operation on sales to get monthly, quarterly or annual sales.
5. Data Reduction
If you are preparing for data analysis, the data set from a data warehouse is likely to be huge. Data
reduction helps to reduce the data size and hence reduce the mining time. It does not affect the
integrity of original meaning. Reduced data set allows efficient mining on small, reduced data.
Check your progress/ Self assessment questions- 3
Q6. List three actions that can be taken to overcome the problem of missing values during data
cleaning process.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q7. Binning is an example of handling __________ data problem.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q8. Merging of data from multiple heterogeneous data stores is referred to as data___________.
Q9. Which of the following is not a feature of data warehouse?
a. Subject oriented
b. Integrated
c. Volatile
d. Time variant
Q10. Which is the following are features of big data?
a. Volume
b. Velocity
c. Variety
d. All the above.
Q11. Which of the following is not a feature of data pre-processing?
a. Data cleansing.
b. Data production.
c. Data reduction.
d. Data transformation.
1.5 Summary
Data warehouse may be defined as collection of data that supports decision-making processes.
Operational data stores are used to record and manage day to day transactions; its model is
application-based; only current data is stored; and database is highly normalized whereas,
informational data stores are used to decision-making support; its model used is subject-based; it
stores both current and historical data; and database is de-normalized. Data over the last decade has
grown exponentially. It became impossible for the current database management systems to handle
this humongous data. Big data may be defined as the data that exceeds the processing capacity of
conventional database systems. Volume presents the most immediate challenge to conventional IT
database systems. Large volume of data needs highly scalable storage and a distributed approach to
querying. Velocity refers to the rate of increase in the data flows into an organization. An
organization currently may not have massive data volumes, but its velocity may be too high that is
eventually going to result in massive volumes of data. Transformation of un-structured data into
structured data is one application of big data. Processing is done on the unstructured data to extract
ordered meaning, for consumption either by humans or as a structured input to an application. It is
important that all problems like inconsistency and other issues in various sources are resolved before
the data is loaded into a data ware house.
1.6 Glossary
Data warehouse- Data warehouse is a collection of data that supports decision-making processes.
Operational store- Used to record day to day transactions.
Normalization- Model that is used to optimize the transactional databases.
De-Normalization- Model that is used to optimize the database query response time.
Data granularity- It refers to the level of detail at which each subject or fact is stored.
Big Data- Big data may be defined as the data that exceeds the processing capacity of conventional
database systems. The data gets so large that it becomes impossible to process it, migrate it, or even
store it.
1.7 Answers to check your progress/self assessment questions
1. Four features of data warehouse are subject-oriented, integrated, time variant and non-volatile.
2. Data cubes.
3. Informational
4. Processing.
5. Volume, velocity, variability.
6. In case some values are missing, you can take following actions:
a. Ignore the tuple.
b. Manually fill the missing value
c. Global constant be used to fill the missing value.
7. Noisy.
8. Integration
9. c.
10. d.
11. b.
1.8 References/ Suggested Readings
1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson.
1.9 Model Questions
1. List various needs of data warehouse.
2. Define data warehouse.
3. Differentiate between operational and informational data stores.
4. What is big data? Explain the 3 V's of big data.
5. What is the need of data pre-processing?
6. List various steps in data pre-processing.
Lesson- 2 Data warehouse: Three tier architecture
Structure
2.0 Objective
2.1 Introduction
2.2 Data warehouse three-tier architecture
2.2.1 Data Sources
2.2.2 ETL
2.2.3 Bottom Tier
2.2.3.1 Data Mart
2.2.4 Middle tier
2.2.4.1 OLAP servers
2.2.5 Top Tier
2.2.5.1 Front-End Reporting Tool
2.3 Summary
2.4 Glossary
2.5 Answers to check your progress/self assessment questions
2.6 References/ Suggested Readings
2.7 Model Questions
2.0 Objective
After Studying this lesson, students will be able to:
1. Define data warehouse.
3. Discuss the need of data warehouse.
4. Describe the notion of big data.
5. Explain the need of data preprocessing.
2.1 Introduction
Now that you are familiar with data warehouse, in this lesson you will learn the three tier architecture
of data warehouse. Data goes through lot of phases before it is ready for analysis. Data must be
processed before it is considered fit for analysis. Also there is a need to represent the data using a
model fit for quick response to queries. Overall the architecture of data warehouse is divided into
three tiers. In this lesson you will learn about each and every tier in detail.
2.2 Data warehouse three-tier architecture
Data warehouse can also be implemented as single tier and two tier architecture. But the two fail to
distinguish between the activities performed in data warehouse architecture. Single-tier architecture
is like no data warehouse. Actual data source acts as the only layer. All data warehouse activities are
performed at the data source site only. Only the virtual existence of data warehouse is there. Main
drawback of single tier architecture is that it fails to separate analytical and transactional
processing. End-user queries are submitted to the source database only. It also effects the
performance of the transactional database, i.e. the source database.
Two-tier architecture provides one additional layer than the single-tier architecture and that
is the actual physical data warehouse. So there is a clear separation between the source data
and the data for analysis. The source data is placed into staging area where it is extracted,
cleansed to remove inconsistencies, and integrated to one common schema using advanced
ETL tools. Information is then stored to one logically centralized repository called data
warehouse and this central repository through analysis layer interacts with end-users. Still,
the two-tier architecture did not support multidimensional server to speed up query
processing.
Following is the structure of three tier architecture. It provides additional layer to store the data using
the model fit for carrying out analysis task is effective manner.
Figure 2.1: Three-tier architecture of data warehouse
Source: http://slideplayer.com/slide/2493383/
2.2.1 Data Sources
Before you are introduced to the three tiers of the data warehouse architecture, you need to understand
the sources from where the data is brought into the data warehouse.
Production Data: operational or transaction data stores result in production data. Not all data is
imported from the operational data stores. Only the data useful for analysis is extracted from the
operation data stores.
Internal Data: Internal data refers to data stored in the private spreadsheets, documents, customer
profiles, etc. Internal data of an enterprise has nothing to do with operational data stores.
Archived Data: Operational systems focus only on the current business data requirements. Historical
snapshots of data can be obtained from archived files, where it is stored in the form of logs.
External Data: High percentage of information used by executives depends on data from external
sources. External data includes market share data of competitors, data released by various government
agencies and various standard financial indicators to check on their performance.
2.2.2 ETL
Once you have identified various storage components, it is time to prepare the data before it is stored
in the data warehouse. Data from several dissimilar sources is full of inconsistencies. Also the data
lacks integration. There is a need to transform the data in a format suitable for querying and analysis.
Data extracted from source components is transformed and then loaded into the data staging
component of the data warehouse. Data staging comprises of temporary storage area and a set of
functions to clean, change, combine, convert, and prepare source data for storage and use in the data
warehouse. Overall ETL activity can be expressed as follows:
Data Extraction: Data extraction deals with numerous data sources and each data source requires use
of relevant and appropriate technique for it. Data sources do employ different models to store data.
Part of data may be stored using relational database systems, legacy network and hierarchical data
models, flat files, private spreadsheets and local departmental data sets, etc.
Data Transformation: Transformation is an important function considering the heterogeneous nature
of source data components. When you implement database system in an enterprise for the first time,
data is inputted manuallyfrom the prior system records, or by extracting data from a file system and
saved to relational database system. In either case, there is need for transforming the data format from
the prior systems. Similarly the data extracted from various dissimilar data store components must to
be transformed into a centralized format acceptable for querying and analyzing of data.
Data Loading: When the data warehouse goes live for the first time, the initial load moves large
volumes of data using up substantial amounts of time. As the data warehouse kicks off and initial
loading has been done with, continuous extraction, transformation of the changes to the source data
are fed as incremental data revisions on an ongoing basis. All operations that lead to loading of
data into data warehouse are performed by the load manager.
Check your progress/ Self assessment questions- 1
Q1. Main drawback of single tier architecture is that it fails to separate ____________ and
________________ processing.
Q2. List difference sources of data.
__________________________________________________________________________
__________________________________________________________________________
Q3. What is the need of data transformation?
__________________________________________________________________________
__________________________________________________________________________
2.2.3 Bottom Tier
Bottom tier of data warehouse architecture stores the data after performing adequate transformation.
The model used to store data in bottom tier is mostly relational data model. Data from various sources
is inserted into the bottom tier using back-end tools. The data is initially loaded into the staging area.
Data may either be stored in a single central repository or maintained in smaller sub-sets of data
warehouse called data marts. Data at the bottom tier is free of in-consistency and integration
problems.
Components/Roles of bottom tier of Data warehouse architecture:
1. Warehouse Monitoring
DM is responsible for the monitoring of data at bottom tier. DM may also be assigned the
responsibility for the overall management of the data warehouse. Responsibilities of data manager
include creating of indexes, deciding on level of denormalization, generating pre-computed
summaries or aggregations and archiving of data. The warehouse manager also does query profiling.
2. Detailed Data
It is important to maintain detailed data in data warehouse. All of it may not be useful and it is not
directly used for analysis purpose. It is basically used to generate aggregated data.
3. Lightly and Highly Summarized Data
Summarized data is key to providing fast response to ad-hoc queries. Summaries are saved in the
bottom tier separately. The data keeps on changing depending upon the nature of query profiles or
change in demand of end user. Query profiles are done to ascertain the general nature of queries in the
past.
4. Archive and back-up data
Back-up of the detailed data is maintained in the form of archives in the bottom tier itself. Archives
are saved as logs and hence takes much lesser space than the actual data.
5. Meta Data
It refers to data about data. It is generally maintained for each activity performed in data warehouse. It
helps to understand the flow of data in warehouse. Meta data is maintained for extraction, loading,
transformation processes. It helps to understand the type of cleansing and transformation performed
on data. It gives an insight into the type of inconsistencies that existed in the source data. Meta data
is also useful in automating the summary generation from detailed data.
2.2.3.1 Data Mart
A data mart is department level data warehouse. It includes information relevant to a particular
business area, department, or category of users. Data marts are designed to satisfy the decision
support needs of specific department or a functional unit. For example, Sales Department, Marketing
department, Accounts department, all have their own data marts. Some data marts are dependent on
other data marts to get their information. The data marts populated (using top-down approach) from a
primary data warehouse are mostly dependent. Data marts very useful for data warehouse systems in
large enterprises as they are used for incrementally developing data warehouses. Data marts are used
to mark out the information required by a category of users to solve queries and it can deliver better
performance as they are smaller in size than primary data warehouses.
Data marts have emerged as key concept along with the rapid growth of data warehouses. Data marts
are similar to data warehouses, but the scope is much smaller (department level). Rather than focusing
on all business activities of an enterprise, data mart focuses on only a single subject.
Data mart can extract data either from the centralized data warehouse or it can extract data directly
from the operational and other sources. Data marts are preferred when the size of the data warehouse
grows to an unmanageable proportion.
Check your progress/ Self assessment questions- 2
Q4. Summarized data is key to providing fast response to _______ queries.
Q5. Query __________ are done to ascertain the general nature of queries in the past.
Q6._______________ refers to data about data.
Q7. Data mart refers to a department level data warehouse. ( TRUE / FALSE ).
_____________________________________________________________
2.2.4 Middle tier
The middle tier of 3-teir architecture is an extension of relational model. Relational model as such is
not fit for analysis. In this layer the data is transformed into a model that is fit for analysis.
2.2.4.1 OLAP servers
OLAP is a relatively new technology, and it comes with several varieties. OLAP servers provide
multidimensional view of data to the managers. Following are the different OLAP servers based on
their implementation:
1. Relational OLAP (ROLAP) servers (Star Schema based): These are the intermediate servers
that provide interface between a relational back-end server and front-end client tools. ROLAP uses an
extended relation DBMS to store data. ROLAP servers include optimization for back-end DBMS,
implementation of aggregation logic, and additional tools and services. ROLAP technology tends to
provide greater scalability than some of the other OLAP servers.
2. Multidimensional OLAP (MOLAP) servers (Cube based): It support multidimensional view of
data using array-based multidimensional storage engines. Data cubes are used to map the
multidimensional view. The data cube allows fast indexing to pre-computed summarized data. With
multidimensional data stores, the storage utilization is significantly low if the data set is sparse. In
such cases, sparse matrix compression techniques should be looked at. MOLAP servers often adopt a
two-level storage representation to handle sparse and dense data sets: the dense sub-cubes are
identified and stored as array structures, while the sparse sub-cubes use compression technology for
efficient storage utilization.
3. Hybrid OLAP (HOLAP) servers: Both the ROLAP and MOLAP comes with their own set of
advantages and disadvantages. The hybrid OLAP approach is the combination of both the ROLAP
and MOLAP technology. It inherits the high scalability of ROLAP and the faster computation of
MOLAP. HOLAP server allows large volumes of detailed data to be stored in a ROLAP relational
database, while aggregations are kept in a separate MOLAP store.
4. Specialized SQL servers: In order meet the ever growing demand of OLAP processing in
relational databases, many relational and data warehousing firms implement specialized SQL servers
that offer advanced query language and query processing support for queries over star and snowflake
schemas in a read-only environment.
Check your progress/ Self assessment questions- 3
Q8. ROLAP provides higher scalability than MOLAP.
a. TRUE
b. FALSE
Q9._____________ allows fast indexing to pre-computed summarized data.
Q10. Data mart is:
a. Enterprise wide data warehouse.
b. Department wide data warehouse.
c. Not a data warehouse.
d. Replication of data warehouse.
Q11. R in ROLAP stands for:
a. Reduced
b. Rational
c. Relational
d. Re
2.2.5 Top Tier
The top tier acts as front end reporting tool to the analysts. The analysts submit their queries using
this front end reporting tool. The front end reporting tool can be an interactive GUI based on
navigation. The end user with little or no knowledge of query languages can also operate it. For
analysts that are well versed with query languages, can type their own queries using a command based
reporting tool. The results are displayed using a variety of visualization tools. The top tier takes the
services of aQuery Manager. End-user queries are managed by the query manager. Front End tools are
used to manage it. Front end tools can also be third party software.
2.2.5.1 Front-End Reporting Tool
Ultimately all rests with the worth of reporting tools to be provided to end-user. Depending upon the
extent of features to be used, a proper investigation of alternative of building or purchasing a reporting
tool must be done. It involves evaluating the cost of building a custom reporting (and OLAP) tool
with the purchase price of a third-party tool. Reporting-tools also have drill-down capabilities; many
of the services can be realized through the use of basic software services like Pivot Table Service of
Microsoft Excel 2000. If the reporting requires more services than what Excel can offer, there is a
need to develop or buy a full fledge OLAP tool.
It is sometimes advisable to buy a third party reporting tool like Microsoft Data Analyzer before
jumping into the development process of developing your own software because reinventing the
wheel is not always beneficial or affordable. Building OLAP tools is not a trivial exercise by any
means.
Check your progress/ Self assessment questions- 4
Q12. The top tier acts as front end ____________ tool to the analysts.
13._____________ is an example of third party reporting tool.
Q14. Analyst must have complete knowledge of query language in order to use front end reporting
tools. ( TRUE / FALSE )
_____________________________________________________
2.3 Summary
Single-tier architecture is like no data warehouse. All data warehouse activities are performed at the
data source site only. End-user queries are submitted to the source database only. It also effects the
performance of the transactional database, i.e. the source database. Two-tier architecture provides
one additional layer than the single-tier architecture and that is the actual physical data
warehouse.Data extracted from source components is transformed and then loaded into the data
staging component of the data warehouse.Data from various sources is inserted into the bottom tier
using back-end tools.Data marts very useful for data warehouse systems in large enterprises as they
are used for incrementally developing data warehouses.The middle tier of 3-teir architecture is an
extension of relational model. OLAP servers provide multidimensional view of data to the managers.
The top tier acts as front end reporting tool to the analysts. The analysts submit their queries using this
front end reporting tool. End-user queries are managed by the query manager.Microsoft Data Analyzer
is a popular third party front end reporting tool.
2.4 Glossary
Data warehouse- Data warehouse is a collection of data that supports decision-making processes.
Data mart- A data mart is department level data warehouse. Data marts are designed to satisfy the
decision support needs of specific department or a functional unit.
ETL tool- That facilitates extraction, transformation and loading of source data into a data warehouse.
Staging area- Temporary storage area where the data is initially loaded before it is fed into data
warehouse.
ROLAP- It refers to intermediate servers that provide interface between a relational back-end
server and front-end client tools. ROLAP servers include optimization for back-end DBMS,
implementation of aggregation logic, and additional tools and services.
MOLAP- It support multidimensional view of data using array-based multidimensional storage
engines. Data cubes are used to map the multidimensional view.
2.5 Answers to check your progress/self assessment questions
1. Analytical, transactional.
2. Different sources of data are:
Production data
Internal data
External data
Archived data
3. Data sources do employ different models to store data. Transformation is an important function
considering the heterogeneous nature of source data components.
4. ad-hoc
5. Profiles.
6. Meta data.
7. TRUE.
8. a.
9. Data cube.
10. b.
11. c.
12. Reporting
13. Microsoft Data Analyzer
14. FALSE.
2.6 References/ Suggested Readings
1. Data Mining: Concepts and Techniques by J. Han and M. Kamber PublisherMorgan Kaufmann
Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas, Publisher Pearson.
2.7 Model Questions
1. Explain in detail the use of ETL tools.
2. What is a data mart? List various advantages of creating a data mart over data warehouse.
3. List various sources of data.
4. What is Meta data? What is the advantage of maintaining meta data?
5. Explain ROLAP and MOLAP.
6. What is a front end reporting tool?
Lesson- 3 Multidimensional data models
Structure
3.0 Objective
3.1 Introduction
3.2 Data model for OLTP
3.3 Multidimensional data model
3.3.1 Schemas for multi-dimensional data
3.3.2 Designing a dimensional model
3.3.3 Dimension Table
3.3.4 Fact Table
3.3.5 Star schema
3.3.5.1 Additivity of facts
3.3.5.2 Surrogate Keys
3.3.6 Snowflake Schema
3.3.7 Difference between Star schema and Snow-flake schema
3.3.8Fact Constellation
3.4 Summary
3.5 Glossary
3.6 Answers to check your progress/self assessment questions
3.7 References/ Suggested Readings
3.8 Model Questions
3.0 Objective
After Studying this lesson, students will be able to:
1. Define denormalization.
2. Describe the fact table and dimension table used in multidimensional model.
3. Explain the various schemas used in multidimensional data model.
4. Differentiate between star and snowflake schemas.
3.1 Introduction
Objective of creating a data warehouse is entirely different from creating a transactional data store.
Hence, the data model needed to maintain the data warehouse is also different from transactional data
stores. You need to design a data model that support faster retrieval of data. In this lesson you will
learn the basic data models used in OLAP and other terminologies used with it.
3.2 Data model for OLTP
OLTP systems are based on normalized relational database models are used to manage the basic
transactional operations. Transactional operation include insertion, deletion and updation. Selection or
retrieval is also performed, but the queries used are predictable in nature and it involves small data.
OLTP is optimized to perform large number of transactions per seconds and the frequency of
transactions is very large. The response time is very low.
Figure 3.1: Data model for OLTP https://functionalmetrics.wordpress.com/tag/relational-
model/
The OLTP systems are based on the ER-model. An ER model is an abstract way of relating to
a database. The data stored in tables using a relational database, often points to data stored in other
tables. The ER model describes each attribute or table as an entity, and the relationship between them.
The ER-model is based on the concept of normalized databases, or we can say the database model
used for representing ER-MODEL is called normalized database. Normalized model for database is a
way of organizing the attributes and tables of a relational database to minimize redundancy. Data
initially is in un-normalized for, i.e. all related attributes of a database are stored within a single table.
Normalized data model typically involves breaking large table into smaller tables, which are less
redundant and defining relationships between them. This type of data isolation helps in speeding up
the data transactional processing. For example, additions, deletions, and modifications to a field can
be made in a single table and then propagated through the rest of the database using the defined
relationships.
Let us consider an example of un-normalized database and how it can be normalized to various levels
of database normalization that supports OLTP systems.
Name address Books issued Stream
Jitesh CHD OS, OB IT, Management
Sachin JAL Communication Skills, POM English, Management
Kunal PHG ACA IT
Table 3.1: Un-normalized database.
The books issued and stream columns have multiple values. To overcome this problem, we move to
1st Normal Form. In 1st normal form each table cell must contain single value and each record needs to
be unique.
Name Address Movies rented Category
Jitesh CHD OS IT
Jitesh CHD OB Management
Sachin JAL Communication Skills English
Sachin JAL POM Management
Kunal PHG ACA IT
Table 3.2: First normal form
Discussing all normalizations is beyond the scope of this lesson. Next you will learn the core topic of
this lesson and that is multidimensional model used to maintain data in data warehouse.
3.3 Multidimensional data model
Multidimensional data model are best suited for data analysis purpose. Multi dimensional data model
is entirely different from relational model. It lets you view data from multiple dimensions. Data cube
structure is used to view data from 3 dimensions.
Figure 3.2: Multidimensional cube
Data using data cube is defined by dimensions and facts. Each dimension in the data cube is assigned
a dimension table. Dimension table is discussed later in this lesson.Each dimension table is connected
to one central fact table (also discussed later in this lesson). A number of operations can be performed
on the data cube in order to provide better viewing of data and viewing of data from specified
dimensions.
It is easy and fast to perform pre-computation using data cubes. A number of multi-dimensional
schemas based on this MDDM can be generated. These schemas are different from the schemas used
to represent the relational model.It is easy for the analysts to identify interesting measures, dimensions
and attributes that make is easy and effective be organize data into levels and hierarchies. MDDM is
based on de-normalization and it is the process of adding back small degree of redundancy in
normalized database. Normalized database may be useful to speed up the recording of transactions,
but it certainly limits the processing speed of responding to various ad-hoc queries. De-normalized
tables for large databases are stored on different disks and even on different sites occasionally. Trying
to fetch data from all these databases in response to a join query can result in large response time.
Data warehouses are based on providing fasterresponse to queries. De-normalization results in need of
large repository. De-normalization optimises the query responsiveness of a database by adding
redundant data back to the normalized database. Not all attributes of normalized databases are joined
together, but only the attributes that are part of the join queries are added back.
Check your progress/ Self assessment questions- 1
Q1. Define ER model.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q2. Data using multidimensional model can be viewed as _________________.
Q3. What is the objective of denormalization?
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
3.3.1 Schemas for multi-dimensional data
You already know that ER model is used to implement the relational model in OLTP systems. Also,
that multidimensional model is the most popular model for data warehouses.Dimensionalmodelling
schema is to represent a set of business measurements using an easy to understand framework for the
end users.A fact table in dimensional model contains measurements of the business. A fact table
consists of foreign keys or dimensions that join to their respective dimension tables. A fact depends
upon its dimensions stored in the dimension tables. A dimension table consists of a primary key that
provides referential integrity with the foreign key of fact table.
3.3.2 Designing a dimensional model
Following factors must be kept in mind while designing a dimensional model.
1. Selection of Business Process
It is important to identify the business process that needs to be modelled.
2. Granularity
It is key to future analysis process.Results of data analysis are based on level of granularity you
choose. It refers to level of detail in a fact table. High level of granularity helps to analyze the data
better. Surely it will increase the storage overhead, but if you do not store detailed data to begin with;
there is no way to generate the same in future.
3. Choice of Dimensions
It is directly linked to the granularity. Dimensions must be carefully selected to begin with and no
dimension should be left out. Addition of dimensions at later stage can be of little or no use.
4. Identification of the Facts
It is linked to selection of the business process. The central fact table is a direct representation of
business activity. Identifying fact tables involve examining of the business to identify the transactions
of interest.
3.3.3 Dimension Table
A dimension table is used to represent one dimension of the MDDM. Dimension table model is used
to represent the business dimensions. Each dimension table consists of a key attribute that is used to
connect with the central fact table. Generally a dimension table consists of a large number of
attributes. Depending on the schema in use, all attributes can be kept in a single dimensions table or
the same can be normalized and broken into number of dimension tables. All dimension tables are
connected to the central fact table, and no two dimensions represented using different key attributes
can be joined together.
3.3.4 Fact Table
There is only a single fact table for a business activity. This single fact table is connected to all
dimension tables in the MDDM. The central fact table does not consist of a key attribute. All keys in
the fact table are connected to the key attribute of the dimension tables surrounding it. One must keep
high level of granularity for the fact table, i.e. more and more attributes should be saved for a fact
table. Additivity of fact table is a key feature. The dimensions of a fact table can either be fully
additive, semi additive or non- additive. Additivity of a fact is a measure that defines the ability of the
fact to be aggregated across all dimensions and their hierarchy without changing the original meaning
of the fact.
3.3.5 Star schema
Star schema is the basic MDDM schema. In star schema, a central fact table is directly connected to
each dimension table in the MDDM. No dimension is normalized or split to form multiple dimension
tables. Each dimension table consists of large number of attributes. The pictorial representation of star
schema takes the form of a star. Cube or hypercube is used to represent a star schema.
Figure 3.3: Star schema.
3.3.5.1 Additivity of facts
A fact table in star schema based on additivity, can be categorized into following three.
Additive: Additive facts are the ones that can be summed up through all of the dimensions in the fact
table.
Semi-Additive: Semi-additive facts ones that can be summed up for some of the dimensions in the
fact table, but not the others.
Non-Additive: Non-additive facts are ones that cannot be summed up for any of the dimensions
present in the fact table.
Check your progress/ Self assessment questions- 2
Q4. Define star schema.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q5. __________facts are the ones that can be summed up through all of the dimensions in the fact
table.
3.3.5.2 Surrogate Keys
Dimension tables can be connected to fact table using Surrogate keys. It is possible that a single key is
being used by different instances of the same entity across different OLTP systems. Surrogate keys
helps to identify such keys inside of a dimension table.
cust_id customer_name
1 Jitesh
2 Ravi
3 Sachin
Table 3.3: Relational database1
cust_id customer_name
1 Ram
2 Harry
3 Karan
Table 3.4: Relational database2
It is clearly visible that the cust_id attribute with key “1” is being used for 2 different customers
across 2 operational systems. It is a major problem faced by the data warehouse designers. Some of
the scenarios when such a problem is faced by the DW designers are as follows:
1. When consolidating information from various source systems.
2. When a company acquires some other company and is trying to create/modify data warehouses of
two companies.
3. Systems developed independently might not be using the same keys.
4. When the value of the key in the source system gets changed in the middle of a year.
It is not guaranteed that the primary key for a dimension table is unique due to this problem.
Sometimes few entities become obsolete and their keys are assigned to new entities in the operational
systems and we are using such keys as the primary keys for dimension tables. Now you are faced with
the problem where a key relates to the data for the newer entity and also to the data of the old entity.
Use of production system keys as primary keys for dimension tables should be avoided.
A surrogate key is capable of uniquely identifying each entity in the dimension table, irrespective of
its original source key. Surrogate key generates a simple integer value or sequence number for every
new entity. They do not have any built-in meanings and are used to map to the production system
keys of the source systems.
3.3.6 Snowflake Schema
With star schema, it can get very difficult to manage the dimension tables with extremely large
number of rows. To manage large dimension tables, it is must to break them down to number of small
tables. Snowflake schema is an extension of Star schema. Snowflake schema normalizes the
dimension tables of Star schema. Normalizing the dimension tables helps to reduce the size of the
dimension table and also helps in reducing the disk space required for the dimension table.
Snowflaking results in removing low cardinality attributes from dimension tables and shifting them in
secondary or next level dimension tables. Snowflaking comes with its own disadvantages.
Normalization leads to high number of complex joins between the dimensions.
Figure 3.4: Snowflake schema
Snowflake model stores the dimensions in normalized form to reduce redundancies. Such dimension
tables are easy to maintain and save a lot of storage space. The snowflake structure results in slower
execution of a query sue to large number of joins. Snowflake schema is not a popular option with data
warehouse experts, but due to bad data warehouse design or inability to handle large dimension tables,
we are left with no other option then to convert the star schema to snowflake schema.
The snowflake schema has the most complex structure and consists of far more number of tables then
the star schema representation. It requires multi-table joins to satisfy queries and is often more time
consuming then a star schema. The starflake schema has a slightly more complex structure than the
star. However, while it has redundancy within each table, redundancy between the dimensions is
eliminated.
Check your progress/ Self assessment questions- 3
Q6. Define surrogate keys.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q7. Define snowflake schema.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
3.3.7 Difference between Star schema and Snow-flake schema
Star Schema Snow-flake Schema
The fact table is at the center and is connected
to all dimension tables.
The dimension tables are completely in
denormalized structure.
Performance of SQL queries is good as there
are less number joins involved.
Data redundancy is high.
Preferred when dimension table is of relatively
low size and contains less number of rows.
It is an extension of star schema where the
dimension tables are further connected to one
or more dimensions. The fact table is only
connected to first level of dimension tables.
The dimensional tables are partially in
denormalized structure.
Performance of SQL queries is not as good as
that of star schema, as there is higher number
of joins involved.
Data redundancy is low.
Preferred when dimension table is of big size
and contains high number of rows.
3.3.8 Fact Constellation
As its name implies, it is shaped like a collection of stars (i.e., star schemas). Contrary to star schema,
it consists of more than one fact table and dimension tables are shared amongst the multiple fact
tables. This Schema is mainly used to aggregate fact tables, or where you want to split a fact table for
better understanding.
Figure 3.5: Fact constellation schema
Check your progress/ Self assessment questions- 3
Q8. Which of the following is an example of MDDM?
a. Star schema
b. Snow-flake schema
c. Fact-constellation schema.
d. All the above
Q9. Fact constellation schema comes with
a. No fact table
b. 1 fact table
c. Multiple fact tables
d. No dimension table
3.4 Summary
OLTP systems are based on normalized relational database model or ER model. An ER model is an
abstract way of relating to a database. Multidimensional data model is based on dimension relations.
Single dimension table is associated to each dimension in the data cube. There is a central fact table in
multidimensional data model connected to each dimension table or dimension. Denormalization is the
process of attempting to optimise the query responsiveness of a database by adding some redundant
data back to the normalized database. Star schema consists of a single fact table and all dimension
tables are connected directly to it and no 2 dimension tables are connected to each other directly. It
forms the shape of a star. A surrogate key is capable of uniquely identifying each entity in the
dimension table, irrespective of its original source key. Snowflake schema normalizes the dimension
tables of Star schema removing low cardinality attributes from dimension tables and shifting them in
secondary or next level dimension tables. Fact constellation schema has more than one fact table and
dimension tables are shared between the fact tables.
3.5 Glossary
Dimension table- A dimension table is associated to each dimension in the data cube.
Fact table- It is connected to each dimension table or dimension of data cube and it represent a
subject.
Star schema- It consists of a single fact table and all dimension tables are connected directly to it.
Snowflake schema- It normalizes the dimension tables of Star schema removing low cardinality
attributes from dimension tables and shifting them in secondary or next level dimension tables.
Fact constellation schema- Extension of star schema having more than one fact table and dimension
tables are shared between the fact tables.
3.6 Answers to check your progress/self assessment questions
1. An ER model is an abstract way of relating to a database. The ER model describes each attribute or
table as an entity, and the relationship between them. The ER-model is based on the concept of
normalized databases.
2. Data cubes.
3. Denormalization is the process of attempting to optimise the query responsiveness of a database by
adding some redundant data back to the normalized database. 4. Star schema consists of a central fact
table surrounded by dimension tables. It consists of a single fact table and all dimension tables are
connected directly to it and no 2 dimension tables are connected to each other directly. It forms the
shape of a star.
5. Additive.
6. A surrogate key is capable of uniquely identifying each entity in the dimension table, irrespective
of its original source key. Surrogate key generates a simple integer value or sequence number for
every new entity.
7. Snowflake schema normalizes the dimension tables of Star schema and reduce the size of the
dimension table. Snowflaking results in removing low cardinality attributes from dimension tables
and shifting them in secondary or next level dimension tables.
8. d.
9. c.
3.7 References/ Suggested Readings
1. Data Mining: Concepts and Techniques by J. Han and M. Kamber PublisherMorgan Kaufmann
Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas, Publisher Pearson.
3.8 Model Questions
1. Explain star schema with the help of an example.
2. Differentiate between the star schema and snowflake schema.
3. Define denormalization and how it helps to speed up data retrieval.
4. Define data cube.
5. List various properties of fact table and dimension table.
Lesson- 4 Spatial Data Warehouse
Structure
4.0 Objective
4.1 Introduction
4.2 Spatial Objects
4.3 Spatial Data Types
4.4 Reference Systems
4.5 Topological Relationships
4.6 Conceptual Models for Spatial Data
4.7 Implementation Models for Spatial Data
4.8 Architecture of Spatial Systems
4.9 Spatial Levels
4.10 Spatial Hierarchies
4.11 Spatial Fact Relationships
4.12 Spatial Measures
4.13 Summary
4.14 Glossary
4.15 Answers to check your progress/self assessment questions
4.16 References/ Suggested Readings
4.17 Model Questions
4.0 Objective
After Studying this lesson, students will be able to:
1. Define various spatial objects and data types.
2. Discuss the concept of topological relationships in spatial data.
3. Describe various spatial levels and hierarchies.
4. Explain spatial fact relationships and measures.
5. Explain different types of architectures for spatial systems.
4.1 Introduction
Spatial data warehouse is a combination of the spatial database and data warehouse technologies. Data
warehouses provide OLAP capabilities for analyzing data using different perspectives. Whereas,
spatial databases provide sophisticated management of spatial data, including spatial index structures,
storage management, and dynamic query formulation. Spatial data warehouses lets you exploit the
capabilities of both types of systems for improving data analysis, visualization, and manipulation.
4.2 Spatial Objects
A spatial object is used by an application to store the spatial characteristics corresponding to a real-
world entity. Spatial objects consist of both the conventional and spatial components. Basic data types
like integer, date, string are used to represent the conventional components of the spatial object.
Conventional components contain the general characteristics of the spatial object, such an employee
object is described by components like name, designation, department, data of joining, etc. whereas,
the spatial component includes the geometry, which can be of various spatial data types, such as point,
line, or surface.
4.3 Spatial Data Types
Spatial data types are used to represent the spatial extent of real-world objects. Conceptual
spatiotemporal model MADS defined number of spatial data types. Each data has an Icon associated
with it. Following are some of the spatial data types defined by conceptual spatiotemporal model
MADS:
Figure 4.1 Spatial data types
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
Point- It is used to represent a zero-dimensional geometries that denotes a single location in space,
such as a school in a city.
Line- It is used to represent a one-dimensional geometries that denotes a series of connected points. A
linear equation can be used to define a line. Route from one city to another is an example of line.
OrientedLine- It is used to represent lines with semantics of a start point and an end point. It also
called a directed line from start to end. A river can be represented using an orientedline
Surface- It is used to represent a two-dimensional geometries denoting a series of connected points
that lie inside a boundary formed by one or more disjoint closed lines.
SimpleSurface- It is used to represent surfaces without holes, such as a river without any island.
SimpleGeo- It is a generalization of spatial types Point, Line, and Surface.
SimpleGeo can be instantiated by specifying which of its subtypes characterizes the new element.
Following are some of the spatial data types used to describe spatially homogeneous sets:
PointSet- It is used to represent sets of points, such as houses in a colony.
LineSet- It is used to represent the represent sets of lines, such as a road network.
OrientedLineSet- It is used to represent a set of oriented lines, such as river and its branches.
SurfaceSet- It is used to represent sets of surfaces with holes.
SimpleSurfaceSet- It is used to represent sets of surfaces without holes.
ComplexGeo- It is used to represent any heterogeneous set of geometries that may include sets of
points, sets of lines, and sets of surfaces, such as water system consisting of rivers, lakes, and
reservoirs. Subsets of ComplexGeo consist of PointSet, LineSet, OrientedLineSet, SurfaceSet, and
SimpleSurfaceSet
Geo- It is the most generic spatial data type. It is the generalization of spatial the types SimpleGeo and
ComplexGeo. Geo can be used to represent the regions that may be either a Surface or a SurfaceSet.
Check your progress/ Self assessment questions- 1
Q1. Spatial data warehouse is a combination of the ___________ database and
________________________ technologies.
Q2. Spatial objects consist of both the ___________ and spatial components.
Q3. ______________ is used to represent surfaces without holes
4.4 Reference Systems
Spatial reference system is used to represent some co-ordinates of a plane that define the locations in a
given geometry. It is a function that associates real locations in space with geometries of coordinate’s
defined in mathematical space. For instance, projected coordinate systems give Cartesian coordinates
that result from mapping a point on the Earth’s surface to a plane. Number of spatial reference
systems are available that can be used in practice.
4.5 Topological Relationships
Relationship between the two spatial values is represented using topological relationships.
Topological relationships are extremely used in practical spatial applications. For example,
topological relationships can be used to find if the two countries share a common border, or to find if
a national highway crosses a state, or to find if a city is located within a state or not . Definitions of
the boundaries, the interior, and the exterior of spatial values are key to the definition of the
topological relationships.
Exterior of a spatial value is composed of all the points of the underlying space that do not belong to
the spatial value.
Interior of a spatial value is composed of all its points that do not belong to the boundary.
Definition of the boundary depends on the spatial data type.
Interior refers to a single point has an empty or no boundary.
The boundary of a line refers to set of all successive points given by its extreme points.
The boundary of a surface is given by the enclosing closed line and the closed lines defining the holes.
The boundary of a ComplexGeo is defined using a recursive function for the spatial union of:
The boundaries of its components that do not intersect other components.
The intersecting boundaries that do not lie in the interior of their union.
Figure 4.2: Topological relationship Icons
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
Following is the list of some of the topological relationships given in figure above:
1. meets- It refers to a topological relationship where two geometries intersect but their interiors do
not. It is possible that the two geometries may intersect in a point and not meet.
2. contains/inside: Consider the predicate: X contains Y if and only if Y inside X. It is an example of
symmetric predicate. It suggests that a geometry contains another one if the inner of an object is
contained in the interior of another object and the two objects to not intersect.
3. Equals: Two geometries are considered to be equal only if they share exactly the same set of points.
4. Crosses: One geometry crosses another if they intersect and the dimension of this intersection is
less than the greatest dimension of the geometries.
5. disjoint/intersects: It is an example of inverse predicate, i.e. when one applies, the other does not.
Two geometries are disjoint if the interior and the boundary one object intersects only the exterior of
another object.
6. covers/coveredBy: Again consider the predicate: X covers Y if and only if Y coveredBy X. It is
also an example of symmetric predicate. A geometry covers another one if it includes all points of the
other.
7. Disjoint- Two geometries are disjoint if they do not intersect or meet.
Check your progress/ Self assessment questions- 2
Q4. Spatial reference system is used to represent some co-ordinates of a plane that define the locations
in a given geometry. (TRUE / FALSE)
___________________________________________________________________
Q5. Relationship between the two _____________ values is represented using topological
relationships.
Q6. The topological relationship in which two geometries intersect but their interiors do not, is called
________.
4.6 Conceptual Models for Spatial Data
A number of conceptual models have been proposed in the literature for representing spatial and
spatiotemporal data. These models are extensions of conceptual models to meet the requirements of
spatial data. Some of the examples of extended conceptual models are ER model and UML model.
Still, these conceptual models vary significantly and it is not easy to extend these to meet the needs of
spatial data, and even if you succeed in doing so, the cost can be enormous. None of these conceptual
models have yet been widely adopted in practice or by the research communities.
4.7 Implementation Models for Spatial Data
Spatial data at an abstract level can be represented using object-based and field-based data models.
Raster and vector data models are used to represent these abstractions of space at the implementation
level. The raster data model is structured as an array of cells representing the value of an attribute for
a real-world location. A cell is addressed or indexed by its position in the array. Usually cells
represent square areas of the grounds. The raster data model can be used to represent spatial objects
like, point for a single cell, line as a sequence of adjoining cells, surface as a collection of contiguous
cells. However, storage of spatial data using raster model is very inefficient for large uniform area.
For a vector data model, objects are created using points and lines as primitives. A point is used to
represent a pair of coordinates, whereas more complex linear and surface objects uses lists, sets, or
arrays, based on the point representation. The vector data representation is inherently more efficient in
its use of computer storage than the raster data representation. However, vector model fails to
represent phenomena for which clear boundaries do not necessarily exist. One such example is
temperature.
4.8 Architecture of Spatial Systems
Collection of spatial objects (using vector representation) can be stored using the following models
1. Spaghetti model,
2. Network model, and
3. Topological models.
Irrespective of which structure is selected for storing collection of spatial objects, two different
computer architectures can be used for spatial systems, called dual and integrated. Dual architecture is
based on separate management systems for spatial and non spatial data and integrated architecture is
an extension of existing database management systems with spatial data types and functions.
Geographic information systems or GISs make use of dual architecture system. GIS requires
heterogeneous data models to represent spatial and non-spatial data. Spatial data is generally
represented using proprietary data structures, which implies difficulties in modelling, use, and
integration.
Whereas, integrated architectures or extended DBMSs provide support for storing, retrieving,
querying, and updating spatial objects while preserving other DBMS functionalities, like recovery
techniques and optimization. The integrated architecture lets you define an attribute of a table as being
of spatial data type. It can be extremely useful in speeding up spatial queries using spatial indexes by
retrieving topological relationships between spatial objects using spatial operators and spatial
functions.
Oracle Spatial and IBM DB2 Spatial Extender are two examples of widely used DBMSs that support
the management of spatial data.
4.9 Spatial Levels
Spatial level may be defined as a level for which the application needs to store spatial characteristics.
It is captured by its geometry represented using one of the spatial data types defined earlier in this
lesson. For example, Point, Line, OrientedLine, Surface, etc. A spatial attribute refers to an attribute
that has a spatial data type as its domain. Multi Dimensional model represents spatial level using the
icon of its associated spatial type with the level name. Consider the figure below:
Figure 4.3 Spatial levels
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
SurfaceSet icon represents the geometry of State members. A level may be spatial independently of
the fact that it has spatial attributes. For example, a level such as State may be spatial and have spatial
attributes such as Capital location.
Check your progress/ Self assessment questions- 3
Q7.Spaghetti model is used to store the collection of spatial objects. (TRUE / FALSE)
Q8. ________ and ___________ computer architectures can be used for spatial systems
Q9. Dual architecture is an extension of existing database management systems with spatial data
types and functions. (TRUE / FALSE)
____________________________________________________________________________
4.10 Spatial Hierarchies
Hierarchy Classification
A spatial hierarchy is composed of several related levels, of which at least one is spatial. If two related
levels in a hierarchy are spatial, a pictogram indicating the topological relationship between them
should be placed on the Link between the levels. If this symbol is omitted, the coveredBy
topologicalrelationship is assumed by default. Following are the different types of spatial hierarchies.
Simple Spatial Hierarchies
Simple spatial hierarchies are those hierarchies where, if all its component parent-child relationships
are one-to-many, the relationship between their members can be represented as a tree. Simple spatial
hierarchies can be further categorized s:
Balanced spatial hierarchies at schema level, have only one path, where all levels are mandatory. At
the instance level, the members form a tree where all the branches have the same length.
Unbalanced spatial hierarchies have only one path at the schema level but, as implied by the
cardinalities, some lower levels of the hierarchy are not mandatory. At instance level, the members
represent an unbalanced tree, with branches of the tree having different lengths
Generalized spatial hierarchies contain multiple exclusive paths sharing some. All these paths
represent one hierarchy and account for the same analysis criterion. At the instance level, each
member of the hierarchy belongs to only one part. The symbol ⊗ is used to indicate that for every
member, the paths are exclusive.
Non-Strict Spatial Hierarchies
Simple spatial hierarchy is used to represent one-to-many parent child relationship, i.e. a child
member can be related to a maximum of a one parent member, but a parent member can be related to
number of child members. But, in practice there may exist many-to-many relationship between the
parent-child members. Non-strict spatial hierarchy has at least one many-to-many relationship. Strict
spatial hierarchy has all one-to-many relationships. Graph is used to represent the members of a non-
strict hierarchy.
Alternative Spatial Hierarchies
Alternative spatial hierarchies have in them several nonexclusive simple spatial hierarchies sharing
some levels. However, all these hierarchies account for the same analysis criterion. Graph is used to
represent these hierarchies at instance level, because a child member can be associated with more than
one parent member belonging to different levels. It is called alternative spatial hierarchies because it is
not semantically correct to simultaneously traverse different component hierarchies, and you must
choose one of the alternative aggregation paths for analysis.
Parallel Spatial Hierarchies
Parallel spatial hierarchies are used when a dimension is associated with several spatial hierarchies
accounting for different analysis criteria. Such hierarchies can be independent or dependent. Various
hierarchies in a parallel independent spatial hierarchies do not share levels or represent non
overlapping sets of hierarchies. Whereas, various hierarchies in parallel dependent spatial hierarchies
share some levels.
4.11 Spatial Fact Relationships
Fact relationship is used to relate the leaf members from all its participating dimensions into a
relationship. For non-spatial dimensions, this relationship corresponds to a relational join operator.
For spatial dimensions, spatial join based on a topological relationship is needed to represent the
relationships. Spatial data warehouses include a feature called n-ary topological relationships.
Topological relationships in spatial databases are generally binary relationships, whereas, topological
relationships in spatial data warehouses relates more than two spatial dimensions.
4.12 Spatial Measures
Spatial measures can be represented by a geometry. Current OLAP systems require aggregation
functions for numeric attributes during the roll-up and drill-down operations. By default, the sum is
applied. Distributive functions reuse aggregates for a lower level of a hierarchy in order to calculate
aggregates for a higher level. Algebraic functions require additional manipulation to reuse values.
Whereas, holistic functions, such as the median, and rank require complete recalculation using data
from the leaf level.
Spatial measures also require the specification of a spatial aggregation function. Several different
aggregation functions for spatial data are defined, such as:
1. Spatial distributive functions including convex hull, spatial union, and spatial intersection.
2. Spatial algebraic functions including center of n points and the center of gravity.
3. Spatial holistic functions are the equipartition and the nearest-neighbor index
Check your progress/ Self assessment questions- 4
Q10.Non-strict spatial hierarchy has at least one many-to-many relationship. (TRUE / FALSE)
____________________________________________________________________________
Q11. _________ is used to represent the alternate spatial hierarchies at instance level.
Q12Topological relationships in spatial data warehouses are binary relationships (TRUE / FALSE)
____________________________________________________________________________
4.13 Summary
Spatial objects consist of both the conventional and spatial components. Conceptual spatiotemporal
model MADS defined number of spatial data types that were discussed in this lesson. Spatial
reference system is used to represent some co-ordinates of a plane that define the locations in a given
geometry. Relationship between the two spatial values is represented using topological relationships.
Conceptualmodels for spatial data are extensions of conceptual models to meet the requirements of
spatial data. Spatial objects can be stored using the spaghetti model, network model, and topological
models. Two different computer architectures are used for spatial systems, called dual and integrated.
A separate management system is created for dual architecture to manage spatial and non-spatialdata,
whereas, integrated architecture is an extension of existing database management systems with spatial
data types and functions. Multi Dimensional model represents spatial level using the icon of its
associated spatial type with the level name. Several related levels are used to represent spatial
hierarchies. Spatial measures can be represented by a geometry. Current OLAP systems require
aggregation functions for numeric attributes during the roll-up and drill-down operations. Spatial
measures also require the specification of a spatial aggregation function.
4.14 Glossary
Spatial data warehouse- It refers to the combination of both the data warehouse and spatial
database technologies.
Spatial object- A spatial stores the spatial characteristics corresponding to a real-world entity and it
consists of both the conventional and spatial components.
Spatial reference system- It is used to represent some co-ordinates of a plane that define the locations
in a given geometry.
Topological Relationship- Relationship between the two spatial values is represented using
topological relationships.
Spatial measure- It can be represented by a geometry. Spatial measures require specification of a
spatial aggregation function during the roll-up and drill-down operations.
4.15 Answers to check your progress/self assessment questions
1. Spatial, data warehouse.
2. Conventional.
3. SimpleSurface.
4. TRUE.
5. Spatial.
6. Meets.
7. TRUE.
8. Dual, integrated.
9. FALSE.
10. TRUE.
11. Graph.
12. FALSE.
4.16 References/ Suggested Readings
"1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson."
4.17 Model Questions
1. List different spatial data types used in spatial databases.
2. List some of the topological relationships used in spatial databases.
3. Define interior, exterior and boundary in topological relationships.
4. What is the difference between dual and integrated architectures of spatial systems?
5. Explain different types of spatial hierarchies.
6. What do you mean by spatial data warehouse?
Lesson- 5 Temporal Data Warehouses- 1
Structure
5.0 Objective
5.1 Introduction
5.2 Temporal Databases: General Concepts
5.2.1 Temporality Types
5.2.2 Temporal Data Types
5.2.3 Synchronization Relationships
5.2.4 Conceptual and Logical Models for Temporal Databases
5.3 Temporal Extension of the MultiDimensional Model
5.3.1 Support for temporality types
5.3.2 Overview of the Model
5.4 Summary
5.5 Glossary
5.6 Answers to check your progress/self assessment questions
5.7 References/ Suggested Readings
5.8 Model Questions
5.0 Objective
After Studying this lesson, students will be able to:
1. Describe the need of temporal database.
2. Discuss the basic concepts of temporal databases.
3. Explain the challenges associated with extension of a MultiDim model to temporal model.
4. List the support provided by source systems for various temporality types.
5.1 Introduction
It is very important to represent the information that varies with time. Conceptual databases do not
save historical data. Temporal databases is a solution to store information that varies with time. Also
the MultiDim models fails to provide support for data that varies with time. This lesson focuses on the
basic concepts of temporal databases and how a MultiDim model can be extended to support
Temporal data model.
5.2 Temporal Databases: General Concepts
Temporal databases are used to represent information that varies over time. Conventional databases
generally store current data, whereas temporal databases store the historical and future data along with
the times at which the changes have happened and are expected to happen. Discrete model is used to
represent the time in a temporal database. The timeline is then represented as a sequence of
consecutive time intervals of same duration called chronons. Groups of consecutive chronons is
calledgranules that may represent time in terms of units like seconds, minutes or even hours.
5.2.1 Temporality Types
Following are some of the temporality types:
1. Valid time (VT): It is used to specify the time period for which a fact is true in the modelled reality.
For example, it can be identified as to how many tickets were booked by a customer in a given period
of time. The time period must be supplied by the customer itself.
2. Transaction time (TT): It is used to specify the time period in which a fact is current in the
database and it starts when the fact is inserted or updated, and ends when the fact is deleted or
updated.
3. Bitemporaltime (BT): It is combination of both the VT and TT, and is used to specify the time
period for which the fact is true in reality and it is current in the database.
4. Lifespan (LS): It is used to specify the time period during which an object exists. For example, a
life span can be used to specify the time period for which an individual was member of an association.
5.2.2 Temporal Data Types
Temporal data types are used to specify the temporal extent of real-world phenomena and are defined
using the spatiotemporal conceptual model MADS,
Figure 5.1 Temporal data types
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
1. Instant: It does not represent a time period, instead it represents a single point of time based on
some specified granularity.
2. Interval: It may be thought of a time period. It refers to set of all successive instants between two
points of instants.
3. SimpleTime: It refers to the generalization of both the Instant and the Interval. It is must to specify
the values for Instant and Interval every time value for SimpleTime is created.
4. InstantSet: It is used to represent a set of single points of time or Intervals. For example, InstantSet
can be used to represent the Instants at which the goals happened during a football match.
5. IntervalSet: Also known as temporal element, it is used to specify simple intervals that is capable of
representing discontinuous duration like duration of matches played in a single tournament.
6. Complex Time: It is used to represent dissimilar set of temporal values, i.e. the set may contain
both the Interval and Instant values.
7. Time: It is the most generic or abstract temporal data type. It can be used to represent the lifespan
of a tournament.
5.2.3 Synchronization Relationships
Relationship between two temporal extents can be represented using synchronization relationships.
Synchronization relationships helps to identify if two events have occurred simultaneously or one
after the other. Synchronization relationships for temporal data are also used to correspond to the
topological relationships for spatial data. Synchronization relationships can also be defined on the
basis of the boundary, interior, and exterior.
Exterior of a temporal value refers to all instants that do not belong to temporal value.
Interior of a temporal value refers to all instants that do not belong to boundary.
Boundary for different temporal data types is different. Because an Instant refers to a single point of
time, it does not have a boundary. As Intervalrefers to set of all successive instants between two
points of instants. First and last Instant forms the boundary of an Interval. Boundary for
ComplexTime is defined as the union of the boundaries of its components that do not intersect with
other components.
Figure 5.2 Icons for Synchronization relationships
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
Following are some of the commonly used synchronization relationships
1. Meets: It refers to the relationship when two temporal values intersect in an instant but their
interiors do not.
2. Overlaps: It refers to the relationship when interiors of two temporal values intersect and their
intersection is not equal to either of them.
3. contains/inside: Consider the predicate: X contains Y if and only if Y inside X. It is an example of
symmetric predicate. It suggests that a temporal value contains another one if the interior of the
former contains all instants of the latter.
4. covers/coveredBy: Again consider the predicate: X covers Y if and only if Y coveredBy X. It is
also an example of symmetric predicate. A temporal value covers another one if the former includes
all instants of the latter.
5. disjoint/intersects: It is an example of inverse predicate, i.e. disjoint and intersects are inverse
temporal predicates. It means that two temporal values are disjoint if they do not share any instant, i.e.
when one applies, the other does not.
6. Equals: Two temporal values are considered to be equal if and only if every instant of one value
belongs to the second and conversely.
7. starts/finishes: A temporal value starts/finishes another if the first/last instants of the two temporal
values are equal, respectively
8. precedes/succeeds: A temporal value precedes/succeeds another if the last/first instant of the former
is before/later the first/last instant of the latter, respectively.
Check your progress/ Self assessment questions- 1
Q1. Temporal databases are used to represent information that _______ over time.
Q2. List various temporality types.
____________________________________________________________________________
____________________________________________________________________________
Q3. Define Instant.
____________________________________________________________________________
____________________________________________________________________________
Q4. Define SimpleTime.
____________________________________________________________________________
____________________________________________________________________________
Q5. Synchronization relationships helps to identify if two events have occurred simultaneously.
(TRUE / FALSE)
____________________________________________________________________________
5.2.4 Conceptual and Logical Models for Temporal Databases
Number of conceptual models have been extended to support the time-varying temporal models in
databases. Some of the conceptual models for which this extension can be provided are ER model
UML model. Extension means to introduce all new construct or to change or modify the existing
construct of a conceptual model. Easy it may sound, it is not easy to implement and may not be cost
effective. Temporal database once created, must be translated into a logical schema for
implementation in a DBMS. Till date, little support is provided by SQL to incorporate support for
time-varying temporal models.
Eventually the lack of support for time-varying temporal models in conceptual models led to the
research in the field of temporal databases. Effective mapping of temporal conceptual models into the
conceptual model, like relational model is still a lot to be desired. Another approach to logical-level
design for temporal databases is to use temporal normal forms, which again is a difficult task to
achieve. As a database student, you already know that normalizing a database in conceptual models in
itself is a very difficult task.
The relational representation of temporal data leads to large number of tables, which causes
performance problems due tomultiple join operations needed for retrieving this information. Also the
model suffers from manyintegrity constraints that encode the underlying semantics of time-
varyingdata. Object-relational model can be used to group together the related temporal data into a
single table, and hence providing a partial solution tofirst problem. Still, integrity constraints must be
added toobject-relational schema. Object-relational model suffers from performance issues when it
comes to managing the time-varying data.
5.3 Temporal Extension of the MultiDimensional Model
In this section, you will learn the fundamental concepts related to the temporal extension of Multi
Dimensional Model.
5.3.1 Support for temporality types
The Multi Dimensional model do provide support the following temporality types discussed earlier in
this lesson:
1. Valid time (VT),
2. Transactiontime (TT),
3. Lifespan(LS).
However, these temporality types should exist in the source systems and these cannot be introduced
by users or generated by DBMS. Following is the list of temporal support provided by different types
of source systems:
1. Snapshot: Data is obtained by dumping the entire source system and changes are found by
comparing current data withprevious snapshots.
Support provided for (VT), (LS)temporality types.
2. Queryable: Data is extracted using a query interface provided by the source system. Changes can
be found by periodic polling of data.Queryable provide direct access to source data.
Support provided for (VT), (LS)temporality types.
3. Logged: Log files are used to record each and every data modification. Periodic polling is done to
find data changes, if any have occurred. Transaction time can be retrieved from log files and valid
time and/or lifespan may be included in thesystem.
Support provided for TT, (VT), and (LS)temporality types.
4. Callback and internal actions: Triggersor a programming environment is provided by the source
system to detect changes and notify those to the user. Changes in data with the time of change are
detected without any delay.
Support provided for TT, (VT), and (LS)temporality types.
5. Replicated: Changed are detected by analyzing the messages sent by the replication system. This
analysis can happen periodically, or even manually.
Support provided for (TT), (VT), and (LS)temporality types.
6. Bitemporal: The source systems itself are temporal databases that include valid time, and/or
lifespan, as well as transaction time.
Support provided for TT, VT, LS temporality types.
Some of the temporality types are enclosed within parentheses that suggests the possibility of their
existence in the source system.Support for temporality types is for the following reasons:
1 Temporalitytypes in developing procedures for correct measure aggregation during roll-up
operations. Roll up operation is one the basic operations that can be performed on OLAP data.
2. Also the transaction time is useful for traceability applications, like for fraud detection.
In addition to temporality types discussed above, loading time(LT) was proposed by the authors of
"Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications". LT
is used to specify the time since thedata is current in a data warehouse. It is not necessarily same as
transaction time. There may be a delay in integrated the change into atemporal data warehouse.
Loading time helps to identify the time since a data item has been available in a data warehouse for
analysispurposes.
Check your progress/ Self assessment questions- 2
Q6. It is not possible to extend a conceptual models to support the time-varying temporal
models in databases. (TRUE / FALSE)
____________________________________________________________________________
Q7.What do you refer to Queryable support?
____________________________________________________________________________
____________________________________________________________________________
Q8. Loading time can be different from the transaction time. (TRUE / FALSE)
____________________________________________________________________________
5.3.2 Overview of the Model
It is not mandatory to store data related to an application over time. Symbols corresponding to the
temporality types are maintained in the schema to determine which temporal data might be needed in
time.
Figure 5.3 Conceptual schema (Temporal Data Warehouse)
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
Changes in the values of measures for data related toproducts and stores are important for analysis
purposes, and hence temporality types are included in the schema for them. Depending on the type of
requirement, appropriate temporality type is mentioned.
MultiDimensional model allows and maintains for both temporal and nontemporal attributes, levels,
parent-child relationships,hierarchies, and dimensions. Temporal level refers to a level for which the
applicationneeds to store the time frame associated with its members. Schema in figure above
includes 4 temporal levels. Non-Temporal levels are calledconventionallevels. In the figure above,
client is an example of conventional level. It also maintains the time at which those changes took
place. Valid time support for the Size and Distributor attributes in the Product level is used to indicate
that the history of changes in the two attributes will be kept.
Atemporal parent-child relationship is used to keep track of time frameassociated with the links. LS
type in the relationship linkingof ProductandCategory is used to store the evolution in time of
assignments of products to categories. Cardinality for temporal support for parent-child relationships
can be interpretations as follows:
1. Instant cardinality: It is valid at every time instant.Symbols for the temporality type can be used to
represent the instant cardinality. For example, LS.
2. Lifespan cardinality: It is valid over the entire member’s lifespan. Symbol's of LS temporality type
surrounded by ellipse is used to represent the lifespan cardinality
Storeand Sales Districtlevels is one-to-many, while the lifespancardinality is many-to-many. It is used
to specify that at any time instant, a store can belong to 1 sales district, but over a lifespan, it may
belong tomany sales districts. Take another example, where both the Instant and Lifespan cardinality
betweenProduct andCategory are one-to-many.It is used to specify that that products belong to one
category over their lifespan that also include instant.
Temporal hierarchy includes at least onetemporal level. Also, a temporal dimensionhas at least one
temporal hierarchy.The non-temporal dimensions and hierarchies are calledconventionaldimensions
andhierarchies. Synchronization relationship refers to two related temporal levels in a hierarchy.
Product and Category is an example of overlap synchronization relationship, and is used to indicate
that lifespan of each product overlaps the lifespan of its corresponding category. In other words, you
can say that each valid product belongs to a valid category.
Temporal join between two or more temporal levels is used to represent the temporal fact relationship.
In the last figure, temporal fact relationshipSalesrelates two temporallevels: Product
andStore.Theoverlapssynchronization icon in the relationship is used to indicate that the users focus
their analysis on products whose lifespanoverlaps the lifespan of their related store. For instance, user
may wish to analyze whether the exclusion of some products from Stores will affect the Sales.
Temporal multidimensional model must provide temporal support for different elements of the model,
such as levels, hierarchies,and measures. Measures and attributes in fact relationships are considered
to be same, and hence support for measures must be provide similar to the support for attributes.
Temporality types of the MultiDimensional model may include levels, attributes, measures and
Parent-child relationships. If you consider the conceptual model given in the last figure, lifespan
support is provided for levels and parent-child relationships. Valid time support is provided for
attributes and measures.
5.4 Summary
Temporal databases are used to represent information that varies over time. Conventional databases
generally store current data, whereas temporal databases store the historical and future data along with
the times at which the changes have happened and are expected to happen. Discrete model is used to
represent the time in a temporal database.Relationship between two temporal extents can be
represented using synchronization relationships. Synchronization relationships helps to identify if two
events have occurred simultaneously. Common synchronization relationships are meets, overlaps,
contains/inside, covers/coveredBy, disjoint/intersects, equals, starts/finishes, precedes/succeeds. Some
of the conceptual models for which this extension can be provided are ER model UML model.
Extension means to introduce all new construct or to change or modify the existing construct of a
conceptual model. It is possible to provide extend MultiDim model to support temporal data models.
5.5 Glossary
Temporal databases- Databases that are used to represent information that varies over time.
Instant-It represents a single point of time based on some specified granularity.
Interval- It refers to set of all successive instants between two points of instants.
SimpleTime- It refers to the generalization of both the Instant and the Interval.
InstantSet- It is used to represent a set of single points of time or Intervals.
IntervalSet-It is used to specify simple intervals that is capable of representing discontinuous duration.
Complex Time- It is used to represent dissimilar set of temporal values.
Synchronization relationships- It is used to identify if two events have occurred simultaneously or one
after the other.
5.6 Answers to check your progress/self assessment questions
1. Varies.
2. Following are the temporality types:
Valid time (VT)
Transaction time (TT)
Bitemporaltime (BT)
Lifespan (LS)
3. Instantrepresents a single point of time based on some specified granularity.
4. SimpleTime refers to the generalization of both the Instant and the Interval. It is must so specify the
values for Instant and Interval every time value for SimpleTime is created.
5. TRUE.
6. FALSE.
7. It provides a query interface to extract data from the source system, i.e. it provide direct access to
source data. Support is provided for (VT), (LS)temporality types.
8. True.
5.7 References/ Suggested Readings
"1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson."
5.8 Model Questions
1. Explain various temporality types used in temporal databases.
2. Explain various temporal data types.
3. What do you mean by synchronization relationships? Explain with the help of an example.
4. Explain different temporal support provided by different types of source systems:
5. What do you mean by instant cardinality?
6. What do you mean by load time?
Lesson- 6 Temporal Data Warehouses- 2
Structure
6.0 Objective
6.1 Introduction
6.2 Temporal Support for Levels
6.3 Temporal Hierarchies
6.3.1 Non-temporal Relationships between Temporal Levels
6.3.2 Temporal Relationships between Non-temporal Levels
6.3.3 Temporal Relationships between Temporal Levels
6.3.4 Instant and Lifespan Cardinalities
6.4 Temporal Fact Relationships
6.5 Temporal Measures
6.5.1 Temporal Support for Measures
6.6 Temporal Granularity
6.7 Logical Representation of Temporal Data Warehouses
6.7.1 Temporality Types
6.7.2 Levels with Temporal Support
6.7.3 Parent-Child Relationships
6.7.4 Fact Relationships and Temporal Measures
6.8 Summary
6.9 Glossary
6.10 Answers to check your progress/self assessment questions
6.11 References/ Suggested Readings
6.12 Model Questions
6.0 Objective
After Studying this lesson, students will be able to:
1. Describe the concept of temporal support in temporal data warehouse.
2. Discuss the representation of hierarchies in Multi Dimensional model.
3. Explain the representation of temporal facts and temporal measures.
4. Define the notion of temporal granularity.
5. List various rules for logical representation of Temporal Data Warehouse.
6.1 Introduction
In the last lesson you learned the basic concepts related to the temporal databases and how the Multi
Dimensional models can be extended to provide temporal support. In this lesson you will study
various rules that should be followed when creating a conceptual temporal data warehouse. Also, the
lesson discusses in detail the technicalities associated with the mapping of conventional model to a
temporal model.
6.2 Temporal Support for Levels
Two types of changes can happen in a level:
1. It can either occur at the member level, i.e. inserting or deleting an entire row.
2. Or, it can occur at the level of attribute values, i.e. changing the value of an attribute.
Temporal data warehouse must represent these changes for analysis purpose. For example, you want
know the effect of change in the MRP a product, on the sales of that product. It is possible to
associate times frames with the members of a level, only if it provides for the temporal support.
Representing the temporal support for a level is easy, and it can be represented using a symbol for the
temporality type next to the level name. Lifespan support is used to specify the time of existence of
the members in the modelled reality. Transaction time and loading time are used to specify the time
since the members are current in a source system and in a temporal data warehouse, respectively. It is
possible that the transaction time is not the same as loading time. It can happen due to delay in
recording the changes in a temporal data warehouse. Temporal support for attributes is used to specify
the changes in their values and the times when these changes occurred. Temporal support for
attributes can be represented by including the symbol for the corresponding temporality type next to
the attribute name. Transaction time, valid time, loading time, or combination of these can be used as
temporal support for attributes. Some of the classical temporal models impose certain constraints on
temporal attributes and the lifespan of their corresponding entity types.
6.3 Temporal Hierarchies
The Multi Dimensional model is capable of representing hierarchies that contain several related
levels. Given two related levels in a hierarchy, following three situations can be encountered:
6.3.1 Non-temporal Relationships between Temporal Levels
It is possible to associate temporal levels with non-temporal relationships. Consider the following
example:
Figure 6.1 Temporal levels with non-temporal relationships
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
Relationship are used to store only the current links between Products and Categories. It means that a
product can only be related with a category, if the two are currently valid. Also, the lifespans of a
child member and its associated parent member must overlap, which can be specified using icon of
the synchronization relationship with the temporal link.
6.3.2 Temporal Relationships between Non-temporal Levels
Temporal relationships also allow you to keep track of the evolution in time of links between parent
and child members. This type of temporal relationship can be represented by inserting the temporality
symbol on the link between the hierarchy levels as shown in the figure below:
Figure 6.2 Temporal Relationships between Non-temporal Levels
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
The Multi Dimensional can be used to assign transaction time, lifespan, loading time, or combination
of these for representing temporal relationships between levels. Deleting a member can sometimes
result in dangling references. To over this problem, all links of the member of the related levels must
also be deleted. If you consider the example in figure above, deleting a section will require you to
delete history of assignments of employees to that section. Temporal Relationships between Non-
temporal Levels are used to keep track of only the history of links between current members.
6.3.3 Temporal Relationships between Temporal Levels
Temporal Relationships between Temporal Levels overcomes the problems based in both the
scenarios discussed above. Temporal Relationships between Temporal Levels results in better
analysis scenarios and helps to avoid the partial history loss. For instance, consider the following
example:
Figure 6.3 Temporal Relationships between Temporal Levels
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
Suppose that the company wants to change the sales districts for better organizational structure. It is
important to store the lifespans of districts in order to analyze how the changes in the organizational
structure effected the sales. It is also vital that lifespans of the stores are stored in order to analyze the
impact of staring a new store or closing an existing store. Also, it will be possible to keep track of the
evolution in time of work assigned of stores to sales districts. It means that the store and the sales
district exist throughout the lifespan of the relationship linking them.
6.3.4 Instant and Lifespan Cardinalities
In conventional or non-temporal model, cardinality is used to define the number of members in a level
related to number of members in another level. Whereas in case of temporal model, considered may
be defined in terms of Instant (instantcardinality), or lifespan (lifespan cardinality).
Figure 6.4 Instant and lifespan cardinalities.
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
Generally it is assumed that the instant cardinality is equal to the lifespan cardinality. In case the two
are different, the lifespan cardinality is represented using an additional line with the LS symbol
surrounded by an ellipse.
For the example in the figure above, instant and lifespan cardinalities are same for work hierarchy, but
different for affiliation hierarchy. Both the cardinalities for Work hierarchy are many-to-many, which
means that an employee can work in different sections at any given instant or during its lifespan.
Whereas, the instant cardinality for affiliation hierarchy is one-to-many and lifespan cardinality is
many-to-many. It means that an employee can be affiliated to one section only at any given instant,
but the same employee can get affiliated to multiple sections during its lifespan.
Check your progress/ Self assessment questions- 1
Q1. _________Relationships Between Temporal Levels results in better analysis scenarios.
Q2. It is possible to associate times frames with the members of a level. (TRUE / FALSE)
____________________________________________________________________________
____________________________________________________________________________
Q3. In conventional model, ____________ is used to define the number of members in a level related
to number of members in another level.
6.4 Temporal Fact Relationships
Fact relationship instance is used to relate the leaf members from all its participating dimensions. If
some of these members are temporal, they have an associated lifespan. Covering the valid time of
measures by the intersection of lifespans of the related temporal members helps in ensuring correct
aggregation. This type of constrain ensures that the lifespan of an instance of a parent-child
relationship is covered by the intersection of the lifespan of the participating objects. This constraint is
similar to the constraint imposed on Temporal Relationships between Temporal Levels.
For instance, consider the following example:
Figure 6.5 Schema for Insurance Company
Reference: " Advanced Data Warehouse Design: From Conventional to Spatial and Temporal
Applications"
The schema above is useful for an Insurance company if it wants to analyze the amount of
compensation paid against different types of risks covered. The indemnity amount is determined by
the measure in the fact relationship determines. Constraint for the schema above indicates that for the
instance of a fact relationship, valid time of Amount measure is covered by the lifespans of the related
members: Insurance policy and Repair work. Temporal join on different synchronization relationships
is needed when two or more temporal levels participate in a fact relationship. For the schema above,
synchronization relationship in the fact relationship states that the lifespans of the 3 members
(Insurance Policy, Repair Work, and Event) must have a nonempty intersection in order to relate
them.
6.5 Temporal Measures
6.5.1 Temporal Support for Measures
Support for only the valid time for measures is provided by the current Multi Dimensional models. In
this section you will see few cases where you can provide loading time and transaction time support
for measures. Let us consider some of the cases that show the benefits of providing different temporal
supports for the measures. It may not be possible to discuss all the situations in this section.
Non-Temporal Sources, Data Warehouse with LT
Generally the source systems do not provide temporal support, or it may be provided in an ad hoc
manner which not sufficient and also difficult to obtain. Also the integration of temporal support in
the source systems into a data ware house is extremely costly. Checking the time consistency
between different source systems is a perfect example for it. In order to obtain the history of how the
source data evolved in time, measure values can be timestamped with the loading time to indicate the
time at which the data was loaded into the warehouse.
Source Systems and Data Warehouse with VT
Support for the valid time, if provided by the source systems is needed in the temporal data warehouse
as well. Valid time in source systems are used to represent the events or states. For example, you can
design an event model to analyze the banking transactions and a state model to analyze the salary of
the employees. Generally the difference between an event model and a state model is not explicit in
the graphical notation, but the same can be stated in its textual representation. Different types of
queries are designed for such a schema. Events model for the banking transaction schema can be used
to analyze the total amount withdrawn from the ATM, maximum or minimum withdrawal, frequency
with which the clients use ATMs during holidays and working days, etc. States model for employee’s
salary can be used to analyze the evolution in time of the salaries paid to employees according to
different criteria, such as changes in professional skills or participation in various training programs,
etc.
A number of other situation exist that can be used to analyze the benefits of providing different
temporal supports for the measures
Check your progress/ Self assessment questions- 2
Q4. Integration of temporal support in the source systems into a data ware house is extremely costly.
(TRUE / FALSE)
____________________________________________________________________________
Q5. ___________________time in source systems are used to represent the events or states.
6.6 Temporal Granularity
Temporal data warehouses must deal with different temporal granularities for measures and for
dimensions. Temporal granularity of measures in the source systems is much finer than in the
temporal data warehouse.
Regular and irregular mappings are used for conversion between different temporal granularities. For
regular mapping, one granule is a partitioning of another granule. For example, an integer granule can
be converted by using a simple divide or multiply strategy. Conversion between seconds and minutes,
minutes and hours, is a perfect example of regular mapping. Granules using irregular mapping cannot
be converted using simple operations like divide or multiply. For example, conversion between days
and months is not easy as number of days in different months are not same.
Also, temporal databases do not allow for the mappings between certain different granularities. For
example, temporal databases to do not allow mapping between weeks and months, as it is possible
that a week days are spread over two months. However, data warehouse supports forced granularities
that are not easy to map or are not mapped in temporal databases.
6.7 Logical Representation of Temporal Data Warehouses
6.7.1 Temporality Types
Mapping of the Multi Dimensional model into an ER model requires additional attributes to keep
track of the temporal support in the Multi Dimensional model. Also the mapping of temporal elements
depends on whether they represent events or states. Instant or a set of instants are used to represent the
events in an ER model. Whereas, interval or a set of intervals are used to represent the states in an ER
model.
Multi Dimensional model provides support for various temporality types like transaction time, valid
time, loading time and lifespan. Events and states can both be represented using the valid time and
lifespan temporality types. Interval or a set of intervals are used to represent the transaction time.
Whereas, an instant is used to represent the loading time as it indicates the point of time instant at
which data was loaded into a temporal data warehouse. Mapping of temporality types from Multi
Dimensional model to ER model can be achieved using either of the following rule:
Rule 1:
Monovalued attribute is used to represent an instant.
Multivalued attribute is used to represent a set of instants.
Composite attribute consisting of two attributes are used to specify the start and end of the
interval.
Multivalued composite attribute is used to represent a set of intervals.
6.7.2 Levels with Temporal Support
Following rules are used to provide transformation of levels and their attributes into the ER model:
Rule 2: Entitytype in an ER model is used to represent a Non-Temporal level.
Rule 3: Entitytype in an ER model for temporal level can be represented using additional attribute for
each of its associated temporality types. Mapping can then be achieved using rule 1.
Rule 4: A monovalued attribute in an ER model is used to represent the non-temporal attribute
Rule 5: Multivalued composite attribute in an ER model is used to represent the temporal attribute.
Mapping can then be achieved using rule 1.
6.7.3 Parent-Child Relationships
Mapping of parent-child relationships can be categorized into following:
Non-Temporal Relationships
The transformation of non-temporal relationships between levels to the ERmodel is based on the
following rule:
Rule 6: Binary relationship without attributes in the ER model is used to represent the non-temporal
parent-child relationship.
Temporal Relationships
The following rule is used for mapping temporal relationships:
Rule 7: Binary relationship with an additional attribute in the ER model us used to represent the
temporal parent-child relationship. The additional attribute keeps track of each of the associated
temporality types of temporal parent-child relationship. Mapping can then be achieved using rule 1.
6.7.4 Fact Relationships and Temporal Measures
Following rules are used for mapping fact relationship and temporal measures:
Rule 8: N-ary relationship in ER model is used to represent the fact relationship
Rule 9: Multivalued composite attribute in an ER model is used to represent the measure of a fact.
Mapping can then be achieved using rule 1.
Check your progress/ Self assessment questions- 3
Q6. Conversion between seconds and minutes is an example of _____________ mapping.
Q7. _____________ relationship in ER model is used to represent the fact relationship
Q8. It is not possible to map a Multi Dimensional model into an ER model. (TRUE / FALSE).
__________________________________________________________________________
6.8 Summary
Changes can occur in a level either at the member level, i.e. inserting or deleting an entire row, or it
can occur at the level of attribute values, i.e. changing the value of an attribute. Temporal data
warehouse must represent these changes for analysis purpose. Temporal Relationships between
Temporal Levels results in better analysis scenarios and helps to avoid the partial history loss.
Cardinality is used to define the number of members in a level related to number of members in
another level. Whereas in case of temporal model, considered may be defined in terms of Instant
(instant cardinality), or lifespan (lifespan cardinality). Support for only the valid time for measures is
provided by the current Multi Dimensional models. The integration of temporal support in the source
systems into a data ware house is extremely costly. Conversion between seconds and minutes,
minutes and hours, is an example of regular mapping. Conversion between days and months is an
example of irregular mapping as number of days in different months are not same. Temporal
databases to do not allow mapping between weeks and months, as it is possible that a week days are
spread over two months.
6.9 Glossary
Temporal databases- Databases that are used to represent information that varies over time.
Instant- It represents a single point of time based on some specified granularity.
Interval- It refers to set of all successive instants between two points of instants.
InstantSet- It is used to represent a set of single points of time or Intervals.
IntervalSet- It is used to specify simple intervals that is capable of representing discontinuous
duration.
Synchronization relationships- It is used to identify if two events have occurred simultaneously or one
after the other.
6.10 Answers to check your progress/self assessment questions
1. Temporal
2. TRUE.
3. Cardinality
4. TRUE.
5. Valid.
6. Regular.
7. N-ary.
8. FALSE.
6.11 References/ Suggested Readings
"1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson."
6.12 Model Questions
1. Explain in detail the concept of representing temporal hierarchies in Multi Dimensional model.
2. Define cardinality for non-temporal models.
3. Explain the concept of temporal support for measures with the help of an example.
4. What do you mean by regular mapping and irregular mapping in context of temporal granularity?
5. Write the rule using which the mapping of temporality types from Multi Dimensional model to ER
model can be achieved.
Lesson- 7 Introduction to data mining
Structure
7.0 Objective
7.1 Introduction
7.2 Data Mining
7.3 Steps in Data Mining
7.4 Types of Data mining
7.5 Mining various Data types
7.6 Data mining issues
7.7 Pattern/context based mining
7.8 Summary
7.9 Glossary
7.10 Answers to check your progress/self assessment questions
7.11 References/ Suggested Readings
7.12 Model questions
7.0 Objective
After studying this lesson, students will be able to:
1. Define data mining.
2. List the steps involved in data mining process.
3. Explain different data mining techniques.
4. Discuss various data types on which data mining can be performed.
5. Describe pattern/context based data mining.
7.1 Introduction
Information Technology has grown leaps and bounces in the field of database and its functionalities.
The database life cycle goes through various operations and phases like data collection, data creation,
data management, data analysis and data understanding. Businesses over time like to study the
behavior of their customers and predict the items that they are most likely to buy. Small business
houses enjoy personal contact with the customers, whereas the bigger business houses do not have
this privilege and need to apply various tools to study the behavior of their customers. In this lesson
you will learn the concept of data mining and its various types.
7.2 Data Mining
Data Mining or Knowledge Discovery Database is nontrivial extraction of understood, previously
unknown, potentially valuable information from data. Although many data mining techniques are
quite new, data mining itself is not a new concept and people have been analyzing data on computers
since the first computers were invented. Over the years, data mining has been named with terms like
knowledge discovery, business intelligence, predictive modelling, predictive analytics, and so on.
According to Gordon S. Linoff and Michael J. A. Berry, “Data mining is a business process for
exploring large amounts of data to discover meaningful patterns and rules”.
Data mining may be defined as a business process that interacts with other business processes to
explore massive data that grows with every passing day for the discovery of knowledge or meaningful
patterns/ rules to help the business in forming strategies. Data Mining is extraction of potentially
important/ key information from data. Data mining is not a new concept and people have been
analyzing data since the first generation computers were invented.
7.3 Steps in Data Mining
Data mining is a scientific process or arrangement of processes in a scientific manner one followed by
the other. You cannot start with a data mining sub-task before the earlier sub-task has been finished.
Following are the generally followed steps in data mining:
Figure 7.1: Steps in data mining
1. Data Cleaning and Integration- Integrating the data from multiple heterogeneous data sources and
removal of error prone and inconsistent data.
2. Data Selection and Preprocessing-The extraction of data from previous stage and making it
consistent for analysis.
3. Data Transformation-It is data extracted from previous stage into data form appropriate for mining.
4. Data Mining- Discovery of intelligent patterns and rules from data.
5. Pattern Evaluation- It means Identification of patterns/ rules that are of interest depending upon the
problem in hand.
Once the patterns have been identified, visualization techniques are used to present the mined
knowledge to the end user or client (which is mainly top level management).
7.4 Types of Data mining
Data mining is performed by different business houses for their special requirements. Like a retail
chain might do it to identify which products are in demand and which products are bought together.
Production house may want to know the model of a particular product which is in demand or any
product sold by competitors that they don't produce, etc. Depending on the nature of use, data mining
can be classified as follows:
Figure 7.2: Types of data mining
Predictive
1. Regression is a statistical tool helps to predict value of dependent variable from the independent
variables. It uses the relation between the numeric variables (both dependent and independent).
Regression analysis can be simple linear, multiple linear, curvilinear, and multiple curvilinear
regression model.
2. Prediction consists of class that is a continuous. Prediction model is used to find numerical value of
the target attribute for objects from real life scenarios and live data.
3. Classification is based on pre-defined classes. It is used to assign a newly presented object to a set
of predefined classes. The classification task is characterized by a well-defined definition of the
classes, and a model set consisting of pre-classified examples. The class label is a discrete qualitative
identifier; forexample, large, medium, or small.
4. Time series analysis is used to store the time related data. It represents sequences of values
changing over time. Data is recorded at regular intervals of time. Some of the applications of time
series analysis are financial applications, scientific applications, etc.
Descriptive
1. Clustering is used to organize information about variables to form homogeneous groups or clusters.
These clusters are based on self similarities between variables.
2. Association Rules are statements about the relationships between attributes of a known set of
entities. It enable the system to predict the aspects of other entities that are not in the group, but
possess the same attributes.
3. Data Summarization is done at the early stages and is used as initial exploratory data analysis. It is
used to find potential hypotheses byunderstanding the behavior of data and exploring hidden
information in that data.
Check your progress/ Self assessment questions- 1
Q1. Define data mining.
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
Q2. The extraction of data from previous stage and making it consistent for analysis is called
_____________________ .
Q3. Which data mining step comes after preprocessing?
________________________________________________________________________
Q4. Data mining may be broadly classified into _______________ and ______________ data mining.
Q5. _________________ are statements about the relationships between attributes of a known set of
entities.
7.5 Mining various Data types
Data mining is not specific to one type data and is applicable to any kind of data repository. Mining
algorithms vary depending on the types of data to be mined. Here are some examples of data types
that can be mined:
1. Flat files: Data is also available in the form of files. These files contain data in the form of text or
binary format with a structure already known by the data mining algorithm to be applied.
2. Relational Databases: Relational database is a set of tables containing values of entity attributes.
Table is represented as a 2-D matrix in which columns are used to represent the attributes and rows
are used to represent tuples. Data mining algorithms for relational databases are more versatile since
they can take advantage of the structure inherent to relational databases.
3. Data Warehouses: A data warehouse is a repository of data collected from multiple heterogeneous
data sources. Data warehouse is set up to generate a central integrated repository of data. Data from
multiple heterogeneous sources is preprocessed to remove all kinds of inconsistencies and integrity
problems from it. Multi-dimensional data models are used to maintain data in Data warehouses to
support decision making.
4. Transaction Databases: A transaction database represents transactional records each with a time
stamp, an identifier and a set of items. Transactions are stored in flat files or in normalized transaction
tables.
5. Multimedia Databases: Multimedia databases include various media types like video, images,
audio, etc. stored on extended object-relational or object-oriented databases.
6. Spatial Databases: Spatial databases store geographical information like maps, and global or
regional positioning along with normal data.
7. Time-Series Databases: Time-series databases contain time related data such as stock market data
or logged activities. It contains data stored at regular intervals. Data mining in such databases
includes the study of trends and correlations between evolutions of different variables.
8. World Wide Web: Data in the World Wide Web is organized in inter-connected documents that
have text, audio, video, and even applications. The World Wide Web is comprised of content of the
Web (Documents), the structure of the Web (Hyperlinks), and the usage of the web (Resource
accessibility).
Check your progress/ Self assessment questions- 2
Q6. A data warehouse is a repository of data collected from multiple heterogeneous data sources.
a. TRUE
b. FALSE
Q7. Data mining cannot be performed on multimedia databases.
a. TRUE
b. FALSE
Q8. Which of the following is not a type of predictive data mining technique?
a. Regression
b. Classification
c. Clustering
d. Prediction
Q9. _____________ databases store geographical information like maps, and global or regional
positioning along with normal data.
7.6 Data mining issues
1. Security issues: Data meant for decision taking purpose must be secured from unauthorized access
and manipulation. Data collected for customer profiling includes sensitive and private information
about individuals or companies.
2. End-user interface issues: The knowledge discovered should be interesting and understandable by
the end-user. Good data visualization helps to achieve this objective. End-user should be able to
visualize data from different dimensions in order to discover meaningful knowledge from it.
3. Methodology issues: Data for mining is extracted from variety of heterogeneous sources. Ensuring
the consistency and integrity of data is vital. Data mining techniques should be able to handle noise in
data or incomplete information. The size of the search space is crucial. The search grows
exponentially with the increase in dimensions. All methodology related issues must be handled
promptly.
4. Performance issues: Data is growing exponentially. Scalability and efficiency of data mining
methods is a challenge when it comes to processing such massive amount of data. To overcome these
issues, sampling of data sets was used for data mining. Now-a-days big data tools are used to manage
massive amount of data and it helps to process the data in much more effective manner.
7.7 Pattern/context based mining
Pattern mining consists of using/developing data mining algorithms to discover interesting,
unexpected and useful patterns in databases. Pattern mining algorithms can be designed to discover
various types of patterns: subgraphs, associations, indirect associations, trends, periodic patterns,
rules, lattices, sequential patterns, etc. Apriori algorithm is a popular frequent itemset mining
algorithm and a number of variations to it have also been designed. Frequent itemset mining or
frequent patterns are fast growing mining technique. It leads to the discovery of interesting patterns
such as association rules, correlations, sequences, classifiers and clusters among distinct items in large
point-of-sale or transactional data sets. The mining of association rules is one of the most critical and
widely accepted problems. Frequent patterns may be defined as patterns that appear frequently in a
data set. For example, if items like watch, sun-glasses, shoes appear in the same order and that too
frequently may be class frequent itemset. And if the tree items are bought one after the other, the same
may be called a frequent sequential pattern.
Measuring the strength of association rule: The association rules technique should be capable of
separating the strong patterns from the weak patterns. The three important methods for measuring the
strength of an association rule are support, confidence, and lift. Support measures the proportion of
transactions that contain all the items in the rule. Confidence is used to measure the predictive
strength of the rule. Confidence is a measure that tells you; how good a rule is at predicting the right
side of the rule. Sometimes the rules may be so obvious, that confidence is of little or no use. Lift is
used to measure the power of the rule by comparing the full rule to randomly guess the right side.
Association rule mining somehow fails to generate the rules that represent true correlation relationship
between the objects. Association rules may or may not be interesting from mining perspective.
Increasing the value for support threshold is not an efficient solution, as it might lead to missing of
many important association rules. So we need a mechanism using which we can separate important
association rules from unimportant association rules. Correlation analysis is the method capable of
revealing as to which association rules not only satisfy the minimum threshhold criteria, but are also
interesting and useful. Statistics based judgment from behind the data, can be much more effective in
leaving out uninteresting rules out of the total rules discovered.
Classification model is also known as classifier. The model is then used to classify new objects or
unseen objects in the real world. For example, after starting a credit policy, an enterprise may wish to
classify employee’s performance and label then accordingly to pre-classified labels such as
"excellent", "good", "ok" and "need improvement". Classifier begins by describing a predetermined
set of data classes. This is also called the learning or training step, where an algorithm is used to build
the classifier. It does so by learning from a training set made up of database tuples and their associated
class labels. X = (x1, x2, …., xn) represents a tuple X. Each tuple implicitly belong to a predefined
class as determined by class label attribute. The class label attribute is categorical and not continuous,
for example, outstanding, good, average and bad. Each tuple of training set is referred by training
tuple and is selected from the database under analysis.
Clustering is also used to find the similar patterns or more popularly known as pattern recognition.
Being able to identify patterns and make scientific reasoning for the same helps in better decision
making. Time series analysis is used to store the time related data. It represents sequences of values
changing over time. Data is recorded at regular intervals of time. Some of the applications of time
series analysis are financial applications, scientific applications, etc.
Check your progress/ Self assessment questions- 3
Q10. What are frequent patterns?
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
Q11. ______________ analysis is used to find which association rules satisfy the minimum thresh
hold criteria and are also interesting and useful.
Q12. Define classifier.
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
7.8 Summary
According to Gordon S. Linoff and Michael J. A. Berry, “Data mining is a business process for
exploring large amounts of data to discover meaningful patterns and rules”. Data mining process goes
through following 5 steps:
1. Data Cleaning and Integration
2. Data Selection and Preprocessing
3. Data Transformation
4. Data Mining
5. Pattern Evaluation
Data mining is broadly classified into predictive and descriptive data mining. Predictive data mining
may be further classified as regression, prediction, classification and time series analysis. Descriptive
analysis may be further classified as clustering, association rules, and data summarization. Data
mining can be applied on flat files, relational databases, data warehouses, operational databases,
multimedia databases, spatial databases, timer-series databases and WWW. Frequent itemset mining
algorithm and a number of variations to it have also been designed. Frequent itemset mining or
frequent patterns leads to the discovery of interesting patterns such as association rules, correlations,
sequences, classifiers and clusters among distinct items in large point-of-sale or transactional data
sets. Correlation analysis is the method capable of revealing as to which association rules not only
satisfy the minimum thresh hold criteria, but are also interesting and useful. Classification model is
also known as classifier. The model is then used to classify new objects or unseen objects in the real
world. Clustering is also used to find the similar patterns or more popularly known as pattern
recognition.
7.9 Glossary
Data mining- Data mining is a business process for exploring large amounts of data to discover
meaningful patterns and rules.
Regression- It is a statistical tool helps to predict value of dependent variable from the independent
variables.
Classification- It is used to assign a newly presented object to a set of predefined classes.
Clustering- It is used to organize information about variables to form homogeneous groups or clusters.
Association Rules- These are statements about the relationships between attributes of a known set of
entities.
Relational databases- Relational database is a set of tables containing values of entity attributes.
Tables have columns and rows, where columns represent attributes and rows represent tuples.
Data warehouse- A data warehouse is a central integrated repository of data collected from multiple
heterogeneous data sources.
7.10 Answers to check your progress/self assessment questions
1. Data mining may be defined as exploration of massive data for the discovery of knowledge or
meaningful patterns/rules to help the business in forming strategies.
2. Data selection and preprocessing.
3. Transformation.
4. Predictive, descriptive.
5. Association Rules.
6. a.
7. b.
8. c.
9. Spatial.
10. Frequent patterns may be defined as patterns that appear frequently in a data set.
11. Correlation.
12. Classifier model is then used to classify new objects or unseen objects in the real world into a set
of predefined classes.
7.11 References/ Suggested Readings
"1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher Morgan Kaufmann
Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas, Publisher Pearson.
4. Data Warehousing, Data Mining, & Olap by Alex Berson and Stephen J smith, Tata McGraw-Hill
Education.
5. Data Mining and Data Warehousing by Bharat Bhushan Agarwal and Sumit Prakash Tayal,
University Science Press.
6. Data Mining: Technologies, Techniques, tools and Trends by Bhavani Thuraisingham."
7.12 Model questions
1. Explain the process of data mining.
2. Explain different types of data mining.
3. List various types of data types on which data mining can be performed.
4. What do you mean by pattern based mining?
5. What is the benefit of correlation analysis?
Lesson- 8 Classification Techniques- 1
Structure
8.0 Objective
8.1 Introduction
8.2 Classification
8.3 Bayesian Classifier
8.4 Bayes’ Theorem
8.5 Naive Bayesian Classification
8.6 Bayesian Belief Networks
8.7 Summary
8.8 Glossary
8.9 Answers to check your progress/self assessment questions
8.10 References/ Suggested Readings
8.11 Model Questions
8.0 Objective
After Studying this lesson, students will be able to:
1. Define the concept of classification in data mining.
2. List different techniques used for implementing classification.
3. Describe the use of Bayesian classifier.
4. State the Bayes' Theorem.
5. Differentiate between Bayesian classifier and Bayesian belief networks.
8.1 Introduction
Classification is key data mining technique that is implemented in various applications. Now a days
you need to identify the characteristics or label your customers based on available attributes. Mobile
service providers, banks, insurance companies, ecommerce companies, all use classification for better
outcome. In this lesson you will learn a basic classification statistical tool called the Bayesian
classifier.
8.2 Classification
Classification is used to predict categorical labels. Classification analysis helps to organize the data in
predefined classes. Classification is also known as supervised learning. For example, depending on
the yearly sales figure or increase in sales, performance of salesman can be classified into predefined
classes such as outstanding, good, average and poor. All objects of training set are associated with
pre-classified class labels. The classification algorithm builds a model by learning from the training
set. Classification model is also known as classifier. The model is then used to classify new objects or
unseen objects in the real world. For example, after starting a credit policy, an enterprise may wish to
classify customers and label then accordingly to pre-classified labels such as "safe", "risky" and "very
risky".
Following techniques can be used with classification model of data mining:
1- Decision Trees.
2- Artificial Neural Networks.
3- Genetics Algorithm.
4- K-Nearest Neighbour.
5- Memory Based Reasoning;
6- Naive Bayesian classifier.
As already discussed earlier, classifier describes a predetermined set of data classes. This is also
called the learning or training step, where an algorithm is used to build the classifier. It does so by
learning from a training set made up of database tuples and their associated class labels. X = (x1, x2,
…., xn) represents a tuple X. Each tuple implicitly belong to a predefined class as determined by class
label attribute. The class label attribute is categorical and not continuous, for example, outstanding,
good, average and bad. Each tuple of training set is referred by training tuple and is selected from the
database under analysis.
Then it time to estimate the predictive accuracy of classifier. You can use either the training set or test
set to measure classifier's accuracy. The estimate value for predicting the accuracy of the classifier is
assumed to be rather high in case we are using the training set. The classifier tends to overfit the data.
Rather, the test set randomly selected from the general data set and its associated class labels should
be used. Test set is mostly free from any anomalies. Accuracy of a classifier is the percentage of
correctly classified test set or training set tuples. Accuracy of a classifier based on test set is expected
to be more than the accuracy of a classifier based on training set.
Figure 8.1: Classification process
Check your progress/ Self assessment questions- 1
Q1. Classification is an example of supervised learning. (TRUE / FALSE).
____________________________________________________________________________
Q2. Classification analysis helps to organize the data in _______________ classes.
Q3. List some of the classification techniques used in data mining.
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
8.3 Bayesian Classifier
Bayesian classifiers are popular statistical classifiers. Bayesian classifier is used to predict the
probability that a given tuple belongs to a particular class. It is a classification tool. It is based on
Bayes’ theorem discussed later in this lesson. A simple Bayesian classifier called naiveBayesian
classifier is very good in terms of performance when compared with other classification algorithms.
Studies have also revealed that Bayesian classifiers result in high accuracy and speed when applied to
large databases.
Naive Bayesian classifier is based on an assumption that the effect of an attribute value on a given
class is independent of the values of the other attributes known as class conditional independence.
Bayesian belief networks are popular graphical models that represent of dependencies among subsets
of attributes. Bayesian belief networks can also be used for classification and are discussed later in
this lesson.
8.4 Bayes’ Theorem
What is Bayes’ theorem all about? A data tuple X is considered as evidence in Bayesian terms. Let
H be some hypothesis stating that data tuple X belongs to a pre-defined class C. The classification
problem is used to determine the probability P(H | X), stating that the hypothesis H holds given the
evidence X.
P(H | X) iscalled the posterior probabilityof H conditioned onX. For instance, suppose that the data
tuples in our database are confined to customers described by attributesage and income. Let the
evidence X be a 25 year customer in age with an income of Rs. 55,000/-. Let H be the Hypothesis that
the customer will purchase a smartphone. It means that P(H | X) reflects the probability that a
customer will purchase a smartphone in case the customer’s age and income is25 and 55,000
respectively. Other attributes such as gender, occupation, marital status does not matter.
Also, P(H) is called prior probability of H. No evidence is considered in prior probability and it is
assumed that a given customer will purchase smartphone irrespective of any attribute like age,
income, occupation, gender, or any other information. The probabilityP(H | X) is the posterior
probability based on evidence X or additional information, whereas the probabilityP(H) is
independent of any evidence X.
On the other hand, probability P(X | H) is the posterior probability of X conditioned on H. It means
that a customer who purchases a smartphone has income of Rs. 55,000/- and is of age 25. Whereas,
P(X) is the prior probability ofX. It is a probability that the customer from a given set of customers
earns Rs. 55,000/- and is 25 years old.
It can be easily ascertained that these probabilities are valuable. So how do you compute these
probabilities? You can compute the posterior probability, P(H | X), from P(H), P(X | H), andP(X)
using the Bayes’ theorem as follows:
(8.1)
Bayes’ theorem can also be used in the naive Bayesian classifier as discussed in the next section.
Check your progress/ Self assessment questions- 2
Q4. Bayesian classifier _________________ is very good in terms of performance when
compared with other classification algorithms.
____________________________________________________________________________
Q5. Define class conditional independence.
Q6. P(H | X)is called the _______________ probabilityof H conditioned onX.
Q7. P(H) is called __________ probability of H.
8.5 Naive Bayesian Classification
Working of Naive Bayesian classifier, also known as simple Bayesian classifier, can be described as
follows:
1. Suppose that you are given a training set D and its associated class labels,each tuple in the
training set D is represented using an n-dimensional attribute vector, X = (x1, x2,…, xn), depicting n
measurements,A1, A2,……. , An.
2. Let us suppose, there are mpredefined classes represented as, C1, C2,….,Cm. For a given a
tuple, X, the classifier will predict if X belongs to the class having the highest posterior probability,
conditioned on X. In other words, the naive Bayesian classifier predicts if tuple X belongs to the class
Ci if and only if
P(Ci | X)> P(Cj| X)for1<= j<= m; j !=i
Idea is to maximize P(Ci | X). The class Ci for which the probability P(Ci | X) is maximized is called
the maximum posteriori hypothesis. By Bayes’ theorem,
(8.2)
3. P(X) in equation 8.2 is constant for all classes. Hence, there is need to maximize P(X |
Ci)P(Ci) in order to maximize P(Ci | X). Class prior probabilities are assumed to be equally likely in
case they are absent, i.e. P(C1) = P(C2) = … = P(Cm). In that case, there is need to maximize only the
P(X | Ci). You can compute the class prior probabilities by P(Ci)= |Ci,D | / |D |, where |Ci,D | represent
the number of training tuples of class Ci in training set D.
4. Compute P(X| Ci) for data sets with large attributes can be very expensive. In order to reduce
the computation cost, naive assumption of class conditional independence is made. Hence, it is
assumed that the attribute values are conditionally independent of each another for a given class label
of the tuple. Thus,
(8.3)
It is rather easy to estimate the probabilities P(x1 | Ci), P(x2 | Ci), …, P(xn | Ci) from the training tuples
asxk refers to the attribute valueAk for tuple X. P(X | Ci) can be computed for both categorical and
continuous-value attributes as follows:
(a) For categorical attribute, P(xk | Ci) refers to the number of tuples of class Ci in D having
value xk for Ak, divided by the number of tuples of classCi in D represented by |Ci,D|.
(b) For continuous-valued attributes, calculations are easy, but are little costly as compared to
calculations for categorical attributes. A continuous-valued attribute is defined by:
(8.4)
Such that
(8.5)
µCirefers to means for class Ciand σCi refers to standard deviation for class Ci, respectively, of the
values of attribute Ak for training tuples of class Ci. These two values are then used with attribute xk,
in equation 8.4 to estimate P(xk | Ci).
5. P(X | Ci)P(Ci) is computes for each class Ci to predict the class label of X. The classifier predicts
that the class label of tuple X is the class Ci if and only if:
P(X | Ci)P(Ci) > P(X | Cj)P(Cj) for1 <= j <= m, j != i
It means that the predicted class label is the class Ci for which P(X | Ci)P(Ci) is the maximum.
In theory, the Bayesian classifiers have minimum error rate in comparison to other classifiers such as
decision tree and neural network classifiers, however, in practice it is not always the case, due to
inaccuracies in the assumptions made for its use like the class conditional independence, and due to
missing probability data.
8.6 Bayesian Belief Networks
Computation cost for the naive Bayesian classifier can be reduced by making the assumption that
there exists class conditional independence. Class conditional independence means that the values of
the attributes on a given class is independent of the values of the other attributes. Theoretically the
naive Bayesian classifier may be most accurate as compared with other classifiers, but however,
dependencies do exist between variables. Bayesian belief networks specify joint conditional
probability distributions. It allows class conditional independencies to be defined between subsets of
variables. It provides a graphical model of causal relationships, on which learning can be performed.
A Bayesian belief network or belief network is defined by two components—a directed acyclic graph
and a set of conditional probability tables. A directed acyclic graph is one that does not involve
cycles, i.e. there is no path from one node to itself. Each node in the directed acyclic graph for belief
network represents a random variable. The variables can be discrete or continuous-valued. The
attributes used to represent nodes can be actual attributes or “hidden variables” believed to form a
relationship. Each directed arc in the graph represents a probabilistic dependence between the
variables. An arc from a node A to node B means that Ais immediate predecessor of B, and B is a
descendant of Y. Each variable is conditionally independent of its non-descendants in the graph.
Figure 8.2: Belief Network
The belief network in the figure above represents a casual model using directed acyclic graph and
conditional probability table for variable Blood pressure and all combinations with parent nodes
Family history and spicy food. The belief network involves 6 Boolean variables. Each arc is used to
represent causal knowledge. For instance, blood pressure problem is influenced by both the family
history of the patient and whether the patient like to eat spicy and oily food. Once the value for a
variable is evaluated, then the value for parent variables do not provide any additional information
regarding descendant variables of that variable. For instance, once it is established that the patient is
suffering from blood pressure, variables FamilyHistory and Spicy food do not provide additional
information regarding Hypertension.
Also the directed arcs of acyclic graph for belief network represented in figure 8.2 shows that variable
BloodPressure is conditionally independent of Cholesterol, given its parents, FamilyHistoryand
SpicyOilyFood.
One conditional probability table for each variable is included in belief network. Which means you
can create 6 CPT's for the belief network represented in figure 8.1. CPT for a variable Xindicates the
conditional distribution P( X | Parents( X )), where Parents( X ) represent the immediate predecessor
variable X. Figure 8.1(b) shows only a single CPT and that for the variable BloodPressure. The
conditional probability for each known value for variable BloodPressure is given for each possible
combination of values of its parents.
P(BloodPressure = yes | FamilyHistory = yes, SpicyOilyFood = yes) = 0.8
P(BloodPressure = no | FamilyHistory = yes, SpicyOilyFood = yes) = 0.2
P(BloodPressure = yes | FamilyHistory = yes, SpicyOilyFood = no) = 0.5
P(BloodPressure = no | FamilyHistory = yes, SpicyOilyFood = no) = 0.5
P(BloodPressure = yes | FamilyHistory = no, SpicyOilyFood = yes) = 0.7
P(BloodPressure = no | FamilyHistory = no, SpicyOilyFood = yes) = 0.3
P(BloodPressure = yes | FamilyHistory = no, SpicyOilyFood = no) = 0.1
P(BloodPressure = no j FamilyHistory = no, SpicyOilyFood = no) = 0.9
Total of each column is 1.
Let X = (x1, … , xn) be a data tuple described by the variables or attributes Y1, …. , Yn,respectively.
A network can provide a complete representation of the existing joint probability distribution with the
following equation:
(8.6)
Where P(x1, … , xn) refers to the probability of a particular combination of values of X, and values
for P(xi | Parents( Yi )) correspond to the entries in the CPT for Yi.
A class label attribute can be represented by selecting any node within the network as an “output”
node. It is possible for the belief network to have more than one output node. It is also possible to
apply various learningalgorithms to the network. In that case, the classifier can return a probability
distribution that gives the probability of each class.
Check your progress/ Self assessment questions- 3
Q8. The naive Bayesian classifier is used to predictifa ______________ belongs to class.
Q9. What happens when the class prior probabilities are absent or missing.
___________________________________________________________________________
Q10. Why the naive Bayesian classifier is difficult to use practically?
___________________________________________________________________________
___________________________________________________________________________
8.7 Summary
Classification is used to predict categorical label.The classification algorithm builds a model by
learning from the training set. Some of the examples of classification techniques are Decision Trees,
Artificial Neural Networks, Genetics Algorithm, K-Nearest Neighbour, Memory Based Reasoning,
Naive Bayesian classifier, etc. A simple Bayesian classifier called naive Bayesian classifier is very
good in terms of performance when compared with other classification algorithms. Naive Bayesian
classifier is based on an assumption that there exists class conditional independence. The
classification problem using Bayes' theorem is used to determine the probability P(H | X), stating that
the hypothesis H holds given the evidence or tuple X. The probability P(H | X)is the posterior
probability based on evidence X or additional information , whereas the probability P(H) is
independent of any evidence X. Bayesian classifier is based on the assumption that their exists class
conditional independence, but however, dependencies do exist between variables practically.
Bayesian belief networks try to overcome this problem by specifying joint conditional probability
distributions. A Bayesian belief network consists of a directed acyclic graphthat does not involve
cycles, i.e. there is no path from one node to itself, and a set of conditional probability tables (one for
each variable)
8.8 Glossary
Classifier- It is used to predict categorical labels for classes. Classifier helps to organize the data in
predefined classes.
Directed acyclic graph- A directed acyclic graph is one that does not involve cycles, i.e. there is no
path from one node to itself.
Conditional Probability table- CPT for a variable X indicates the conditional distribution P( X |
Parents( X )), where Parents( X ) represent the immediate predecessor variable X.
Probability-
Bayesian belief network- It is a popular graphical models that represent of dependencies among
subsets of attributes.
Naive Bayesian classification- It is used to determine the probability P(H | X), stating that the
hypothesis H holds given the evidence X.
8.9 Answers to check your progress/self assessment questions
1. TRUE.
2. Predefined
3. Following are some of the classification techniques used in data mining:
Decision Trees.
Artificial Neural Networks.
Genetics Algorithm.
K-Nearest Neighbour.
Memory Based Reasoning;
Naive Bayesian classifier.
4. NaiveBayesian classifier.
5. It means that the attribute values are conditionally independent of each another.
6. Posterior,
7. Prior.
8. Tuple.
9. In that case, class prior probabilities are assumed to be equally likely, i.e. P(C1) = P(C2) = … =
P(Cm).
10. It is based on the assumption that there exists class conditional independence, but however,
dependencies do exist between variables practically.
8.10 References/ Suggested Readings
"1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson."
8.11 Model Questions
1. Define classification model of data mining. Give an example.
2. Explain the Bayes' theorem.
3. Explain the working of Naïve Bayesian Classification.
4. Draw a Bayesian Belief Network by taking any example.
Lesson- 9Classification Techniques- 2
Structure
9.0 Objective
9.1 Introduction
9.2 k-Nearest-Neighbor Classifiers
9.3 Case-Based Reasoning
9.4 Genetic Algorithms
9.5 Rough Set Approach
9.6 Fuzzy Set Approaches
9.7 Classification by Backpropagation
9.8 Summary
9.9 Glossary
9.10 Answers to check your progress/self assessment questions
9.11 References/ Suggested Readings
9.12 Model Questions
9.0 Objective
After Studying this lesson, students will be able to:
1. Explain the working of k-nearest-neighbor classifier.
2. Discuss the usage of case-based reasoning.
3. Describe the steps involved in genetic algorithms.
4. State the benefits of using Rough set approach to classification.
5. Justify how fuzzy set approach is better than rule based classification.
6. Write the algorithm for backpropagation.
9.1 Introduction
In the last lesson you got an idea of what is classification in data mining. Also you learned about a
probability theory based classification method called Bayesian classification. A number of other
classification techniques exist that can be used for predicting the class labels of continuous values
with lesser error rate. This lesson discusses some of these methods.
9.2 k-Nearest-Neighbor Classifiers
Pattern recognition is one example where k-nearest neighbor classifier is used. Nearest-neighbor
classifiers are designed to learn by comparing a given test tuple with training tuples similar to it. The
training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space
and all training tuples are stored in an n-dimensional pattern space. The k-nearest-neighbor classifier
works by searching entire pattern space for the k training tuples that are closest to a given unknown
tuple. Thek training tuples that are closest to the given unknown tuple are called k “nearest neighbors”
of the unknown tuple.
How do you define Closeness? It can be defined in terms of the distance metric computed using
Euclidean distance. The Euclidean distance refers to the distance between 2 points or tuples, say, X1 =
(x11, x12, …..,x1n) andX2= (x21,x22,….,x2n), is
(9.1)
For each numeric attribute, the difference between the corresponding values for that attribute in tuple
X1 and X2 is computed, then the difference is squared, and accumulated. Finally the square root for the
total accumulated distance is evaluated. It is useful to normalize the attributes to prevent attributes
with large ranges from outweighing attributes with smaller ranges. Example of attributes with large
ranges can be income and attributes with small ranges can be Makes_Investement or some binary
attribute. Min-max normalization can be used to convert a value vfor a numeric attribute A to v' in the
range [0, 1]:
(9.2)
WhereminA and maxArefers to minimum and maximum numeric values of attribute A.
It is also possible to compute the distance for categorical values. If the corresponding values for an
attribute in tuple X1 and X2 are identical, the difference between the 2 is taken as 0.If the
corresponding values for an attribute in tuple X1 and X2 are not identical, the difference is considered
to be 1.
It may also be possible that corresponding values for a given attributeAis missing intuple X1 and/or in
tuple X2. In such a case, maximum possible difference is assumed. Suppose that each of the attributes
have been mapped to the range [0, 1].
In case of categorical attributes, the difference is assumed to be 1, if the corresponding value is
missing in one or both the tuples. For numeric attributes, the difference is assumed to be 1, if the
corresponding value is missing in both the tuples. In case, only one of the corresponding value is
missing, the difference can either be |1 - v1| or |0 - v0|, whichever is greater.
What should be the value of k for k-nearest neighbour classifier?It can be computed iteratively.
Compute the error rate of the classifier for a test set with k = 1. Increment the value of k and compute
the error rate of the classifier for a test set again. Repeat the process for a large value of k. Select the
value of k that gives the minimum error rate. In general, the value of k depends on the size of the
training tuples.
For training database D, of |D| tuples and k = 1, O(|D|) comparisons are needed to classify a given test
tuple. It is possible to reduce the number of comparisons to O(log(|D|) by pre-sorting and arranging
the stored tuples into search trees. Also, the parallel implementation can be used to reduce the running
time to almost a constant O(1).
9.3 Case-Based Reasoning (CBR)
Database containing solutions to various problems is used by CBR classifier to solvenew problems.
The tuples or “cases” for problem solving using CBR are saved as complex symbolic descriptions.
Customer service help desks, answers to frequently asked questions, are popular examples of CBR
classifiers.CBR is also used in medical education.
For every new case, CBR classifier first searches for a matching training case. Solution ofatraining
case is returnedif it found to be identical to the new case. In case no identical match is found, CBR
searches for training cases with components similar to those of the new case. These training cases
may be considered as neighbors of the new case. In case incompatibilities arise with the individual
solutions, you can backtrack to search for other solutions. Graph is the best data structure that can be
used to logically maintain training cases for CBR classifier.
CBR is based on searching for a good similarity metric.Indexing of training cases and the
development of efficient indexing techniques is key to success of CBR classifier
Check your progress/ Self assessment questions- 1
Q1. The k-nearest neighbor classifier is not used for Pattern recognition.
a. TRUE
b. FALSE.
____________________________________________________________________________
Q2. ____________________containing solutions to various problems is used by CBR classifier
to solvenew problems.
Q3. C in CBR stands for
a. Categorical
b. Class
c. Case
d. Classifier
9.4 Genetic Algorithms
Genetic algorithmsis based on natural evolution. It works as follows:
You need to create an initial populationconsisting of randomly generated rules. String of bits are used
to represent each rule. Suppose that samples in a given training set are described using Boolean
attributes Is_Consistent and Not_Consistent, and there are two classes, Likely_to_pass and
Not_Likely_to_pass. The rule “IFIs_Consistent THENLikely_to_pass” can be encoded using string of
bits “11”, where the leftmost bit represent attribute Is_Consistent, and rightmost bit represents class
Likely_to_pass. ‘k’ bits are needed to encode the attribute’s values,incase an attribute has k>2values.
Then a new population is formed that consists of fittest rules in the current population along with the
offspring of these rules. It is based on the rule “survival of the fittest”. The crossover and mutation,
two genetic operators are used to create the offspring. In crossover, Swapping of substrings from a
given pairs of rules to form new pairs of rules is called crossover, whereas mutation refers to
randomly selected bits in a rule’s string are inverted.
The process of forming new population and offspring based on prior populations of rules continues
until a population, P, evolves where each rule in P satisfies a pre-specified fitness threshold.
9.5 Rough Set Approach
Discovery of structural relationships within noisy data is the main objective of rough set theory of
classification. It can only be applied to discrete-valued attributes. You can apply rough set approach to
continuous-valued attributes by converting it into discrete values.
The idea of rough set approach is to establish equivalence classes within the given training data. All
data tuples or cases forming an equivalence class are identical with respect to the attributes describing
the data. Rough set approach is used to approximately define classes that cannot be distinguished in
terms of the available attributes.The lower approximation ofCrefers to data tuples that with all
certainty will belong to C (based on some attribute knowledge) without ambiguity. The upper
approximation of Crefers to data tuples that cannot be described as not belonging to C (based on some
attribute knowledge). A decision table can be used to represent decision rules generated for each class.
Rough sets can also be used for attribute reduction. Attribute reduction refers to identification and
removal of attributes that do not contribute toward the classification and relevance analysis.
Discernibilitymatrixis used to store the differences between attribute values for each pair of data
tuples, which is then searched to detect redundant attributes, rather than having to search the entire
training set to detect redundant attributes.
Check your progress/ Self assessment questions- 2
Q4. Genetic algorithmscreates an initial populationconsisting of randomly generated______.
Q5.Discovery of structural relationships within _________ is the main objective of rough set
theory of classification.
Q6. You cannot apply rough set approach to continuous-valued attributes. (TRUE / FALSE).
__________________________________________________________________________
9.6 Fuzzy Set Approaches
Rule-based systems for classification are not always effective for continuous attributes. For instance,
consider the following rule for loan approval.
Loan applications from applicant who are in service for 2 or more years and have a monthly
income of 50,000/- or more, are approved:
IF (years_in_current_service>= 2) AND (salary>=50000) THEN loan = t =approved
If the rule implemented strictly, it means that the applicant whose salary is 49,000/- per month will
not be sanctioned the loan, whereas the applicant with salary of 50,000/- per month will be sanctioned
the loan, no matter even if his/her years in service are much lesser than the first applicant. It is unfair
and lacks sensibility.
It will be better to discretize salary into categories such as {low_salary, medium_salary, high_salary
}, and then let “fuzzy” thresholds or boundaries to be defined for each category.Fuzzy logic uses truth
values between 0:0 and 1:0, instead of exact cutoff between categorise to represent the degree of
membership a given value has in a given category. A fuzzy set is then used to represent each category.
How does the fuzzy logic, then makes out for 49,000/- per month? The fuzzy logic states that salary
of 49,000/- per month is more or less, high, even though it is not as high as 50,000/-. It provides a
graphical tool that helps in converting attribute values to fuzzy truth values.
Fuzzy set theory is also called possibility theory. It lets you work at a high level of abstraction and
offers a means for dealing with inexact measurement of data. What is inexact? For example, if
50,000/- is high, what about 49,000/-? It surely cannot be medium. In fuzzy set theory, elements may
belong to more than one fuzzy set. If you take the same example again, 49,000/- may belong to
medium and high fuzzy sets. Consider the following fuzzy set notion:
mmedium_salary(49000) =0:20 and mhigh_salary(49000) =0:90,
Wheremis a membership function operating on the fuzzy sets of medium_salary andhigh_salary,
respectively. In fuzzy set theory,
Fuzzy set theory is particularly useful for rule-based classification task. It is possible to provide
operations for combining fuzzy measurements. For example, fuzzy sets for salary can be combined
with fuzzy sets junior_employee and senior_employee for the attribute years_in_current_service.
More than 1 fuzzy rule may be applied to classify a tuple. Each applicable rule
contributesavoteformembership in thecategories. A number of methods exist for translating the
resulting fuzzy output into a defuzzified value that is returned by the system.
9.7 Classification by Backpropagation
It is an example of neural network learning algorithm. Neural networkis a set of connected
input/output units, and each connection has a weightassociated to it. Neural network learns by
adjusting the value of the weights. It does so to predict the correct class label of the input tuples.
Neural networks are highly tolerant to noisy data. Also the neural networks are good at classifying the
patterns on which they are trained, also known as unseen patterns. Neural networks are best or
continuous-valued inputs and outputs. Some of the applications of neural networks are character
recognition, robot designing etc. Amongst many of the neural network algorithms, it is themost
popular neural network algorithm. Backpropagation algorithm can be implemented on multilayer
feed-forward networks,
A multilayer feed-forward neural network consists of an input layer, hiddenlayers, and an output
layer.
Figure 9.1 Multilayer feed-forward neural network
Multilayer feed-forward neural network may consist of number of multiple hidden layers. Each layer
in multilayer feed-forward neural network is made up of units. Attributes for each training tuple forms
the inputs to the network. Input units are then weighted and fed to first hidden layer. Output from a
hidden layer units are passed to second hidden layer, and soon. The last hidden layer passes the
weighted outputs to outputlayerthat produces the prediction of the class label for given tuples. Input
layer units are called input units, hidden layer and output layer units are called neurodes.
Figure 9.2: A multilayer feed-forward neural network with weights.
Multilayer feed forward is also fully connected, as output of each unit at one layer provides input to
each unit in the next forward layer.
Each unit accepts as input, a weighted sum of outputs from all units in the previous layer.A nonlinear
(activation) function is then applied to the weighted input. Multilayer feed-forward networks should
be supplied large hidden units and large training samples, so it can closely approximate any function.
Backpropagation processing adata set of training tuples iteratively, comparing the prediction made by
the network for each tuple with the actual known target value. The target value can either be known
class label or continuous value. The weights for each training tuple are modified to minimize the
mean squared error between the network’s prediction and the actual target value. These modifications
are made in the “backwards” direction. The steps in the algorithm are expressed in terms of inputs,
outputs, and errors.
Check your progress/ Self assessment questions- 3
Q7. Fuzzy set approach to classification is better than rule-based classification. (TRUE/
FALSE).
__________________________________________________________________________
Q8. Neural networks are highly tolerant to noisy data. (TRUE / FALSE).
__________________________________________________________________________
Algorithm: Backpropagation.
1. Initialize the weighted links. Typically the weights are initialized to a small random number.
2. Then, for each training example in the testing set:
Input the training data to the input nodes, then calculate Ok, which is the output of node k. This is
done for each node in the hidden layer(s) and output layer.
3. Then calculate δk for the each output node, wheretkis the targetof the node:
δk←Ok(1 – Ok)(tk– Ok)
4. Now calculate δk for the each hidden node,
δk←Ok(1 – Ok) ∑wh,k • δk
5. Finally adjust weights of all the links, where xiis the activation and ηis the learning rate:
Wi,j← Wi,j+ ηδjxi
Neural network is trained at many iterations of the training set to find an acceptable
approximation of the function it is being trained on.
9.8 Summary
The k-nearest-neighbor classifierworks by searching entirepattern space for the k training tuples that
are closest to a given unknown tuple. The k training tuples that are closest to the given unknown tuple
are called k “nearest neighbors” of the unknown tuple. Database containing solutions to various
problems is used by CBR classifier to solvenew problems. For every new case, CBR classifier first
searches for a matching training case. Solution ofatraining case is returnedif it found to be identical to
the new case. Customer service help desks, answers to frequently asked questions, are popular
examples of CBR classifiers.CBR is also used in medical education. Genetic algorithmsis based on
natural evolution. It works as follows. You need to create an initial populationconsisting of randomly
generated rules. String of bits are used to represent each rule. Then a new population is formed that
consists of fittestalong with the offspring of these rules. The crossover and mutation, two genetic
operators are used to create the offspring. In crossover, Swapping of substrings from a given pairs of
rules to form new pairs of rules is called crossover, whereas mutation refers to randomly selected bits
in a rule’s string are inverted. Discovery of structural relationships within noisy data is the main
objective of rough set theory of classification. It can only be applied to discrete-valued attributes. You
can apply rough set approach to continuous-valued attributes by converting it into discrete values. The
idea of rough set approach is to establish equivalence classes within the given training data. Rough
set approach is used to approximately define classes that cannot be distinguished in terms of the
available attributes. Neural networks are highly tolerant to noisy data. Also the neural networks are
good at classifying the patterns on which they are trained, also known as unseen patterns. Neural
networks are best or continuous-valued inputs and outputs. Some of the applications of neural
networks are character recognition, robot designing etc.
9.9 Glossary
Classification- Classification is also known as supervised learning. Classification is used to predict
categorical labels. The classification algorithm builds a model by learning from the training set.
Rough set approach- Rough set approach is used to approximately define classes that cannot be
distinguished in terms of the available attributes.
Multilayer feed-forward neural network- In it the output from a previous layer is fed as input to the
next layer and output from one layer is not connected back to itself or some previous layer.
Fuzzy set- It refers to a thresholds or boundaries defined for discretize values for an attribute. For
example, salary can categorized into fuzzy set {low_salary, medium_salary, high_salary}
Rough set approach- Rough set approach is used to approximately define classes that cannot be
distinguished in terms of the available attributes.
9.10 Answers to check your progress/self assessment questions
1. b.
2. Database
3. c.
4. Rules
5. Noisy data
6. FALSE.
7. TRUE.
8. TRUE.
9.11 References/ Suggested Readings
"1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson."
9.12 Model Questions
1. What is a multilayer feed-forward neural network?
2. Write the backpropagation algorithm for classification.
3. Explain the process of genetic algorithm.
4. How fuzzy set approach is better than rule-based classification. Give an example.
5. Explain k-nearest-neighbor approach to classification.
Lesson- 10 Prediction
Structure
10.0 Objective
10.1 Introduction
10.2 Prediction Model
10.3 Regression analysis
10.3.1 Linear regression
10.3.2 Nonlinear Regression
10.4 Summary
10.5 Glossary
10.6 Answers to check your progress/self assessment questions
10.7 References/ Suggested Readings
10.8 Model questions
10.0 Objective
After studying this lesson, students will be able to:
1. Define the prediction model of data mining
2. Discuss the use of regression analysis in prediction model.
3. Describe the two types of regression analysis.
4. Explain the need to study the predictors or classifiers accuracy
10.1 Introduction
Classification is used to classify the data into predefined classes based on some categorical data. In
this lesson you will study another data mining technique called prediction used to predict continuous
values for a given input. Prediction is a very popular tool of data mining and is used mainly in retail
chains to predict the buying patterns of customers over a period of time. In this lesson regression
analysis as a tool of prediction is discussed. Linear and Non-linear regression techniques are discussed
with the help of suitable examples. In the end, the lesson discusses the need to study the accuracy of
these prediction tools. Ultimately the benefit of using any prediction tools greatly depends upon the
accuracy of such tools.
10.2 Prediction Model
In the last lesson you studied classification task of data mining and decision tree as a tool to classify
data. Classification using decision tree is used to classify categorical data that defines data ranges.
Predicting continuous values for a given set of input is called numeric prediction. Suppose that
marketing manager of any retail shop wants to predict purchasing that will be done by a particular
customer during a sale at his/ her shop. It is a perfect example of numeric prediction. Such a model is
called a predictor. The most popular and widely used approach to numeric prediction is a statistical
methodology called regression. Some classification techniques can also be customized for numeric
prediction such as, back propagation, k-nearest-neighbor classifiers, or support vector machines.
The attribute is referred to as the predicted attribute rather than calling it class label attribute. For
example, instead of predicting if it would be safe to sanction a loan to particular customer, rather the
numerical prediction model is used to predict the amount that is considered to be safe for advancing to
a particular customer by the bank. Whereas the classification model is more concerned with labeling
the attributes to some pre-defined categorical classes. Simply change class label attribute
(loan_decision) that is used to classify if it is safe to advance the loan with continuous-valued
attribute (loan_amount) that is used to predict the amount considered safe for advancing.
Prediction may be defined as a function, y= f (X), where X refers to the input and y refers to the output
which is acontinuous valued attribute. In other words you predict the value of y in respect to input
value X. For example, details of loan applicant can be input to prediction model and loan amount can
be output of that model. Rather than using a separate test set to evaluate the accuracy for a classifier,
accuracy for a predictor is easy to compute. You can define error as the difference between the value
predicted by the model and the known value for that output variable y.
10.3 Regression analysis
Regression analysis is a statistical technique used for predicting relationships among variables for
each tuple. The variables under consideration are called independent and dependent variables.
Regression analysis is used to study the relationship between the dependant and independent
variables. Regression analysis shows the change in the value of dependent variable in respect to
change in the value of independent variable keeping value of all other variables fixed. The prediction
target is a function of independent variables called the regression function.
The output variable is also known as response variable. Predictor variables (class labels in classifiers)
are the attributes of interest that describe the tuple. Prediction for the value of response variable is
done in respect to the value of known predictor variables.
Regression analysis does well when all predictor variables are also continuous-valued or ordered.
Most of data mining problems can be solved using linear regression, and most of nonlinear problems
can also be transformed and converted to a linear one.
Check your progress/ Self assessment questions- 1
Q1. What is the difference between classification and prediction?
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q2. Define regression analysis.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q3. Regression analysis does well when all predictor variables are _________________ or
_________.
10.3.1 Linear regression
Linear regression develops a linear equation that explains the relationship between 2 variables for
some data set D. Of the two variables, one is called the predictor variable whose value is known and
the other is called response variable whose value is to be predicted. You can construct a regression
model for fitness, i.e. to predict the weight of based on the height of an individual. You often find
such data in hospitals or gyms. Height in this example is predictor variable and weight is response
variable and both are continuous ordered variables. It is extremely important to determine if there
exists a relationship between two variables of interest before you attempt to fit a linear model to
observed data set D. A tool called scatterplot is used to determine the relationship strength between
two variables.
A linear regression line can be specified as Y = a + bX,
Where X refers to an independent variable and Y refers to a dependent variable. Both b and a are
called regression coefficients, b refers to the slope the regression line and a refers to intercept or value
of y when x = 0.
Figure 10.1: Regression line
Least-squares method is a simple technique for fitting a regression line. It minimizes the sum of
squares of vertical deviations from each data point to the regression line. Vertical deviations are also
called errors. The vertical deviation for a point that lies exactly on the fitted line is 0.
Figure 10.2: Regression line using Least-Squares method.
Least-squares method tries to best fit the data set by adjusting the regression coefficients a and b of a
model function. A data set D or n data pairs , for i = 1, 2 ..., n. The model function has the
form , where m adjustable parameters are held in the vector , and it tries to find the
parameter values that "best" fits the data. The least squares method results in the best fit when the sum
(s), of squared deviations is minimum.
Deviation or residual is defined as the error or the difference between the actual value of
dependant (response) variable and the predicted value for the same.
.
You can compute the value for regression coefficients (a and b) to find the line of best fit
“graphically” by using curve-fitting program, e.g. Excel’s Trendline. Excel in essence calculates a and
b using these formulae:
refers to mean of x values and refers to mean of y values, and point ( ; ) always lies on the
line of best fit. In other words, y’ = ax + b, where y’ is the average of yi’s and x is the average of xi’s.
Let us consider the following example,
Compute the best fit equation:
X 8 2 11 6 5 4 12 9 6 1
Y 3 10 3 6 8 12 1 4 9 14
Calculate the means of the x-values and the y-values.
= ( 8 + 2 + 11 + 6 + 5 + 4 + 12 + 9 + 6 + 1 ) / 10 = 6.4
= ( 3 + 10 + 3 + 6 + 8 + 12 + 1 + 4 + 9 + 14 ) / 10 = 7
Now calculate , , , and for each i.
i xi yi xi - yi - (xi - ) (yi - ) (xi - )2
1 8 3 1.6 -4 - 6.4 2.56
2 2 10 - 4.4 4 - 13.2 19.36
3 11 3 4.6 -4 - 18.4 21.16
4 6 6 - 0.4 -4 0.4 0.16
5 5 8 - 1.4 1 - 1.4 1.96
6 4 12 - 2.4 5 - 12 5.76
7 12 1 5.6 - 6 - 33.6 31.36
8 9 4 2.6 - 3 - 7.8 6.76
9 6 9 - 0.4 2 - 0.8 0.16
10 1 14 - 5.4 7 - 37.8 29.16
Calculate the slope.
Calculate the y-intercept.
Use the formula to compute the y-intercept.
Use the slope and y-intercept to form the equation of the line of best fit.
The slope of the line is –1.1 and the y -intercept is 14.0.
Therefore, the equation is y = –1.1 x + 14.0.
Draw the line on the scatter plot.
Figure 10.3: regression line. http://hotmath.com/hotmath_help/topics/line-of-best-fit.html
Check your progress/ Self assessment questions- 2
Q4.Define linear regression.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q5. A tool called ______________ is used to determine the relationship strength between two
variables.
Q6.Define linear regression line.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q7. Classification is used to classify:
a. Continuous valued variables.
b. Categorical valued variables.
c. Both a and b.
d. None of the above.
Q8. Which method is widely used for linear regression?
a. Best square
b. Least square
c. Most square.
d. None of the above.
10.3.2 Nonlinear Regression
The linear regression is used to fit a straight line to a set of data points. The response variable, y in
case of linear regression is modeled as a linear function of a single predictor variable, x.The best or
most accurate relationship is not always straight, but a curved one. To represent a curved relationship
between two variables, you can use nonlinear regression. For example, if the growth in dependent
variable is exponential to independent variable, the relationship between the two is a curve. Non-
linear regression is best suited when the relationship between response variable and predictor
variable may be modeled using a polynomial function. You can make linear least-squares method
adapts to it. The method creates new variables that are nonlinear functions of variables in your data.
New variables or data points results in a curved function of the original variables.
Figure 10.4 Non-linear regression line
Polynomial regression is typically of interest when there is only one predictor variable. It can be
modeled by inserting polynomial terms to the basic linear model. As already discussed,
transformations can be applied to the original variables to convert the nonlinear model into a linear
one that can then be solved by the method of least squares discussed in the last section.
Consider the following polynomial relationship:
y = a + bx + cx2 + dx3 equation. (10.1)
We can convert this polynomial equation to linear one by defining new variables for the original
variables:
x1 = x, x2 = x2, andx3 = x3
Equation 10.1 can now be converted to linear form by transforming the equation using the above
assignments,
y = a + ax1 + bx2 + cx3,
Now this equation can be easily solved using the method of least-squares for linear model using any
software for regression analysis, or manually as explained in the previous section. Polynomial
regression is a special case of multiple or multi-level regression. For instance, the addition of high-
order variables x2, x3, and so on, are simple functions of the single variable x, and can be considered
equivalent to adding new independent variables.
Some nonlinear models are inflexible and cannot be converted into equivalent linear form. It is only
possible to obtain least square estimates through extensive calculations on more complex formulae.
Typically, a regression analysis is extremely accurate for prediction, except when the data contain
outliers. Having outliners is the problem with the data and not with the analysis technique. Outliers
are data points that are highly inconsistent as compared to remaining data.
Figure 10.5 outliners in data set.
Outlier detection is a topic that needs separate and special mention. But applying such technique to
detect and remove outliners, must take care that we don’t remove any meaningful data point from the
data set.
Check your progress/ Self assessment questions- 3
Q9. When you use nonlinear regression analysis? Also give an example.
___________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________________
Q10. Some nonlinear models are __________________ and cannot be converted into equivalent
linear form.
Q11. What is an outlier?
___________________________________________________________________________
___________________________________________________________________________
Q12. Prediction accuracy using regression analysis suffers when the data contain____________.
10.4 Summary
Classification is used to classify the data into predefined classes based on some categorical
data.Classification using decision tree is used to classify categorical data that defines data ranges.
Predicting continuous values for a given set of input is called numeric prediction. The most popular
and widely used approach to numeric prediction is a statistical methodology called regression.
Prediction may be defined as a function, y= f (X), where X refers to the input and y refers to the output
which is acontinuous valued attribute. Regression analysis is a statistical technique used for predicting
relationships among variables for each tuple. Regression analysis does well when all predictor
variables are also continuous-valued or ordered. Linear regression develops a linear equation that
explains the relationship between 2 variables for some data set D. A linear regression line can be
specified as Y = a + bX, where X refers to an independent variable and Y refers to a dependent
variable. Both b and a are called regression coefficients, b refers to the slope the regression line and
a refers to intercept or value of y when x = 0. To represent a curved relationship between two
variables, you can use nonlinear regression. Non-linear regression is best suited when the relationship
between response variable and predictor variable may be modeled using a polynomial
function.Polynomial regression is a special case of multiple or multi-level regression.Some nonlinear
models are inflexible and cannot be converted into equivalent linear form.Typically, a regression
analysis is extremely accurate for prediction, except when the data contain outliers.Outliers are data
points that are highly inconsistent as compared to remaining data.
10.5 Glossary
Prediction- Model used to predict continuous or ordered values for a given set of input.
Regression analysis - Statistical technique used for predicting relationships between the dependant
and independent variables.
Scatterplot- A tool used to determine the relationship strength between two variables.
Outlier- Outlier is a data point that is highly inconsistent as compared to remaining data points.
Classification- Classification is used to classify the data into predefined classes based on some
categorical data.
Scatterplot- Scatterplot is a tool used to determine the relationship strength between two variables.
Least square method- It is a simple technique for fitting a regression line. It minimizes the sum of
squares of vertical deviations from each data point to the regression line.
10.6 Answers to check your progress/self assessment questions
1. Classification is used to classify data into predefined categorical classes, whereas numeric
prediction model is used to predict continuous values for a given set of input.
2. Regression analysis is a statistical technique used for predicting relationships between the
dependant and independent variables. Regression analysis shows the change in the value of dependent
variable in respect to change in the value of independent variable keeping value of all other variables
fixed.
3. continuous-valued, ordered.
4. Linear regression develops a linear equation that explains the relationship between 2 variables for
some data set D. Of the two variables, one is called the predictor variable whose value is known and
the other is called response variable whose value is to be predicted.
5. Scatterplot.
6. A linear regression line can be specified as Y = a + bX, where X refers to an independent variable
and Y refers to a dependent variable. Both b and a are called regression coefficients, b refers to the
slope the regression line and a refers to intercept or value of y when x = 0.
7. c.
8. b.
9. Sometimes the best line fit is not the straight line, but a curved one. A curved relationship between
two variables can be represented using nonlinear regression. For example, if the growth in dependent
variable is exponential to independent variable, the relationship between the two is a curve.
10. Inflexible.
11. Outliers are data points that are highly inconsistent as compared to remaining data.
12. Outliers.
10.7 References/ Suggested Readings
1. Data Mining: Concepts and Techniques by J. Han and M. Kamber PublisherMorgan Kaufmann
Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas, Publisher Pearson.
4. Data Warehousing, Data Mining, & Olap by Alex Berson and Stephen J smith, Tata McGraw-Hill
Education.
5. Data Mining and Data Warehousing by Bharat Bhushan Agarwal and Sumit Prakash Tayal,
University Science Press.
6. Data Mining: Technologies, Techniques, tools and Trends by Bhavani Thuraisingham.
10.8 Model questions
1. Define numeric prediction model.
2. Define linear regression.
3. Why do we need nonlinear regression?
4. Give an example of outlier.
5. Can you always convert a nonlinear model into equivalent linear form?
6. Explain least square method.
Lesson- 11 Introduction to clustering
Structure
11.0 Objective
11.1 Introduction
11.2 Clustering
11.3 Cluster Types
11.4 Types of Data in Cluster Analysis
11.4.1 Interval-Scaled Variables
11.4.2 Binary Variables
11.4.3 Categorical Variables
11.4.4 Ordinal Variables
11.4.5 Ratio-Scaled Variables
11.5 Summary
11.6 Glossary
11.7 Answers to check your progress/self assessment questions
11.8 References/ Suggested Readings
11.9 Model Questions
11.0 Objective
After Studying this lesson, students will be able to:
1. Define the concept of clustering.
2. Differentiate between clustering and classification.
3. Describe various cluster types.
4. Explain different type of data used in cluster analysis.
11.1 Introduction
Clustering is another important type of analysis method used in data mining. It is widely in data
mining applications. Some of the applications where clustering is used include market research,
pattern recognition, data analysis, and image processing, biology, processing of documents on web,
etc. The next two lessons focus on what you mean by clustering and various techniques to clustering.
11.2 Clustering
Cluster analysis is used to group together objects based their relationships. It results into clear
separation of objects that are not similar to each other. Clustering comes under undirected data
mining. The process of undirected data mining is both different and similar to the process for directed
data mining. Both directed and undirected mining works with applications that require exploration
and understanding of the data. Both techniques are improved by including intelligent variables into
the data that identify different aspects of the business. Undirected data mining differs from data
mining by not having a target variable, and it poses following challenges:
As there is no target variable, the human interpretation is very important. Hence, the process
of undirected data mining cannot be fully automated.
The measures for clustering are more qualitative than the ones associated with directed
techniques. There are no simple statistical measures such as the CCR or the R2 value for summarizing
the goodness of the results. Instead, undirected mining uses descriptive statistics and visualization for
summarizing the results.
Clustering is different from classification in the sense that clustering leads to classification.
Classification is preferred when the classes are pre-defined or known in advance, where as in
clustering, we derive classes on the basis of similarity between objects. Objects in one class are
dissimilar to the objects in another class.
Some of the applications or uses of clustering technique in data mining are as follows:
1. Clustering is used widely in applications such as market research. Understanding the
customer behaviour in order to improve the customer base and provide better facilities are the prime
objectives. Sometimes it is even used to study the time patters at which there is great amount of rush
as compared to other time slots.
2. Clustering is also used to find the similar patterns or more popularly known as pattern
recognition. Being able to identify patterns and make scientific reasoning for the same helps in better
decision making.
3. Clustering is also useful in classification of documents on the web. It is particularly useful in
designing of search engines and search engine optimization.
4. Clustering is frequently used by the bankers to identify clusters that are most likely to be
fraudulent. It is particularly useful in case of predicting credit card frauds.
5. Clustering in data mining acts as a tool to gain insight into the distribution of data to observe
characteristics of each cluster.
Initially the data points or objects in the database belong to one cluster. Or you can say that they all
are one, but still don't have any similarities. The initial data points can be viewed as:
As you can see in the figure above, all data points are shown using the same symbol. Already, a
simple look at these points is enough to identify the similarities and the dissimilarities between the
data points.
The most basic type of clustering would be to divide the data points into two clusters. One consisting
of all data points on the left side and the other consisting of all data points on the right side.
Last figure shows data points as members of two clusters. Again, as sub-cluster can be further divided
into sub-clusters on the basic of similarity or relationship among the data points. As you can see, that
both the left and right clusters can be further divided into 2 clusters each. Four data points on the
lower side of cluster on left side and four data points on the upper side of cluster on right side are
dissimilar to other data points in the same cluster and hence can be assigned to a separate cluster as
shown in the figure below,
Careful analysis of the clusters in the figure above can lead to formation of 2 more clusters. 2 clusters,
one on the left and other on the right consisting of 8 data points each can be divided into 2 clusters
each as shown in the figure below:
Observe the figure above, it is not just the shape of data elements, but also the density or difference
with other data points in same cluster and other clusters that makes the clusters feel quite obvious.
The definition of a cluster or how well we can divide a given data set into clusters, not only depend
upon the clustering algorithm, but also on the fact that how well the data points are related or
unrelated to each other. Once the data points have been partitioned into groups or clusters based on
data
Assign of labels to newly formed clusters is the next step. It is far better and less expensive technique
as compared to the classification technique. Also, the clustering technique is much more adaptive to
changes.
Check your progress/ Self assessment questions- 1
Q1. What is the difference between classification and clustering?
____________________________________________________________________
____________________________________________________________________
Q2. Clustering is an example of directed data mining. (TRUE / FALSE ).
____________________________________________________________________
____________________________________________________________________
Q3. Clustering is used in the field of pattern recognition. (TRUE / FALSE ).
____________________________________________________________________
11.3 Cluster Types
1. Well-Separated Cluster: It consists of objects or set of points such that, one object or point in a
cluster is nearer to other objects in the same cluster as compared to any other object from other
clusters.
2. Center-based Cluster: In this type of clustering method, a set of randomly selected points or
objects are considered to be the center of each cluster. Each point is assigned to the cluster whose
center is closest to that point as compared to the center of other clusters.
3. Nearest-neighbour based Cluster: A cluster in which each point is similar to at least one other
point in the same cluster and it allows a point to be close to a minimum of one point in the same
cluster.
4. Density-based Cluster: A density base cluster is one in which all clusters with dense regions of
points are separated by low-density point. This definition is used when the data points are highly
inconsistent and there exist a number of outliners. So a well-separated cluster cannot be defined for it.
11.4 Types of Data in Cluster Analysis
Now it is time to study the types of data used in cluster analysis and how it is used for analysis. Data
to be clustered may represent income, sales, documents, countries, and so on. Typical clustering
algorithms operate on either of the following two data types.
Check your progress/ Self assessment questions- 2
Q4. A ______________________ consists of objects where one object in a cluster is nearer to other
objects in the same cluster as compared to any other object from other clusters.
Q5. A density based cluster is one in which each point is similar to at least one other point in the same
cluster. (TRUE / FALSE ).
____________________________________________________________________
11.4.1 Interval-Scaled Variables
Interval-scaled variables refer to continuous measurements of a roughly linear scale. Some of the
examples of interval-scales variables are height and weight, latitude and longitude coordinates, marks,
etc. Change in measurement unit can affect the results of clustering analysis. For instance, changing
measurement units from absolute marks to grades, or from Celsius to Fahrenheit for temperature, may
lead to a very different clustering structure. Changing to smaller units generally lead to a larger range
of values for that variable, and thus a larger effect on the resulting clustering structure.
Standardization is used to avoid dependence on the choice of measurement units. It attempts to give
all variables an equal weight. Sometimes there may be a need to give more weight to a certain set of
variables as compared to others. For example, if you are selecting players for spot number 5 and 6 in
20-20 cricket, you may prefer to give more weight to the strike rate than the average of the batsman.
How do you achieve standardization for a variable?The idea is to convert the original
measurements to unit-less variables. Given measurements for a variable f, this can be performed as
follows.
1. Calculate the mean absolute deviation,sf:
(11.1)
Where x1f , …, xn f are n measurements of f, and mf is the mean value of f, i.e.
2. Calculate the standardized measurement, or z-score:
(11.2)
The mean absolute deviation, sf, handles outliers much better than the standard deviation, σf. For mean
absolute deviation, the deviations from the mean (i.e., |xi f -mf |) are not squared. It helps to reduce the
effect of outliers. Advantage of mean absolute deviation over other measures is that the z-scores of
outliers do not become too small and the outliers remain detectable.
Standardization is not useful for all application types and the choice of standardization should be left
to the user, i.e. whether to implement it or not. Standardization is also known as normalization.
Dissimilarity or similarity between the objects described by interval-scaled variables is computed
based on the distance between each pair of objects. One example for distance measure is Euclidean
distance, which is defined as
(11.3)
Where i = (xi1, xi2, …., xin) and j = (xj1, xj2, …, xjn) are two n-dimensional data objects.
Euclidean distance satisfies the following mathematic requirements of a distance function:
1. d(i, j) >= 0: Which means the distance is a nonnegative number.
2. d(i,i) = 0: Which means the distance of an object to itself is 0.
3. d(i, j) = d( j,i): Which means the distance is a symmetric function.
4. d(i, j)<= d(i, h) + d(h, j): Triangular inequality.
11.4.2 Binary Variables
A binary variable is used to represent two constant states: 0 or 1, where 0 means absent, and 1 means
present. For example, given a variable married, 1 means that the person is married and 0 means the
person is not married. Binary variables cannot be treated as interval-scaled variables as it can lead to
misleading clustering results.
Computing the dissimilarity using binary variables involve computing a dissimilarity matrix from the
given binary data. A 2-by-2 contingency table can be used to represent the dissimilarity matrix, where
q refers to the number of variables = 1 for both i and j, r refers to the number of variables = 1 for i and
0 for j, s refers to the number of variables = 0 for i and 1 for j, and t refers to the number of variables
= 0 for both objects i and j. The total number of variables is p, where p = q + r + s +t.
A binary variable is symmetric if both of its states are equally valuable and carry the same weight.
For example, gender variable during registration is a symmetric binary variable. Dissimilarity that is
based on symmetric binary variables is called symmetric binary dissimilarity. Its dissimilarity can be
used to assess the dissimilarity between objects i and j.
(11.4)
A binary variable is asymmetric if the outcomes of the two states are not equally important, such as
the positive and negative outcomes of a disease test. You should code the most important outcome,
which is usually the rarest one. For example 1 means the test is positive and 0 means the test is
negative. For two asymmetric binary variables, the agreement of two 1s (a positive match) is then
considered more significant than that of two 0s (a negative match). The dissimilarity based on such
variables is called asymmetric binary dissimilarity, where the number of negative matches, t, is
considered unimportant and thus is ignored in the computation,
(11.5)
Check your progress/ Self assessment questions- 3
Q6. List some of the examples of interval-scaled variables.
____________________________________________________________________
____________________________________________________________________
Q7. _____________ is used to avoid dependence on the choice of measurement units in case of
interval-scaled variables.
Q8. Distance between each pair of objects can be computed using a distance measure
called__________________.
Q9. Which of the following is a type of cluster?
a. Well-separated cluster
b. Centre based cluster
c. Density based cluster
d. All the above
Q10. Which type of variable refers to continuous measurements of a roughly linear scale?
a. Categorical variable
b. Ordinal variable
c. Inter scaled variable
d. Binary variable
11.4.3 Categorical Variables
Categorical variable is also known as nominal variable and it is a generalization of the binary variable
as it can take on more than two states. For example, grade variable can be used to represent states
like A, B, C, D and F.
Let M be the number of states for a categorical variable. Letters, symbols, or integers can be used to
represent the states. The dissimilarity between two objects using categorical variables can be
computed based on the ratio of mismatches:
(11.6)
Where i and j refer to 2 objects, m represents the number of matches and p represents the total number
of variables. A match means that objects i and j represent same state for a variable. Weights are used
to increase the effect of m or to assign greater weight to the matches in variables with larger number
of states.
11.4.4 Ordinal Variables
An ordinal variable is similar to a categorical variable, except that the M states of ordinal value are
ordered in a sequence. For example, designations can be represented using sequential order, such as
Registrar, Deputy Registrar, Assistant Registrar, etc. A continuous ordinal variableis like a set of
continuous data of an unknown scale. Ordinal variables can also be obtained by performing
discretization of interval-scaled quantities by splitting the value range into a finite number of classes.
The values of an ordinal variable can be mapped to ranks.
Computing the dissimilarity between objects for ordinal variables is similar to that of interval-scaled
variables. The computation of dissimilarity with respect to variable finvolves:
1. The value of f for the ith object is xi f, and f has Mf ordered states, representing the ranking 1, …, Mf.
2. Map the range of each variable onto [0.0,1.0] so that each variable has equal weight. It can be
achieved by replacing the value of ri f by:
(11.7)
3. Compute the dissimilarity using the Euclidean distance measure for interval-scaled variables,
using zi f to represent the f value for the ith object.
11.4.5 Ratio-Scaled Variables
A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential
scale, approximately following the formula:
PeQt or Pe-Qt
(11.8)
Where P and Q are positive constants, and t represents time. Some of the examples can be decay of
deadly virus like EBOALA or the growth of a bacteria population
Computation of dissimilarity between objects for ratio-scaled variables can be achieved using wither
of the three methods:
Treat ratio-scaled variables like interval-scaled variables.
With the help of logarithmic transformation to a ratio-scaled variable f having value xi f for
object i by using the formula yif = log(xif). The yif values can be treated as interval-values.
Treat xi f as continuous ordinal data and treat their ranks as interval-valued.
11.5 Summary
Cluster analysis is used to group together objects based their relationships. Clustering comes under
undirected data mining. Classification is preferred when the classes are pre-defined or known in
advance, where as in clustering, we derive classes on the basis of similarity between objects. Some of
the applications where clustering is used include market research, pattern recognition, data analysis,
and image processing, biology, processing of documents on web, etc. some of cluster types include
well separated clusters, center based clusters. Nearest neighbour based clusters, density based clusters,
etc. Interval-scaled variables refer to continuous measurements of a roughly linear scale. Some of the
examples of interval-scales variables are height and weight, latitude and longitude coordinates, marks,
etc. A binary variable is used to represent two constant states: 0 or 1, where 0 means absent, and 1
means present. Computing the dissimilarity using binary variables involve computing a dissimilarity
matrix from the given binary data. Categorical variable is also known as nominal variable and it is
ageneralization of the binary variable as it can take on more than two states. An ordinal variable is
similar to a categorical variable, except that the M states of ordinal value are ordered in a sequence. A
ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale.
11.6 Glossary
Clustering- Clustering is used to group together objects based on the intra-cluster similarities and
inter-cluster dissimilarities between the objects.
Classification- It refers to assignment of objects to one of the pre-defined classes.
Standardization- Also known as normalization is used to avoid dependence on the choice of
measurement units in case of interval-scaled variables.
Euclidean distance- It refers to a distance measure andis used to compute the distance between each
pair of objects.
11.7 Answers to check your progress/self assessment questions
1. Classification is preferred when the classes are pre-defined or known in advance, where as in
clustering is used to derive classes on the basis of similarity between objects.
2. FALSE.
3. TRUE.
4. Well-Separated Cluster.
5. FALSE.
6. Some of the examples of interval-scales variables are height and weight, latitude and longitude
coordinates, marks, etc.
7. Standardization.
8. Euclidean distance.
9. d.
10. c.
11.8 References/ Suggested Readings
1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson.
11.9 Model Questions
1. Explain different cluster types along with figures.
2. Define cluster and list some of the applications of clusters.
3. What is the difference between clustering and classification?
4. Define standardization.
5. What are categorical or nominal variables? How can you compute the dissimilarity between objects
using categorical variables?
Lesson- 12 Clustering Methods
Structure
12.0 Objective
12.1 Introduction
12.2 Partitioned Clustering
12.2.1 K-means clustering
12.2.2 K-medoid clustering
12.3 Hierarchical clustering
12.3.1 Agglomerative clustering
12.4 Density-Based Methods
12.4.1 DBSCAN
12.5 Summary
12.6 Glossary
12.7 Answers to check your progress/self assessment questions
12.8 References/ Suggested Readings
12.9 Model Questions
12.0 Objective
After Studying this lesson, students will be able to:
1. Explain different methods of clustering.
2. Describe k-mean and k-medoid partitioning methods of clustering.
3. Explain hierarchical methods of clustering called agglomerative clustering.
4. Discuss the density based clustering method called DBSCAN.
5. Write algorithms for all 3 types of clustering methods.
12.1 Introduction
Now that you are aware of what clustering is and types of data used in cluster analysis, in this lesson
you will learn the major clustering methods used in data mining. Clustering methods are based on
type of clustering approach followed. Major clustering methods are based on type of clustering used,
like partitioning based methods, hierarchical clustering methods and density based methods. The
lesson contains brief explanations for each of the methods along with the algorithms. I am sure you
will enjoy reading this lesson
12.2 Partitioned Clustering
Let us suppose you want to form k clusters for a given data set consisting of n data points or objects; a
partitioning algorithm arranges the data points into k partitions or less then K partitions in case, k >n).
One partition means one cluster. The partition clusters are formed on the basis of threshold value that
defines the similarity or dissimilarity between the data points of one cluster with data points of other
clusters, i.e. it classifies the data into k groups, which together satisfy the following requirements:
1. Each group must contain at least one object.
2. Each object must belong to exactly one group.
A partitioning method creates an initial k partitioning and with each iteration it attempts toimprove
the partitioning by moving objects from one group to another. Grouping of objects in name cluster is
based on how close or related the objects are to each other.
In this section we will discuss the most basic and most popular data partitioning technique based on
centroid of cluster called K-means clustering.
12.2.1 K-means clustering
K-mean clustering is based on the idea that a center (centroid) can represent a cluster. Centroid is
computed as a mean/ median of points within a cluster. Cluster similarity is measured in regard to the
mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of
gravity. A centroid doesn’t need to correspond to an actual data point. New data points for each
cluster have high intra-cluster similarity and low inter-cluster similarity. The k-means algorithm
proceeds by randomly selectingk objects, each of which initially represents a cluster mean or center.
The remaining objectsare assigned to the cluster based on the shortest distance between the object and
the cluster mean. It then computes the new mean for each cluster.
For example, consider the following image with circles representing the data points and cross
representing two initial randomly chosen centroids.
Each data point should be assigned to the centroid that is nearest to it. The devision of the data points
can be better viewed as follows:
Visually it seems that it is not the best classifictation. Compute the average of data points in two
clusters and compute the new centroids for two clusters. The new clentroids are as follows:
The figure also shows the assignment of data points to the centroid nearest to each data point.
Recompute the two centroids again and assign the data points to each centroid. The final clustering or
classification will look like:
Why did I call it final classification. Because if you compte the mean of data points from both
clusters, the mean expected is same.
K-means Clustering Algorithm
1. Select any K points to form initial centroids.
2. Assign each data point to centroid nearest to it.
3. Re-compute mean centroid for all clusters using the data points for each cluster.
4. Jump to step 2, if there is a change in the centroids.
12.2.2 K-medoid clustering
The k-means algorithm is prone to or sensitive to outliers as objects with extremely large value may
substantially distort the distribution of data. Instead of taking the mean value of the objects in a
cluster as a reference point, you may pick actual objects to represent the clusters. Each remaining
object is clustered with the representative object to which it is the most similar. The partitioning
method is then performed based on the principle of minimizing the sum of the dissimilarities between
each object and its corresponding reference point.In this section, you will learn an algorithm that
diminishes such sensitivity. Rather than considering the mean value of the data points as a centroid,
actual data point is picked to represent the cluster. The representative points for all clusters are called
medoids. The remaining non-selected data points are then clustered using representative data point to
which it is the most similar.
K-medoid Clustering Algorithm
1. K candidate points are initial selected as medoids that are expected to be the best central points for
a cluster.
2. Assign each non-selected data point to the medoid closest to it.
3. Distance of non-selected points from the closest candidate medoid is computed. Sum up the
computed distance over all points. The one with the lowest cost is selected as the new configuration.
If there is change between old and new configurations, jump to step 2.
4) End.
A change in the medoid leads to reassignment of data points. The data points may be reassigned to
new medoid, or to some other medoid with current minimum distance, or no reassignment at all. Let
us consider few cases of reassignment. DPj is the representative point or medoid to which point P is
currently assigned. Dpi is another medoid and DPrandom is the non-selected data point that will
replace DPj as representative point or medoid.
Initially DPj and DPi are two medoids, and a data point P which is closer to DPj and hence is assigned
to DPj. Now, if we replace the medoid DPj with DPrandom, the distance of P with Dpi is less than the
new medoid DPrandom, hence, the data point P will be reassigned to DPi.
In this case, the data point P after replacement is closer to new medoid DPrandom rather than old
medoid DPi, and hence will be assigned to DPrandom.
Initially in this case, the data point P is closer to DPi and is assigned to DPi. After replacement of DPj
with DP random, the data point P is closer to DPrandom instead of DPi. So it will be reassigned to
new medoid DPrandom instead of DPi.
There is one more case in which there will be no reassignment. I wish if the readers could
themselves draw that case.
Check your progress/ Self assessment questions- 1
Q1. Partitioning algorithm classifies the data into k groups that satisfies the following requirements:
___________________________________________________________________________
___________________________________________________________________________
Q2. In case of k-mean clustering, the centroid is computed as a ___________ of points within a
cluster.
Q3. Centroid in k-mean clustering corresponds to an actual data point.
a. TRUE
b. FALSE
Q4. What is the difference between k-mean and k-medoid clustering methods?
___________________________________________________________________________
___________________________________________________________________________
12.3 Hierarchical clustering
A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. Data
points are grouped together into a tree like structure of clusters. The clusters of individual points at the
bottom of the tree structure are all-inclusive to cluster points at the top. There are two techniques to
form this cluster hierarchy known as agglomerative or divisive based on how the hierarchical
decomposition is formed.
a) Agglomerative: It is a bottom-up strategy that starts by taking data objects as atomic clusters. It
successively merges the objects or groups that are close to one another, until all of the groups are
merged into one (the topmost level of the hierarchy), or until a termination condition holds.
b) Divisive: It is a top-down strategy that starts by taking one all-inclusive cluster and all objects
belonging to the same cluster. In each successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a termination condition holds.
Hierarchical method suffers in term of flexibility. Once a step (merge or split) is done, it can never be
undone. There are two approaches to improving the quality of hierarchical clustering:
(1) Perform careful analysis of object “linkages” at each hierarchical partitioning.
(2) Integrate hierarchical agglomeration and other approaches by first using a hierarchical
agglomerative algorithm to group objects into microclusters, and then performing macroclusteringon
the microclusters using another clustering method such as iterative relocation.
12.3.1 Agglomerative clustering
Initial data points or atomic clusters
Join the clusters with minimum distance, in this example the clusters with minimum distance are A
and B.
Then we proceed further to join clusters with minimum distance
We continue with this process until we reach single all-inclusive cluster.
Figure 12.1 Hierarchical view of this data clustering process
Agglomerative based Hierarchical Clustering Algorithm
1) Compute the proximity graph.
2) Merge 2 clusters with minimum distance.
3) Proximity matrix should be updated to replicate the proximity between newly formed cluster and
remaining clusters.
4) If more than one cluster remains, go to step 3.
The divisive strategy is absolutely the opposite of it, and you can achieve single atomic clusters by
just reversing the order that we discussed in the last example.
Divisive hierarchical clustering algorithm
1. Compute the proximity graph first and then the minimum spanning tree for the same.
2. Create a new cluster by removing the link corresponding to largest distance.
3. Jump to step 2 until a single atomic clusters remain.
Check your progress/ Self assessment questions- 2
Q5. In case of hierarchical clustering, data points are grouped together into a ________ like structure
of clusters.
Q6. Two techniques to form cluster hierarchy are known as ______________ and____________.
Q7. Agglomerative clustering is a top-down strategy that starts by taking one all-inclusive cluster and
all objects belonging to the same cluster.
a. TRUE
b. FALSE
12.4 Density-Based Methods
Density-based clustering methods are good to discovering clusters with arbitrary shapes. Density-
based clustering methods consider clusters as dense regions of objects in the data space that are
separated by regions of low density. DBSCAN is one popular method of clustering based on density.
It grows clusters according to a density-based connectivity analysis. OPTICS extends DBSCAN to
produce a cluster ordering obtained from a wide range of parameter settings.
12.4.1 DBSCAN
It stands for Density-Based Spatial Clustering of Applications with Noise and is a density-
based clustering algorithm. The algorithm is used to discover clusters of arbitrary shape in
spatial databases with noise by growing regions with sufficiently high density into clusters. A
cluster using DBSCAN may be defined as a maximal set of density-connected points.
Consider the following definitions:
The neighbourhood within a radius ε of a given object is called the ε-neighbourhood of the object.
The object is called a core objectif the ε-neighbourhood of an object contains at least a minimum
number, MinPts, of objects.
Density reachability is the transitive closure of direct density reachability, and this relationship is
asymmetric. Only core objects are mutually density reachable. Density connectivity, however, is a
symmetric relation.
Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is
within the ε-neighbourhood of q, and q is a core object.
An object p is density-reachable from object q with respect to ε and MinPts in a set of objects, D, if
there is a chain of objects p1, …, pn, where p1 = q and pn = p such that pi+1is directly density-reachable
from piwith respect toεandMinPts, for1i n, pi2D.
An object p is density-connected to object q with respect to ε and MinPts in a set of objects, D, if
there is an object o 2 D such that both p and q are density-reachable from o with respect to ε and
MinPts.
A density-based cluster is a set of density-connected objects that is maximal with respect to density-
reachability. DBSCAN searches for clusters by checking theε-neighbourhood of each point in the
database. If the ε-neighbourhood of a point p contains more than MinPts, a new cluster with p as a
core object is created. DBSCAN then iteratively collects directly density-reachable objects from these
core objects, which may involve the merge of a few density-reachable clusters. The process
terminates when no new point can be added to any cluster.
Figure 12.2: Density reachability and density connectivity in density-based clustering.
Check your progress/ Self assessment questions- 3
Q8. ____________________is one popular method of clustering based on density.
Q9. Density-based clustering methods are good to discovering clusters with arbitrary shapes. ( TRUE
/ FALSE ).
___________________________________________________________________________
Q10. DBSCANstands for:
___________________________________________________________________________
12.5 Summary
Toform k clusters for a given data set consisting of n data points or objects, a partitioning algorithm
arranges the data points into k partitions or less then K partitions in case, k >n).The partition clusters
are formed on the basis of threshold value that defines the similarity or dissimilarity between the data
points of one cluster with data points of other clusters. A partitioning method creates an initial k
partitioning and with each iteration it attempts toimprove the partitioning by moving objects from one
group to another. Centroid in k-mean clustering is computed as a mean/ median of points within a
cluster.A centroid doesn’t need to correspond to an actual data point. New data points for each cluster
have high intra-cluster similarity and low inter-cluster similarity.The k-means algorithm is prone to or
sensitive to outliers. K-medoid clustering does consider the mean value of the data points as a
centroid, instead actual data point is picked to represent the cluster. The representative points for all
clusters are called medoids. The remaining non-selected data points are then clustered using
representative data point to which it is the most similar.Data points in hierarchical clustering are
grouped together into a tree like structure of clusters.Agglomerative clustering is a bottom-up strategy
that starts by taking data objects as atomic clusters and then these objects are merged together to form
different clusters based on their similarities. Divisive clusteringis a top-down strategy that starts by
taking one all-inclusive cluster and all objects belonging to the one cluster. Clusters are then split into
smaller clusters. Density-based clustering methods are good to discovering clusters with arbitrary
shapes.DBSCAN is one popular method of clustering based on density.The algorithm is used to
discover clusters of arbitrary shape in spatial databases with noise by growing regions with
sufficiently high density into clusters.
12.6 Glossary
DBSCAN- DBSCAN algorithm is used to discover clusters of arbitrary shape in spatial databases
with noise by growing regions with sufficiently high density into clusters.
Centroid- Centroid is called the center of gravity for a given cluster. All objects in a cluster are based
on the similarity with the centroid object.
Mean- It refers to average sum of all objects or elements in a cluster.
Median- It refers to an element or object lying at the midpoint of a frequency distribution of observed
values or quantities.
Tree- A tree data structure is used to simulate the hierarchical relationship between the objects.
Object on top of tree is called ROOT and objects at the bottom of the tree are called LEAFS.
12.7 Answers to check your progress/self assessment questions
1.
a. Each group must contain at least one object.
b. Each object must belong to exactly one group.
2. Mean/ median.
3. b.
4. K-mean clustering takes the mean/ median value of the objects in a cluster as the centroid, whereas
in case of k-medoid clustering actual objects are selected as medoids to represent the clusters.
5. Tree
6. Agglomerative, divisive.
7. b.
8. DBSCAN
9. TRUE.
10. Density-Based Spatial Clustering of Applications with Noise
12.8 References/ Suggested Readings
1. Data Mining: Concepts and Techniques by J. Han and M. Kamber Publisher
Morgan Kaufmann Publishers
2. Advanced Data warehouse Design (from conventional to spatial and temporal applications) by
Elzbieta Malinowski and Esteban Zimányi Publisher Springer
3. Modern Data Warehousing, Mining and Visualization by George M Marakas,
Publisher Pearson.
12.9 Model Questions
1. Write the algorithm for agglomerative based hierarchical clustering.
2. What is the difference between k-mean and k-medoid clustering?
3. Explain the working of k-mean clustering with the help of an example.
4. Write the algorithm for k-medoid clustering.
5. Define the concept of density reachability.
6. Define centroid.
7. What is the termination condition of k-mean clustering method?
8. What is the difference between agglomerative and divisive clustering methods?