Sub Code : CS2032 Sub Name: Data Warehousing and Data Mining UNIT-I PART A DATA WAREHOUSING

29
Sub Code : CS2032 Sub Name: Data Warehousing and Data Mining UNIT-I PART A DATA WAREHOUSING 1. Define the term ‘Data Warehouse’. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. 2. Write down the applications of data warehousing. IBM Netezza Oracle ExaData Kognitio 360

Transcript of Sub Code : CS2032 Sub Name: Data Warehousing and Data Mining UNIT-I PART A DATA WAREHOUSING

Sub Code : CS2032 Sub Name: Data Warehousing and Data Mining

UNIT-I PART A

DATA WAREHOUSING

1. Define the term ‘Data Warehouse’. A data warehouse is a subject-oriented, integrated,

time-variant and non-volatile collection of data in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have

different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a

product.Time-Variant: Historical data is kept in a data warehouse. For

example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This

contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system

may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never

be altered.

2. Write down the applications of data warehousing.• IBM Netezza• Oracle ExaData• Kognitio 360

• Teradata

3. When is data mart appropriate? A data mart is a repository of data that is designed to serve a

particular community of knowledge workers. The goal of a data mart, however, is to meet the particular demands of a

specific group of users within the organization, such as human resource management (HRM ). Generally, an

organization's data marts are subsets of the organization's data warehouse.

4. List out the functionality of metadata. Business Metadata - This metadata has the data ownership

information, business definition and changingpolicies.

Technical Metadata (Structural metadata) - Technical metadata includes database system names, table and

column names and sizes, data types and allowed values. Technical metadata also includes structural

information such as primary and foreign key attributes and indices.

Operational Metadata (descriptive metadata)- This metadata includes currency of data and data lineage.Currency of

data means whether data is active, archived or purged. Lineage of data means history of data migrated and

transformation applied on it.

5. What are nine decision in the design of a Datawarehousing?

1. Choosing the subject matter 2. Deciding what a fact table represents 3. Identifying and conforming the dimensions 4. Choosing the facts 5. Storing pre calculations in the fact table

6. Rounding out the dimension table 7. Choosing the duration of the db 8. The need to track slowly changing dimensions 9. Deciding the query priorities and query models

6. List out the two different types of reporting tools. Production reporting tool used to generate regular

operational reports Desktop report writer are inexpensive desktop tools

designed for end users.

7. Why data mining is used in all organizations. 8. What are the technical issues to be considered when

designing and implementing a data warehouse environment? A number of technical issues are to be considered when

designing a data warehouse environment. These issuesinclude: The hardware platform that would house the data

warehouse The dbms that supports the warehouse data The communication infrastructure that connects data

marts, operational systems and end users The hardware and software to support meta data

repository The systems management framework that enables admin

of the entire environment Implementationconsiderations

9. List out some of the examples of access tools. Data query and reporting tools Application development tools Executive info system tools (EIS) OLAP tools Data mining tools

10 . What are the advantages of data warehousing. 1. Integrating data from multiple sources; 2. Performing new types of analyses; and 3. Reducing cost to access historical data.

Other benefits may include: 1. Standardizing data across the organization, a

"single version of the truth"; 2. Improving turnaround time for analysis and

reporting; 3. Sharing data and allowing others to easily access

data; 4. Supporting ad hoc reporting and inquiry; 5. Reducing the development burden on IS/IT; and 6. Removing informational processing load from

transaction-oriented databases;

11. Give the difference between the Horizontal and VerticalParallelism.

Horizontal parallelism: which means that the data base is partitioned across multiple disks and parallel

processing occurs within a specific task that is performed concurrently on different processors against different set of data

Vertical parallelism: This occurs among different tasks. All query components such as scan, join, sort etc are

executed in parallel in a pipelined fashion. In other words, an output from one task becomes an input into

another task.

12. Draw a neat diagram for the Distributed memory shared disk architecture.

13. Define star schema. The multidimensional view of data that is expressed

using relational data base semantics is provided by the data base schema design called star schema. The basic of stat

schema is that information can be classified into twogroups:

Facts Dimension

14. What are the reasons to achieve very good performance by SYBASE IQ technology?

15. What are the steps to be followed to store the external source into the data warehouse?

Collect and analyze business requirements Create a data model and a physical design Define data sources Choose the db tech and platform Extract the data from operational db, transform it,

clean it up and load it into the warehouse Choose db access and reporting tools Choose db connectivity software

Choose data analysis and presentation s/w Update the data warehouse

16. What is virtual warehouse? A virtual data warehouse provides a compact view of the data

inventory. It contains Meta data. It uses middleware to build connections to different data sources. They can be

fast as they allow users to filter the most important pieces of data from different legacy applications.

17. Draw the standard framework for metadata interchange. 18. List out the five main groups of access tools.

Data query and reporting tools Application development tools Executive info system tools (EIS) OLAP tools Data mining tools

19. Define Data Visualization. Data visualisation is the study of the visual representation

of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for

the units of information".

20. What are the various forms of data preprocessing? Data cleaning: fill in missing values, smooth noisy

data, identify or remove outliers, and resolveinconsistencies.

Data integration: using multiple databases, data cubes, or files.

Data transformation: normalization and aggregation.

Data reduction: reducing the volume but producing the same or similar analytical results.

Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

21. How is data warehouse different from database? How are they similar?

A database is used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data

Warehousing. A data warehouse is used for Online Analytical

Processing (OLAP). This reads the historical data for the Users for business decisions.

In a database the tables and joins are complex since they are normalized for RDMS. This reduces redundant data and saves storage space.

In data warehouse, the tables and joins are simple since they are de-normalized. This is done to reduce the

response time for analytical queries.

Relational modeling techniques are used for RDMS database design, whereas modeling techniques are used

for the Data Warehouse design.

A database is optimized for write operation, while a data warehouse is optimized for read operations.

In a database, the performance is low for analysis queries, while in a data warehouse, there is high

performance for analytical queries.

A data warehouse is a step ahead of a database. It includes a database in its structure. 

22. What is data transformation? Give example. The transform stage applies a series of rules or

functions to the extracted data from the source to derive the data for loading into the end target. Transformation types

may be required to meet the business and technical needs of the server or data warehouse

23. With an example explain what is Meta data? Metadata is one of the important keys to the success of

the data warehousing and business intelligence effort. Metadata is simply defined as data about data. The data

that are used to represent other data is known as metadata. For example the indexes of a book serve as metadata for the

contents in the book.

24. What is data mart? Data mart contains a subset of corporate wide data that is of

value to a specific group of users. The scope is confined to specific selected subjects. For eg, a marketing data mart may

confine its subject to customer, item and sales.

PART-B

1. Enumerate the building blocks of data warehouse. Explain the importance of metadata in a data warehouse environment.

[16] 2. Explain various methods of data cleaning in detail [8] 3. Diagrammatically illustrate and discuss the data

warehousing architecture with briefly explain components of data warehouse [16]

4. (i) Distinguish between Data warehousing and data mining. [8] (ii)Describe in detail about data extraction, cleanup

[8]

5. Write short notes on (i)Transformation [8] (ii)Metadata [8]

6. List and discuss the steps involved in mapping the data warehouse to a multiprocessor architecture. [16]

7. Discuss in detail about Bitmapped Indexing [16] 8. Explain in detail about different Vendor Solutions. [16]

UNIT-II BUSINESS ANALYSIS

PART A

1. Difference between OLAP and OLTP. OLTP System

Online Transaction Processing

(Operational System)

OLAP System Online Analytical Processing

(Data Warehouse)

Source ofdata

Operational data; OLTPs are the original source of the

data.

Consolidation data; OLAP data comes from the various OLTP

Databases Purpose of

data To control and run

fundamental business tasks To help with planning, problem

solving, and decision support

What thedata

Reveals a snapshot of ongoing business processes

Multi-dimensional views of various kinds of business

activities

Inserts andUpdates

Short and fast inserts and updates initiated by end

users

Periodic long-running batch jobs refresh the data

Queries

Relatively standardized and simple queries

Returning relatively fewrecords

Often complex queries involving aggregations

Processing Typically very fast Depends on the amount of data

Speed

involved; batch data refreshes and complex queries may take many hours; query speed can be

improved by creating indexes

SpaceRequirements

Can be relatively small if historical data is archived

Larger due to the existence of aggregation structures and

history data; requires more indexes than OLTP

DatabaseDesign

Highly normalized with manytables

Typically de-normalized with fewer tables; use of star and/or snowflake schemas

Backup andRecovery

Backup religiously; operational data is

critical to run the business, data loss is

likely to entail significant monetary loss

and legal liability

Instead of regular backups, some environments may consider

simply reloading the OLTP data as a recovery method

2. Classify OLAP tools. MOLAP ROLAP HOLAP (MQE: Managed Query Environment)

3. What is meant by OLAP? OLAP stands for Online Analytical Processing. It uses

database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large

amounts of data. E.g. OLAP technology could provide management with fast answers to complex queries on their

operational data or enable them to analyze their company's historical data for trends and patterns.

4. Difference between OLAP & OLTP 5. Define Concept Hierarchy.

A concept hierarchy for a given numeric attribute attribute defines a discretization of the attribute. Concept hierarchies can be used to reduce the data collecting and replacing low-level concepts (such as numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior). Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret.

6. List out the five categories of decision support system. Communication-driven DSS Data-driven DSS Document-driven DSS Knowledge-driven DSS Model-driven DSS

7. Define Cognos Impromptu Impromptu is an interactive database reporting tool. It

allows Power Users to query data without programming knowledge. When using the Impromptu tool, no data is written

or changed in the database. It is only capable of reading thedata.

8. List out any 5 OLAP guidelines. Multidimensional conceptual view: The OLAP should

provide an appropriate multidimensional Business model that suits the Business problems and

Requirements.

Transparency: The OLAP tool should provide transparency to the input data for the users.

Accessibility: The OLAP tool should only access the data required only to the analysis needed.

Consistent reporting performance: The Size of the database should not affect in any way the

performance. Client/server architecture: The OLAP tool should use

the client server architecture to ensure better performance and flexibility.

9. Distinguish between multidimensional and multi- relational OLAP.

10. Define ROLAP. ROLAP relies on manipulating the data stored in the

relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a

"WHERE" clause in the SQL statement. Data stored in relational tables

11. Draw a neat diagram for the web processing model. 12. Define MQE. MQE is otherwise called as HOLAP. HOLAP technologies attempt

to combine the advantages of MOLAP and ROLAP. For summary- type information, HOLAP leverages cube technology for

faster performance. It stores only the indexes and aggregations in the multidimensional form while the rest of

the data is stored in the relational database.

13. Draw a neat sketch for three-tired client/serverarchitecture.

14. List out the applications that the organizations uses to build a query and reporting environment for the data

warehouse.

financial services Banking Services

Consumer goods

Retail sectors.

Controlled manufacturing

15. Distinguish between window painter and data windowspainter.

16. Define ADF, SGF and DEF. ADF is s based on distributed object computing framework and

is used to define user interface and application logic.

17. What is the function of power play administrator? PowerPlay can interoperate with a wide variety of

third-party software tools, databases and applications. It is stored in multidimensional data sets called

powercubes.

PART-B

1. Discuss the typical OLAP operations with an example. [6] 2. List and discuss the basic features that are provided by

reporting and query tools used for business analysis. [16]

3. Describe in detail about Cognos Impromptu [16] 4. Explain about OLAP in detail. [16] 5. With relevant examples discuss multidimensional online

analytical processing and multi-relational online analytical processing. [16]

6. Discuss about the OLAP tools and the Internet [16] 7. (i)Explain Multidimensional Data model. [10]

(ii)Discuss how computations can be performed efficiently on data cubes. [6]

UNIT-III DATA MINING

PART A

1. Define data. 2. State why the data preprocessing an important issue for

data warehousing and data mining. Data preprocessing is used to avoid Incomplete, Noisy and

Inconsistent in data

Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

data Noisy: containing errors or outliers

Inconsistent: containing discrepancies in codes ornames

3. What is the need for discretization in data mining? Data discretization is a form of numerosity reduction

that is very useful for the automatic generation of concepthierarchies

Discretization handle numerics very well also putting values into buckets so that there are a limited number of

possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric

and string columns.

4. What are the various forms of data preprocessing? Data cleaning - Fill in missing values, smooth noisy

data, identify or remove outliers, and resolveinconsistencies

Data integration - Integration of multiple databases, data cubes, or files

Data transformation - Normalization and aggregation Data reduction - Obtains reduced representation in

volume but produces the same or similar analyticalresults

Data discretization - Part of data reduction but with particular importance, especially for numerical data

5. What is concept Hierarchy? Give an example. A concept hierarchy for a given numeric attribute

defines a discretization of the attribute. Concept

hierarchies can be used to reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young,

middle-aged, or senior).

6. What are the various forms of data preprocessing? 7. Mention the various tasks to be accomplished as part of

data pre-processing. Tasks to be accomplished Remove - incomplete data

Noisy data Inconsistent data

8. Define Data Mining. Data mining refers to extracting or mining" knowledge from

large amounts of data. There are many other terms related to data mining, such as knowledge mining, knowledge

extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for

another popularly used term, Knowledge Discovery in Databases", or KDD

9. List out any four data mining tools. Rapid miner Orange GNU octave SenticNet API Weka

10. What do data mining functionalities include?

Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In

general, data mining tasks can be classified into two categories:

• Descriptive• predictive

Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining

tasks perform inference on the current data in order to make predictions.

11. Define patterns. Pattern mining concentrates on identifying rules that

describe specific patterns within the data. Market-basket analysis, which identifies items that typically occur

together in purchase transactions, was one of the first applications of data mining. For example, supermarkets used

market-basket analysis to identify items that were often purchased together.

PART-B

1.

(i) Explain the various primitives for specifying Data mining Task. (ii) Describe the various descriptive statistical

measures for data mining.

[10 ][6]

2.

Discuss about different types of data andfunctionalities.

[16]

3.

(i)Describe in detail about Interestingness ofpatterns.

(ii)Explain in detail about data mining taskprimitives.

[10][6]

4.5.

(i)Discuss about different Issues of data mining. (ii)Explain in detail about data preprocessing.

How data mining system are classified? Discuss each classification with an example.

[6][10

][16]

6.

How data mining system can be integrated with a data warehouse? Discuss with an example.

[16]

UNIT-IV ASSOCIATION RULE AND CLASSIFICATION

Part A

1. What is meant by market Basket analysis? Market Basket analysis can solve problem contain well

defined items that can be group together in potentially interesting way. That are,

Choosing right Items Generate rules Identify useful rules that are unknown, valid and

actionable

2. What is the use of multilevel association rules?• The goal of Multiple-Level Association Analysis is to

find the hidden information in or between levels ofabstraction

3. What is meant by pruning in a decision tree induction? Pruning is a technique in machine learning that reduces

the size of decision trees by removing sections of the tree that provide little power to classify instances. The dual goal of pruning is reduced complexity of the final

classifier as well as better predictive accuracy by the

reduction of over fitting and removal of sections of a classifier that may be based on noisy or erroneous data.

4. Write the two measures of Association Rule. Support ( A=>B)= P(AUB) Confidence (A=>B)=P(B/A)

5. With an example explain correlation analysis. Correlation Analysisprovides an alternative framework for

finding interesting relationships, or to improve understanding of meaning of some association rules (a lift

of an association rule) Two item sets A and B are independent(the occurrence of A is

independent of the occurrence of item set B) iffprobabilityP

P(A ∩B) = P(A) . P(B)

6. Define conditional pattern base. Conditional pattern base = set of prefix paths in FP-tree

that cooccur with some item.

7. List out the major strength of decision tree method. Decision trees are able to generate understandable

rules. Decision trees perform classification without

requiring much computation. Decision trees are able to handle both continuous and

categorical variables. Decision trees provide a clear indication of which

fields are most important for prediction or classification.

8. In classification trees, what are the surrogate splits, and how are they used?

Discriminant-based univariate splits Discriminant-based linear combination splits C&RT-style exhaustive search for univariate splits

9. The Naïve Bayes’ classifier makes what assumptions that motivate its name?

10. What is the frequent item set property? Every subset of a frequent itemset is also frequent.

Also known as Apriori Property or Downward Closure Property, this rule essentially says that we don’t need

to find the count of an itemset, if all its subsets are not frequent. This is made possible because of the anti-

monotone property of support measure – the support for an itemset never exceeds the support for its subsets.

Stay tuned for this. If we divide the entire database in several partitions,

then an itemset can be frequent only if it is frequent in at least one partition. Bear in mind that the support of an itemset is actually a percentage and if this minimum

percentage requirement is not met for at least one individual partitions, it will not be met for the whole

database. This property enables us to apply divide and conquer type of algorithms. Again, stay tuned for this

too.

11. List out the major strength of the decision treeInduction.

12. Write the two measures of association rule. 13. What are the application areas of association rule?

Finding pattern in biological database Medical diagnosis Fraud detection Census data

14. What is tree pruning in decision tree induction? Tree Pruning is a technique in machine learning that

reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. The dual goal of pruning is reduced complexity of the final

classifier as well as better predictive accuracy by the reduction of over fitting and removal of sections of a

classifier that may be based on noisy or erroneous data. Tree Pruning Approaches listed below:

Prepruning - The tree is pruned by halting its construction early.

Postpruning - This approach removes subtree form fully grown tree.

15. What are multidimensional association rules? Association rules that involve two or more dimensions

or predicates Interdimension association rule: Multidimensional

association rule with no repeated predicate or dimension

Hybrid-dimension association rule: Multidimensional association rule with

multiple occurrences of some predicates or dimensions.

16. What are the Apriori properties used in the Apriorialgorithms?

Detection of frequent item set. Generation of rule 17. How is predication different from classification?

18. What is a support vector machine? 19. What are the means to improve the performance of

association rule mining algorithm? 20. State the advantages of the decision tree approach over

other approaches for performing classification. 21. How are association rules mined from large databases?

• I step: Find all frequent item sets: • II step: Generate strong association rules from

frequent item setsPART-B

1. Decision tree induction is a popular classification method. Taking one typical decision tree induction

algorithm , briefly outline the method of decision tree classification. [16]

2. Consider the following training dataset and the original decision tree induction algorithm (ID3). Risk is the class

label attribute. The Height values have been already discredited into disjoint ranges. Calculate the information

gain if Gender is chosen as the test attribute. Calculate the information gain if Height is chosen as the test attribute.

Draw the final decision tree (without any pruning) for the training dataset. Generate all the “IF-THEN rules from the decision tree.

Gender Height Risk F (1.5, 1.6) Low M (1.9, 2.0) High F (1.8, 1.9) Medium F (1.8, 1.9) Medium F (1.6, 1.7) Low M (1.8, 1.9) Medium F (1.5, 1.6) Low M (1.6, 1.7) Low M (2.0, 8) High M (2.0, 8)

High F (1.7, 1.8) Medium M (1.9, 2.0) Medium F (1.8, 1.9) Medium F

(1.7, 1.8) Medium F (1.7, 1.8) Medium [16]

(a) Given the following transactional database 1 C, B, H 2 B, F, S 3 A, F, G 4 C, B, H 5 B, F, G 6 B, E, O

(i) We want to mine all the frequent itemsets in the data using the Apriori algorithm.

Assume the minimum support level is 30%. (You need to give the set of frequent item sets in L1, L2,… candidate item sets

in C1, C2,…) [9] (ii) Find all the association rules that involve only B, C.H

(in either left or right hand side of the rule). The minimum confidence is 70%. [7]

3. Describe the multi-dimensional association rule, giving a suitable example. [16]

4. (a)Explain the algorithm for constructing a decision tree from training samples [12]

(b)Explain Bayes theorem. [4] 6. Develop an algorithm for classification using Bayesian

classification. Illustrate the algorithm with a relevant example. [16]

7. Discuss the approaches for mining multi level association rules from the transactional databases. Give relevant

example. [16] 8. Write and explain the algorithm for mining frequent item

sets without candidate generation. Give relevant example.[16]

9. How is attribute oriented induction implemented? Explain in detail. [16]

10. Discuss in detail about Bayesian classification [8] 11. A database has four transactions. Let min sup=60% and min

conf=80%.

TID DATE ITEMS_BOUGHTT100 10/15/07 {K,A,B}T200 10/15/07 {D,A,C,E,B}T300 10/19/07 {C,A,B,E}T400 10/22/07 {B,A,D}

Find all frequent itemsets using Apriori and FP growth, respectively. Compare the efficiency of the two mining

process. [16]

UNIT-V CLUSTERING AND APPLICATION AND TRENDS IN DATA

PART A

1. What are the requirements of clustering? The basic requirements of cluster analysis are

• Dealing with different types of attributes. • Dealing with noisy data. • Constraints on clustering. • Dealing with arbitrary shapes. • High dimensionality • Ordering of input data • Interpretability and usability • Determining input parameter and • Scalability

2. Define spatial data mining. Extracting undiscovered and implied spatial information.

Spatial data: Data that is associated with a location Used in several fields such as geography, geology, medical

imaging etc.

3. What is text mining? Extraction of meaningful information from large amounts

free format textualdata.

Useful in Artificial intelligence and pattern matching Also known as text mining, knowledge discovery from text, or

contentanalysis.

4. Distinguish between classification and clustering.

CLASSIFICATION CLUSTERING We have a Training set

containing data that have been previously categorized

Based on this training set, the algorithms finds the

category that the new data points belong to

We do not know the characteristics of

similarity of data in advance

Using statistical concepts, we split the datasets into

sub-datasets such that the Sub-datasets have “Similar”

data Since a Training set exists, we

describe this technique as Supervised learning

Since Training set is not used, we describe this technique as

Unsupervised learning Example:We use training dataset

which categorized customers that have churned. Now based on this

training set, we can classify whether a customer will churn or

not.

Example:We use a dataset of customers and split them into sub-

datasets of customers with “similar” characteristics. Now

this information can be used to market a product to a specific

segment of customers that has been identified by clustering algorithm

5. Define a Spatial database. A spatial database is a database that is optimized to store and query data that represents objects defined in a

geometric space. Most spatial databases allow representing simple geometric objects such as points, lines and polygons.

6. List out any two various commercial data mining tools. DBMiner GeoMiner Multimedia miner WeblogMiner

7. What is the objective function of K-means algorithm?• k-means algorithm is implemented in 4 steps:

1. Partition objects into k nonempty subsets2. Compute seed points as the centroids of the

clusters of the current partition. The centroid is the center (mean point) of the cluster.

3. Assign each object to the cluster with the nearest seed point.

4. Go back to Step 2, stop when no more new assignment (or fractional drop of SSE or MSE is less than a

threshold).

8. Mention the advantages of Hierarchical clustering.

10. Define Binary variables? And what are the two types of binary variables?

Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when state is 1,

variable is present. There are two types of binary variables, symmetric and asymmetric binary variables.

Symmetric variables are those variables that have same state values and weights. Asymmetric variables are those

variables that have not same state values and weights.

11. What is web usage mining? Technique to process information available on web and

search for useful data. To discover web pages, text documents , multimedia files, images, and other types of resources from web. Used in several fields such as E- commerce, information filtering, fraud detection and education and research.

12. What are interval scaled variables? Interval scaled variables are continuous measurements

of linear scale. For example, height and weight, weather temperature or coordinates for any cluster. These

measurements can be calculated using Euclidean distance or Minkowski distance.

13. What are the applications of spatial databases?

14. Define Clustering? Clustering is a process of grouping the physical or conceptual data object into clusters.

15. What is cluster analysis ? A cluster analysis is the process of analyzing the various

clusters to organize the different objects into meaningful and descriptive objects.

16. What are the two data structures in cluster analysis?• Data matrix

– (two modes) n objects, p variables

• Dissimilarity matrix– (one mode)

between all pairs of n objects

17. What is an outlier? Give example. Very often, there exist data objects that do not comply with

the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the

remaining set of data, are called outliers 18. What is audio data mining?

Uses audio signals to indicate patterns of data or the features of data mining

results. Patterns are transformed into sound and music. To identify interesting or unusual patterns by

listening pitches, rhythms, tune and melody.

19. List two application of data mining. DNA analysis Financial data analysis Retail Industry Telecommunication industry Market analysis Banking industry Health care analysis.

20. What are the fields in which clustering techniques are used? • Clustering is used in biology to develop new plants and

animal taxonomies. • Clustering is used in business to enable marketers to

develop new distinct groups of their customers and characterize the customer group on basis of purchasing.

• Clustering is used in the identification of groups of automobiles Insurance policy customer.

• Clustering is used in the identification of groups of house in a city on the basis of house type, their cost and

geographical location.

• Clustering is used to classify the document on the web for information discovery.

PART-B

1. BIRCH and CLARANS are two interesting clustering algorithms that perform effective clustering in large data

sets. (i) Outline how BIRCH performs clustering in large data

sets. [10] (ii) Compare and outline the major differences of the two scalable clustering algorithms BIRCH and CLARANS.

[6] 2. Write a short note on web mining taxonomy. Explain the

different activities of text mining. 3. Discuss and elaborate the current trends in data mining.

[6+5+5] 4. Discuss spatial data bases and Text databases [16] 5. What is a multimedia database? Explain the methods of

mining multimedia database? [16] 6. (a) Explain the following clustering methods in detail.

(a) BIRCH (b) CURE [16] 7. Discuss in detail about any four data mining

applications. [16] 8. Write short notes on

(i) Partitioning methods [8] (ii) Outlier analysis [8] 9. Describe K means clustering with an example. [16]

10. Describe in detail about Hierarchical methods. 11. With relevant example discuss constraint based cluster

analysis. [16]