Sub Code : CS2032 Sub Name: Data Warehousing and Data Mining
UNIT-I PART A
DATA WAREHOUSING
1. Define the term ‘Data Warehouse’. A data warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have
different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a
product.Time-Variant: Historical data is kept in a data warehouse. For
example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system
may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never
be altered.
2. Write down the applications of data warehousing.• IBM Netezza• Oracle ExaData• Kognitio 360
• Teradata
3. When is data mart appropriate? A data mart is a repository of data that is designed to serve a
particular community of knowledge workers. The goal of a data mart, however, is to meet the particular demands of a
specific group of users within the organization, such as human resource management (HRM ). Generally, an
organization's data marts are subsets of the organization's data warehouse.
4. List out the functionality of metadata. Business Metadata - This metadata has the data ownership
information, business definition and changingpolicies.
Technical Metadata (Structural metadata) - Technical metadata includes database system names, table and
column names and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
Operational Metadata (descriptive metadata)- This metadata includes currency of data and data lineage.Currency of
data means whether data is active, archived or purged. Lineage of data means history of data migrated and
transformation applied on it.
5. What are nine decision in the design of a Datawarehousing?
1. Choosing the subject matter 2. Deciding what a fact table represents 3. Identifying and conforming the dimensions 4. Choosing the facts 5. Storing pre calculations in the fact table
6. Rounding out the dimension table 7. Choosing the duration of the db 8. The need to track slowly changing dimensions 9. Deciding the query priorities and query models
6. List out the two different types of reporting tools. Production reporting tool used to generate regular
operational reports Desktop report writer are inexpensive desktop tools
designed for end users.
7. Why data mining is used in all organizations. 8. What are the technical issues to be considered when
designing and implementing a data warehouse environment? A number of technical issues are to be considered when
designing a data warehouse environment. These issuesinclude: The hardware platform that would house the data
warehouse The dbms that supports the warehouse data The communication infrastructure that connects data
marts, operational systems and end users The hardware and software to support meta data
repository The systems management framework that enables admin
of the entire environment Implementationconsiderations
9. List out some of the examples of access tools. Data query and reporting tools Application development tools Executive info system tools (EIS) OLAP tools Data mining tools
10 . What are the advantages of data warehousing. 1. Integrating data from multiple sources; 2. Performing new types of analyses; and 3. Reducing cost to access historical data.
Other benefits may include: 1. Standardizing data across the organization, a
"single version of the truth"; 2. Improving turnaround time for analysis and
reporting; 3. Sharing data and allowing others to easily access
data; 4. Supporting ad hoc reporting and inquiry; 5. Reducing the development burden on IS/IT; and 6. Removing informational processing load from
transaction-oriented databases;
11. Give the difference between the Horizontal and VerticalParallelism.
Horizontal parallelism: which means that the data base is partitioned across multiple disks and parallel
processing occurs within a specific task that is performed concurrently on different processors against different set of data
Vertical parallelism: This occurs among different tasks. All query components such as scan, join, sort etc are
executed in parallel in a pipelined fashion. In other words, an output from one task becomes an input into
another task.
12. Draw a neat diagram for the Distributed memory shared disk architecture.
13. Define star schema. The multidimensional view of data that is expressed
using relational data base semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into twogroups:
Facts Dimension
14. What are the reasons to achieve very good performance by SYBASE IQ technology?
15. What are the steps to be followed to store the external source into the data warehouse?
Collect and analyze business requirements Create a data model and a physical design Define data sources Choose the db tech and platform Extract the data from operational db, transform it,
clean it up and load it into the warehouse Choose db access and reporting tools Choose db connectivity software
Choose data analysis and presentation s/w Update the data warehouse
16. What is virtual warehouse? A virtual data warehouse provides a compact view of the data
inventory. It contains Meta data. It uses middleware to build connections to different data sources. They can be
fast as they allow users to filter the most important pieces of data from different legacy applications.
17. Draw the standard framework for metadata interchange. 18. List out the five main groups of access tools.
Data query and reporting tools Application development tools Executive info system tools (EIS) OLAP tools Data mining tools
19. Define Data Visualization. Data visualisation is the study of the visual representation
of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for
the units of information".
20. What are the various forms of data preprocessing? Data cleaning: fill in missing values, smooth noisy
data, identify or remove outliers, and resolveinconsistencies.
Data integration: using multiple databases, data cubes, or files.
Data transformation: normalization and aggregation.
Data reduction: reducing the volume but producing the same or similar analytical results.
Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
21. How is data warehouse different from database? How are they similar?
A database is used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data
Warehousing. A data warehouse is used for Online Analytical
Processing (OLAP). This reads the historical data for the Users for business decisions.
In a database the tables and joins are complex since they are normalized for RDMS. This reduces redundant data and saves storage space.
In data warehouse, the tables and joins are simple since they are de-normalized. This is done to reduce the
response time for analytical queries.
Relational modeling techniques are used for RDMS database design, whereas modeling techniques are used
for the Data Warehouse design.
A database is optimized for write operation, while a data warehouse is optimized for read operations.
In a database, the performance is low for analysis queries, while in a data warehouse, there is high
performance for analytical queries.
A data warehouse is a step ahead of a database. It includes a database in its structure.
22. What is data transformation? Give example. The transform stage applies a series of rules or
functions to the extracted data from the source to derive the data for loading into the end target. Transformation types
may be required to meet the business and technical needs of the server or data warehouse
23. With an example explain what is Meta data? Metadata is one of the important keys to the success of
the data warehousing and business intelligence effort. Metadata is simply defined as data about data. The data
that are used to represent other data is known as metadata. For example the indexes of a book serve as metadata for the
contents in the book.
24. What is data mart? Data mart contains a subset of corporate wide data that is of
value to a specific group of users. The scope is confined to specific selected subjects. For eg, a marketing data mart may
confine its subject to customer, item and sales.
PART-B
1. Enumerate the building blocks of data warehouse. Explain the importance of metadata in a data warehouse environment.
[16] 2. Explain various methods of data cleaning in detail [8] 3. Diagrammatically illustrate and discuss the data
warehousing architecture with briefly explain components of data warehouse [16]
4. (i) Distinguish between Data warehousing and data mining. [8] (ii)Describe in detail about data extraction, cleanup
[8]
5. Write short notes on (i)Transformation [8] (ii)Metadata [8]
6. List and discuss the steps involved in mapping the data warehouse to a multiprocessor architecture. [16]
7. Discuss in detail about Bitmapped Indexing [16] 8. Explain in detail about different Vendor Solutions. [16]
UNIT-II BUSINESS ANALYSIS
PART A
1. Difference between OLAP and OLTP. OLTP System
Online Transaction Processing
(Operational System)
OLAP System Online Analytical Processing
(Data Warehouse)
Source ofdata
Operational data; OLTPs are the original source of the
data.
Consolidation data; OLAP data comes from the various OLTP
Databases Purpose of
data To control and run
fundamental business tasks To help with planning, problem
solving, and decision support
What thedata
Reveals a snapshot of ongoing business processes
Multi-dimensional views of various kinds of business
activities
Inserts andUpdates
Short and fast inserts and updates initiated by end
users
Periodic long-running batch jobs refresh the data
Queries
Relatively standardized and simple queries
Returning relatively fewrecords
Often complex queries involving aggregations
Processing Typically very fast Depends on the amount of data
Speed
involved; batch data refreshes and complex queries may take many hours; query speed can be
improved by creating indexes
SpaceRequirements
Can be relatively small if historical data is archived
Larger due to the existence of aggregation structures and
history data; requires more indexes than OLTP
DatabaseDesign
Highly normalized with manytables
Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup andRecovery
Backup religiously; operational data is
critical to run the business, data loss is
likely to entail significant monetary loss
and legal liability
Instead of regular backups, some environments may consider
simply reloading the OLTP data as a recovery method
2. Classify OLAP tools. MOLAP ROLAP HOLAP (MQE: Managed Query Environment)
3. What is meant by OLAP? OLAP stands for Online Analytical Processing. It uses
database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large
amounts of data. E.g. OLAP technology could provide management with fast answers to complex queries on their
operational data or enable them to analyze their company's historical data for trends and patterns.
4. Difference between OLAP & OLTP 5. Define Concept Hierarchy.
A concept hierarchy for a given numeric attribute attribute defines a discretization of the attribute. Concept hierarchies can be used to reduce the data collecting and replacing low-level concepts (such as numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior). Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret.
6. List out the five categories of decision support system. Communication-driven DSS Data-driven DSS Document-driven DSS Knowledge-driven DSS Model-driven DSS
7. Define Cognos Impromptu Impromptu is an interactive database reporting tool. It
allows Power Users to query data without programming knowledge. When using the Impromptu tool, no data is written
or changed in the database. It is only capable of reading thedata.
8. List out any 5 OLAP guidelines. Multidimensional conceptual view: The OLAP should
provide an appropriate multidimensional Business model that suits the Business problems and
Requirements.
Transparency: The OLAP tool should provide transparency to the input data for the users.
Accessibility: The OLAP tool should only access the data required only to the analysis needed.
Consistent reporting performance: The Size of the database should not affect in any way the
performance. Client/server architecture: The OLAP tool should use
the client server architecture to ensure better performance and flexibility.
9. Distinguish between multidimensional and multi- relational OLAP.
10. Define ROLAP. ROLAP relies on manipulating the data stored in the
relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a
"WHERE" clause in the SQL statement. Data stored in relational tables
11. Draw a neat diagram for the web processing model. 12. Define MQE. MQE is otherwise called as HOLAP. HOLAP technologies attempt
to combine the advantages of MOLAP and ROLAP. For summary- type information, HOLAP leverages cube technology for
faster performance. It stores only the indexes and aggregations in the multidimensional form while the rest of
the data is stored in the relational database.
13. Draw a neat sketch for three-tired client/serverarchitecture.
14. List out the applications that the organizations uses to build a query and reporting environment for the data
warehouse.
financial services Banking Services
Consumer goods
Retail sectors.
Controlled manufacturing
15. Distinguish between window painter and data windowspainter.
16. Define ADF, SGF and DEF. ADF is s based on distributed object computing framework and
is used to define user interface and application logic.
17. What is the function of power play administrator? PowerPlay can interoperate with a wide variety of
third-party software tools, databases and applications. It is stored in multidimensional data sets called
powercubes.
PART-B
1. Discuss the typical OLAP operations with an example. [6] 2. List and discuss the basic features that are provided by
reporting and query tools used for business analysis. [16]
3. Describe in detail about Cognos Impromptu [16] 4. Explain about OLAP in detail. [16] 5. With relevant examples discuss multidimensional online
analytical processing and multi-relational online analytical processing. [16]
6. Discuss about the OLAP tools and the Internet [16] 7. (i)Explain Multidimensional Data model. [10]
(ii)Discuss how computations can be performed efficiently on data cubes. [6]
UNIT-III DATA MINING
PART A
1. Define data. 2. State why the data preprocessing an important issue for
data warehousing and data mining. Data preprocessing is used to avoid Incomplete, Noisy and
Inconsistent in data
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
data Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes ornames
3. What is the need for discretization in data mining? Data discretization is a form of numerosity reduction
that is very useful for the automatic generation of concepthierarchies
Discretization handle numerics very well also putting values into buckets so that there are a limited number of
possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric
and string columns.
4. What are the various forms of data preprocessing? Data cleaning - Fill in missing values, smooth noisy
data, identify or remove outliers, and resolveinconsistencies
Data integration - Integration of multiple databases, data cubes, or files
Data transformation - Normalization and aggregation Data reduction - Obtains reduced representation in
volume but produces the same or similar analyticalresults
Data discretization - Part of data reduction but with particular importance, especially for numerical data
5. What is concept Hierarchy? Give an example. A concept hierarchy for a given numeric attribute
defines a discretization of the attribute. Concept
hierarchies can be used to reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).
6. What are the various forms of data preprocessing? 7. Mention the various tasks to be accomplished as part of
data pre-processing. Tasks to be accomplished Remove - incomplete data
Noisy data Inconsistent data
8. Define Data Mining. Data mining refers to extracting or mining" knowledge from
large amounts of data. There are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for
another popularly used term, Knowledge Discovery in Databases", or KDD
9. List out any four data mining tools. Rapid miner Orange GNU octave SenticNet API Weka
10. What do data mining functionalities include?
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two categories:
• Descriptive• predictive
Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining
tasks perform inference on the current data in order to make predictions.
11. Define patterns. Pattern mining concentrates on identifying rules that
describe specific patterns within the data. Market-basket analysis, which identifies items that typically occur
together in purchase transactions, was one of the first applications of data mining. For example, supermarkets used
market-basket analysis to identify items that were often purchased together.
PART-B
1.
(i) Explain the various primitives for specifying Data mining Task. (ii) Describe the various descriptive statistical
measures for data mining.
[10 ][6]
2.
Discuss about different types of data andfunctionalities.
[16]
3.
(i)Describe in detail about Interestingness ofpatterns.
(ii)Explain in detail about data mining taskprimitives.
[10][6]
4.5.
(i)Discuss about different Issues of data mining. (ii)Explain in detail about data preprocessing.
How data mining system are classified? Discuss each classification with an example.
[6][10
][16]
6.
How data mining system can be integrated with a data warehouse? Discuss with an example.
[16]
UNIT-IV ASSOCIATION RULE AND CLASSIFICATION
Part A
1. What is meant by market Basket analysis? Market Basket analysis can solve problem contain well
defined items that can be group together in potentially interesting way. That are,
Choosing right Items Generate rules Identify useful rules that are unknown, valid and
actionable
2. What is the use of multilevel association rules?• The goal of Multiple-Level Association Analysis is to
find the hidden information in or between levels ofabstraction
3. What is meant by pruning in a decision tree induction? Pruning is a technique in machine learning that reduces
the size of decision trees by removing sections of the tree that provide little power to classify instances. The dual goal of pruning is reduced complexity of the final
classifier as well as better predictive accuracy by the
reduction of over fitting and removal of sections of a classifier that may be based on noisy or erroneous data.
4. Write the two measures of Association Rule. Support ( A=>B)= P(AUB) Confidence (A=>B)=P(B/A)
5. With an example explain correlation analysis. Correlation Analysisprovides an alternative framework for
finding interesting relationships, or to improve understanding of meaning of some association rules (a lift
of an association rule) Two item sets A and B are independent(the occurrence of A is
independent of the occurrence of item set B) iffprobabilityP
P(A ∩B) = P(A) . P(B)
6. Define conditional pattern base. Conditional pattern base = set of prefix paths in FP-tree
that cooccur with some item.
7. List out the major strength of decision tree method. Decision trees are able to generate understandable
rules. Decision trees perform classification without
requiring much computation. Decision trees are able to handle both continuous and
categorical variables. Decision trees provide a clear indication of which
fields are most important for prediction or classification.
8. In classification trees, what are the surrogate splits, and how are they used?
Discriminant-based univariate splits Discriminant-based linear combination splits C&RT-style exhaustive search for univariate splits
9. The Naïve Bayes’ classifier makes what assumptions that motivate its name?
10. What is the frequent item set property? Every subset of a frequent itemset is also frequent.
Also known as Apriori Property or Downward Closure Property, this rule essentially says that we don’t need
to find the count of an itemset, if all its subsets are not frequent. This is made possible because of the anti-
monotone property of support measure – the support for an itemset never exceeds the support for its subsets.
Stay tuned for this. If we divide the entire database in several partitions,
then an itemset can be frequent only if it is frequent in at least one partition. Bear in mind that the support of an itemset is actually a percentage and if this minimum
percentage requirement is not met for at least one individual partitions, it will not be met for the whole
database. This property enables us to apply divide and conquer type of algorithms. Again, stay tuned for this
too.
11. List out the major strength of the decision treeInduction.
12. Write the two measures of association rule. 13. What are the application areas of association rule?
Finding pattern in biological database Medical diagnosis Fraud detection Census data
14. What is tree pruning in decision tree induction? Tree Pruning is a technique in machine learning that
reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. The dual goal of pruning is reduced complexity of the final
classifier as well as better predictive accuracy by the reduction of over fitting and removal of sections of a
classifier that may be based on noisy or erroneous data. Tree Pruning Approaches listed below:
Prepruning - The tree is pruned by halting its construction early.
Postpruning - This approach removes subtree form fully grown tree.
15. What are multidimensional association rules? Association rules that involve two or more dimensions
or predicates Interdimension association rule: Multidimensional
association rule with no repeated predicate or dimension
Hybrid-dimension association rule: Multidimensional association rule with
multiple occurrences of some predicates or dimensions.
16. What are the Apriori properties used in the Apriorialgorithms?
Detection of frequent item set. Generation of rule 17. How is predication different from classification?
18. What is a support vector machine? 19. What are the means to improve the performance of
association rule mining algorithm? 20. State the advantages of the decision tree approach over
other approaches for performing classification. 21. How are association rules mined from large databases?
• I step: Find all frequent item sets: • II step: Generate strong association rules from
frequent item setsPART-B
1. Decision tree induction is a popular classification method. Taking one typical decision tree induction
algorithm , briefly outline the method of decision tree classification. [16]
2. Consider the following training dataset and the original decision tree induction algorithm (ID3). Risk is the class
label attribute. The Height values have been already discredited into disjoint ranges. Calculate the information
gain if Gender is chosen as the test attribute. Calculate the information gain if Height is chosen as the test attribute.
Draw the final decision tree (without any pruning) for the training dataset. Generate all the “IF-THEN rules from the decision tree.
Gender Height Risk F (1.5, 1.6) Low M (1.9, 2.0) High F (1.8, 1.9) Medium F (1.8, 1.9) Medium F (1.6, 1.7) Low M (1.8, 1.9) Medium F (1.5, 1.6) Low M (1.6, 1.7) Low M (2.0, 8) High M (2.0, 8)
High F (1.7, 1.8) Medium M (1.9, 2.0) Medium F (1.8, 1.9) Medium F
(1.7, 1.8) Medium F (1.7, 1.8) Medium [16]
(a) Given the following transactional database 1 C, B, H 2 B, F, S 3 A, F, G 4 C, B, H 5 B, F, G 6 B, E, O
(i) We want to mine all the frequent itemsets in the data using the Apriori algorithm.
Assume the minimum support level is 30%. (You need to give the set of frequent item sets in L1, L2,… candidate item sets
in C1, C2,…) [9] (ii) Find all the association rules that involve only B, C.H
(in either left or right hand side of the rule). The minimum confidence is 70%. [7]
3. Describe the multi-dimensional association rule, giving a suitable example. [16]
4. (a)Explain the algorithm for constructing a decision tree from training samples [12]
(b)Explain Bayes theorem. [4] 6. Develop an algorithm for classification using Bayesian
classification. Illustrate the algorithm with a relevant example. [16]
7. Discuss the approaches for mining multi level association rules from the transactional databases. Give relevant
example. [16] 8. Write and explain the algorithm for mining frequent item
sets without candidate generation. Give relevant example.[16]
9. How is attribute oriented induction implemented? Explain in detail. [16]
10. Discuss in detail about Bayesian classification [8] 11. A database has four transactions. Let min sup=60% and min
conf=80%.
TID DATE ITEMS_BOUGHTT100 10/15/07 {K,A,B}T200 10/15/07 {D,A,C,E,B}T300 10/19/07 {C,A,B,E}T400 10/22/07 {B,A,D}
Find all frequent itemsets using Apriori and FP growth, respectively. Compare the efficiency of the two mining
process. [16]
UNIT-V CLUSTERING AND APPLICATION AND TRENDS IN DATA
PART A
1. What are the requirements of clustering? The basic requirements of cluster analysis are
• Dealing with different types of attributes. • Dealing with noisy data. • Constraints on clustering. • Dealing with arbitrary shapes. • High dimensionality • Ordering of input data • Interpretability and usability • Determining input parameter and • Scalability
2. Define spatial data mining. Extracting undiscovered and implied spatial information.
Spatial data: Data that is associated with a location Used in several fields such as geography, geology, medical
imaging etc.
3. What is text mining? Extraction of meaningful information from large amounts
free format textualdata.
Useful in Artificial intelligence and pattern matching Also known as text mining, knowledge discovery from text, or
contentanalysis.
4. Distinguish between classification and clustering.
CLASSIFICATION CLUSTERING We have a Training set
containing data that have been previously categorized
Based on this training set, the algorithms finds the
category that the new data points belong to
We do not know the characteristics of
similarity of data in advance
Using statistical concepts, we split the datasets into
sub-datasets such that the Sub-datasets have “Similar”
data Since a Training set exists, we
describe this technique as Supervised learning
Since Training set is not used, we describe this technique as
Unsupervised learning Example:We use training dataset
which categorized customers that have churned. Now based on this
training set, we can classify whether a customer will churn or
not.
Example:We use a dataset of customers and split them into sub-
datasets of customers with “similar” characteristics. Now
this information can be used to market a product to a specific
segment of customers that has been identified by clustering algorithm
5. Define a Spatial database. A spatial database is a database that is optimized to store and query data that represents objects defined in a
geometric space. Most spatial databases allow representing simple geometric objects such as points, lines and polygons.
6. List out any two various commercial data mining tools. DBMiner GeoMiner Multimedia miner WeblogMiner
7. What is the objective function of K-means algorithm?• k-means algorithm is implemented in 4 steps:
1. Partition objects into k nonempty subsets2. Compute seed points as the centroids of the
clusters of the current partition. The centroid is the center (mean point) of the cluster.
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment (or fractional drop of SSE or MSE is less than a
threshold).
8. Mention the advantages of Hierarchical clustering.
10. Define Binary variables? And what are the two types of binary variables?
Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when state is 1,
variable is present. There are two types of binary variables, symmetric and asymmetric binary variables.
Symmetric variables are those variables that have same state values and weights. Asymmetric variables are those
variables that have not same state values and weights.
11. What is web usage mining? Technique to process information available on web and
search for useful data. To discover web pages, text documents , multimedia files, images, and other types of resources from web. Used in several fields such as E- commerce, information filtering, fraud detection and education and research.
12. What are interval scaled variables? Interval scaled variables are continuous measurements
of linear scale. For example, height and weight, weather temperature or coordinates for any cluster. These
measurements can be calculated using Euclidean distance or Minkowski distance.
13. What are the applications of spatial databases?
14. Define Clustering? Clustering is a process of grouping the physical or conceptual data object into clusters.
15. What is cluster analysis ? A cluster analysis is the process of analyzing the various
clusters to organize the different objects into meaningful and descriptive objects.
16. What are the two data structures in cluster analysis?• Data matrix
– (two modes) n objects, p variables
• Dissimilarity matrix– (one mode)
between all pairs of n objects
17. What is an outlier? Give example. Very often, there exist data objects that do not comply with
the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the
remaining set of data, are called outliers 18. What is audio data mining?
Uses audio signals to indicate patterns of data or the features of data mining
results. Patterns are transformed into sound and music. To identify interesting or unusual patterns by
listening pitches, rhythms, tune and melody.
19. List two application of data mining. DNA analysis Financial data analysis Retail Industry Telecommunication industry Market analysis Banking industry Health care analysis.
20. What are the fields in which clustering techniques are used? • Clustering is used in biology to develop new plants and
animal taxonomies. • Clustering is used in business to enable marketers to
develop new distinct groups of their customers and characterize the customer group on basis of purchasing.
• Clustering is used in the identification of groups of automobiles Insurance policy customer.
• Clustering is used in the identification of groups of house in a city on the basis of house type, their cost and
geographical location.
• Clustering is used to classify the document on the web for information discovery.
PART-B
1. BIRCH and CLARANS are two interesting clustering algorithms that perform effective clustering in large data
sets. (i) Outline how BIRCH performs clustering in large data
sets. [10] (ii) Compare and outline the major differences of the two scalable clustering algorithms BIRCH and CLARANS.
[6] 2. Write a short note on web mining taxonomy. Explain the
different activities of text mining. 3. Discuss and elaborate the current trends in data mining.
[6+5+5] 4. Discuss spatial data bases and Text databases [16] 5. What is a multimedia database? Explain the methods of
mining multimedia database? [16] 6. (a) Explain the following clustering methods in detail.
(a) BIRCH (b) CURE [16] 7. Discuss in detail about any four data mining
applications. [16] 8. Write short notes on
(i) Partitioning methods [8] (ii) Outlier analysis [8] 9. Describe K means clustering with an example. [16]
10. Describe in detail about Hierarchical methods. 11. With relevant example discuss constraint based cluster
analysis. [16]
Top Related