Visual data mining modeling techniques for the visualization of mining outcomes

47
www.elsevier.com/locate/jvlc Journal of Visual Languages & Computing Journal of Visual Languages and Computing 14 (2003) 543–589 Visual data mining modeling techniques for the visualization of mining outcomes Ioannis Kopanakis, Babis Theodoulidis* CRIM—Center of Research in Information Management, Department of Computation, UMIST, PO Box 88, Sackville Street, ManchesterM60 1QD, UK Received 2 August 2002; received in revised form 7 March 2003; accepted 9 June 2003 Abstract The visual senses for humans have a unique status, offering a very broadband channel for information flow. Visual approaches to analysis and mining attempt to take advantage of our abilities to perceive pattern and structure in visual form and to make sense of, or interpret, what we see. Visual Data Mining techniques have proven to be of high value in exploratory data analysis and they also have a high potential for mining large databases. In this work, we try to investigate and expand the area of visual data mining by proposing new visual data mining techniques for the visualization of mining outcomes. r 2003 Elsevier Ltd. All rights reserved. Keywords: Visual data mining; Databases; Association rules; Classification 1. Data mining The process of searching and analyzing large amounts of data is called ‘‘data mining’’. The large collections of data are the potential lodes of valuable information but like in real mining, the search and extraction can be a difficult and exhaustive process [1]. Data Mining is a knowledge discovery process of extracting previously unknown, actionable information from very large databases. In details it is the non-trivial extraction of implicit, previously unknown and potentially useful information from ARTICLE IN PRESS *Corresponding author. Tel.: +44-161-200-3309; fax: +44-161-200-3324. E-mail addresses: [email protected] (I. Kopanakis), [email protected] (B. Theodoulidis). URL: http://www.crim.co.umist.ac.uk. 1045-926X/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.jvlc.2003.06.002

Transcript of Visual data mining modeling techniques for the visualization of mining outcomes

www.elsevier.com/locate/jvlc

Journal ofVisual Languages & ComputingJournal of Visual Languages and Computing

14 (2003) 543–589

Visual data mining modeling techniques for thevisualization of mining outcomes

Ioannis Kopanakis, Babis Theodoulidis*

CRIM—Center of Research in Information Management, Department of Computation, UMIST,

PO Box 88, Sackville Street, ManchesterM60 1QD, UK

Received 2 August 2002; received in revised form 7 March 2003; accepted 9 June 2003

Abstract

The visual senses for humans have a unique status, offering a very broadband channel for

information flow. Visual approaches to analysis and mining attempt to take advantage of our

abilities to perceive pattern and structure in visual form and to make sense of, or interpret,

what we see. Visual Data Mining techniques have proven to be of high value in exploratory

data analysis and they also have a high potential for mining large databases. In this work, we

try to investigate and expand the area of visual data mining by proposing new visual data

mining techniques for the visualization of mining outcomes.

r 2003 Elsevier Ltd. All rights reserved.

Keywords: Visual data mining; Databases; Association rules; Classification

1. Data mining

The process of searching and analyzing large amounts of data is called ‘‘datamining’’. The large collections of data are the potential lodes of valuable informationbut like in real mining, the search and extraction can be a difficult and exhaustiveprocess [1].Data Mining is a knowledge discovery process of extracting previously unknown,

actionable information from very large databases. In details it is the non-trivialextraction of implicit, previously unknown and potentially useful information from

ARTICLE IN PRESS

*Corresponding author. Tel.: +44-161-200-3309; fax: +44-161-200-3324.

E-mail addresses: [email protected] (I. Kopanakis), [email protected] (B. Theodoulidis).

URL: http://www.crim.co.umist.ac.uk.

1045-926X/$ - see front matter r 2003 Elsevier Ltd. All rights reserved.

doi:10.1016/j.jvlc.2003.06.002

data. In other words, it is the search from relationships and global patterns that existin large databases, but are ‘‘hidden’’ among the vast amounts of data. Theserelationships represent valuable knowledge about the database and objects in theworld [2].

1.1. Data mining life cycle

We view the life cycle of the data mining operation as a three-stage process ofpreparing the data for mining, deriving the model, and using the knowledge obtainedfrom the data [3]. (Fig. 1)The data preparation stage deals with improving the data quality and summarizing

the data to facilitate the analysis and discovery process. Data mining can be done oneither operational databases or on a data warehouse, which is usually a summarydatabase of the various businesses of an enterprise. The quality of the data in thedata warehouse is constantly monitored by data analysts. Due to the heterogeneityand non-standard policies enforced on data quality at the different source databases,the warehouse data is usually cleaned or standardized via data scrubbing.The model derivation stage focuses on choosing learning samples, testing samples

and learning algorithms. Due to the large volume of available data, data mining maybe done on subsets of the data from the data warehouse. An appropriate data sampleis selected from the data in the warehouse and is checked for descriptiveness. Thisprocess may have to iterate a few times before a suitable sample set can be selected.The selected sample dataset forms the training data for the data-mining algorithm.The data-mining process is viewed in our framework as the derivation of anappropriate knowledge model of the patterns in the data that are interesting to theuser. The algorithm for model derivation, together with the guidance provided by theuser, will generally produce several models of the information contained in the data.The data-mining algorithms use guidance from the analyst to decide various

ARTICLE IN PRESS

Scrub, Verify,Summarize Data Data

Warehouse

Selection ofTraining Data

Sample

Training Datafor knowledge

Model Learning

Model DerivationAlgorithm + User

Guidance of learningprocess

Model Usage &Population ShiftMonitoring and

Incremental Learning

OperationalDatabase

InterestingModels

Models learnedfrom training

data

Selection of mostInteresting Models

Verify &Evaluate

Data Preparation Stage

ValidationStage

Model Derivation

Stage

Knowledge Engineer

Fig. 1. Information flow in data mining life cycle.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589544

parameters of the model being learned from the data, such as its accuracy andprevalence, and to control the computational complexity of the learning process.Among all the models generated, the users may only select a few interesting

models to be included in their applications. The usage and maintenance phase isconcerned with monitoring of database updates and continued validation of patternslearned in the past. Even though the learning process may have user guidance, not allthe knowledge models generated will have business applications. Only the interestingmodels are selected and applied performing business tasks. Another important taskin this stage of the life cycle is to continuously monitor the validity of the knowledgemodels in the context of changes to data in the warehouse. When the population inthe warehouse shifts significantly, the previously learned models will no longer beapplicable, and new models will have to be derived. We may also be able to learn newmodels incrementally from the new data.

1.2. Visual data mining

Visual Data Mining could be related with all the previously described sub-modulesthat we have partitioned the overall data-mining task. The goal should be to providea synthesis of visualization and data mining, to enhance the effectiveness of theoverall data mining process. Since this synthesis is rather new, there is very littlework that covers both aspects.

Visual data mining involves the invention of visual representations that could be

applied in all three data-mining life cycle stages, as partitioned to the data preparation,

model derivation and validation stage. That concept also indicates the partitioning ofvisual data mining in three fields, each one targeted on producing visualrepresentations that will enhance information and knowledge flow throughout eachdata mining module.Visual data mining in the field of data preparation could be defined as the attempt

to enhance or carry out some of the pre-processing module’s tasks in a visualmanner. That in general involves the visual manipulation of row data according tothe requirements posed by the following data mining stage of model derivation. Bythe term of visual manipulation we presume the ability to handle problems such asmissing data fields, data transformations, sampling and pruning, data discrepanciesand inconsistencies usually met in this stage, by the use of visualization techniques.Such capabilities enable us to formulate accurate hypotheses and objectives in KDD,and select carefully only the relevant and useful data to be sampled and extractedduring the data pre-processing.Visual data mining in the field of model derivation implies the specification of

model construction performed at this stage by visual means. Selection of the trainingdata set and model, definition of its parameters, training process specification andoutcomes storage are the general tasks of this stage. Further than that, according toour point of view, a visual overview of the whole model derivation module shouldalso be recommended. That actually implies evaluation, monitoring and guidance ofthis data-mining module. Evaluation includes the validation of training samples,test-samples, and learned models against the data in the database plus the

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 545

appropriateness of data and learning algorithms for specific data-mining situations.Monitoring includes activities such as tracking the progress of the data-miningalgorithms, evaluating the continued relevance of learned patterns in the context ofdatabase updates, etc. Guidance includes activities such as user-initiated biasing oraltering inputs, learned patterns and other system decisions.Visualizing a model should allow a user to discuss and explain the logic behind the

model with colleagues, customers, and other users. Getting by in the logic or rationaleis part of building users’ trust in the modelling results [4]. If the user can understandwhat has been discovered, he/she will trust it and put it into use. A model that can beunderstood is a model that can be trusted. Unfortunately, users are often forced totrade off accuracy of a model for understandability. Advanced visualizationtechniques can greatly expand the range of models that can be understood bydomain experts, thereby easing the accuracy/understandability trade off [5].Visual data mining on the validation stage could be defined as the graphical

presentation of data, whether the data is base data, summary data, or minedoutcomes extracted from data. This is a type of visual data analysis, where theanalytic component is offloaded to human perception [6]. That implies that the basicobjective of visual data mining on the validation stage is to represent as muchinformation hidden in the Data Space to our Visualization Space, in a way that theuser will acquire as much information/knowledge from that representation. Thatattempt involves a mapping from the amount of information available to the amountof information that can be visualized by our visual data mining techniques [7].These notions simply define the aim of any visual data mining model proposed;

to produce information rich visualization outcomes easily perceived by human’sperception. Two main factors are emphasized by that statement. On one hand, thevisualization model should present as much information as possible and on the otherhand, this representation should be done in such a way that the knowledge engineerwould easily acquire that knowledge. The difficulty on producing new visualizationmodels is balancing between those two factors having also in mind to increase themagnitude and the quality of the knowledge extracted.

1.2.1. Importance of visual data mining

In 1854, while searching for ideas to bring a cholera epidemic raging in London,Dr. J. Snow drew dots on a map of the neighbourhood at the locations of therecorded deaths. The maps had the positions of the drinking water wells. Theconcentration of deaths near just one of the wells was visually striking. He hadthe handle of the suspect well changed and the epidemic stopped! Apparently,the disease was being transmitted by contact with the handle. This true story iswidely considered as an early success of visualization [8].In our days visualization could be the link between the two most powerful

information-processing systems: humans and the modern computer. Humans,unfortunately, have many limitations. In particular, we are quite limited in ourability to handle scale and are easily overwhelmed by the volumes of data that arenow routinely connected. Data mining, an automated process, is a natural reductiontechnique that could complement human capabilities. Combing these two

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589546

approaches for knowledge discovery is clearly a great idea [9]. Visualization couldadd our tremendous pattern-recognition ability with the problem solving datamining processes. Transforming and presenting problems visually could provide newinsights and paves the way to their solutions [8].The idea is to bridge the differences in approaches and encourage research to help

produce new methodologies. Data mining is primarily centred on computationnumber crunching techniques, with minimal user involvement as the machineattempts to extract various features of the data. Data visualization, at the otherextreme, has emphasized on user interactions and manipulations of graphical datarepresentations for visual feature recognition and understanding. Each approach hasits advantages and its major weaknesses. Whereas algorithms working in isolationcan miss out on the ‘‘wisdom’’ that is readily available from human knowledge of theproblem and the data, strictly manually guided approaches can easily cause usersto lose their way in high-dimensional spaces. Between data mining and datavisualization the statistical camp employs user-applied numerical methods withstandard graphical displays for visual interpretation. It our belief that the jointefforts of those disciplines can provide break thoughts in the most difficult analysisproblems, along with helping overcome hurdles within individual fields [9–11].For all those reasons, we strongly believe that the contribution of visual data

mining could be of essential importance in order to make the knowledge engineerpart of the data mining process and take advantage of human’s perceptual system.Our definition of visual data mining gives us the flexibility to apply that concept toall three stages of the data-mining life cycle.

1.3. Summary

In this work we are mostly interested in producing visual representations appliedon the validation stage. Our aim is to assist the knowledge engineer to acquireenhanced knowledge and extract valuable inferences by the visualization ofoutcomes produced by data mining processes.Having investigated in Section 2 the problematic issues encountered during the

exploitation of data mining outcomes we continue introducing our modellingtechniques. Three models are presented for the visualization of association rules eachone providing a different perspective and level of detail over the visualized set ofrules. Those models complement a suite of visualization techniques whichinteractively combine powers and drive to the derivation of inferences as presentedin the two evaluation scenarios (Sections 3.6.1 and 3.6.2).Following our research track we introduce in Section 4 two visualization

techniques for the representation of relevance analysis outcomes. As presented inthe corresponding case studies these techniques allow the visualization of largerelevance analysis outcomes enhancing the derivation of temporal inferences.Finally, visualizing classification outcomes, we introduce the 3D-class preserving

projection technique and its application in Section 5.1.1. Introduction of thistechnique makes us capable of producing 3D class-preserving projections that bestdiscriminate among four class centroids.

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 547

2. Research focus

The most common used mean of visualization is the relatively small computerscreen, where the total number of data items that can be mapped at one time islimited [7,12]. The same restriction holds also true even for other representationalmeans such as printed views or virtual worlds where for other additional reasons wehave considerable representational limitations. Those restrictions become even moretightening if we consider the pace of growth that characterizes today’s datasets.Taking those facts under consideration we need to find allies on our attempt to gaininsight into our data. Those allies are the data mining algorithms and theirknowledge-extracting powers.Our point of view suggests not just performing visual data mining to the raw data

but also to the outcomes produced by the data mining algorithms. Our approachindicates the visual mining of the results produced by the algorithms as a secondary‘‘fine-tuned’’ step or information abstraction step [13,14]. As a first thought thatseems to be quite useful, as in most cases the results of data mining algorithmsrepresenting association rules, relevance analysis outcomes, classifications, etc. are ina form difficult to be understood by humans who are accustomed to perceiveinformation by their visual senses.Currently, it is a challenging task for designers of visual data mining environments

to find the strategies, methods and corresponding tools to visualize a particular typeof information. The graphical presentation should be simple enough to be easilyunderstood, but complete enough to reveal all the information present in the model.This is a difficult balance because simplicity usually trades off against completeness[4]. It is not obvious how to effectively visualize the results of mining large amounts ofdata in N-dimensions, where N can be as large as 1200 [9]. Visualization has a numberof dimensions to be measured and is highly dependent on the user, the task, and thestructure of the data. It is difficult to pull this out to identify an optimal method [15].Each type of data mining outcomes produced has its specific characteristics

regarding its comprehension. Investigating the specific problems that the knowledgeengineer encounters on his/her attempt to exploit the mining outcomes will help us tobe more precise on our research focus and the resulting visualization suggestions aslong as we justify our interest for their application.

2.1. Association rules

Mining for association rules, as a central task of data mining, has been studiedextensively by many researchers. Much of the existing research, however, is focusedon how to generate rules efficiently. Limited work has been done on how to help theuser understand and use the discovered rules. In real-life applications though, theknowledge engineer wants first to have a good understanding over a set of rulesbefore trusting them and use the mining outcomes. Investigation and comprehensionof rules is a critical pre-requirement for their application.Those issues become even more tightening if we consider several other outstanding

problems in the field of mining for association rules. In brief, the first difficulty is the

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589548

‘‘large resulting rule set’’ problem. A rule-mining algorithm can easily generate alarge number of rules that cannot be handled by human users. Some methods such asrule pruning and interesting rules selection have been proposed to deal with thisproblem. The following issue is the ‘‘hard to understand’’ problem. Rules producedby rule generation algorithms are often hard to be understood by human users. Thethird problem is the ‘‘rule behaviour’’ problem. In real-life environments, datachange over time. Rules mined in the past may not be valid in the future. As aconsequence each rule has a behaviour history over time [14]In general, the mining of associations in a database creates rules that syntactically

could be expressed as

IF List/ConditionS THEN List/ResultS Support (%) Confidence (%)

/ConditionS ) lower limitonumerical attributeo=upper limitnumerical attributeo=upper limitlower limitonumerical attributecategorical attribute IN {sub-set list}

/ResultS ) lower limitonumerical attributeo=upper limitCategorical Attribute=categorical value

Clarifying the syntactic formalization we could comment that ListoCondition>is a set of conditions upon the values of several attributes. The condition could beeither a numerical attribute in a range with upper, lower or both limits specified or acategorical attribute taking values from a finite set. In the same context but not thatflexible List/ResultS could be again any combination of specific conditions. Thoseconditions though could either be an equality condition of one categorical attributeor a numerical attribute in range with both upper and lower limits specified.Simplifying also support and confidence mathematical definition we could affirm

that the support of each rule is actually the percentage of tuples from the wholedatabase that satisfy the rule’s left-hand side clause (IF expression) and theconfidence is the percentage of tuples from the previously mentioned set which theydo also satisfy the rule’s right-hand side clause (THEN expression).Trying to be general, constructing a concise mathematical formalization, the

notation that we suggest for the purposes of our study is

IF ½:::ðnVali1pnAttripnVali2Þ:::ðcAttrjINcValj1; :::cValjN Þ�THEN ½:::ðnValk1pnAttrkpnValk2Þ:::ðcAttrm ¼ cValmÞ� s%,c%which can be further generalized to

IF fK AttriIN½Vali1;Vali2K�;Kg

THEN fK AttrkIN½Valk1;Valk2K�;Kg s%; c%:

The prefixes ‘‘n’’ and ‘‘c’’ denote numerical and categorical attributes and values.Both IF and THEN clauses are finite sets of conditions as also any set of categoricalvalues in a categorical sub-expression. Numerical ranges with no lower or upperlimit specified could be produced as if we consider the corresponding nVali1 or

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 549

nVali2 values equal to N or N. In a similar way of thinking, sub-expressions ofequality could be considered when nVali1 and nVali2 are equal fnVali1 ¼nVali2; ðnVali1pnAttripnVali2Þg ) nAttri ¼ nVali1: The generalized notation canalso produce all the possible syntactic forms of an association rule by assigning tothe Vali1 or Vali2 the appropriate numerical or categorical values.

2.2. Relevance analysis

Relevance analysis knowledge is qualitative and it is quite useful when miningfrom large databases that hold information about many objects. The context of theoutput produced by relevance analysis though, is not that complex as in the case ofassociation rules. That makes the knowledge engineer’s effort easier. Severaltechniques have been utilized for the visualization of relevance analysis outcomeswith their advantages, drawbacks and limitations, with most important thedecreasing quality of the representation as the number of relevant attributesincreases. When the number of attributes is large, the resulting visualization loosesits basic characteristics and becomes a fuzzy mapping, confusing the analyst. Most ofthose techniques have also been targeted on the visualization of relevance analysisoutcomes at a specific time point.Our aim therefore, was to produce an analogously to the content of relevance

analysis simple representation, which will have a robust behaviour as the number ofthe examined relevant attributes increases. It should be capable of dealing with alarge number of attributes without been overwhelmed by fuzzy characteristics.Additionally, we paid particular attention on the temporal aspect of the relevantanalysis outcomes. On one hand that is because of the importance of the time factor,and on the other hand, due to the little research and applied work that has been doneon this field.When we examine the relevance of a specific attribute to a set of other fields, the

output that is produced by relevance analysis techniques is, or can be transformed inthe form:

Examined Attribute Relevant to Target Attribute Uncertainty CoefficientExmAttr List /TrgAttrS List /C%S

In other words the uncertainty coefficient for each target attribute is obtained withrespect to the examined attribute producing a list of relevant factors correspondingto each target attribute. That syntactic representation could be transformed in thefollowing mathematical formalisation.

ExmAttr : ðTrgAttr1; c1%Þ:::ðTrgAttrN ; cN%Þ:

2.3. Classification

Classification is a primary method for machine learning and data mining.It is either used as a stand-alone tool to get insight into the distribution of a data set,

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589550

e.g. to focus further analysis and data processing, or as a pre-processing step forother algorithms operating on the detected clusters. The main enquiries that theknowledge engineer usually has on his/her attempt to understand the classificationoutcomes are

* How well separated are the different classes?* What classes are similar or dissimilar to each other?* What kind of surface separates various classes, (i.e. are the classes linearly

separable?)* How coherent or well formed is a given class?

Those questions are difficult to be answered by applying the conversionalstatistical methods over the raw data produced by the classification algorithm.Unless the user is supported by a visual representation that will actually be his/hernavigational tool in the n-dimensional classified world, concluding inferences will bea tedious task. Our main aim therefore should be to visually represent andunderstand the spatial relationships between various classes in order to answerquestions such as the above mentioned.Answers to these questions can enable the data analyst to infer inter-class

relationships that may not be part of the given classification, and additionally, gaugethe quality of the classification and quality of the feature space. Discovery ofinteresting class relationships in such a visual examination can help in the design of abetter classifier and also lead to enhanced feature selection. Such an analysis wouldbe useful while training a classifier in a ‘‘pre-classification phase’’, or in evaluatingthe quality of clusters in a ‘‘post-clustering’’ phase.In general, a simple mathematical formalization that we could utilize to represent

that a tuple ti; with attributes ti1; ti2; :::; tiN ; belongs to class ci has as follows:

ti1 ¼ vi1; ti2 ¼ vi2; :::; tiN ¼ viN ; tiNþ1 ¼ ci:

The tiNþ1 field is the classifying attribute and each vij ; j ¼ 1; :::;N the numerical orcategorical value of that field.

3. Visual representation of association rules

In this section, we propose three visual data mining models for the visualization ofoutcomes produced when mining for association rules. The proposed visualizationtechniques are based on abstract modelling conceptions applied in the field of datamining. That justifies their reference as visual data mining models. Addressing theproblems of this field, our attempt is targeted on facilitating the knowledge extractorto visually analyze and understand a single or a set of rules, along with theirchanging behaviours. Addressing those issues, we believe that utilizing representa-tions such as charts, scatter-plots and simple graphs is one promising way to enhancethe visual discovery of knowledge, as long as remaining simple on ourrepresentations. Extensive work has been done on those types of representation,which could constitute our basic modelling infrastructure.

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 551

3.1. Bar chart form visual data mining model for association rules

Bar chart graphs have been utilized in numerous scientific and commercialapplications with their usefulness undoubtedly proved. The ease to reveal the notionsrepresented, makes them easily perceived by analysts accustomed to such graphs,and a good starting point for our research effort. Based on the principal ideas of thebar chart representations we came up with the bar chart form visual data-miningmodel for association rules.The core idea of our modelling proposal is that each rule’s sub-expression should

be visualized by a bar in the chart. By the term of sub-expression we designateany condition constituting the overall expression of a rule’s left- or right-handside clause. Based on the previously defined mathematical formalization, a sub-expression could have one of the following syntactic forms: ðnVali1pnAttripnVali2Þor ðcAttrjINfcValj1:::cValjNj

gÞ depending on sub-expression attribute type (numer-ical or categorical). The items composing each sub-expression’s conception arerepresented on the graphical characteristics of the bar. The bar’s length, depth,colour and position are features that are utilized to assist and enhance ourrepresentation.An interesting observation on our modelling proposal is that following a

harmonized way of representation, based on rule’s fundamental factors, weanalogously constructed our visual representation. As the assembly of sub-expressions constitutes a rule, the set of bars, representing those sub-expressions,is forming our bar chart model, visualizing the overall rule. On our attempt to defineour model in detail, we continue in the following sub-sections examining specificcases depending on the type of the rule to be visualized.

3.1.1. Bar chart form model and numerical sub-expressions

Gradually introducing our model, we examine the case of having numerical sub-expressions composing rule’s left- and right-hand side clause. For such type of rulesour modelling scheme implies that each sub-expression’s characteristics will bemapped to the length and position of the corresponding bar in the chart. Followingthat idea, in the horizontal axis (X-axis) of the chart constructed we map the namesof the numerical attributes participating in that rule. An empty place partitions thebars, distinguishing rule’s left- and right-hand side clause. Sub-expressions ordering,and as a consequence bars ordering, might be specified by the user or any otherautomatic method (i.e. ordering depending on how many tuples in the dataset satisfysub-expression’s condition). The vertical axis has been enumerated and scaled tomap the condition limits of all the sub-expressions in that rule.According to our modelling suggestion, sub-expression’s numerical range should

be represented by the range that the bar occupies in the vertical axis. In detail, bar’sstarting point is defined as the value of the lower limit in sub-expression’s condition.Analogously thinking, bar’s upper limit is defined as the value of the upper limit insub-expression’s condition. At the boundaries of the vertical axis we have alsoassigned the infinite values of N and +N. Such an approach gives us theflexibility of visualizing in a uniform manner sub-expression’s ranges with either their

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589552

upper or lower limit unspecified. In sub-expressions of equality ððnAttri ¼ nValiÞ thebar has been degenerated to a line segment. An example of our visual data-miningmodelling proposal regarding association rules with numerical sub-expressions ispresented in Fig. 2.If we expand our model further, we could suggest the utilization of colour. Instead

of having single coloured bars with their position representing the related sub-expression’s characteristics, we could enhance our perception by the smootheddistinctive colouring of those bars, designating the extent of their correspondingsub-expression’s range. That could be achieved if we map by normalization the rangeof each sub-expression’s condition to the colour range of that specific bar. Theutilization of colour interpolation filters is considered appropriate for that aim.From our research, we suggest PBC colour scale (colour scale for Perception-BasedClassification) [7] based on the HSI colour model. An example representation,according to the proposed visual data-mining model, with the utilization of thefeatures described is presented in Fig. 23.

3.1.2. Including categorical sub-expressions in the bar chart form model

Making our modelling suggestion more flexible, we continue posing no restrictionson the rule’s type. Including also categorical attributes in our model, we suggest thateach corresponding categorical sub-expression’s bar should be divided in smallersegment areas, differently coloured, each one representing a categorical value.Assigning a single colour to each categorical value will aid on their discrimination aslong as identifying the similar ones, independently from the sub-expression that theybelong. As a categorical value might participate in different sub-expressions, thesingle value colouring method will enhance our ability to identify their existencewherever we trace segment areas similarly coloured. This type of modelling andcoding option was preferred due to the uniform expansion that it has in order toinclude also categorical sub-expressions in our representation.The representation that we have chosen in our visual data-mining model, as long

as the colouring scheme, produces a bar chart form where numerical and categoricalbars are easily distinguishable. Based in the same underlying representationalnotions we can easily distinguish among numerical and categorical sub-expressions,

ARTICLE IN PRESS

Fig. 2. Bar chart form model and numerical sub-expressions.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 553

as shown in the example of Fig. 3. The size of the categorical segment areas haschosen to be the same for all categorical values as there was no indication forchoosing differently. Options for specific case representation, such as the mappingof high–medium–low categorical values to bright–normal–dark colours or big–medium–small segment sizes are also supported in our model, although highlydepended on each case study.The length of the bar corresponding to a categorical sub-expression is analogous

to the number of the categorical values in that sub-expression. That harmonizednotion of analogous mapping is preserved both in the representation of numericaland categorical sub-expressions. Thus, we could urge that in all cases the size of thebar is analogous to the sub-expression’s ‘‘norm’’, indicating the condition’s range orthe number of categorical values.

3.1.3. Support and confidence in bar chart form model

Rule’s support and confidence factors are of vital importance as they indicate thestrength of the rule in the mined data set. According to our suggestions, we have twooptions for the representation of this information. The first option indicates thatthose two factors should be presented in the form of bars. Two additional bars at theright most place of the chart, partitioned by an empty column from bars representingsub-expressions, would indicate rule’s support and confidence correspondingly.Trying to produce a compact representation of a rule we could map the

information regarding rule’s support and confidence at the background of thecorresponding visualization view. Utilizing a coloured background, following afilling pattern, provides the flexibility of a different mapping. According to this typeof modelling option the background colour indicates the level of support and thechosen filling pattern the confidence. Again, colour and pattern mappingconventions have been investigated and chosen depending on what would bestreveal the underlying information. As in Fig. 24, we have chosen that rule’s supportshould be analogously mapped to the brightness of the background’s grey scale andthe confidence to the density (thickness) of the grid-line pattern.According to this option of modelling, a rule with high strength factors would

have a bright background with an intense overlapping filling pattern. Such an

ARTICLE IN PRESS

Fig. 3. Bar chart form model and categorical sub-expressions.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589554

approach would have a strong influence to the overall representation as long as it isthe way that the user perceives the coded information. Another advantage of thisalternative representational way is that we are decreasing the required screen area forthe visualization. This modelling option can be particularly desirable in cases ofmultiple rules visualization on a single screen, as this compact form is consideredmore adequate. Specific examples and cases studies are provided in the evaluationSection 3.6.

3.1.4. Bar chart form model and evolution over time

We are particularly interested on taking under consideration the time aspect in theconstruction of the models proposed. That aim stems from the importance of thetime factor in the mining procedure as long as in the knowledge extracted. Whenthe data set is time-stamped, it is in the interest of the knowledge extractor to takeunder consideration that additional information and extract also time-stampedinferences. Understanding over those outcomes could be enhanced by the utilizationof modelling techniques that can deal adequately with the temporal factor. As theimportance of temporal data mining is increasing, visual data mining techniqueshave to conform to that trend, and being complete, they should be able to representthe evolution over time of the temporal outcomes produced.Based over those notions we follow two approaches for the visualization of time

stamped association rules; either by animation over the existing static model, or bythe construction of a 3D world with multiple replicas of the static model. Each one ofthose static models corresponds to a specific time-point depending to its positions onthe time axis (usually the depth axis).In the first case, as time passes, the animated growth or shrunk of bars and

changing position will represent the respective change and evolution over time ofthat specific rule. Additionally, all other features that have also been utilized in ourrepresentation (colour, depth, segment areas, etc.) evolve accordingly, following thechanges over time of the corresponding rule. That animated representation,performed in a speed rate that the user considers appropriate for his/her knowledgeextraction task, would provide insight over the time factor of the outcomesvisualized, revealing the underlying temporal knowledge.In the second approach, as we navigate in the 3D space, future or past rule’s states

will be revealed on respect to our current time point. In detail, the plane defined bythe X and Z axes of our coordinate system, as long as any other plane parallel to thatone provides the basic environment of a single rule’s representation at a specific timepoint. The parallel to X- and Z-axis plane, intersecting the time axis at a specifictime-point, is our projection framework to visualize the corresponding rule at thatspecific time point.Both models are considered adequate for the mining of time-stamped association

rules. The animated bar chart form model provides a simple single-screenrepresentation with minor requirements on user’s behalf. The knowledge extractorshould only adjust the animation’s speed rate according to the desired depth ofinsight that is required and to the learning rate that he/she is capable of following. Inan analogous way, the 3D bar chart form model provides the same potential on

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 555

mining a rule’s history. By zooming in and out of the 3D representation constructed,we could either acquire an abstract overview or a detailed inspection of rule’sevolution.

3.2. Visualizing a set of association rules

Our research objective is to produce visual representations of a set of rules in asingle view according to the model defined. Up to this point, we have been targetedon providing insight over a single rule by its visualization in order to makeunderstood, trusted and usable the underlying knowledge. In real-life problemsthough, we usually deal with a set of association rules.Additional requirements though are posed following that new perspective. The

knowledge extractor in such cases seeks for inferences regarding either the whole setof rules, or a subset of it. That information should be available to be extracted eitherby our focused attention to the overall representation, or to a subset area. Thoseconcepts led us to the construction of a compact view, where the conciserepresentation of each rule was integrated. Each chart representation, considerablereduced to its basic characteristics, reveals the abstract form of the rule. With thatguideline under consideration, our attempt was to have the representation of asmany as possible rules in a single view, maintaining though the visualizationprinciples that we already had.Two main issues evolve with such an attempt. On one hand, we should define the

reduced representation of a rule and on the other hand, how the placement of thoseabstracts representations would be. Our intension is to produce a smart placement ofconcise rules’ representations that will enhance human’s perception abilities andmining effort.

3.3. Similarity arrangement of association rules

The problem of defining a smart placement of rules in a compact visualization isnot a simple one. Relevant issues have been investigated in the field, regardingattributes ordering and placement in multidimensional data visualizations, as in thecase of attributes ordering in the parallel coordinates model [8], but no generalsolution exists as the answers proposed are quite case dependent. On our attempt toprovide a solid suggestion to the problem specified and produce an advancedplacement of association rules, we first define the similarity factor between two rules.Based over that measure we find the similarity factor among all rules to be visualizedand then we indicate a smart pattern for placement of those rules in order to meet thenew posed requirements.Moving one step at a time, we first define the attributes participation table. That is

a n � m matrix, where n is the number of rules to be visualized and m the number ofdistinct attributes participating in any sub-expression of that set. Each rule has beenassigned to a row and all distinct attributes names to the columns. The table elementpij ; if equal to one designates the participation of the corresponding attribute j in the

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589556

ith rule (pij equal to 0 elsewhere):

p11 ?

^

pnm

264

375 where pij ¼

1; attrjArulei;

0; attrjerulei:

(

Before the definition of the similarity factor, for clarification reasons we first specifythe term dimension ð� jjKjjÞ of a rule. As each attribute participating in a rule isactually adding an additional dimension to our data space, we define that thedimension of an association rule is the number of attributes participating in that rule,either belonging in the left- or-right-hand clause (i.e. ||if(a#K)(b#K)(c#K) then

(d#K)||=||if(k#K)(l#K)(m#K)(n#K)||=4). Based on the attributes participa-tion table we may define now the rules similarity table as

s11 ?

^

snn

264

375; where sij ¼ sji ¼

2Pm

k¼1ðpikpjkÞjjrijj þ jjrj jj

:

In this table, all rules to be visualized have been assigned both to columns androws in the same ordering, which is random and irrelevant to the following steps.Table element sij specifies the similarity factor between rules i and j: That makes ourmeasurement invariant of the sizes of the rules compared, providing an accurateindication of similarity. Values close to 1 indicate high similarity among rules andsmall values, close to zero, low similarity. The resulting table is orthogonal andsymmetric.Having constructed the similarity table we may proceed on defining the rules

placement. Our intension is to define a concise view, where each rule is surrounded by

its most similar rules of the set to be visualized, in order to produce sub-areas in therepresentation that might be of knowledge engineer’s interest. The strategy that wefollow is based on the similarity table. We first suggest a linear ordering of the rules,which will latter be transformed for our purposes to a 2D rules placement, by theutilization of a smart filling pattern. For our case of representation, that wouldproduce neighbourhoods of similar rules. That step-by-step approach maintains theflexibility to propose analogous suggestions over similar case problems, as the samenotions under a different filling pattern could produce 1D or 3D rules placement,providing a concise view even for different modelling suggestions (Fig. 4).The attempt starts by finding a linear sequence of rules. Our suggestion is that the

ordering procedure should define a sequence where the most similar rules are first,then the most similar to the second one will follow, and so on. In other words, in thatlinear placement each rule should have on its right its most similar rule of the setaccording to the similarity measurement defined. The starting point of the sequenceis the two rules that have the highest similarity factor. Having found the highestvalue in the table, we continue searching in the similarity table for the most similarrule to the second one, again comparing the factors among them. In case of equalsimilarity value, resulting in an ordering procedure conflict, we also take under

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 557

consideration the support and confidence of each rule, giving priority to the highestfactor.Continuing with the arrangement of rules in the visualization area, the placement

that we suggest is a recursive pattern based on a generic scheme. It is a simple backand forth arrangement, proceeding vertical to the diagonal axis of a grid form. First,the elements are placed from down-left to up-right until we meet the borders of thegrid form, then below backwards from up-right to down-left, then again forward asprevious, and so on. As we proceed to the centre of the grid, the length of the routeincreases to continue decreasing when passing it.The visualization area is a square grid form of dimension d; ðdANÞ: The size of the

grid is the smallest one capable of encapsulating all rules representations. In otherwords, the condition ðd 1Þ2onpd2 should be preserved, where d is the grid’sdimension and n the size of the set to be visualized. This placement, as it will benoticed in our case studies, provides a semantically meaningful arrangement ofclosely related rules according to our similarity criterion, with nice clusteringproperties. That kind of trace walks through our visualization area in a frugal way,producing clusters of similar rules, accordingly mapped to the visualization area.By such an arrangement we are enhancing the visual mining attempt to be targeted

in sections of the compact representation that we have preserved to be constituted bysimilar rules. Instead of trying to visually eliminate patterns of interest in the wholeview, which will actually be like searching for a pin in a hey stack, the knowledgeextractor will be searching for patterns in the pre-constructed clusters. Following thisapproach we have managed to construct neighbourhoods of mining interest, where theuser will be able to query for inferences.Detailed illustration of the characteristics of this approach, applicable to real life

case studies, will be in Section 5. In Fig. 5 we give an indicative example, in order topresent an early view of method’s behaviour. In this example, of the visualizationof rules derived from the echocardiogram [16] data set, we can easily distinguishthe three clusters of similar rules at the upper left, middle and down-right sub-areasof the view. The reduced representation of the rules in this example has chosento be just the core of the modelling suggestion with no legend or axeslabelling. Customizing the properties of the view, we may reveal or conceal thatinformation.

ARTICLE IN PRESS

Fig. 4. Smart 2D placement.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589558

3.4. Grid form visual data mining model for association rules

On our attempt to provide a different perspective over the visualization of a set ofrules, we came up with the grid form visual data-mining model. Our intention was toprovide a view over a set of rules under the same denominator, in order to make theircomparison easier. That was the reason why this model was oriented from its earlybeginning on visualizing a collection of rules, as long as to provide an overallcomparable view of that set. That of course has several disadvantageous andadvantageous impacts on matters such as the level of detail or the number of therules visualized.The approach that we have followed in order to create an abstract representation

of a set of association rules is based over a cluster of cells, constructing a grid form.Named after its basic framework, the grid form visual data-mining model has eachcolumn corresponding to an attribute and each row representing a rule. The crossingarea of an attribute’s column and a rule’s row, which constructs a cell on the grid,might correspond to a sub-expression. That is specified by the existence of a bar inthe cell as long as its colouring scheme.In detail, the existence of a bar in a cell defines the participation of the

corresponding field in a sub-expression of the rule visualized which has been related

ARTICLE IN PRESS

Fig. 5. Bar chart form model and rules placement.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 559

to that row. Condition’s characteristics are coded to the properties of the bar. Asimilar coding approach to the bar chart form model has been adopted for each barin order to represent the detailed condition’s information, as could be seen in Fig. 6.Similarly, in order to be able to represent infinite sub-expressions, at the left andright boundaries of each cell we have assigned the infinite values N, +N.Additionally, close to the boundaries of each cell we have assigned the normalizedminimum and maximum values of that attribute. Again, the length and position ofthe bar, as long as colour interpolation filters, have been utilized for the coding of theranges in numerical sub-expressions. In the case of categorical sub-expressions,coloured bar segments corresponding to categorical values have been coding thespecific details. The example of Fig. 6 demonstrates that the existence of the barswithin the corresponding grid cells indicates a numerical sub-expression with bothupper and lower limits specified and a categorical sub-expression with threecategorical values.As we have already mentioned, the concept of rules ordering according to our

similarity criterion may also be applied in the case of the grid form model. Followingthe same procedure, we construct the rules similarity table, from which we extractthe sequence of rules. In the final step though, as the pattern placement requirementsare one-dimensional, the constructed rules ordering defines also the placement of therules. In other words, the rules ordered according to the similarity criterion, definingthe corresponding linear sequence, are mapped to the vertical axis of the grid form.An example of a set of rules ordered according to our similarity criterion andvisualized based on the bar chart form model is presented in Fig. 25.Addressing matters such as the evolution over time we could suggest either the

time-oriented approach of animation or the construction of 3D worlds. In the firstcase, as we have already been accustomed with, the animated shrunk and growth ofthe length of bars as long as their changing position and colour would reveal theevolution over time of the rules visualized. In the later case, as an alternative tactic isconsidered more adequate, we follow a different one to the previous model approachin the construction of the 3D world.According to the alternative suggestion each grid form should be placed in such a

way that it conforms a side of a perspective wall in 3D. Each side of the wallrepresents the set of rules visualized at a specific time-point. The n-sides of the

ARTICLE IN PRESS

Fig. 6. Colour mapping on the grid form model.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589560

perspective wall reveal the overall evolution over time of the set of rules in the timeperiod defined by the chosen n time points. By the clockwise or anti-clockwiserotation of the perspective wall past and future behaviour of a set of rules will berevealed on respect to our current time-point. An indicative example is presented inFig. 7.Reducing the dimensions of the perspective wall, the flexible and adaptive

properties of this model make possible its application on the visualization of theevolution over time of a single rule. In a 2D grid form we could have in all rows therepresentation of a single rule in several time points. In this approach, the verticalaxis of our coordinate system has been time-stamped, defining the related time pointof a grid’s row. The representation of the rule at each time point is visualized in thecorresponding row of the grid. Such an alternative perspective provides the time-oriented context of a rule in a single view. Knowledge regarding the time-orientedbehaviour of the rule along with each specific attribute factor is revealed when we aretargeting our interest in the columns of the grid, vertically moving our inspection.Grid form visualization model is considered adequate to produce compact, concise

and abstract representations of a set of association rules. Abstract knowledge wouldbe extracted by seeking for colour and sequence of cube patterns or any otherdistinguishable features on the view constructed. Furthermore, the overall

ARTICLE IN PRESS

Fig. 7. Grid form model—perspective wall.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 561

construction along with the placement makes easier the grouping of attributes.As a consequence, their enhanced examination results on extracting detailedinferences either over static or time oriented sets of rules. We believe that this modelprovides an alternative perspective over a set of rules bringing us one step closer tocreate a suite of models for the visualization of association rules, desired for mostvisual data mining tasks.

3.5. Parallel coordinates visual data mining model for association rules

Inventing visual data mining models is actually conceiving new mappingtechniques from the multidimensional space to a lower dimensional space evenin the case of association rules. As each attribute participating in a rule isactually adding an additional dimension to our data space, we try to map eachassociation rule existing in Rn to a lower dimensional space, which could berepresented in a display. The main issue that evolves with such an attempt is not onlyhow we could display data having many more than two variables but also how wecould represent the underlying information and knowledge of a rule in ourvisualization. That makes our task even more demanding. In the visualization ofmultivariate data sets, lots of ingenious methodologies visually encoding multi-variate points sets were developed. Many of them though, are laborious with highrepresentational complexity limiting the number of variables that can be handledand lose valuable information.Our approach is based on the parallel coordinates system. In geometry

parallelism, which does not require the notion of angle, rather than orthogonalityis the more fundamental concept. This, coupled with the fact that orthogonality‘‘uses-up’’ the plane very fast, was the inspiration for parallel coordinates [8].Following those notions, as long as the fundamental concepts of parallel coordinatessystem, we came up with the parallel coordinates visual data mining modelfor association rules. In the original idea of the parallel coordinates system thegoal was the visualization of multidimensional geometry and multivariateproblems without loss of information. Furthermore, in our case, our attempt isalso to find a mapping capable of transforming association rules in such a formthat they could be visualized according to the principal ideas of parallelcoordinates.Analogously to the original parallel coordinates system, we map the attributes

participating in any rule’s sub-expression to the equidistant axes, which are parallelto one of the display axes. Each rule viewed as a point in the n-dimensional space ispresented as a polygonal line, intersecting each of the axes at that point, whichcorresponds to the characteristics of the considered sub-expression. Every time wefollow the trace of a rule in this representation, inspecting a crossing point among thecorresponding segment line and a parallel axis, is like gradually examining anadditional rule’s sub-expression. Several issues rise with those sort statements. Weshould first clarify what we mean by the term sub-expression’s corresponding pointand how the segment line and parallel coordinate crossing point reveals sub-expression’s characteristics. In other words, we should define the mapping procedure

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589562

that we follow in the representation of each sub-expression both for categorical andnumerical attributes.Furthermore, when visualizing a set of rules, we should also examine how we treat

the cases where an attribute does not participate in a rule, yet it has a correspondingaxis in our visualization. That occurs when an attribute appears in at least one sub-expression of another rule that is also included in the data set to be visualized. In theoriginal parallel coordinates model we assume the all n-dimensional points havespecific values along all dimensions. That does not conform in our case, as ourattempt is to define a model flexible enough to represent a set of rules, which are notnecessarily composed from exactly the same attributes.Last, but not less important and one of the difficult problems in using parallel

coordinates is to somehow define the axes ordering and find an axes permutationwhich is ‘‘good’’ for our representation. Addressing such an issue should result in amodel that is invariant from rules ordering in the data set, as long as to the sub-expressions ordering in a rule. Furthermore we want a robust model to produce thesame representation for the same set of rules and a smart ordering which willenhance the knowledge acquisition.Having posed those specific requirements and starting with the numerical sub-

expressions we could propose the crossing points of those cases to be the mean valueof the numerical range. That is of course when the range has both the lower andupper limits specified. In the case of infinite ranges, with either upper or lower limitsunspecified, we define that the crossing point should be the finite limit of the range.The discrimination on the type of the sub-expression will be based on the markingtype of the crossing point. In detail, when we are visualizing a sub-expression withboth ranges specified a horizontal small segment line will mark that case. The lengthof that marking line will be analogous to the range indicating its width. In the othercases, with infinite sub-expressions, the marking point will be an arrow pointingfrom the starting point of the range to the infinite value (either N or +N)depending on the case. That formalization will be giving additional hints regardingthe detailed characteristics of the numerical sub-expression.Regarding categorical sub-expressions we present two different approaches

depending on the characteristics of the categorical attribute. Categorical attributeshave a finite set of values either self-characterized, designating their ordering (i.e.high, medium, low), or providing general clustering information (i.e. married, single,divorced, widowed), usually with a larger number of distinct values. Self-characterized categorical attributes usually have an obvious assignment on an axis.In the later case though, the representation depends on the random assignment ofnumerical values to the distinct categorical values. Moreover, there is nostraightforward way to indicate the participation of more than one categoricalvalues in a single sub-expression (i.e. attr IN {A,B}) according to our modellingapproach as we cannot have more than one crossing point between a segment lineand a parallel axis.For cases where the categorical attribute does not have an obvious or predefined

ordering, or when in a sub-expression we have more than one categorical valuesassigned to an attribute, we propose the utilization of the ‘‘dimensional expansion’’

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 563

technique ( flattening) [17]. According to this technique each categorical attributeshould be expanded to as many new dimensions as the number of the categoricalvalues that it can take. In other words, we have as many parallel axes for acategorical attribute as its number of the distinct categorical values that is can take.For each one of those axes, the crossing point with a segment line at the highest pointindicates the participation of the corresponding categorical value at that rule’s sub-expression. In the opposite case, when the crossing point is at the minimum of anaxis, we represent the non-use of that value.The number of dimensions of the data space, and as a consequence the number of

the parallel axes, has been increased by the application of this method (Fig. 8).However, parallel coordinates visualization technique can better illustrate thecategorical nature of the sub-expressions expanded in that way. On the representa-tion of general categorical attributes with many distinct values and particularly insub-expressions with many categorical values, the utilization of this technique isconsidered essential. Mining for clusters and patterns of rules can be significantlyenhanced by the utilization of the dimensional expansion method, as we can have adetailed view refined even in the level of a single attribute.Regarding attributes that do not participate in a rule yet they have a

corresponding axis, our modelling suggestion is that the corresponding attribute’spolygonal line should have a faint colour at that area and there would be noconnection marker between the segment line and the parallel axis. That approachresults on a clear representation, indicating that there is no relevance between thatrule and the corresponding numerical or categorical attribute.As far as the ordering of the attributes, and as a consequence the parallel

coordinates sequence, we suggest that the attributes should be first divided in twocategories depending on their type (numerical and categorical) (Fig. 9). Eachcategory, ordered according to the frequency of each attribute in the set of rules,would produce the overall ordering. This simplified choice is justified by the fact thatwe do not want to add more computational complexity as this solution found to beacceptable for our case.The brightness of the colour and the thickness of the segment line corresponding

to the specific rule have been utilized for the encoding of the support andconfidence factors. In Fig. 26 we present an indicative example of rules derivedfrom the echocardiogram [16] data set which have been visualized by theparallel coordinates model. The common behaviour that some rules share is quiteevident.

ARTICLE IN PRESS

Name Income Name IncomeHigh IncomeMedium IncomeLowName1 High Name1 1 0 0 Name2 Medium Name2 0 0 Name3 Low → Name3 0

11

Name4 High Name4 1 0 0 Name5 Low Name5 0

0

0 1

Fig. 8. Dimensional expansion of a data set.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589564

3.6. Evaluation of the visual data mining models for association rules

The three techniques presented complement a suite of visual data mining modelsfor association rules. Providing different perspectives and points of view over a set ofassociation rules they support the visual data mining effort by making possible thevisual extraction of knowledge in a drill down manner. By the utilization of theadequate model we may extract detailed or general inferences from the refined orabstract representations. We consider that starting from the concise bar chart formmodel, to the parallel coordinates and finally to the grid form model we constructabstract, to middle level and detailed representations correspondingly. We believethat the contribution of their combined application is greater than the sum of theircontribution as individual techniques.

3.6.1. Dermatology case study

For our case study we have chosen the field of medicine and more specifically themining for association rules in the dermatology data set of the UCI Repository ofmachine learning databases [16]. The differential diagnosis of erythemato-squamousdiseases is a real problem in dermatology. They all share the clinical features oferythema and scaling, with very little differences. The diseases in this group arepsoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis,and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but

ARTICLE IN PRESS

Fig. 9. Parallel coordinates model.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 565

unfortunately these diseases share many histopathological features as well. Anotherdifficulty for the differential diagnosis is that a disease may show the features ofanother disease at the beginning stage and may have the characteristic features at thefollowing stages.For the construction of this data set patients were first evaluated clinically with 12

features. Afterwards, skin samples were taken for the evaluation of 22 histopatho-logical features. The values of the histopathological features are determined by ananalysis of the samples under a microscope. The dataset contains 34 attributes, 33 ofwhich are linear valued and one of them is nominal. The number of instances is 366.Over this dataset we have performed mining for association rules by the utilization ofthe data mining tool Envisioner [18]. The resulting rules were then provided as inputfor our models.To begin with, after the mining for association rules we selected those rules that

had a middle or low level of support with high confidence factor. That decision wasmade as rules with high support and confidence were obvious for the experts of thefield. That fact led us to the direction of trying to mine for new knowledge in thestack of rules that is commonly not analyzed, usually left out of the mining process.That is due to the laborious effort that those rules require in order to becomprehended, as their large number and variations are quite confusing. That is themain reason why it is difficult for the knowledge engineer to perceive theirknowledge and derive combined inferences.For the set of rules derived from the dermatology data set and the sub-set of

rules selected, we applied the bar chart form model along with the utilization ofthe similarity criterion and the smart-2D placement. The resulting representationis shown in Fig. 10. As it can be depicted from that view, if we omit several rules atthe lower-right part of the visualization area, there are distinctive similarities amongthe remaining rules. The less relevant rules of the subset were positioned by ourplacement algorithm at the lower-right part of the visualization area, allowing theconstruction of four clusters of rules with distinctive similarities in the remainingarea. In the figure we have marked the four similar sub-sets of rules. With theinteractive functionality provided we can select those rules and proceed to thefollowing step of our quest for interesting inferences.Having inspected the neighbourhoods of interesting rules we finally selected the A,

B and C clusters as marked in Fig. 10. Those rules’ had sub-expressions of the‘‘elongation of the rete ridges’’ attribute with ranges of small values.In Fig. 11 the selected sub-set of rules has been visualized according to the parallel

coordinates model. For this representation it was considered necessary to slightlymodify the values in order to avoid an intense overlapping phenomenon along withassigning zero values to the non-participating attributes. By interaction we would havebeen able to distinguish among the rules but for their printed presentation weconsidered more adequate their slight rearrangement in order to clearly comment uponthe constructed view. This method, of slightly disturbing the position of overlappedglyphs in a representation, is a commonly utilized technique in the field of visualization.By a simple investigation of their representation we can easily distinguish the

common patterns that those three clusters of rules share. Following the trace of those

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589566

rules, in order to observe their behaviour, we can notice that the first five fields areonly capable of partitioning those rules in two clusters. After the entry in the scene ofthe ‘‘munro microabcess’’ attribute we have their partitioning in three clusters. Thatconforms to our knowledge that those diseases share common clinical andhistopathological features. Interactively investigating this representation, we noticedthat the two classes (‘‘psoriasis’’ and ‘‘seboreic dermatitis’’) can be distinguishedbased on three fields. The combined overview of all the rules designate that middlevalues of ‘‘band-like infiltrate’’ (approximately 1.5) with low values of ‘‘munromicroabcess’’ (approximately 0) and middle values of the ‘‘PNL infiltrate’’(approximately 1.5) indicate the classification of those cases to ‘‘seboreic dermatitis’’.Examining the remaining sub-set of rules we can clearly find that there is nocontradiction to the previous conclusion as cases with low values of ‘‘band-likeinfiltrate’’ (approximately 0), middle values of ‘‘munro microabcess’’ (approximately1.5) and low at the ‘‘PNL infiltrate’’ (approximately 0) are classified as ‘‘psoriasis’’.Although those inferences seem to be interesting we are not yet capable to query

their strength as we have been based for their extraction in a middle level of detailrepresentation. One thing that we could say with confidence is that there aredistinctive patterns of similarity among those rules with interesting inferencesprobably derived when we manage to combine their knowledge. Additionally,

ARTICLE IN PRESS

Fig. 10. Evaluation of the bar chart form model—dermatology set of rules.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 567

attributes ‘‘band-like infiltrate’’, ‘‘munro microabcess’’ and ‘‘PNL infiltrate’’ canprobably play a pivotal role on the discrimination and classification of the cases.In order to verify our conclusions and derive detailed inferences, we utilize the grid

form model. On this step of our visual data mining attempt, this model is consideredmore suitable for such tasks as it is capable to represent a rule in the refined detail ofa sub-expression. The flattening technique for the categorical attributes ‘‘Class’’ hasbeen utilized as we wanted to highlight the distinctive classification of the rulesvisualized.The investigation of Fig. 12 indicates that our inferences should be enriched. The

selection of the ‘‘munro microabcess’’ attribute is not the best choice. A moreaccurate and stronger inference could be made if instead of the ‘‘munromicroabcess’’ factor we consider the ‘‘elongation of the rete ridges’’ attribute. Thecore idea of our inference could be determined as: cases of with higher than themiddle indication of ‘‘PNL infiltrate’’ (higher that 1.5) are classified as ‘‘Seboreicdermatitis’’. ‘‘Psoriasis’’ is indicated in the cases where we have higher than themiddle values of ‘‘elongation of the rete ridges’’ and a small indication of‘‘Spongiosis’’. Both inferences hold true for cases with not high values of ‘‘elongationof the rete ridges’’ as that was our initial criterion on the selection of the clusteredrules.

ARTICLE IN PRESS

Fig. 11. Evaluation of the parallel coordinates model—dermatology selected set of rules.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589568

As a final step, the knowledge extractor has the option to thoroughly examine therules according to the detailed bar chart form model (Fig. 27). In that view all thecomponents constituting a rule are precisely mapped in their representation. Byscrolling down the view we can examine the ordered sequence of rules as producedaccording to their similarity criterion. That detailed view enhances the utilization ofthe knowledge extractor’s expertise for verifying once more the inferences derivedand make the final conclusive remarks.The steps that we have followed throughout the evaluation scenario presented the

core notions of our work. Detailed advantageous characteristics of each model couldbe illustrated in even more demanding case studies where back and forth visual datamining steps might be considered necessary. That would result in a higher level ofcollaboration among the visualization models and among the models and the expert.

3.6.2. Visual mining of association rules—adult case study

In this section we are presenting a new case study where the proposed models wereapplied in order to investigate their capabilities in a greater extent. We will try todemonstrate their adaptive capabilities and abilities to cope well with any scenario.The indicative case study along with the previous presented scenario will provide asolid view of the perspective and potential of those models.

ARTICLE IN PRESS

Fig. 12. Evaluation of the grid form model—dermatology selected set of rules.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 569

To begin with, we selected the adult data set [16]. This data set has previously beenused in several classification experiments. The task is to predict whether a given adultmakes less or more than $50,000 a year based on the attributes such as education,hours of work per week, etc. The 48,842 instances have six continuous and eightcategorical attributes. The selection of this data set was due to the interestingproperties that it had for our purposes and to demonstrate the behaviour of ourmodels with categorical attributes.The large set of rules generated by Envisioner [18] was a mixture of numerical and

categorical values. Many of the rules seemed to be obvious or with no interest ifanalysed separately. Those factors, along with the large set of rules, made difficulttheir analysis and their combined investigation in order to conclude interesting andgeneral inferences which would hold true with confidence. We would like to find outhow our models cope with such cases, if capable to enhance combined examinationof association rules of any kind (numerical or categorical), allowing the extraction ofinteresting conclusions.As in the previous scenario, we start our visual mining attempt with the utilization

of the concise bar chart form model. The sub-set of rules selected was derived by theinvestigation for associations among the class of the adult (income greater or lessthat $50,000) and the capital gained, the education and the age of that adult. Werestricted the mining for association rules to those attributes as we were mostlyinterested in a clear demonstration of our ideas. Following in our case studies anapproach analogous to the one described we concluded that the models are capableto expand and deal with larger cases and sets of rules. That expansion thoughrequires analogous increase of involvement on user’s behalf, which was actually oneof our main goals.In Fig. 28 we present the constructed representation of the set, according to the

abstract bar chart form model. Again the smart 2D placement of ordered rules basedon their similarity has been utilized. The categorical attribute education has beenexpanded with the bars in the 2nd, 3rd, 4th columns of each chart to designateeducation of ‘‘Bachelors’’, ‘‘High School graduate’’ and ‘‘College’’ correspondingly.The 1st and 5th columns have been assigned to ‘‘capital gain’’ and ‘‘age’’. The brightblue and the black bars of the two right most columns (7th, 8th column of the chartas presented at the corresponding colour illustration) of the chart indicate theincome of higher and lower than $50K correspondingly.Due to the fact that those rules had quite many common sub-expressions, the

cluster of similar rules are not distinguishable by their first examination. Actually,they seem to construct a single cluster of similar rules. Patterns of similar sub-expressions though can be noticed. Distinguishable rules with bright backgroundcolour, indicating high levels of support and confidence factors, are attractingour interest.Investigating the constructed representation we could notice that large levels of

‘‘capital gain’’ (1st column) is commonly associated with education in the level ofbachelor (2nd blue bar) and results to the greater than 50K (G50K) class (7th brightblue bar). Moreover, the 3rd rule of the first row and the 3rd of the first columnindicate with high support and confidence that adults with none or low capital gain

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589570

and education of either high school or college are more likely to be classified to thelower than 50K (L50K) class (8th black bar).When the selected set of rules is visualized according to the parallel coordinates

model, the representation of Fig. 13 is constructed. A more detailed perspective isacquired by this view, where patterns of common sub-expressions among the rulesare easily distinguishable. Again, the categorical attributes have been expanded inorder to have a detailed view of each rule’s composing factors. The demonstration ofthe dimensional expansion technique, which was first introduced for the purposes ofthis model, highlights interesting patterns of common sub-expressions by the detailedexpansion of determinant categorical attributes.To begin with, the assumptions made based on the bar chart form model, could be

further verified in this view. If we notice the corresponding representations ofthe rules investigated in the previous model, we can conclude that the hypothesesmade are valid. Furthermore, the examination of the group of segment lines withhigh capital gain refines the previous inference regarding the age of the adults. Wecan urge that older adults with high capital gain and bachelor are the most likelygroup to earn more that $50K. That additional conclusion was made based on thebright colours that the thick segment lines with high marked values on the capitalgain parallel axis had.

ARTICLE IN PRESS

Fig. 13. Parallel coordinates model—adult data set of rules.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 571

Finally, following the direction of our mining attempt, we are examining theconstructed view of the set of rules selected based on the grid form model, as shownin Fig. 14. Several new and complementary observations came up by theinvestigation of this visual data mining model.In rules (R5–R6; R11–R12; R21–R22) high levels of capital gain is associated with

education in the level of bachelor. Moreover, in some of these cases, with highsupport and confidence factors, the same assumption states true for adults withgreater than the average age. These cases though, as expected, are more likely to fallin the class of lower that $50K as it is not that easy to earn more than $50K per year.On the other hand, those adults are the only group that have a significant possibilityof support and confidence to gain more that $50K.The investigation of this model lead us at first to the direction of extracting a new

assumption which later became a conclusion by its verification. Noticing thealternating pattern which is constructed in our view by the classification bars(ClG50K, ClL50K) and the corresponding analogous pattern that is followed by theconfidence bars, we targeted our interest on those rules. Examining rules (R1, R2) wecan assume that it is more likely for one adult with education of some college tobelong in the class of L50K. The same assumption lead us to the inspection of rules:(R3, R4), (R7, R8), (R9, R10), (R13, R14), (R15, R16), (R118, R117), (R120, R119), which

ARTICLE IN PRESS

Fig. 14. Grid form model—adult data set of rules.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589572

are associations of adults of various ages, education of college or high school andsmall capital gain.The combined inference made by all those rules is complementary to the one

already derived and suggests with high confidence factors that adults irrelevantly totheir age with not a bachelor education and small capital gain are more likely tobelong to the L50K class. The two derived conclusions complementary support ourbelief regarding their validity. Expressed in a different way, they both conclude to thesame notion, derived though by the visual mining of two different sets of associationrules.The inference, although expected, indicates our approach and presents the abilities

and the behaviour of our models. The power to interact among the models andamong the models and the user along with leading the knowledge engineer to a drilldown visual mining approach, makes possible the combined derivation ofconclusions regarding the association rules represented.

4. Visualizing relevance analysis outcomes

Following the track of our research interest, we proceed in this sub-section on thedefinition of visual data mining models regarding the representation of outcomesproduced by relevance analysis tasks.

4.1. Solar plexus visual data mining model for relevance analysis

Our aim was to suggest a flexible, information rich and with robust behaviourvisualization that in a pleasant informative way would reveal a clear indication ofthe relevancies among the attributes under examination. In that context, we suggest amodel where each attribute should be represented by a circle. In the centre of ourmodel we have the attribute whose relevance we inspect. The target attributes areplaced in a circular form around the examined attribute in equal size arcs. The closera circle is to the centre, the most relevant the corresponding target attribute to themain attribute. Additional mapping indications of the relevance factor are thethickness of the connection arrows as long as the radius and the brightness ofthe sphere. Thick connection arrow between the examined attribute and a targetattributes, large sphere with bright colour indicate a strong relevance among thecorresponding attributes. An indicative example is presented in Fig. 15.If the number of relevant attributes under examination is larger than a specific

threshold, which was found to be ten in our testing procedures, then we shouldconsider the expansion of our model in the three dimensions. Otherwise, theprevious model is not capable of clearly presenting the underlying relevancies dueto fuzzy overlapping or dense placement of the spheres. The leading idea forthis attempt is that the most relevant attributes should be placed in the foreground ofour world. These notions imply the ordering of the attributes according to theirrelevance factor along with their smoothed placement in the dimension of depthin order to produce an overall smart representation of the spheres in the

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 573

3D world constructed. Preserving our original idea, that the distance among theexamined and each one of the target attributes should be analogous to the relevantfactor, the resulting placement is the 3D-snail placement of the spheres, as shownin Fig. 16.Taking under consideration the temporal character that the relevant outcomes

might have, we have included in our model the time aspect. For the representation ofthe relevancies among the examined attributes at each specific time point we followthe previously defined model either in its 2D or 3D form. The series of time-stampedsub-sets of relevant analysis outcomes will produce an animated representationwhere the evolution of sphere’s position, radius and colour would reveal theirtemporal behaviour.

4.2. Time-oriented parallel coordinates visual data mining model for relevance analysis

When particular attention needs to be paid over the time aspect of the relevanceanalysis task, we suggest the time-oriented parallel coordinates visualization model.The aim is to represent the evolution of the relevance factor among the examinedand the target attributes as time passes. In such cases, our modelling is based overthe underlying ideas of the parallel coordinates technique and the segment line-plot.Our perspective suggests that the trace of the segment line in the plot, which

corresponds to a target attribute, should reveal the evolution over time of itsrelevance factor. That implies that the crossing points over the parallel axes shouldrepresent the relevance factor at that specific time point. That is achieved by

ARTICLE IN PRESS

Fig. 15. Solar plexus model.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589574

adopting several representational assumptions. The x-axis has been time-stampedwith a parallel axis at each time-point. Over each parallel axis we indicate therelevance factor by a marking point, which is the crossing point for the segment line.Expanding those notions for a set of target attributes we may construct therepresentation shown in Fig. 17.Such a representation gives us the opportunity to have an overall notion of the

history of the relevance analysis task in a single view. Additionally, the simplicity ofthe overall model would pose no computational or representative complexities, asthat familiar to analysts representation manner would result on an easily perceivedflow of information and knowledge. Detected patterns in the representation indicatesimilarities in the evolution of the corresponding relevant attributes and enhance thederivation of temporal inferences.

4.3. Evaluation of the visual data mining models for relevance analysis

For the evaluation of the visual data mining model for relevance analysisoutcomes we selected the Spambase [16]. This database was constructed by thecollection of spam and non-spam e-mails and their processing in order to derivefactors such as the frequency of specific words and characters. Performing relevance

ARTICLE IN PRESS

Fig. 16. Solar plexus model 3D-snail placement.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 575

analysis with Envisioner [18] we visualized the outcomes derived with the modelproposed. The resulting representation is shown in Fig. 29.Although the number of examined and visualized attributes is relatively large there

is a clear indication of the most relevant attributes. By navigating in the 3D-worldconstructed we can target on the high, medium or low relevant attributes andexamine their exact indicated relevance factor.A hypothetical time oriented scenario of the case study presented was constructed

for the evaluation of the time oriented parallel coordinates model. We have formed aset of time stamped relevance analysis outcomes indicating the time evolution of therelevance factors of the Spambase during a period of twelve months. As it can bederived from the representation of Fig. 18, factors such as: average length ofuninterrupted sequences of capital letters, length of longest uninterrupted sequenceof capital letters, total number of capital letters in the e-mail and frequency of thecharacter exclamation mark have an increasing trend of relevance, designating if ane-mail is a spam one or not as time passes. On the other hand, factors such as the:frequency of words ‘‘free’’ and ‘‘money’’ and the frequency of character ‘‘dollar’’have a periodical behaviour, highly influencing the categorization of an e-mailduring the summer months.

ARTICLE IN PRESS

Fig. 17. Time-oriented parallel coordinates model for relevance analysis.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589576

5. Visualizing classification outcomes

On our attempt to graphically reveal the knowledge extracted by a classifier wehave mainly based our research effort on the underlying ideas of the geometricprojection techniques [19]. From their study we have concluded that those methodsseem to be the most promising framework to base our research effort on our attemptto meet the posed requirements in the area of visualizing classification outcomes.According to the mathematical formalization that we have adopted, a tuple ti;

categorized in the class ci could be expressed as: ti1 ¼ vi1; ti2 ¼ vi2; :::; tiN ¼viN ; tiNþ1 ¼ ci with the last field being the classification attribute. The data maynaturally occur in this form, or constructed by a data mining classificationalgorithm. On our modelling case we consider numerical values as field values. If wewant to include all types of attributes, an evident (high=1, medium=0, low=1) orsynthetic (single=0, married=1, widowed=3) mapping of categorical to numericalvalues is essential.Among the several geometric projection techniques that we have studied, the most

interesting methodology was the Class-Preserving Projection Algorithm [20], due tothe robust behaviour that it has and its middle level of computational complexity.Those issues are quite essential to be considered when applying techniques in the

ARTICLE IN PRESS

Fig. 18. Evaluation of the time-oriented parallel Coordinates model.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 577

field of data mining where the limitations posed by the commonly large sets of theclassified data and the computational power provided by modern computers are ofcritical importance.The main characteristic of classified data embedded in high-dimensional Euclidean

space is that proximity in Rn implies similarity. During the mapping procedures,class-preserving projection techniques preserve the properties of the classified data inthe Rn space also to the projection plane in order to construct correspondingrepresentations from which accurate inferences could be extracted. Our researchstudy on those techniques formed a new geometric projection technique that expandsthe existing methods in the area of visualizing classified data. That new techniquenamed 3D Class-Preserving Projection technique projects from the Rn to the R3 spacealong with being capable of preserving the class distances (discriminating) among alarger number of classes.

5.1. 3D class-preserving projection technique

In this section we introduce class-preserving projections of multidimensional data.The main advantage of those projections is that they maintain the high-dimensionalclass structure by the utilization of linear projections, which can be displayed on acomputer screen. The challenge is in the choice of those planes and the associatedprojections. Considering the problem of visualizing high-dimensional data that havebeen categorized into various classes, our goal is to choose those projections thatbest preserve inter-class and intra-class distances in order to extract inferencesregarding their relationships.On our attempt to expand the existing projection techniques we worked on the

definition of a projection scheme that would result on the construction of a 3Dworld. In detail, our attempt was to define a projection scheme that would map fromthe Rn to the R3 space, preserving though the properties of the data in the highdimensional space, as long as the discrimination among the classes. Conclusively, itshould be a 3D class-preserving projection technique.That attempt stemmed from our belief that the freedom provided by the additional

dimension in the projection world would result on the construction of aninformation rich representation. The loss of information when mapping from Rn

to the R3 space is decreased compared to the one when mapping to the R2 space.Moreover, the freedom provided in 3D representations would enhance knowledgeengineer’s explanatory attempts. Allowing the navigation in the 3D worldconstructed will bring the knowledge engineer into direct contact with the projectionof the classified data. Our beliefs guided our research effort in the definition of the3D Class-Preserving Projection Technique, which is presented in this sub-section.The main idea in this method is that in order to project onto the 3D space we

should define our orthonormal projection vectors based on four points. If we chosethose four points to be the class-means of the classes of our interest, we havemanaged to maximize the inter-class distances among those four classes on ourprojection. Such an approach provides the flexibility of distinguishing among fourclasses instead of three, as long as being promoted into the 3D projection space.

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589578

We consider the case where the data is divided into four classes. Let x1;x2; :::; xN

be all the N-dimensional data points, and m1;m2;m3;m4 denote the correspondingclass-centroids. Let w1; w2 and w3 be an orthonormal basis of the candidate 3Dworld of projection. Analogously to the previous cases of the 2D projections, thepoint xi gets projected to ðwT

1 xi;wT2 xi;wT

3 xiÞ and consequently, the means mj getmapped to ðwT

1 mj ;wT2 mj ;wT

3 mjÞ j ¼ 1; 2; 3; 4:Again, one way to obtain good separation of the projected classes is to maximize

the difference between the projected means. This may be achieved by choosingvectors w1;w2;w3ARn such that the objective function

Cðw1;w2;w3Þ ¼X3i¼1

fjwTi ðm2 m1Þj

2 þ jwTi ðm3 m1Þj

2 þ jwTi ðm4 m1Þj

2

þ jwTi ðm3 m2Þj

2 þ jwTi ðm4 m2Þj

2 þ jwTi ðm4 m3Þj

2g

is maximized. The above may be rewritten as

Cðw1;w2;w3Þ ¼X3i¼1

fwTi fðm2 m1Þðm2 m1Þ

T þ?þ ðm4 m3Þðm4 m3ÞTgwig

¼wT1 SBw1 þ wT

2 SBw2 þ wT3 SBw3

¼WTSBW ;

where

W ¼ ½w1;w2;w3�; wTi wi ¼ 1; wT

i wj ¼ 0; iaj; i; j ¼ 1; 2; 3 and

SB ¼ ðm2 m1Þðm2 m1ÞT þ?þ ðm4 m3Þðm4 m3Þ

T:

The positive semi-definite matrix SB can be interpreted as the inter-class orbetween-class scatter matrix. Note that SB has rankp3, since ðm3 m2ÞAspanfðm2 m1Þ; ðm3 m1Þg; ðm4 m2ÞAspanfðm4 m1Þ; ðm2 m1Þg; ðm4 m3ÞAspanfðm4 m1Þ; ðm3 m1Þg:It is clear that the search for the maximizing w1; w2 and w3 can be restricted to

the column (or row) space of SB: But as we noted above, this space is at mostof dimension 3. Thus, in general, the optimal w1; w2 and w3 must form anorthonormal basis spanning the space determined by the vectors ðm2 m1Þ;ðm3 m1Þ and ðm4 m1Þ: In the degenerate case when SB is of rank two, (i.e. whenm1;m2 and m3 are collinear) w1 should be in the direction of m2 m1; w2 in thedirection of m3 m1 while w3 can be chosen to be any unit vector orthogonal to theplane defined by the other two vectors.

5.1.1. Evaluation of the 3D class-preserving projection technique

In order to provide solid proofs regarding the advantageous characteristics of ourinnovative attempt to expand the existing 2D projection techniques to the threedimensions, we are presenting in this section several evaluation case studies of thenewly introduced 3D class-preserving projection technique.

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 579

5.1.1.1. Letter image recognition case study. To begin with, we are demonstratingthe ability of this technique to discriminate among a larger number of classes.In Fig. 30 we have visualized the classes A, B, C and D of the letter imagerecognition data set, which are represented in the colour figure by the red, green, blueand mauve spheres correspondingly. From the constructed 3D representation, it isquite clear the distinction of class A compared to the adjacent arrangement of classesB, C and D. As expected, the similar curves of those characters resulted to theneighbouring placement of the corresponding classes in the n-dimensional space,which was preserved in the projection to our 3D-world. Furthermore, due to thesimilarity of the straight line that letters B and D have the representations of theclasses B and D are bound even closer. Those constituting basic shapes controlledthe placement of the corresponding classes.Our first impression is that we have managed to define a technique which

projects to the three dimensions, preserving though the properties of the classifieddata in the high-dimensional space. In other words, we have managed theoreticallyand practically to construct a 3D class-preserving projection technique. For thepractical demonstration of our arguments we continue with supplementary casestudies, where we apply our technique in the medical field.

ARTICLE IN PRESS

Fig. 19. 3D class-preserving projection technique—dermatology (CL 3rd, 4th, 5th, 6th).

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589580

5.1.1.2. Dermatology case study. In this case study we are attempting again to gaininsight into the dermatology data set from a different perspective, that of visualmining the classified data in the high-dimensional space. Instead of analysing theassociation rules mined, we would like to investigate the properties of the classes inthe 34-dimensional dermatology data space. In order to derive interesting inferences,our attempt will be focused on composing accurate assumptions regarding theproperties of those classes and their relationships. Those assumptions will betranslated to corresponding conclusions regarding the dermatology diseasesexamined, which will then be suggested to the experts of the field for their finalevaluation.As we have already stated, in this scenario the differential diagnosis of

erythemato-squamous diseases is a real problem in dermatology. They all sharethe clinical features of erythema and scaling, with very little differences. The diseasesand as a result the classes in this group are cronic dermatitis, lichen planus, pityriasisrosea, pityriasis rubra pilaris, psoriasis and seboreic dermatitis.The tool for our navigation in the high dimensional classified world will be the 3D

class-preserving projection technique. According to this technique we are capable totarget our interest in the discrimination among four classes. In Fig. 19, we selectedthe pityriasis rosea, pityriasis rubra pilaris, psoriasis and seboreic dermatitis classes,which are represented by the red, green, blue and mauve spheres in the colour

ARTICLE IN PRESS

Fig. 20. 3D class-preserving projection technique—dermatology (CL 1st, 2nd, 4th, 5th).

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 581

version and by the medium grey, bright grey, black and dark grey spheres in the greyversion correspondingly. The remaining cases, which belong to the diseases notselected, have been represented by the white spheres.From the constructed representation we could notice that the selected classes are

quite coherent and clearly distinguishable. Each one of them though has severalcases where the distinction is unclear. Those cases are mixed up in the centre of theview with instances of the other classes. In other words, in each one of the selectedclasses to be visualized, there are a number of cases where there is a clear indicationof the type of the dermatology disease that they belong and some others which arequite confusing.By that conclusion, we have affirmed something that we already knew from the

domain information provided for the data set. The diagnosis of erythemato-squamous diseases is difficult due to the common clinical features that they have.The conclusion derived is therefore accurate, which also suggests the accuratebehaviour of our 3D class-preserving projection technique. That was achieved bythe preservation of the properties of the classified high dimensional data during themapping procedure of our technique.According to the 3D class-preserving projection technique we have the option to

select any combination of four, among the six classes of our data set. The resulting

ARTICLE IN PRESS

Fig. 21. 3D class-preserving projection technique—dermatology (CL 1st, 4th, 5th, 6th).

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589582

representation will then preserve the best discrimination properties among theselected classes of interest. Selecting thus the classes of cronic dermatitis, lichenplanus, pityriasis rubra pilaris and psoriasis, the representation of Fig. 20 isconstructed according to our model.The selection of those diseases was motivated by our attempt to find distinctive

classes, which would provide directions for enhancing the diagnosis procedure. Wekept on our investigation the pityriasis rubra pilaris and psoriasis classes as theyalready had indications of a clear discrimination among them. Among the remainingnone investigated diseases we selected the other two. In the resulting representationwe can notice the improved discrimination among the classes visualized and theabsolute detach of the pityriasis rubra pilaris class.These observations made feasible our research attempt to provide directions to the

experts of the field which will enhance the diagnosis procedure. Our observations, inthe context of the medical field, could be summarized as: if we have already excludedfor a patient the case of seboreic dermatitis and pityriasis rosea the diagnosisamong the remaining erythemato-squamous diseases is not that complex, as thediscrimination among these diseases is quite clear. Furthermore, for the patientsbelonging to the pityriasis rubra pilaris class the diagnosis is expected to be derived

ARTICLE IN PRESS

Fig. 22. 3D class-preserving projection technique—dermatology (CL 2nd, 3rd, 5th, 6th).

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 583

with confidence (if all the necessary clinical examinations have been performed) asthere is a clear discrimination of this disease.In Figs. 21 and 22 we are continuing to investigate the relationships among

the diseases by selecting in each case the corresponding classes of interest. In Fig. 21cronic dermatitis, pityriasis rubra pilaris, psoriasis and seboreic dermatitis have beenselected and in Fig. 22 lichen planus, pityriasis rosea, psoriasis and seboreicdermatitis. In both examples the classes have been represented by the red, green, blueand mauve spheres in the colour version and by medium grey, bright grey, black anddark grey in the grey version correspondingly.Visual mining the constructed representations concludes to analogous to the

previous inferences that support our derived arguments and strengthen our beliefregarding the usefulness of the innovative technique introduced. Being ournavigational tool, 3D class-preserving projection technique enhanced our attempton gaining insight into the high-dimensional data space, accurately highlightingproperties of the classified data (Figs. 23–30).

ARTICLE IN PRESS

Fig. 23. Bar chart form model—enhancing colour.

Fig. 24. Bar chart form model—support & confidence.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589584

6. Conclusions and future work

With the proposed visual data-mining models our attempt was focused on theinvention of visual representations of the outcomes produced by common datamining processes. In order to equip the knowledge engineer with a tool that would be

ARTICLE IN PRESS

Fig. 25. Grid form model.

Fig. 26. Parallel coordinates model—example.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 585

utilized on his/her attempt to gain insight over the mined knowledge, we tried topresent as much information extracted in a human perceivable way. Additionally,our basic guideline was that the construction of each visual representation, as long as

ARTICLE IN PRESS

Fig. 27. Evaluation of the detailed bar chart form model—dermatology selected set of rules.

Fig. 28. Bar chart form model—adult data set of rules.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589586

the definition of the underlying visual data mining model, should harmonically mapthe constituting elements and notions of the corresponding type of mining outcome.The models proposed have distinctive advantageous characteristics, addressing

the commonly tedious issues that the knowledge engineer handles during the

ARTICLE IN PRESS

Fig. 29. Evaluation of the solar plexus model.

Fig. 30. 3D class-preserving projection technique—letter image recognition.

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 587

exploitation of the mining outcomes. Furthermore, their ability to combine forcesand enhance the information flow among them and the user, brings us one stepcloser to make human part of the data mining process, in order to exploit human’sunmatched abilities of perception.Our future work is mainly targeted on the extensive evaluation of those models

along with the development of the advanced visualization features not yetimplemented in VizMiner [21]. Furthermore, expansion of those techniques in orderto address issues such as attributes ordering, mapping of categorical attributes androbust models behaviour regarding large sets of outcomes is among our future plans.

Acknowledgements

Thanks to Dr. Yannis Zorgios for his thoughts and comments on this work andto Dr. Areti Sfrintzeri for her expertise in the medical evaluation scenarios.

References

[1] D.A. Keim, H.-P. Kriegel, Using visualization to support data mining of large existing databases,

Proceedings of the IEEE Visualization ’93 Workshop, San Jose, CA, in: Lecture Notes in Computer

Science, Vol. 871, Springer, Berlin, 1994, pp. 210–229.

[2] W. Frawley, G. Piatetsky-Shapiro, C. Matheus, Knowledge discovery in databases: an overview, AI

Magazine (1992) 13, 213–228.

[3] M. Ganesh, E.-H. Han, V. Kumar, Visual data mining: framework and algorithmic development,

Department of Computer and Information Sciences, University of Minnesota, Minneapolis, 1996.

[4] K.H. Thearling, B.G. Becker, D. Decoste, W. Mawby, M. Pilote, D. Sommerfield, Visualizing data

mining models, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan

Kaufmann, Los Altos, CA, 2001, pp. 205–222.

[5] W.L. Johnston, Model visualization, in: Information Visualization in Data Mining and Knowledge

Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 223–227.

[6] D.A. Keim, J.P. Lee, B.M. Thuraisingham, C.M. Wittenbrink, Database issues for data visualization:

supporting interactive database exploration, Proceedings of the Workshop on Database Issues for

Data Visualization, Atlanta, GA, 1995, in: Lecture Notes in Computer Science, Springer, Berlin,

1996, pp. 12–25.

[7] D.A. Keim, H.-P. Kriegel, Issues in visualizing large databases, Proceedings of the Third IFIP 2.6

Working Conference on Visual Database Systems, Lausanne, Switzerland, in: Visual Database

Systems 3, Chapman & Hall, London, 1995, pp. 203–214.

[8] A. Inselberg, Data mining, visualization of high dimensional data, ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining (KDD 2001), Proceedings of the Workshop

on Visual Data Mining, San Francisco, USA, 2001, pp. 65–81.

[9] U.M. Fayyad, G.G. Grinstein, Introduction, in: Information Visualization in Data Mining and

Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 1–17.

[10] D.A. Keim, Visual data mining, tutorial, International Conference on Very Large Databases

(VLDB ’97), Athens, Greece, 1997.

[11] D. Law, Y. Foong, A visualization-driven approach for strategic knowledge discovery, in:

Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann,

Los Altos, CA, 2001, pp. 182–190.

[12] D.A. Keim, H.-P. Kriegel, Possibilities and limits in visualizing large amounts of multidimensional

data, in: Perceptual Issues in Visualization, Springer, Berlin, 1995, pp. 203–214.

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589588

[13] P. Docherty, A. Beck, A visual metaphor for knowledge discovery. An integrated approach to

visualizing the task, data and results, in: Information Visualization in Data Mining and Knowledge

Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 191–203.

[14] K. Zhao, B. Liu, Visual analysis of the behavior of discovered rules, in: ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining (KDD 2001), Proceedings of the Workshop

on Visual Data Mining, San Francisco, USA, pp. 59–64.

[15] G.G. Grinstein, P. Hoffman, R.M. Pickett, Benchmark development for the evaluation of

visualization for data mining, in: Information Visualization in Data Mining and Knowledge

Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 129–176.

[16] C.L. Blake, C.J. Merz, UCI Repository of machine learning databases [http://www.ics.uci.edu/Bmlearn/

MLRepository.html], Department of Information and Computer Science, University of California,

Irvine, CA.

[17] P.E. Hoffman, G.G. Grinstein, A survey of visualizations for high-dimensional data mining,

in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann,

Los Altos, CA, 2001, pp. 47–82.

[18] G. Koundourakis, EnVisioner: a data mining framework based on decision trees, Ph.D. Thesis,

Department of Computation, University of Manchester Institute of Science and Technology

(UMIST), Manchester UK.

[19] I.S. Dhillon, D.S. Modha, W.S. Spangler, Visualizing class structure of multidimensional data, in:

Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Interface

Foundation of North America, Vol. 30, Minneapolis, May, 1998, pp 488–493.

[20] I.S. Dhillon, D.S. Modha, W.S. Spangler, Class Visualization of High-Dimensional Data with

Application, IBM Almaden Research Center, San Jose, 1999.

[21] I. Kopanakis, B. Theodoulidis, Visual Data Mining and Modelling Techniques, ACM SIGKDD

International Conference On Knowledge Discovery and Data Mining (KDD 2001), Proceedings of

the Workshop on Visual Data Mining, San Francisco, USA, 2001, pp. 114–128.

ARTICLE IN PRESSI. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589 589