CRISP-DM as a framework for discovering knowledge in small and medium sized enterprises' data

BULETINUL STIINTIFIC al Universitatii “Politehnica” din Timisoara, ROMANIA,

Seria AUTOMATICA si CALCULATOARE

SCIENTIFIC BULLETIN of “Politehnica” University of Timisoara, ROMANIA,

Transactions on AUTOMATIC CONTROL and COMPUTER SCIENCE, Vol. xx (yy), Fasc. z, 20vv, ISSN 1224-600X

1

CRISP-DM as a Framework for Discovering Knowledge

in Small and Medium Sized Enterprises’ Data

Z. Bošnjak*, O. Grljević

**, S. Bošnjak

***

* Department of Business Information Systems, University of Novi Sad, Faculty of Economics, Segedinski put 9-11, 24000 Subotica,

Serbia

Phone: (381) (0)24- 628-045, Fax: (381) (0)24 -546-486, E-Mail: [email protected], WWW:

http://www.ef.uns.ac.rs/osoblje2008/personalwebs/bosnjak_zita.htm

** Department of Business Information Systems, University of Novi Sad, Faculty of Economics, Segedinski put 9-11, 24000 Subotica,

Serbia

Phone: (381) (0)24- 628-166, Fax: (381) (0)24 -546-486, E-Mail: [email protected],

*** Department of Business Information Systems, University of Novi Sad, Faculty of Economics, Segedinski put 9-11, 24000 Subotica,

Serbia

Phone: (381) (0)24- 628-004, Fax: (381) (0)24 -546-486, E-Mail: [email protected], WWW:

http://www.ef.uns.ac.rs/osoblje2008/personalwebs/bosnjak_sasa.htm

Abstract – Discovering knowledge from a waste amount

of data has become a promising area nowadays, but at the

same time it is a very intricate, uncertain and time

consuming process. The complexity of a data collection,

the oscillations in data quality and their impact on the

discovery process, as well as the applicability of results,

urge for an extensive research and gain of experience to

overcome the difficulties that can jeopardize the

knowledge in data discovery (KDD) process as a whole. In

this article we described the limitations and challenges of

discovering knowledge, that we have experienced

analyzing small and medium sized enterprises’ (SMEs)

data.

Keywords: data mining, CRISP-DM methodology, data

analysis models, data exploration.

I. INTRODUCTION

Data mining methods and techniques have become an

important research area, because their inclusion into a data

analysis process can reveal hidden relations, behavioral

patterns, entity profiles, and similar regularities in data

stored in large databases or warehouses. Knowledge

discovered by intelligent methods could hardly be acquired

by traditional means, such as statistical analysis, data

queries or other analytical methods, due to huge amounts of

collected data and the vague idea of the knowledge

existence. Therefore, knowledge discovery in data (KDD),

and data mining (DM) as its integral part, are indispensable

for data analysts. However, the above stated is true only in

case of quality input data, or data that could be transferred

through the process of data preprocessing into such. Data

quality refers to the accuracy and completeness of the data

being analyzed. In (14) the authors stated that: “Data

quality is a multifaceted issue that represents one of the

biggest challenges for data mining.” It has been recognized

that the success of a whole KDD process depends on

provided inputs.

In our paper, we described part of our efforts to take

advantage of the intelligent data analysis and discover

some knowledge hidden in small and medium sized

enterprises’ (SMEs) data, with the aim to support the

development of SMEs sector. Furthermore, the challenges

we faced in data mining related to poor data quality are

described, as well as the consequent limitations in

applicability of revealed knowledge, and the lessons

learned.

II. DISCOVERING KNOWLEDGE THROUGH CRISP-

DM METHODOLOGY

The cross-industry standard process for data mining

(CRISP-DM) is an iterative process model, developed in

1996, and the most favored methodology ever since (1).

The research described in this article was conducted using

this methodology.

A. The CRISP-DM Methodology Life Cycle

CRISP-DM comprises of the following tasks: (a) business

2

understanding; (b) data understanding; (c) data preparation;

(d) modeling; (e) evaluation; and (f) deployment.

Generally, these tasks follow each other as subsequent

phases, but within this main stream, many iterative cycles

can be observed. This can be explained by the fact that the

output of each phase influences the next methodological

step. Namely, after the data understanding phase, the

analyst often has to return to the business understanding

and reconsider the aims and reasons for KDD. Similarly,

after the data modeling phase, it is not unusual, that a new

data preprocessing is needed in order to improve the

derived data models and even to develop additional ones.

Furthermore, the findings of the evaluation phase could

also require the new start from the first task of the CRISP-

DM methodology, the business understanding, in case that

the models do not support the realization of KDD goals

defined in beforehand.

B. Understanding the Goals of Knowledge in Data

Discovery

The role of the first step of CRISP-DM is to devise the

reasons for, and goals of a KDD process. Many authors

have reported about diversified areas in both the private

and public sectors in which DM is used. In (14) it is said

that “Industries such as banking, insurance, medicine, and

retailing commonly use data mining to reduce costs,

enhance research, and increase sales.” Other sources, like

(15) and (20), report on data mining applications as a

means to detect fraud and waste, as well as for purposes

such as measuring and improving program performance or

searching for trends in data. Research described in (13),

(16), and (17) focus on analysis of large and complex

datasets derived from business and industrial activities,

known as “enterprise data”. Enterprise data mining has

been used to design a more cost-effective strategy for

optimizing some type of performance measure (reducing

production time, improving quality, eliminating wastes,

maximizing profit, etc.). In the KDD process described in

this article, the intelligent data analysis was conducted in

order to reveal hidden, potentially useful knowledge in the

data collection that consisted of 2365 records, each

containing data on one SME in Vojvodina province, with

more than 100 attributes. The obtained KDD results should

be the starting point for devising proactive actions for

fostering the development of SMEs sector and prevent

unfavorable courses of action not only for individual

business entities, but for the industry as a whole.

C. Knowledge Acquired Trough Data Understanding and

Data Preparation Phases

The data on SMEs were collected in 2006, by means of

distributed questionnaires, provided by four Regional

Agencies for the Development of Small and Medium Sized

Enterprises and Entrepreneurship, from the Province of

Vojvodina. The questions in the questionnaire were

grouped by relevant topics into eight groups: general data,

business data, technical aspects and technology,

administrative and legislative conditions, financial aspects,

market conditions and distribution, human resources, and

business connectivity.

The second CRISP-DM methodology phase, the data

understanding, consists of data formatting, description,

exploration and data quality verification. The data which

represented the basis for our analysis were originally stored

in MS Access format, but they were further transformed

into appropriate input for DataEngine software tool for

intelligent data analysis. Reference [12] describes this tool

in detail, while the activities of data preprocessing are

described in more detail in (2). In the data understanding

phase, all source data records were still regarded as equally

important and all attributes were subject to further

processing. Table 1 shows the source data format, where

the first part of the record corresponds to the general

enterprise data, while the second part corresponds to the

answers the enterprises’ representatives provided in the

questionnaire, such as the type of the ownership, number of

owners, the director’s level of education, work experience

and gender, number of male/female employees, percentage

of enterprise capacity exploited, main problems of doing

business (lack of funds, regulations, disharmony of

standards, lack of market information, etc.), total capital,

profit, ROI, domestic and international market shares, and

alike. After uploading the data into DataEngine, values yes

and no were replaced by 1 and 0 respectively. In

subsequent steps of the KDD process, the unique set of

data was divided into subsets and different DM methods

and techniques were used for their analysis, as described in

the sequel.

Data exploration is another task in CRISP-DM. It should

support the analyst to state some initial hypotheses about

data relations, dispersions and their impact on each other.

Simple database queries and visualization techniques can

help to formulate some statements related to defined DM

goals and to reveal interesting patterns. During the

exploration of data on SMEs, some interesting things were

observed, such as the fact that there were almost three

times less female than male directors in SMEs in

Vojvodina (24.06% vs. 64.27%), as shown in Fig. 1(a), or

that the overall education level of directors was

disappointing, with a predominant III grade of a secondary

school (on the scale from 3 to 8, this was rank 4), as shown

in Fig. 1(b). While such insight into the sector of SMEs

TABLE I. Source data format

1 1 08779767 102697394 1 1 80381

2 1 4 15 35 1 1 1 2 2 35 2 2 0 Yes Yes No No No No No No 2 Yes No

No No 1 Yes No No No No 1 1 2 Yes Yes No No No No No No No Yes No 2 No No No No No No Yes No No Yes Yes No No

3

(a)

(b)

Fig. 1. (a) Distribution of gender: 0-unanswered, 1-males, 2-females;

(b) Distribution of education degrees: 8- PhD, 7-University, 6-High

school, 5- Secondary school, 4-III grades of secondary school, 3-

elementary school, 1-other, 0-unanswered

dealing with, it had to be treated with precautions, due to

the large number of missing data. Fig. 1(a) shows that more

than 11% of SMEs have not provided the requested answer

to the question about the director’s gender, while Fig. 1(b)

exhibits even greater lack of information, as 27.78% of

values for the director’s education degree were missing. In

the phase of data cleansing, missing values were either

replaced by some neutral value, or the records with missing

data were excluded from further analysis in the KDD

process. Consequently, if the answer to the question: What

is the main difficulty the enterprise was coping with in

everyday business? was not provided by SME, we assumed

that there were no difficulties worth mentioning and

replaced the answer with zero value. Similarly, only 40%

of analyzed SMEs were willing to answer the questions

about capacity utilization, and therefore, the further data

analysis was conducted on 840 records. Even less

enterprises, 31.54% provided answers regarding their

investments, and therefore the input data set for the

modeling phase was reduced accordingly.).

Besides the better insight into the data collection,

visualization techniques were also very useful in erroneous

data detection, especially when such data occurred as

„outliers“. In Fig. 2 we can see one such example, where

instead of the code 52240 for business industry, the value

552240 was entered in the 966th

record. The layouts before

and after the data correction are shown in Fig. 2(a) and Fig.

2(b) respectively.

Visualization techniques were also useful in the phase of

data exploration of CRISP-DM methodology, for verifying

(a)

(b)

Fig. 2. Visualization techniques support „outlier“ detection: (a) Layout before data correction; (b) Layout after data correction

or rejecting initially stated hypothesis on data dependences.

It was discovered that there was a high correlation (r=

0.766442) between investment activities of enterprises in

the last five years and their adoption of standardization

policies. This finding was no surprise, and

could be easily explained, as standardization often impose

serious turns in doing business, demanding additional

investments. On the other hand, the hypothesized

correlation between equipment maturity and lack of

investments into new technologies had to be rejected (r= -

0.20295). This was in contrast with our expectation that

SMEs that had outdated equipment invested less in the last

5 years than SMEs having contemporary equipment.

D. Challenges of Revealing Knowledge through Modeling

The modeling phase of CRISP-DM methodology includes

the application of different DM and knowledge discovery

methods, with wide scale of tunable parameters, each.

These methods can be grouped into different categories

depending on the algorithms used. Some methods are based

on artificial neural networks (ANNs), cluster analysis,

decision trees, mining of association rules, genetic

algorithms, Bayes networks, rule induction, etc. It is

well known, that no method dominates the other

methods all the time. Reference [13] and (18) provides

some answers to the question how to decide which method

to choose for a particular application.

4

In the research described in this paper, the data modeling

was supported by DataEngine software tool for intelligent

data analysis. DataEngine combines statistical methods

with neural networks technology (both supervised and

unsupervised learning models) and fuzzy technology, as to

provide best data mining capabilities for a specific KDD

task.

The stress in our investigations was on devising

discriminators among successful and less successful

enterprises, to describe the general profile of businesses

that were likely to fail in achieving their goals, to select the

attributes of high predictability in forecasting future

business gains/losses, etc. The first task belongs to

classification problems. Each SME, presented with a

corresponding tuple in the database, had to be mapped to

one of predefined, nonoverlapping classes that partition the

entire database. The second task is a typical clustering task,

where grouping of data in the database is accomplished by

finding similarities between data according to their

characteristics. By partitioning or segmenting the database

into clusters, we were hoping to gain a more general view

of SMEs in Vojvodina. The third task is prediction. In our

view, prediction is used both to predict class labels, as

classification, and to assess the value of an attribute that a

given sample is going to have. Utilizing DataEngine, we

have created data models for clustering by fuzzy c-means

algorithm and Kohonen neural networks, and for

classification and prediction using the technique of a

multilayer feed forward (MLP) neural networks.

As data mining is known to be a time-consuming, laborious

endeavor, without guaranties that interesting and

potentially useful patterns will be revealed, we were

prepared to the trial-and-error approach to data analysis and

creation of a large number of data models. Some models,

such as the clustering model which divided the SMEs

according to the main problems they were facing in

everyday business operations (lack of available funds,

complex administrative and legislative regulations,

disharmony with standards, insufficient market

information, insufficient information on technologies,

unavailability of qualified work force, and human resources

development) resulted in interesting findings. By this

clustering model, enterprises were clustered into four

groups, as shown in Fig. 3. It can be seen that one of the

clusters (marked as cluster IV) comprises of SMEs that had

not recognized any of the above listed threats as a serious

one to their business operations (rank 2), while all other

enterprises had recognized the lack of funds as a serious

threat. Surprising was the finding that membership of

SMEs in domestic/foreign business associations, as well as

their involvement in industry clusters1 had no influence on

overcoming the problem of insufficient funds, despite the

1 The distinction between the term “cluster” in a sense of data mining

output and the term “cluster” in a sense of grouping of similar business entities, that share a common business goal, should be made. Therefore,

in the paper the latter was replaced by term “industry cluster”.

Fig. 3. DataEngine clustering model regarding the main problems in business operations

fact that both business associations and industry clusters

should be established primarily for this reason.

During the modeling phase of CRISP-DM, conducted on

the SMEs data, we witnessed that not only successful data

models, but also the inability to develop a meaningful data

model, could provide valuable information. Although we

failed to develop a MLP classification model, which could

classify SMEs based on seven input attributes related to

difficulties in everyday business operations, into one of the

three predefined classes:

1 – SMEs with outdated equipment

2 – SMEs with moderately outdated equipment

3 – SMEs with “new generation” equipment,

we learned that the defined output variable was not

dependant on the selected input attributes in any way.

Namely, after number of trials with different MLP

configurations (with one or two hidden layers, with two to

fifteen neurons in each, using Sigmoid, Tanh,

Mod.Sigmoid, Sim.Sigmoid and Parabola transfer

functions) and various learning methods offered in

DataEngine tool (Resilent Propagation, Super SAB, Back

Propagation, Quick Propagation, with and without

momentum and weight decay, with sequential and random

presentation order), we concluded that there was no hidden

relation we were hopeful to find by these specific models.

Some data analysis models were impossible to build

because of some manifestations of poor data quality that

became obvious only during the modeling phase of CRISP-

DM methodology. The clustering model we tried to build

for investigating the similarities/dissimilarities between

SMEs concerning legal and administrative limitations to

the development of business operations was one of these.

Only when the fuzzy c-means algorithm derived c

completely equal cluster centers for all c∈{2,3,…,20}, it

became obvious that the data was erroneous. Fig. 4

visualizes the identical structure of all cluster centers in

case when c=4. The reason for this unexpected finding lies

in the fact that instead of ranking the eight conditions,

listed in the questionnaire, from the least important (mark

1) to the most important ones (mark 8), majority of SMEs

found this requirement too demanding and introduced some

5

Fig. 4. Fuzzy c-means algorithm calculated an identical structure of cluster

centers

evaluation method of their own, which varied from all

ranks equal to 1, to all ranks equal to 8. These “innovative”

rankings can be seen in Fig. 5, which shows the values

attached to eight relevant attributes of the first twenty

records in the data set. The records containing correct

rankings are highlighted, so it can easily be noticed that the

erroneous records dominate over correct ones. This is,

unfortunately, the case in the whole database.

III. CONCLUSION

In this article, the intelligent analysis of small and medium

sized enterprises data is described, from the viewpoint of

revealed hidden relations in data, which could support the

development of SMEs sector. Despite the existence of

several limitations to KDD process as a result of

oscillations in quality of the analyzed data, we found some

very interesting and unexpected relations between

particular subsets of attributes describing SMEs in

Vojvodina, of which few are mentioned in this paper.

Additional discovered patterns are described in (11).

Crucial for the success of the described KDD process were

the first phases of CRISP-DM methodology, comprising of

data formatting, description, exploration, and data quality

verification, because the data source was partially

erroneous and of poor quality. These initial CRISP-DM

steps are described in (2) in more detail, while in this paper

some experiences and lessons learned from data

exploration are presented.

The fact that lots of questions in the questionnaire used for

data collection were left unanswered, and that even among

the provided answers lots of data were incorrect, led us to

general conclusion that the goal and importance of the

KDD endeavor were not recognized or fully understood by

SMEs representatives.

The results of the complex preparation process were

appropriate inputs for data modeling by intelligent data

analysis techniques, such as statistical, fuzzy, and neural

techniques. In the modeling phase of CRISP-DM, we

used DataEngine, the software package appropriate for

Fig. 5. Erroneous data had to be excluded from further data analysis

demanding analyses of more than 2000 records with over

100 attributes of SMEs. During data modeling, many useful

findings were derived, but some additional irregularities in

data subsets were observed, as well. Such data had to be

disregarded, so the available data collection was

significantly reduced. In order to avoid the exclusion of

large amounts of input data from intelligent analysis, more

attention should be paid to the data acquisition process. Our

general conclusion is that the KDD process would be less

troublesome if the data analyst involved in the later phases

of CRISP-DM methodology could have supervised the data

collection / acquisition process.

ACKNOWLEDGMENTS

This paper is the result of a research on the project titled:

“Comparative Advantages of Intelligent Data Analysis

Methods in Strengthening the Sector of Small and Medium

Sized Enterprises“, No. 114-451-01092/2008-01, funded by

the Ministry of Sciences and Technology Development of

Province of Vojvodina, Republic of Serbia.

REFERENCES

[1] E. Chapman (NCR), J. Clinton (SPSS), R. Kerber (NCR), T.

Khabaza (SPSS), T. Reinartz (DaimlerCrysler), C. Shearer (SPSS),

and R. Wirth (DaimlerCrysler), CRISP-DM 1.0 Step-by-Step Data

Mining Guide, SPSS, http://www.crisp-dm.org/CRISPWP-

0800.pdf, 2000.

[2] O. Grljević, and Z. Bošnjak, “Primena CRISP-DM metodologije u

analizi podataka o malim i srednjim preduzećima” (“CRISP-DM methodology Utilization in Preprocessing Small and Medium Sized

Enterprises Data”), Book of proceedings of XXXV Symposium on

OR, SYM-OP-IS 2008, ISBN: 978-86-7395-248-2, pp. 275-279,

2008.

[3] I.H. Witten, and E. Frank, Data Mining: Practical Machine Learning

Tools and Techniques, Elsevier Inc., 2005. [4] D. Pyle, Data Preparation for Data Mining, Morgan Kaufman

Publisher Inc., 1999.

[5] I. Bratko, M. Kubat, and R.S. Michalski, Machine Learning and

Data Mining: Methods and Applications, John Wiley & Sons Inc.,

1998.

[6] K.J. Cios,L.A. Kurgan, R.W. Swiniarski, and W. Pedrycz, Data

Mining: A Knowledge Discovery Approach, Springer Science +

Business Media LLC, 2007.

6

[7] M.E. Colleen, F.C. Monique, and W. Robin, “Influence of Missing Values on Artificial Neural Network Performance”,

http://www.sce.carleton.ca/faculty/frize/MIRG_2001/Ennet_ medinfo2001.pdf , 2001.

[8] H. Iwamya, and B. Kermanashahi, Sensitivity Analysis and

Artificial Neural Network used for Long-term Forecasting of 9

Japanese Power Utilities, Department of Electronics & Information

Engineering Tokyo University of Agriculture and Technology,

Tokyo, 2000. [9] H. Jiawei, and M. Kamber, Data Mining Concepts and Techniques,

Morgan Kaufman Publishers, San Francisco, 2001.

[10] K.A. Smith, and J.N.D. Gupta, Neural Networks in Business:

Techniques and Applications, IRM Press, London, England, 2002.

[11] Z. Bošnjak, and O. Grljević, „Data mining as a mean for devising

actions for development of the sector of small and medium sized

enterprises”, unpublished.

[12] DataEngine - Users Guide, MIT, Germany, 1998.

[13] T. Warren Liao, and E. Triantaphyllou, Recent Advances in Data

Mining of Enterprise Data: Algorithms and Applications, Vol. 6,

Series on Computers and Operations Research, ISBN 978-981-277-985-4, 2008.

[14] J. W. Seifert, “Data Mining: An Overview”, CRS Report for

Congress, http://www.fas.org/irp/crs/RL31798.pdf, December 16,

2004.

[15] G. Cahlink, “Data Mining Taps the Trends,” Government Executive Magazine, http://www.govexec.com/tech/articles/1000managete

ch.html, October 1, 2000. [16] H. R. Nemati, and C. D. Barko, Organizational Data Mining:

Leveraging Enterprise Data Resources for Optimal Performance,

Idea Group Inc (IGI), ISBN 1591402220, 9781591402220, 2003.

[17] P.G. Harrison, and C.M. Llado, “Performance Evaluation of a

Distributed Enterprise Data Mining System Source”, Lecture Notes

In Computer Science, Vol. 1786, Springer-Verlag, London, pp. 117- 131, 2000.

[18] Z. Bošnjak, S. Bošnjak, “Expert System Support in Data Mining

Method Selection”, Book of Abstracts, pp. 154, 20th European

Conference on Operations Research, Rhodes, Greece, July 4-7,

2004.

[19] Bošnjak Z., Bošnjak S., Stojković M.: “Application of Fuzzy

Clustering for Searching Trends in Data - The Public Transport

Company in Subotica Case Study“, Proceedings of EUROFUSE

2005, ISBN 86-7172-022-5, pp. 26-35, [The Ninth Meeting of the EURO Working Group on Fuzzy Sets ”Fuzzy for Better”, Jun 15-18,

Belgrade, Serbia, 2005].

Manuscript received June AA, 2007; revised September

BB, 2007; accepted for publication December CC, 2007.

CRISP-DM as a framework for discovering knowledge in small and medium sized enterprises' data

Documents

Transcript of CRISP-DM as a framework for discovering knowledge in small and medium sized enterprises' data