CRISP-DM as a framework for discovering knowledge in small and medium sized enterprises' data
-
Upload
facultyeconomicssubotica -
Category
Documents
-
view
0 -
download
0
Transcript of CRISP-DM as a framework for discovering knowledge in small and medium sized enterprises' data
BULETINUL STIINTIFIC al Universitatii “Politehnica” din Timisoara, ROMANIA,
Seria AUTOMATICA si CALCULATOARE
SCIENTIFIC BULLETIN of “Politehnica” University of Timisoara, ROMANIA,
Transactions on AUTOMATIC CONTROL and COMPUTER SCIENCE, Vol. xx (yy), Fasc. z, 20vv, ISSN 1224-600X
1
CRISP-DM as a Framework for Discovering Knowledge
in Small and Medium Sized Enterprises’ Data
Z. Bošnjak*, O. Grljević
**, S. Bošnjak
***
* Department of Business Information Systems, University of Novi Sad, Faculty of Economics, Segedinski put 9-11, 24000 Subotica,
Serbia
Phone: (381) (0)24- 628-045, Fax: (381) (0)24 -546-486, E-Mail: [email protected], WWW:
http://www.ef.uns.ac.rs/osoblje2008/personalwebs/bosnjak_zita.htm
** Department of Business Information Systems, University of Novi Sad, Faculty of Economics, Segedinski put 9-11, 24000 Subotica,
Serbia
Phone: (381) (0)24- 628-166, Fax: (381) (0)24 -546-486, E-Mail: [email protected],
*** Department of Business Information Systems, University of Novi Sad, Faculty of Economics, Segedinski put 9-11, 24000 Subotica,
Serbia
Phone: (381) (0)24- 628-004, Fax: (381) (0)24 -546-486, E-Mail: [email protected], WWW:
http://www.ef.uns.ac.rs/osoblje2008/personalwebs/bosnjak_sasa.htm
Abstract – Discovering knowledge from a waste amount
of data has become a promising area nowadays, but at the
same time it is a very intricate, uncertain and time
consuming process. The complexity of a data collection,
the oscillations in data quality and their impact on the
discovery process, as well as the applicability of results,
urge for an extensive research and gain of experience to
overcome the difficulties that can jeopardize the
knowledge in data discovery (KDD) process as a whole. In
this article we described the limitations and challenges of
discovering knowledge, that we have experienced
analyzing small and medium sized enterprises’ (SMEs)
data.
Keywords: data mining, CRISP-DM methodology, data
analysis models, data exploration.
I. INTRODUCTION
Data mining methods and techniques have become an
important research area, because their inclusion into a data
analysis process can reveal hidden relations, behavioral
patterns, entity profiles, and similar regularities in data
stored in large databases or warehouses. Knowledge
discovered by intelligent methods could hardly be acquired
by traditional means, such as statistical analysis, data
queries or other analytical methods, due to huge amounts of
collected data and the vague idea of the knowledge
existence. Therefore, knowledge discovery in data (KDD),
and data mining (DM) as its integral part, are indispensable
for data analysts. However, the above stated is true only in
case of quality input data, or data that could be transferred
through the process of data preprocessing into such. Data
quality refers to the accuracy and completeness of the data
being analyzed. In (14) the authors stated that: “Data
quality is a multifaceted issue that represents one of the
biggest challenges for data mining.” It has been recognized
that the success of a whole KDD process depends on
provided inputs.
In our paper, we described part of our efforts to take
advantage of the intelligent data analysis and discover
some knowledge hidden in small and medium sized
enterprises’ (SMEs) data, with the aim to support the
development of SMEs sector. Furthermore, the challenges
we faced in data mining related to poor data quality are
described, as well as the consequent limitations in
applicability of revealed knowledge, and the lessons
learned.
II. DISCOVERING KNOWLEDGE THROUGH CRISP-
DM METHODOLOGY
The cross-industry standard process for data mining
(CRISP-DM) is an iterative process model, developed in
1996, and the most favored methodology ever since (1).
The research described in this article was conducted using
this methodology.
A. The CRISP-DM Methodology Life Cycle
CRISP-DM comprises of the following tasks: (a) business
2
understanding; (b) data understanding; (c) data preparation;
(d) modeling; (e) evaluation; and (f) deployment.
Generally, these tasks follow each other as subsequent
phases, but within this main stream, many iterative cycles
can be observed. This can be explained by the fact that the
output of each phase influences the next methodological
step. Namely, after the data understanding phase, the
analyst often has to return to the business understanding
and reconsider the aims and reasons for KDD. Similarly,
after the data modeling phase, it is not unusual, that a new
data preprocessing is needed in order to improve the
derived data models and even to develop additional ones.
Furthermore, the findings of the evaluation phase could
also require the new start from the first task of the CRISP-
DM methodology, the business understanding, in case that
the models do not support the realization of KDD goals
defined in beforehand.
B. Understanding the Goals of Knowledge in Data
Discovery
The role of the first step of CRISP-DM is to devise the
reasons for, and goals of a KDD process. Many authors
have reported about diversified areas in both the private
and public sectors in which DM is used. In (14) it is said
that “Industries such as banking, insurance, medicine, and
retailing commonly use data mining to reduce costs,
enhance research, and increase sales.” Other sources, like
(15) and (20), report on data mining applications as a
means to detect fraud and waste, as well as for purposes
such as measuring and improving program performance or
searching for trends in data. Research described in (13),
(16), and (17) focus on analysis of large and complex
datasets derived from business and industrial activities,
known as “enterprise data”. Enterprise data mining has
been used to design a more cost-effective strategy for
optimizing some type of performance measure (reducing
production time, improving quality, eliminating wastes,
maximizing profit, etc.). In the KDD process described in
this article, the intelligent data analysis was conducted in
order to reveal hidden, potentially useful knowledge in the
data collection that consisted of 2365 records, each
containing data on one SME in Vojvodina province, with
more than 100 attributes. The obtained KDD results should
be the starting point for devising proactive actions for
fostering the development of SMEs sector and prevent
unfavorable courses of action not only for individual
business entities, but for the industry as a whole.
C. Knowledge Acquired Trough Data Understanding and
Data Preparation Phases
The data on SMEs were collected in 2006, by means of
distributed questionnaires, provided by four Regional
Agencies for the Development of Small and Medium Sized
Enterprises and Entrepreneurship, from the Province of
Vojvodina. The questions in the questionnaire were
grouped by relevant topics into eight groups: general data,
business data, technical aspects and technology,
administrative and legislative conditions, financial aspects,
market conditions and distribution, human resources, and
business connectivity.
The second CRISP-DM methodology phase, the data
understanding, consists of data formatting, description,
exploration and data quality verification. The data which
represented the basis for our analysis were originally stored
in MS Access format, but they were further transformed
into appropriate input for DataEngine software tool for
intelligent data analysis. Reference [12] describes this tool
in detail, while the activities of data preprocessing are
described in more detail in (2). In the data understanding
phase, all source data records were still regarded as equally
important and all attributes were subject to further
processing. Table 1 shows the source data format, where
the first part of the record corresponds to the general
enterprise data, while the second part corresponds to the
answers the enterprises’ representatives provided in the
questionnaire, such as the type of the ownership, number of
owners, the director’s level of education, work experience
and gender, number of male/female employees, percentage
of enterprise capacity exploited, main problems of doing
business (lack of funds, regulations, disharmony of
standards, lack of market information, etc.), total capital,
profit, ROI, domestic and international market shares, and
alike. After uploading the data into DataEngine, values yes
and no were replaced by 1 and 0 respectively. In
subsequent steps of the KDD process, the unique set of
data was divided into subsets and different DM methods
and techniques were used for their analysis, as described in
the sequel.
Data exploration is another task in CRISP-DM. It should
support the analyst to state some initial hypotheses about
data relations, dispersions and their impact on each other.
Simple database queries and visualization techniques can
help to formulate some statements related to defined DM
goals and to reveal interesting patterns. During the
exploration of data on SMEs, some interesting things were
observed, such as the fact that there were almost three
times less female than male directors in SMEs in
Vojvodina (24.06% vs. 64.27%), as shown in Fig. 1(a), or
that the overall education level of directors was
disappointing, with a predominant III grade of a secondary
school (on the scale from 3 to 8, this was rank 4), as shown
in Fig. 1(b). While such insight into the sector of SMEs
TABLE I. Source data format
1 1 08779767 102697394 1 1 80381
2 1 4 15 35 1 1 1 2 2 35 2 2 0 Yes Yes No No No No No No 2 Yes No
No No 1 Yes No No No No 1 1 2 Yes Yes No No No No No No No Yes No 2 No No No No No No Yes No No Yes Yes No No
3
(a)
(b)
Fig. 1. (a) Distribution of gender: 0-unanswered, 1-males, 2-females;
(b) Distribution of education degrees: 8- PhD, 7-University, 6-High
school, 5- Secondary school, 4-III grades of secondary school, 3-
elementary school, 1-other, 0-unanswered
dealing with, it had to be treated with precautions, due to
the large number of missing data. Fig. 1(a) shows that more
than 11% of SMEs have not provided the requested answer
to the question about the director’s gender, while Fig. 1(b)
exhibits even greater lack of information, as 27.78% of
values for the director’s education degree were missing. In
the phase of data cleansing, missing values were either
replaced by some neutral value, or the records with missing
data were excluded from further analysis in the KDD
process. Consequently, if the answer to the question: What
is the main difficulty the enterprise was coping with in
everyday business? was not provided by SME, we assumed
that there were no difficulties worth mentioning and
replaced the answer with zero value. Similarly, only 40%
of analyzed SMEs were willing to answer the questions
about capacity utilization, and therefore, the further data
analysis was conducted on 840 records. Even less
enterprises, 31.54% provided answers regarding their
investments, and therefore the input data set for the
modeling phase was reduced accordingly.).
Besides the better insight into the data collection,
visualization techniques were also very useful in erroneous
data detection, especially when such data occurred as
„outliers“. In Fig. 2 we can see one such example, where
instead of the code 52240 for business industry, the value
552240 was entered in the 966th
record. The layouts before
and after the data correction are shown in Fig. 2(a) and Fig.
2(b) respectively.
Visualization techniques were also useful in the phase of
data exploration of CRISP-DM methodology, for verifying
(a)
(b)
Fig. 2. Visualization techniques support „outlier“ detection: (a) Layout before data correction; (b) Layout after data correction
or rejecting initially stated hypothesis on data dependences.
It was discovered that there was a high correlation (r=
0.766442) between investment activities of enterprises in
the last five years and their adoption of standardization
policies. This finding was no surprise, and
could be easily explained, as standardization often impose
serious turns in doing business, demanding additional
investments. On the other hand, the hypothesized
correlation between equipment maturity and lack of
investments into new technologies had to be rejected (r= -
0.20295). This was in contrast with our expectation that
SMEs that had outdated equipment invested less in the last
5 years than SMEs having contemporary equipment.
D. Challenges of Revealing Knowledge through Modeling
The modeling phase of CRISP-DM methodology includes
the application of different DM and knowledge discovery
methods, with wide scale of tunable parameters, each.
These methods can be grouped into different categories
depending on the algorithms used. Some methods are based
on artificial neural networks (ANNs), cluster analysis,
decision trees, mining of association rules, genetic
algorithms, Bayes networks, rule induction, etc. It is
well known, that no method dominates the other
methods all the time. Reference [13] and (18) provides
some answers to the question how to decide which method
to choose for a particular application.
4
In the research described in this paper, the data modeling
was supported by DataEngine software tool for intelligent
data analysis. DataEngine combines statistical methods
with neural networks technology (both supervised and
unsupervised learning models) and fuzzy technology, as to
provide best data mining capabilities for a specific KDD
task.
The stress in our investigations was on devising
discriminators among successful and less successful
enterprises, to describe the general profile of businesses
that were likely to fail in achieving their goals, to select the
attributes of high predictability in forecasting future
business gains/losses, etc. The first task belongs to
classification problems. Each SME, presented with a
corresponding tuple in the database, had to be mapped to
one of predefined, nonoverlapping classes that partition the
entire database. The second task is a typical clustering task,
where grouping of data in the database is accomplished by
finding similarities between data according to their
characteristics. By partitioning or segmenting the database
into clusters, we were hoping to gain a more general view
of SMEs in Vojvodina. The third task is prediction. In our
view, prediction is used both to predict class labels, as
classification, and to assess the value of an attribute that a
given sample is going to have. Utilizing DataEngine, we
have created data models for clustering by fuzzy c-means
algorithm and Kohonen neural networks, and for
classification and prediction using the technique of a
multilayer feed forward (MLP) neural networks.
As data mining is known to be a time-consuming, laborious
endeavor, without guaranties that interesting and
potentially useful patterns will be revealed, we were
prepared to the trial-and-error approach to data analysis and
creation of a large number of data models. Some models,
such as the clustering model which divided the SMEs
according to the main problems they were facing in
everyday business operations (lack of available funds,
complex administrative and legislative regulations,
disharmony with standards, insufficient market
information, insufficient information on technologies,
unavailability of qualified work force, and human resources
development) resulted in interesting findings. By this
clustering model, enterprises were clustered into four
groups, as shown in Fig. 3. It can be seen that one of the
clusters (marked as cluster IV) comprises of SMEs that had
not recognized any of the above listed threats as a serious
one to their business operations (rank 2), while all other
enterprises had recognized the lack of funds as a serious
threat. Surprising was the finding that membership of
SMEs in domestic/foreign business associations, as well as
their involvement in industry clusters1 had no influence on
overcoming the problem of insufficient funds, despite the
1 The distinction between the term “cluster” in a sense of data mining
output and the term “cluster” in a sense of grouping of similar business entities, that share a common business goal, should be made. Therefore,
in the paper the latter was replaced by term “industry cluster”.
Fig. 3. DataEngine clustering model regarding the main problems in business operations
fact that both business associations and industry clusters
should be established primarily for this reason.
During the modeling phase of CRISP-DM, conducted on
the SMEs data, we witnessed that not only successful data
models, but also the inability to develop a meaningful data
model, could provide valuable information. Although we
failed to develop a MLP classification model, which could
classify SMEs based on seven input attributes related to
difficulties in everyday business operations, into one of the
three predefined classes:
1 – SMEs with outdated equipment
2 – SMEs with moderately outdated equipment
3 – SMEs with “new generation” equipment,
we learned that the defined output variable was not
dependant on the selected input attributes in any way.
Namely, after number of trials with different MLP
configurations (with one or two hidden layers, with two to
fifteen neurons in each, using Sigmoid, Tanh,
Mod.Sigmoid, Sim.Sigmoid and Parabola transfer
functions) and various learning methods offered in
DataEngine tool (Resilent Propagation, Super SAB, Back
Propagation, Quick Propagation, with and without
momentum and weight decay, with sequential and random
presentation order), we concluded that there was no hidden
relation we were hopeful to find by these specific models.
Some data analysis models were impossible to build
because of some manifestations of poor data quality that
became obvious only during the modeling phase of CRISP-
DM methodology. The clustering model we tried to build
for investigating the similarities/dissimilarities between
SMEs concerning legal and administrative limitations to
the development of business operations was one of these.
Only when the fuzzy c-means algorithm derived c
completely equal cluster centers for all c∈{2,3,…,20}, it
became obvious that the data was erroneous. Fig. 4
visualizes the identical structure of all cluster centers in
case when c=4. The reason for this unexpected finding lies
in the fact that instead of ranking the eight conditions,
listed in the questionnaire, from the least important (mark
1) to the most important ones (mark 8), majority of SMEs
found this requirement too demanding and introduced some
5
Fig. 4. Fuzzy c-means algorithm calculated an identical structure of cluster
centers
evaluation method of their own, which varied from all
ranks equal to 1, to all ranks equal to 8. These “innovative”
rankings can be seen in Fig. 5, which shows the values
attached to eight relevant attributes of the first twenty
records in the data set. The records containing correct
rankings are highlighted, so it can easily be noticed that the
erroneous records dominate over correct ones. This is,
unfortunately, the case in the whole database.
III. CONCLUSION
In this article, the intelligent analysis of small and medium
sized enterprises data is described, from the viewpoint of
revealed hidden relations in data, which could support the
development of SMEs sector. Despite the existence of
several limitations to KDD process as a result of
oscillations in quality of the analyzed data, we found some
very interesting and unexpected relations between
particular subsets of attributes describing SMEs in
Vojvodina, of which few are mentioned in this paper.
Additional discovered patterns are described in (11).
Crucial for the success of the described KDD process were
the first phases of CRISP-DM methodology, comprising of
data formatting, description, exploration, and data quality
verification, because the data source was partially
erroneous and of poor quality. These initial CRISP-DM
steps are described in (2) in more detail, while in this paper
some experiences and lessons learned from data
exploration are presented.
The fact that lots of questions in the questionnaire used for
data collection were left unanswered, and that even among
the provided answers lots of data were incorrect, led us to
general conclusion that the goal and importance of the
KDD endeavor were not recognized or fully understood by
SMEs representatives.
The results of the complex preparation process were
appropriate inputs for data modeling by intelligent data
analysis techniques, such as statistical, fuzzy, and neural
techniques. In the modeling phase of CRISP-DM, we
used DataEngine, the software package appropriate for
Fig. 5. Erroneous data had to be excluded from further data analysis
demanding analyses of more than 2000 records with over
100 attributes of SMEs. During data modeling, many useful
findings were derived, but some additional irregularities in
data subsets were observed, as well. Such data had to be
disregarded, so the available data collection was
significantly reduced. In order to avoid the exclusion of
large amounts of input data from intelligent analysis, more
attention should be paid to the data acquisition process. Our
general conclusion is that the KDD process would be less
troublesome if the data analyst involved in the later phases
of CRISP-DM methodology could have supervised the data
collection / acquisition process.
ACKNOWLEDGMENTS
This paper is the result of a research on the project titled:
“Comparative Advantages of Intelligent Data Analysis
Methods in Strengthening the Sector of Small and Medium
Sized Enterprises“, No. 114-451-01092/2008-01, funded by
the Ministry of Sciences and Technology Development of
Province of Vojvodina, Republic of Serbia.
REFERENCES
[1] E. Chapman (NCR), J. Clinton (SPSS), R. Kerber (NCR), T.
Khabaza (SPSS), T. Reinartz (DaimlerCrysler), C. Shearer (SPSS),
and R. Wirth (DaimlerCrysler), CRISP-DM 1.0 Step-by-Step Data
Mining Guide, SPSS, http://www.crisp-dm.org/CRISPWP-
0800.pdf, 2000.
[2] O. Grljević, and Z. Bošnjak, “Primena CRISP-DM metodologije u
analizi podataka o malim i srednjim preduzećima” (“CRISP-DM methodology Utilization in Preprocessing Small and Medium Sized
Enterprises Data”), Book of proceedings of XXXV Symposium on
OR, SYM-OP-IS 2008, ISBN: 978-86-7395-248-2, pp. 275-279,
2008.
[3] I.H. Witten, and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques, Elsevier Inc., 2005. [4] D. Pyle, Data Preparation for Data Mining, Morgan Kaufman
Publisher Inc., 1999.
[5] I. Bratko, M. Kubat, and R.S. Michalski, Machine Learning and
Data Mining: Methods and Applications, John Wiley & Sons Inc.,
1998.
[6] K.J. Cios,L.A. Kurgan, R.W. Swiniarski, and W. Pedrycz, Data
Mining: A Knowledge Discovery Approach, Springer Science +
Business Media LLC, 2007.
6
[7] M.E. Colleen, F.C. Monique, and W. Robin, “Influence of Missing Values on Artificial Neural Network Performance”,
http://www.sce.carleton.ca/faculty/frize/MIRG_2001/Ennet_ medinfo2001.pdf , 2001.
[8] H. Iwamya, and B. Kermanashahi, Sensitivity Analysis and
Artificial Neural Network used for Long-term Forecasting of 9
Japanese Power Utilities, Department of Electronics & Information
Engineering Tokyo University of Agriculture and Technology,
Tokyo, 2000. [9] H. Jiawei, and M. Kamber, Data Mining Concepts and Techniques,
Morgan Kaufman Publishers, San Francisco, 2001.
[10] K.A. Smith, and J.N.D. Gupta, Neural Networks in Business:
Techniques and Applications, IRM Press, London, England, 2002.
[11] Z. Bošnjak, and O. Grljević, „Data mining as a mean for devising
actions for development of the sector of small and medium sized
enterprises”, unpublished.
[12] DataEngine - Users Guide, MIT, Germany, 1998.
[13] T. Warren Liao, and E. Triantaphyllou, Recent Advances in Data
Mining of Enterprise Data: Algorithms and Applications, Vol. 6,
Series on Computers and Operations Research, ISBN 978-981-277-985-4, 2008.
[14] J. W. Seifert, “Data Mining: An Overview”, CRS Report for
Congress, http://www.fas.org/irp/crs/RL31798.pdf, December 16,
2004.
[15] G. Cahlink, “Data Mining Taps the Trends,” Government Executive Magazine, http://www.govexec.com/tech/articles/1000managete
ch.html, October 1, 2000. [16] H. R. Nemati, and C. D. Barko, Organizational Data Mining:
Leveraging Enterprise Data Resources for Optimal Performance,
Idea Group Inc (IGI), ISBN 1591402220, 9781591402220, 2003.
[17] P.G. Harrison, and C.M. Llado, “Performance Evaluation of a
Distributed Enterprise Data Mining System Source”, Lecture Notes
In Computer Science, Vol. 1786, Springer-Verlag, London, pp. 117- 131, 2000.
[18] Z. Bošnjak, S. Bošnjak, “Expert System Support in Data Mining
Method Selection”, Book of Abstracts, pp. 154, 20th European
Conference on Operations Research, Rhodes, Greece, July 4-7,
2004.
[19] Bošnjak Z., Bošnjak S., Stojković M.: “Application of Fuzzy
Clustering for Searching Trends in Data - The Public Transport
Company in Subotica Case Study“, Proceedings of EUROFUSE
2005, ISBN 86-7172-022-5, pp. 26-35, [The Ninth Meeting of the EURO Working Group on Fuzzy Sets ”Fuzzy for Better”, Jun 15-18,
Belgrade, Serbia, 2005].
Manuscript received June AA, 2007; revised September
BB, 2007; accepted for publication December CC, 2007.