Mining Educational Data : A Perspective Review on Data Mining Suites

11
Mining Educational Data : A Perspective Review on Data Mining Suites Wan Aezwani Wan Abu Bakar Department of Computer Science Faculty of Science & Technology, University Malaysia Terengganu 21030 Kuala Terengganu, Terengganu. [email protected] Masita Abd Jalil Department of Computer Science Faculty of Science & Technology, University Malaysia Terengganu 21030 Kuala Terengganu, Terengganu. [email protected] Mohd Yazid Md. Saman Department of Computer Science Faculty of Science & Technology, University Malaysia Terengganu 21030 Kuala Terengganu, Terengganu. [email protected] ABSTRACT A development for building an application of data mining algorithms requires the use of powerful software and programming tools. As the number of available tools continues to grow, the choice of the most prominent and suitable tool becomes increasingly difficult. This paper attempts to support the decision-making process in choosing the best tools by discussing the historical development and presenting the existing open source and state-of- the-art data mining tools. The data mining tools that will be presented here are WEKA (Waikato Environment for Knowledge Analysis) and RapidMiner (RM). These tools are among the top 5 popular DM tools and also commercial with free licenses for academic purpose and use. Furthermore, we propose criteria for the tool categorization based on different user groups, data structures, data mining tasks and methods, visualization and interaction styles, import and export options for data and models, platforms, and license policies. These criteria are then used to classify data mining tools into nine different types. The typical characteristics of these types are explained and a selection of the most important tools is categorized. Keywords : Educational Data Mining (EDM), Data Mining Suits (DMS), Data mining tools, free and open source licenses. I. INTRODUCTION Data mining has a long history, with strong roots in statistics, artificial intelligence, machine learning, and database research [1,2]. Today, a large 1

Transcript of Mining Educational Data : A Perspective Review on Data Mining Suites

Mining Educational Data : A Perspective Review on DataMining Suites

Wan Aezwani Wan Abu BakarDepartment of Computer Science

Faculty of Science & Technology, University Malaysia Terengganu21030 Kuala Terengganu, Terengganu.

[email protected]

Masita Abd JalilDepartment of Computer Science

Faculty of Science & Technology, University Malaysia Terengganu21030 Kuala Terengganu, Terengganu.

[email protected]

Mohd Yazid Md. SamanDepartment of Computer Science

Faculty of Science & Technology, University Malaysia Terengganu21030 Kuala Terengganu, Terengganu.

[email protected]

ABSTRACT

A development for building an application of data mining algorithmsrequires the use of powerful software and programming tools. As the number ofavailable tools continues to grow, the choice of the most prominent andsuitable tool becomes increasingly difficult. This paper attempts to supportthe decision-making process in choosing the best tools by discussing thehistorical development and presenting the existing open source and state-of-the-art data mining tools. The data mining tools that will be presented hereare WEKA (Waikato Environment for Knowledge Analysis) and RapidMiner (RM).These tools are among the top 5 popular DM tools and also commercial withfree licenses for academic purpose and use. Furthermore, we propose criteriafor the tool categorization based on different user groups, data structures,data mining tasks and methods, visualization and interaction styles, importand export options for data and models, platforms, and license policies.These criteria are then used to classify data mining tools into ninedifferent types. The typical characteristics of these types are explained anda selection of the most important tools is categorized. 

Keywords : Educational Data Mining (EDM), Data Mining Suits (DMS), Datamining tools, free and open source licenses.

I. INTRODUCTION Data mining has a long history, with strong roots in statistics, artificialintelligence, machine learning, and database research [1,2]. Today, a large

1

number of standard data mining methods are available [3,4]. From a historicalperspective, these methods have different roots. One early group of methodswas adopted from classical statistics: the focus was changed from the proofof known hypotheses to the generation of new hypotheses. Examples includemethods from Bayesian decision theory, regression theory, and principalcomponent analysis. Another group of methods stemmed from artificialintelligence - like decision trees, rule-based systems, and others. The term‘machine learning’ includes methods such as support vector machines andartificial neural networks. There are several different and sometimesoverlapping categorizations; for example, fuzzy logic, artificial neuralnetworks, and evolutionary algorithms, which are summarized as computationalintelligence [5].

The typical life cycle of new data mining methods begins with theoreticalpapers based on in-house software prototypes, followed by public or on-demandsoftware distribution of successful algorithms as research prototypes [6].Then, either special commercial or open source packages containing a familyof similar algorithms are developed or the algorithms are integrated intoexisting open source or commercial packages. Many companies have tried topromote their own stand alone packages, but only few have reached notablemarket shares. The life cycle of some data mining tools is remarkably short.Typical reasons include internal marketing decisions and acquisitions ofspecialized companies by larger ones, leading to a renaming and integrationof product lines.

Open-source libraries have also become very popular since the 1990s. The mostprominent example is Waikato Environment for Knowledge Analysis (WEKA) [7].WEKA has started in 1994 as a C++ library, with its first public release in1996. In 1999, it was completely rebuilt as a JAVA package; since that time,it has been regularly updated. In addition, WEKA components have beenintegrated in many other open-source tools such as Pentaho, RapidMiner, andKNIME.

II. DATA MINING TECHNIQUES AND TOOLS

Data mining is the process of automatically discovering useful information inlarge data repositories. Data mining techniques are deployed to scour largedatabases in order to find novel and useful patterns that might otherwiseremain unknown [8]. Data mining, also called knowledge discovery indatabases, in computer science, the process of discovering interesting anduseful patterns and relationships in large volumes of data [9]. According to[10] the main challenges of data mining (DM) are:

To deal with huge amounts of data located at different sites such thatit can exceeds the terabyte limit

To partition and distribute the data for parallel processing to achieveacceptable time and space performance

To mine the knowledge in a fast and efficient manner to make it usableand updated.

2

Educational Data Mining (called EDM) is an emerging discipline, concernedwith developing methods for exploring the unique types of data that come fromeducational settings, and using those methods to better understand studentsand the settings which they learned [11].  Historically, EDM is relativelynew scientific discipline. Although researchers have been recording andanalyzing data from educational software for a long time, only recently hasEDM been established as a field in its own right through conferences(Internal Conference on Educational Data Mining) started in 2008 andscientific journal, Journal of Educational data Mining (JEDM), where itsfirst issue published in 2009 [12]. EDM borrows and extends related fieldssuch as machine learning (the study of computer programs that learn from andimprove with empirical data), text mining (approaches to find patterns innatural language text) and also statistics. A key area of EDM is miningcomputer logs of student performance [13].  Another key area is miningenrollment data [14].  Key uses of EDM include predicting studentperformance, studying and learning in order to recommend improvements tocurrent educational practices. EDM can be considered as one of the learningsciences, as well as an area of data mining. The common and typical steps inEDM include data acquisition, data preprocessing (i.e data cleaning), datamining and result validation.

Once the DM tasks are applied in educational data setting, then it isautomatically referred to as EDM. EDM tasks don’t really differ to thoserelated in DM. It also consists of DM task or sometimes called DM techniques asfollows [10]:

Prediction : classification, regression and density estimation Clustering Relationship mining : Association rule mining, correlation mining, sequential pattern mining and

causal data mining Distillation of data for human judgment Discovery with models

There are several types of data mining software tools such as Data MiningSuites (DMS), Mathematical Packages (MATs), Integration Packages (INTs), DataMining Libraries (LIBs), Specialties (SPECs) and Solutions (SOLs). The onlytype discussed in this paper is DMS.

DMS focus largely on data mining and include numerous methods [6]. Theysupport feature tables and time series, while additional tools for textmining are sometime available. The application focus is wide and notrestricted to a special application field, such as business applications;however, coupling to business solutions, import and export of models,reporting, and a variety of different platforms are nonetheless supported. Inaddition, the producers provide services for adaptation of the tools to theworkflows and data structures of the customer. DMS is mostly commercial andrather expensive, but some open-source tools such as RapidMiner exist.Typical examples include IBM SPSS Modeler, SAS Enterprise Miner, Aliced'Isoft, DataEngine, DataDetective, GhostMiner, Knowledge Studio, KXEN, NAG

3

Data Mining Components, Partek Discovery Suite, STATISTICA, and TIBCOSpotfire.

The most popular type of open-source licenses is the GNU General PublicLicense of the Free Software Foundation (GNU-GPL or GPL) [15]. It permitsfree redistribution, integration in other packages, and modification of thesoftware as long as all subsequent users receive the same level of freedom(so-called ‘copy left’). This restriction guarantees that all softwarecontaining GNU-GPL components must be licensed under GNU-GPL. Weaker formsare licenses that are free for academic use, but not for business users.Mixed forms of licenses occur especially if open-source software is used toexpand commercial tools such as Matlab(Mathematical Laboratory). Table 1illustrates a free and open source tools which are very popular andcommercial.

Table 1 : Data Mining Suites which are free and open source [1]

TOOL TYPE LINK

D2K DMS alg.ncsa.uiuc.edu

Gnome DataMine Tools DMS www.togaware.com/

datamining/gdatamine

RapidMiner DMS rapid-i.com/content/view

Weka DMS,LIB

sourceforge.net/projects/weka

III. CRITERIA FOR COMPARING DATA MINING SOFTWAREThe different criteria for comparison of data mining software are introduced[6]. These criteria are based on user groups, data structures, data miningtasks and methods, import and export options, and license models. There aremany different data mining tools available, which fit the needs of quitedifferent user groups:

Business applications: This group uses data mining as a tool for solvingcommercially relevant business applications such as customer relationshipmanagement, fraud detection, and so on. This field is mainly covered by avariety of commercial tools providing support for databases with largedatasets, and deep integration in the company's workflow.

Applied research: A user group that applies data mining to research problems,for example, technology and life sciences. Here, users are mainly interested

4

in tools with well-proven methods, a graphical user interface (GUI), andinterfaces to domain-related data formats or databases.

Algorithm development: Develops new data mining algorithms, and requires toolsto both integrate its own methods and compare these with existing methods.The necessary tools should contain many concurrent algorithms.

Education: For education at universities, data mining tools should be veryintuitive, with a comfortable interactive user interface, and inexpensive. Inaddition, they should allow the integration of in-house methods duringprogramming seminars.

IV. DATA MINING DATASETS

The ease with which data and models can be imported and exported amongdifferent software tools plays a crucial role in the functionality of datamining tools [6]. First, the data are normally generated and hosted fromdifferent sources such as databases or software associated with measurementdevices. In business applications, interfaces to databases such as Oracle orany database supporting the Structured Query Language (SQL) standard are themost common means of importing data. Because almost all other nondata miningtools support export as text or excel files, formats such as Comma SeparatedValues (CSV) are frequently used to import formats with data mining tools. Inaddition, almost all softwares have proprietary binary or textual files, andexchanges formats for data and models, e.g., Attribute-Relation File Format(ARFF) in WEKA (WEKA standard).

V. COMPARATIVE REVIEWS OF WEKA AND RAPIDMINER

The sneak preview of Weka discussed here is on Weka 3.7.9. Weka is acollection of open source ML (machine learning) algorithms which is availablefor 49 data pre-processing tools, 76 classifiers and regression algorithms, 8clustering algorithms, 15 attributes/subset evaluators and 10 searchalgorithms for feature selections and 3 association rule algorithms [16]. Ithas been created by researchers at the University of Waikato in New Zealandand is the open source software in JAVA issued under the GNU General PublicLicense. Weka is also an icon of a bird found only on the islands of NewZealand [17].

5

Figure 1 : Weka Graphical User Interface (GUI)

Rapid-I provides software, solutions, and services in the fields ofpredictive analytics, data mining, and text mining. The version discussedhere is the RapidMiner 5.3.005. The software is under Rapid-I group fromDortmund, Germany. It is licensed under the AGPL version 3. The companyconcentrates on automatic intelligent analyses on a large-scale base, i.e.for large amounts of structured data like database systems and unstructureddata like texts. The open-source data mining specialist Rapid-I enables othercompanies to use leading-edge technologies for data mining and businessintelligence. The discovery and leverage of unused business intelligence fromexisting data enables better informed decisions and allows for processoptimisation. Rapid-I is serving its customers globally with offices inGermany and the United States. Furthermore, more than 30 partners on allcontinents are looking forward to support your data analysis projects usingRapid-I software products. RapidMiner is downloadable from the URLhttp://rapid-i.com/content/view/398/243/lang,en/ [18] and easily run theself-extracting file to install RapidMiner. Figure 2 to figure 3 depict onthe RapidMiner GUI and the RapidMiner Operators and Repositories View.

6

Figure 2 : About RapidMinerFigure 3 : Operators and Repositories

View

Between Weka and RapidMiner, both DMS has its own strength and limitation.Table 2 illustrates on the differences between these two major DMS where thecomparison is done on the aspects of their features availability.

Table 2 : Differences Between WEKA and RapidMiner

FEATURESAVAILABILITY

WEKA RAPIDMINER

Power and flexibility

Weka's Experimenter is easy to use but let's face it: itis not flexible enough to meet real-worlds process requirements.

RapidMiner provides much more analysis steps (operators) than Weka and much more possibilities to combine them.

Scalability The update manager is available in Weka depends onyour needs

RM can be updated through RapidMiner Extension and also we can extend from Wekaitself

Visualization Visualization is more on textual base

Visualization is variety viatable, graph, annotation andalso textual base

Data Format Availability

ARFF, XRFF, CSV, C4.5, Database, JSON, LibSVM, Matlab, SerializedInstances,SVMLight, TextDirectory, andURL

ARFF, XRFF, CSV, C4.5, Database, Excel, Excel with format, XML, SAS, Access, AML, SPSS, Stata, Sparse, Dbase, BibTex, DasyLab, and URL

Preprocessing Offer a variety of methods in preprocessing and data extraction

Offer a variety of methods in preprocessing and data transformation plus it is dynamically tested based on our preferences

Classification Bayes, functions, lazy, meta, misc, rules(DT, JRip, OneR, PART, ZeroR), trees (DecisionStump, J48, LMT, RandomForest, RandomTree, REPTree)

Lazy Modelling (Default Model, k-NN), Bayesian Modelling (Naives Bayes, NaivesBayes (Kernel)), Tree Induction (DT, ID3, CHAID, DecisionStump, RandomTree,

7

RandomForest), RuleInduction(SingleRuleInduction, SubgroupDiscovery, TreeToRules), NeuralNet Training (Perceptron, NeuralNet, AutoMLP)

Clustering CobWeb, Expectation-Maximisation (EM), FarthestFirst, FilteredClusterer, HierarchicalClusterer, MakeDensityBasedClusterer, SimpleKMeans

k-Means, k-Means (Kernel), k-Means (Fast), k-Medoids, DBSCAN, EM, Support Vector Clustering, RandomClustering, AgglomerativeClustering, TopDownClustering, FlattenClustering, ExtractCluster Prototype

Association Offer 3 Association analysisalgorithm i.e Apriori, FP-Growth and FilteredAssociator

Offer only FP-Growth algorithm, but can be extensible from Weka extension such as W-Apriori and W-FP-Growth.

VI. CONCLUSION AND FUTURE WORKS

End users need a simple-to-use tools that efficiently solve their businessproblems. Typical end users are for example marketers, engineers or managers[19]. Many advanced tools for data mining are available either as open-sourceor commercial software. They cover a wide range of software products, fromcomfortable problem-independent data mining suites, to business-centered datawarehouses with integrated data mining capabilities, to early researchprototypes for newly developed methods. Recent tools are able to handle largedatasets with single features, time series, and even unstructured data-liketexts; however, there is still a lack of powerful and generalized miningtools for multidimensional datasets such as images and videos. Thus for thenew algorithm, we have to develop and then extend in the available tools. TheDM tools are good for the starting point to perform an experiment insimulating the algorithm that we have chosen in our research project. Asoverall, the tools really help in general concept understanding and testingand implementation of our research projects.

ACKNOWLEDGMENT

We wish to thank all faculty members for supporting our work in reviewingfor spelling errors and consistencies and also for the meaningful comments andsuggestions.

8

REFERENCES

[1] Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996) From Data Mining to Knowledge Discovery in Databases, AI Magazine Vol. 17 No. 3.

[2] Smyth, P. (2000). Data mining: data analysis on a grand scale?. Statistical Methods in MedicalResearch, 9(4), 309-327.

[3] Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Morgan Kaufmann.

[4] Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statisticallearning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85.

[5] Engelbrecht, A. P. (2007). Computational intelligence: an introduction. Wiley.

[6] Data Mining Tools (Copyright © 1999-2013 John Wiley & Sons), retrieved on 21/3/2013 fromonlinelibrary.wiley.com

[7] Frank, E., Hall, M., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I. H., & Trigg, L.(2010). Weka-a machine learning workbench for data mining. Data Mining and Knowledge DiscoveryHandbook, 1269-1277.

[8] Tan, P.-N., Steinbach, M. and Kumar, V. (2006) Introduction To Data Mining, Addison Wesley,ISBN : 0-321-32136-7.

[9] Data Mining (© 2013 Encyclopedia Britannica). Retrieved on 20/6/2013 from http://global.britannica.com/EBchecked/topic/1056150/data-mining

[10] Paidi, A. N. (2012). “Data Mining: Future Trends and Applications”, International Journal of Modern Engineering Research (IJMER), Vol. 2, Issue 6, pp. 4657-4663.

[11] Educational Data Mining (2010). Retrieved on 3/1/2013 from http://www.educationaldatamining.org/.

9

[12] Scheuer, O., & McLaren, B. M. (2011). Educational data mining. The Encyclopedia of the Sciences of Learning. New York, NY: Springer.

[13] Baker, R. S. J. D., & Yacef, K. (2009). The state of educational data mining in 2009: Areview and future visions. Journal of Educational Data Mining, 1(1), 3-17.

[14] Romero, C., Ventura, S. and Garcia, E. (2008). "Data Mining in Course Management Systems:MOODLE Case Study and Tutorial". Computers & Education. 51(1): 368–384.

[15] Free software foundation (2004-2012) retrieved on 21/3/2013 from http://www.fsf.org

[16] Weka Tutorials, retrieved on 21/3/2013 fromwww.cs.ubbcluj.ro/~gabis/ml/MLSoftware/ WekaTutorial . ppt

[17] An introduction to Weka, retrieved on 21/3/2013 from www.se.cuhk.edu.hk/~hcheng/ WEKA . ppt

[18] RapidMiner, retrieved on 01/04/2013 from http://rapid-i.com/content/view/398/243/lang,en/

[19] Goebel, M. and Gruenwald, L. (1999) A Survey of Data Mining and Knowledge Discovery SoftwareTools, ACM SIGKDD Explorations, Vol. 1, Issue 1, pp. 20-33.

BIOGRAPHY

10

Wan Aezwani Bt Wan Abu Bakar is currently pursuing her PhD in Computer Science at Universiti Malaysia Terengganu (UMT) Terengganu. She received her master’s degree in Master of Science (Computer Science) from Universiti Teknologi Malaysia (UTM) Skudai, Johor prior to finishing her study in Bachelor’sdegree also in the same stream from Universiti Putra Malaysia (UPM) Serdang, Selangor. Her master’s research was formerly on

11

Masita @ Masila Abdul Jalil received her B.Eng (Hons) in Computer System Engineering from the University of Warwick, UKin 1997. After graduated, she joined CELCOM (M), one of the leading telecommunication providers in Malaysia as a system engineer. She later pursued her Master study in Engineering Business Management at the same university before joining Universiti Malaysia Terengganu (UMT) as a lecturer in 2001. In2012, she obtained her PhD in Information Technology from

Md Yazid Mohd Saman is a Professor of Computer Science in thefield of Parallel and Distributed Systems. He is currentlyattached to the Department of Computer Science, Fakulti Sains& Teknologi, Universiti Malaysia Terengganu. He also hastaught several undergraduate and postgraduate courses such asComputer Programming, Data Structures, Operating Systems,System Analysis & Design, Discrete Structures, ComputerNetworks, Parallel and Distributed Computing. His researchinterests include Software Development, Distributed & Parallel