ANALYZING STUDENT PERFORMANCE IN DISTANCE LEARNING WITH GENETIC ALGORITHMS AND DECISION TREES

30
Analyzing student performance in distance learning with genetic algorithms and decision trees Dimitris Kalles, [email protected] Hellenic Open University, Sachtouri 23, 262 22, Patras, Greece Christos Pierrakeas, [email protected] Hellenic Open University, Sachtouri 23, 262 22, Patras, Greece Abbreviated title Analyzing student performance in distance learning Restore: double line spacing for normal, left and right margins at 3.17 cm, top and bottom margins at 2.54 cm Page 1 of 30

Transcript of ANALYZING STUDENT PERFORMANCE IN DISTANCE LEARNING WITH GENETIC ALGORITHMS AND DECISION TREES

Analyzing student performance in distance learning with genetic algorithms and

decision trees

Dimitris Kalles, [email protected]

Hellenic Open University, Sachtouri 23, 262 22, Patras, Greece

Christos Pierrakeas, [email protected]

Hellenic Open University, Sachtouri 23, 262 22, Patras, Greece

Abbreviated title

Analyzing student performance in distance learning

Restore: double line spacing for normal, left and right margins at 3.17 cm, top and bottom

margins at 2.54 cm

Page 1 of 30

Abstract

Students who enrol in the undergraduate program on informatics at the Hellenic Open

University (HOU) demonstrate significant difficulties in advancing beyond the introductory

courses. We have embarked in an effort to analyse their academic performance throughout the

academic year, as measured by the homework assignments, and attempt to derive short rules

that explain and predict success or failure in the final exams. In this paper we review previous

approaches, compare them with genetic algorithm based induction of decision trees and argue

why our approach has a potential for developing into an alert tool.

Introduction

The Hellenic Open University (HOU) was founded in 1992. Its primary goal is to offer

university-level education using distance learning methods and to develop the appropriate

material and teaching methods to achieve this goal. The HOU targets a large number of adult

candidate students and offers both undergraduate and postgraduate studies. It should be noted

that, currently, the HOU is the only University in Greece offering distance-learning

opportunities to students. The courses of the HOU were initially designed and offered

following the distance learning methodology of the British Open University. The first Courses

were offered by the HOU in 1998 and currently (2004) nearly 20,000 students attend the

HOU, up from 10,000 about one year and a half ago.

The undergraduate programme in informatics is heavily populated, with more than 2,000

enrolled students. About half of them currently attend junior courses on mathematics,

software engineering, programming, databases, operating systems and data structures.

Currently, we face two key educational problems. The first is that these courses are heavy on

mathematics and adult students have not had many opportunities to sharpen their

mathematical skills since high-school graduation (which may have occurred at least 5 years

and typically at about 10-15 years before enrolling at HOU). The second is that the lack of a

Page 2 of 30

structured academic experience, notwithstanding the exposure to mathematics, has also meant

that general learning skills and attitudes may also be lacking.

These problems result in substantial failure rates at the introductory courses. Such failures

usually mean that students do not progress smoothly towards the senior courses, thus skewing

the academic resources of the HOU system towards filtering the input rather than polishing

the output, from a quantitative point of view. Even though this may be perfectly acceptable

from an educational, political and administrative point of view, we must analyse and strive to

understand the mechanism and the reasons of failure. This could significantly enhance the

ability of HOU to fine-tune its tutoring and admission policies without compromising the

academic rigour that is expected of a large university.

In this work we build upon an earlier study at HOU (Kotsiantis et al., 2004), which applied

various machine learning techniques to the problem of failing. We acknowledge its findings,

confirm their validity from a qualitative point of view, and strive to produce consistently

shorter models, with rudimentary setting of parameters for model derivation, based on a

combination of machine learning approaches, namely decision trees and genetic algorithms.

This paper is structured in four subsequent sections. In the next section, we briefly review the

problem of predicting -at large- student performance and we focus the problem on adult

students learning at a distance, specializing on how HOU has been addressing this issue. In

the same section we also review our algorithmic approach, that of combining genetic

algorithms and decision trees. We then move on to the next section by summarizing previous

findings in the same operational context and by providing new experimental results that

validate our proposed algorithmic approach, based on the benchmark data. We then move to

conduct more extended but more focused experiments on a range of other (similar) data sets,

to gauge the (application) scalability of our approach in terms of several data sets over a

number of years over a number of different study modules. Finally, we conclude by

discussing promising findings, limitations and issues of large-scale operational deployment.

Page 3 of 30

1 Background

1.1 Summarizing the application domain

The aim of the establishment of Open Universities around the world is to cover the

educational and re-educational needs of the workforce by providing a high level of studies.

The educational philosophy of the activities developed is different from the conventional

educational systems. This philosophy intends to promote “life long education” and to provide

adults with “a second educational chance” (Keegan, 1993). The method used is known as

“distance learning” education.

The ability to predict a student’s performance could be useful in a number of ways associated

with university-level distance learning, especially preventing students’ dropout. Students drop

out quite often in institutions providing education using open and distance learning methods

(Cardon & Christensen, 1998) due to a number of factors. Examples of such factors are the

degree of difficulty of the selected courses and subject areas, the false estimation of “free”

time that could be dedicated to studies, health, family and occupation problems, etc.

In open and distance learning, dropout rates are definitely higher than those in conventional

universities. Moreover, they vary depending on the education system adopted by the

institution providing distance learning, as well as on the selected subject of studies. Recently

the Open Learning journal (2004), published a volume on issues on student retention in open

and distance learning, where similarities and differences across systems is discussed,

highlighting issues of institutions, subjects and geographic areas.

Previous work at HOU on the subject of student dropout in the field of informatics showed

that the total percentage of dropouts reaches 28.4%, compared to 71.6% who continue (Xenos

et al., 2002). It is important to mention that the great majority of students dropped out after

failing to deliver the first one or two written assignments.

With regard to the reasons provided by the students for dropping out, more than half of them

(who dropped out) claimed that they were not able to correctly estimate the time that they

Page 4 of 30

would have to devote to their professional activity and as a result, the time dedicated to their

education decreased unexpectedly. The second reason offered by approximately one out of

four students was their feeling that their knowledge was not sufficient for university-level

studies (other reasons, such as family or health issues were also quoted). It is, thus, reasonable

to assert that predicting a student’s performance can enable a tutor to take early remedial

measures by providing more focused coaching, especially in issues such as priority setting

and time management.

1.2 Summarizing the AI technologies used in the application domain

Decision trees can be considered as rule representations that, besides being accurate, can

produce comprehensible output, which can be also evaluated from a qualitative point of view

(Mitchell, 1997). Decision trees are usually produced by analyzing the structure of examples

(training instances), which are usually given in a tabular form.

For example, a decision tree looks like the one shown in Table 1, conveys the information that

a mediocre grade at an assignment, turned in at about the middle (in the time-line) of the

module, is an indicator of possible failure at the exams, whereas a non-mediocre grade refers

the alert to the last assignment. An excerpt of a training set that could have produced the

above tree could be the one shown in

Table 2.

Table 1: A sample decision tree.

Assgn2 in [3..6]

Assgn4 < 3 FAIL

FAIL PASS

Page 5 of 30

Table 2: A sample decision tree training set.

Assgn1 Assgn2 Assgn3 Assgn4 Exam

… … … … …

4.6 7.1 3.8 9.1 PASS

9.1 5.1 4.6 3.8 FAIL

7.6 7.1 5.8 6.1 PASS

… … … … …

Genetic algorithms (Booker et al., 1990; Mitchell, 1997) can directly evolve binary decision

trees that explain and/or predict the success/failure patterns of junior undergraduate students.

To do so, we evolve populations of tree representations according to a fitness function that

allows for fine-tuning decision tree size vs. accuracy on the training set. At each time-point

(in genetic algorithms dialect: generation) a certain number of decision trees (population) is

generated and sorted according to some criterion (fitness). Based on that ordering, certain

transformations (genetic operators) are performed on some members of the population to

produce a new population that will be subject to the same cycle until a predefined number of

generations is reached (or no further improvement is detected).

These concepts form the basis of the GATREE System (Papagelis & Kalles, 2001), which has

been built on top of the GALIB library (Wall, 1996).

The genetic operators on the tree representations are relatively straightforward. A mutation

may modify the test attribute at a node or the class label at a leaf. A cross-over may substitute

whole parts of a decision tree by parts of another decision tree. Examples of both operators

are shown in Table 3.

Page 6 of 30

Table 3: GATREE genetic operators (from Papagelis & Kalles (2001)).

Operator Example

Mutation

cross-over

Mutated Leaf

New Class

Mutated Node

New Test Value

Chosen NodeChosen Node

The GATREE fitness function is

( )xsize

xssifiedCorrectClaTreefitnessi

ii += 2

2 *

The first part of the product above is the actual number of training instances that a decision

tree (a member of a population) classifies correctly. The second part of the product (the size

factor) includes a factor x which has to be set to an arbitrary big number. Thus, when the size

of the tree is small, the size factor is near one, while it decreases when the tree grows big.

This way, the payoff is greater for smaller trees. The size factor can be altered to match

Page 7 of 30

individual needs, but this must be exercised with care since we never know whether a target

concept can be represented with a decision tree of a specific size.

2 Experimental comparison in depth

Before proceeding we elaborate on some background that sets the context for the experiments

and the data1.

The basic educational unit at HOU is the module, which runs for about ten months and is the

equivalent of about 3-4 conventional university semester courses. A student may register with

up to three modules per year. For each module, a student is expected to attend five plenary

class meetings throughout the academic year (a class contains about thirty students). Each

meeting is about four hours long and may be structured along tutor presentations, group-work

and review of assigned homework. Furthermore, each student must turn in some written

assignments (typically four or six), which contribute towards the final grade, before sitting a

written exam. Assignments are graded on a 0-10 scale (with an average 5 securing a pass) and

are prepared on the assumption that a student will normally devote 1-2 hours per day to

studying and about 10-15 hours to complete an assignment to a satisfactory level.

The “Informatics” programme is composed of 12 modules and leads to a Bachelor Degree.

The vast majority (up to 98%) of registered students in the “Informatics” programme, upon

being admitted at HOU, selects the module “Introduction to Informatics” (INF10), which

allows us to use INF10 as a test-bed for experimentation.

Key demographic characteristics of students (such as age, sex, residence etc), their marks in

written assignments and their presence or absence in plenary meetings constituted the initial

training set for a supervised learning algorithm, to predict whether a student would eventually

pass or fail a specific module.

1 The data sets are available for academic research only, for free.

Page 8 of 30

2.1 Summarizing the previous findings

Initial experimentation consisted of several Machine Learning techniques to predict student

performance with reference to the final examination. These techniques were decision trees,

neural networks, Naïve Bayes, instance-based learning (see Mitchell (1997) for introduction),

logistic regression (Perlich et al., 2004) and support vector machines (Burges, 1998).

The experiments took place in two distinct phases. During the first phase (training phase) the

algorithms were trained using the students’ data collected from the academic year 2000-1.

The training phase was divided in 9 consecutive steps. The 1st step included the demographic

data and the resulting class (pass or fail); the 2nd step was enhanced with data from the first

plenary meeting; the 3rd step was enhanced with data from the first written assignment; the 4th

step with data from the second plenary meeting, until the 9th step that included all attributes

(demographic data, data from all four plenary meetings and the marks of all four written

assignments). Assignment (numeric) data were manually quantized.

Subsequently, ten groups of data for the new academic year (2001-2) were collected from 10

tutors and the HOU registry. Each one of these 10 groups was used to measure the prediction

accuracy within these groups (testing phase). The testing phase also took place in 9 steps.

During the 1st step, the demographic data of the new academic year were used to predict the

class (pass or fail) of each student. During the 2nd step these demographic data along with the

data from the first plenary meeting were used in order to predict the class of each student. The

remaining steps were completed in the same way as described above for the training phase.

The overall accuracy was computed in a rather unconventional way. First, a model was

created based on the academic year 2000-1 data (from all tutors). Then, this model was tested

on 10 data sets of academic year 2001-2. The results were averaged for all combinations of

the nine (9) steps and the ten (10) test data sets. This was, then, repeated for each classifier

(type). The overall accuracy of the algorithms used is shown in Table 4.

All accuracies (reported in this paper) are measured in %.

Page 9 of 30

Table 4: Accuracy results of initial experimentation.

Classifier Overall Accuracy

C4.5 69.99

BP 72.26

Naïve Bayes 72.48

3-NN 66.93

Logistic regression 72.32

SMO 72.17

Indeed, it was proved that learning algorithms could enable tutors to pinpoint students at risk

of failure with satisfactory accuracy; however, demographics seem to play a rather limited

role with respect to the rest of the available information (actual academic performance).

Furthermore, the analysis of the experiments and the comparison of the six algorithms had

demonstrated sufficient evidence that the Naïve Bayes algorithm was very appropriate to be

used for the prediction of students’ performance, also given the ease of implementation.

2.2 Setting out the reference base

In our work, the latest version of WEKA (Witten & Frank, 2000) was used to bring the

experimental results up-to-date, in the sense that more conventional reporting techniques were

used. For the sake of simplicity and completeness we conducted a 10-fold cross-validation

experiment with the classifiers shown in Table 5, and reported the average accuracy as well as

the accuracy on the training set. We also report the model size based on the full training data

set (where applicable, i.e. only for symbolic classifiers). Leaves were counted as nodes and

the trees were binary.

All decision tree model sizes (reported in this paper) are measured in nodes.

Page 10 of 30

Table 5: Summary results for conventional classifiers.

Classifier Average Accuracy Accuracy on Training Set Model Size

J48 65.41 82.27 57

J48 (with REP) 69.48 75.00 33

J48 (with bagging) 69.48 86.92 ... N/A ...

MLP 66.86 93.02 ... N/A ...

Logistic regression 67.44 75.58 ... N/A ...

Naïve Bayes 63.08 65.12 ... N/A ...

3-NN 63.66 75.58 ... N/A ...

J48 is a WEKA implementation of the C4.5 algorithm (Quinlan, 1993). REP stands for

reduced-error-pruning, a technique for pruning decision trees (Quinlan, 1993). Bagging is a

technique used to reduce variance in classifiers (Breiman, 1996). MLP is a standard neural

network implementation (Rumelhart et al., 1994). 3-NN stands for 3-Nearest-Neighbour. N/A

stands for “not applicable”, for example, in MLP, the model size is specified by the model

builder and is not something that can be evolved during model building (of course, the model

size can have an effect on other quality parameters, such as speed, accuracy, etc. but we have

selected a default WEKA-supplied value because this is beyond the scope of our

experimentation).

The previous findings were useful indeed2. Based on our tutoring experience, success in the

initial written assignments is a strong indicator of successful follow-up. This is also

corroborated by the fact that such success motivates the student towards a more whole-

2 Predictably, the results in Table 4 and in Table 5 are not comparable.

Page 11 of 30

hearted approach to studying, based on the positive reinforcement. The usual risk here is that

there might be a temptation (on the tutor’s part) to artificially increase early grades; this is a

risky approach that may hide weaknesses and could convey the wrong message to the student.

To conduct experimentation with reference to the above (baseline) results, we have used the

GATREE system (Papagelis & Kalles, 2001). Its basic version allows for experimenting with

up to 150 generations and up to 150 members per generation. The results are shown in Table

6. (Average accuracy is reported based on 10-fold cross-validation.)

Table 6: Summary results for basic GATREE decision trees.

Classifier Average Accuracy Accuracy on Training Set Model Size

(gen/pop) 250/250 77.06 87.50 33/37

(gen/pop) 250/100 80.00 86.63 39

(gen/pop) 100/250 78.53 86.05 15/17

(gen/pop) 100/100 76.76 84.59 19

To ensure the validity of the experimentation we used the same data sets of the original

experimentation, including demographic data and quantized data. For all experiments we used

the default settings for the genetic algorithm operations (cross-over probability at 0.99,

mutation probability at 0.01, error rate at 0.95 and replacement rate at 0.25).

Where an X/Y number is reported as model size the interpretation is as follows: the genetic

algorithm produced (at the end) a best tree of Y nodes, which was then post-processed by

hand to eliminate redundant nodes (which sneak into the structure due to its pseudo-random

evolution and our decision to not yet constrain the genetic operators) resulting in a tree of X

nodes.

Page 12 of 30

The initial findings suggest that, when compared to conventional decision-tree classifiers, our

approach produces significantly more accurate trees. The tree size is comparable and the

running time is longer, but due to its overall short duration, the exercise can be considered a

success.

2.3 A quantitative comparison: model accuracy, model size and model derivation cost

The real issue is whether we can use our approach to achieve flexible trade-offs between

speed, accuracy and simplicity. The experiments presented in this section serve to answer this

question affirmatively and unequivocally.

We first investigate whether increasing the number of generations and the population size has

an effect on the quality of the final solution. We can hope that as we pay more in time, we

might be able to evolve more accurate or more compact trees. We used the unlimited-use

version of the GATREE system for these experiments. The results are shown in Table 7.

Table 7: Summary results for extended-run GATREE decision trees.

Classifier Average Accuracy Accuracy on Training Set Model Size

(gen/pop) 100/1000 75.59 86.92 41

(gen/pop) 250/1000 79.12 88.95 43/49

(gen/pop) 1000/100 74.71 88.66 27

(gen/pop) 1000/250 79.12 88.95 25

(gen/pop) 1000/1000 80.00 92.15 51

A casual comparison suggests that the extra effort may not be warranted, since the

improvements do not seem to be related with either the number of generations or the size of

the population. However, it does seem that the relatively big reclassification accuracy in the

Page 13 of 30

1000/1000 case (as measured on the training set) is a strong symptom of over-fitting, as

witnessed by the tree size.

While it would probably be a substantial improvement in itself to claim here that we have a

better technique of predicting drop-out, the underlying realities of the experience of the tutors,

and the difficulty of setting up a mechanism to collect and process data across all courses of

the university (meaning the deployment of tens of tests per course), suggest that we take a

closer qualitative look at the findings.

There are two directions that we should look at. The first direction has to do with the

comprehensibility of the induced decision trees themselves, as compared to the initial

findings: we would like to be able to confirm or refute the finding that demographics play a

limited role in the drop-out rate and that the most important indicators have to do with

performance at the initial written assignments. The second direction has to do with

GATREE’s ability to trade-off size with accuracy in the evolution mechanism: we would like

to see whether this ability can facilitate our planning to maybe use such findings on an

operational scale (covering tens of thousands of students and hundreds of tutors, the latter

with diverse IT literacy levels, every academic year).

Note that a trade-off between size and accuracy is explicitly stipulated in the fitness function

of GATREE (Papagelis & Kalles, 2001). That the trade-off is reasonable, as set in the default

parameters, can be argued by referring to the experimental results as reported in Table 5,

Table 6 and Table 7. Note that the gap between the re-classification error and the cross-

validation error is significantly smaller in GATREE induced trees (slightly more than 10%, at

worst). Equally important, that gap does not seem to particularly depend on the model size. In

WEKA induced classifiers, one must either explicitly use a post-processing technique (J48

with REP) or settle for a mediocre performance (naïve Bayes) to see a small gap.

To further gauge trade-offs we used GATREE in two extreme operating modes. In size-biased

and accuracy-biased modes, we respectively shift the fitness function towards either very

small or very accurate trees. Since we have observed that the gap between re-classification

Page 14 of 30

error and cross-validation error is (relatively) small for GATREE, we only report re-

classification error results in Table 8 and in Table 9. In

Table 10, however, we also report the cross-validation error for the largest-scale experiment.

Table 8: Summary results for extended-run size-biased GATREE decision trees.

Classifier Accuracy on Training Set Model Size

(gen/pop) 100/100 79.36 3

(gen/pop) 100/250 83.14 5

(gen/pop) 100/1000 84.30 7

(gen/pop) 250/100 83.14 5

(gen/pop) 250/250 83.14 5

(gen/pop) 250/1000 84.30 7

(gen/pop) 1000/100 83.14 5

(gen/pop) 1000/250 83.14 5

(gen/pop) 1000/1000 84.30 7

Page 15 of 30

Table 9: Summary results for extended-run accuracy-biased GATREE decision trees.

Classifier Accuracy on Training Set Model Size

(gen/pop) 100/100 84.30 17

(gen/pop) 100/250 85.47 29/33

(gen/pop) 100/1000 87.79 127/133

(gen/pop) 250/100 86.05 33

(gen/pop) 250/250 88.95 107/123

(gen/pop) 250/1000 90.41 135/165

(gen/pop) 1000/100 89.24 51

(gen/pop) 1000/250 91.57 67/73

(gen/pop) 1000/1000 93.31 93/115

Table 10: Cross-validation results for gen/pop:1000/1000 GATREE decision trees.

Classifier Average Accuracy Accuracy on Training Set Model Size

Size-biased 78.82 84.30 7

Accuracy-biased 76.82 93.31 93/115

A few comments are due by taking into account all results and, in particular Table 5, Table 6,

and

Table 10. GATREE can indeed produce small and accurate decision trees quite early, and still

be able to improve over them with time. This is an overall positive evaluation of the approach.

However, with reference to

Page 16 of 30

Table 10, one can most clearly observe the problem of over-fitting, as manifested by the

discrepancy of re-classification error and the cross-validation error. This gap can clearly be

associated with the model size of the two extreme approaches.

The final round of experiments in trade-offs utilize a particular feature of GATREE, namely

the (monotonic) adjustment of bias towards either accuracy-biased or size-biased decision

trees. In this set of experiments we start at one end of the spectrum (for example, accuracy-

biased) and throughout the evolution we shift our bias towards the other spectrum (in that

case, size-biased).

There is an obvious intuitive appeal to both approaches. In principle, starting with small trees

could allow a small structure to be optimized for accuracy. Also, in principle, starting with

accuracy-biased trees could mean that evolution could lead to the trimming (pruning) of

superfluous tree parts.

We report the results in Table 11 (note that sa refers to size being preferred at the start of the

evolution and accuracy being preferred towards the end, and as should be interpreted

likewise). We simply note on the results that they confirm earlier findings regarding the gap

between re-classification error and the cross-validation error. We also note that the non-

impressive performance gap between the tree sizes in static-bias and dynamic-bias approaches

may suggest that a non-educated uniform policy on bias modification is not warranted.

Page 17 of 30

Table 11: Cross-validation results for gen/pop:1000/1000 GATREE decision trees.

Classifier Average Accuracy Accuracy on Training Set Model Size

(gen/pop) 100/100 sa 78.53 83.43 45/57

(gen/pop) 100/100 as 77.65 85.47 17/19

(gen/pop) 250/250 sa 80.00 88.66 113/135

(gen/pop) 250/250 as 79.71 88.37 47/53

(gen/pop) 1000/1000 sa 78.53 95.06 129/141

(gen/pop) 1000/1000 as 77.06 92.73 125/141

2.4 A qualitative comparison: model consistency

In this section we document how the GATREE-induced trees confirm3 earlier findings about

the (lack of) importance of student demographics, with reference to the findings summarized

in section 2.1 above.

Based also on the description of the HOU context, as set out earlier in this paper, in Table 12

we report the generated trees for the size-biased variants. We use the notation AssgnX to refer

to the grade obtained at the Xth assignment and the notation MeetY to refer to the presence or

absence at the Yth plenary class meeting. In the embedded figures, solid lines refer to a YES

answer, whereas dotted lines refer to a NO answer.

We now move to explain the findings, based on our tutoring experience. For a start, we note

that all results make sense. This is not counter-intuitive; indeed, we know that many

3 The reader should interpret this extra step as an effort to replicate previous results using a new

technique. The previous sections dealt with model accuracies; we now discuss the actual models.

Page 18 of 30

hypotheses may be compatible with a target concept and all of them may be of value (after all,

this has given rise to ensemble classifiers).

Let us first consider the small-scale experiment (top row in Table 12). By referring to Table 8,

we note that its classifying ability is clearly the least favourable of all alternatives and can be

compared to a single decision stump. In essence, we interpret, if a student achieves at least 6

in a sophisticated assignment (as is usually the 3rd one), a pass can be normally expected.

The medium-scale experiment is more interesting (middle row in Table 12); note that it also

suggests that a relative stability exists in the induced model despite the fluctuations in

generations and populations. Its interpretation is as follows: a mediocre grade at an

assignment, turned in at about the middle (in the time-line) of the course, is an indicator of

possible failure at the exams, whereas a non-mediocre grade refers the alert to the last

assignment. Note that this makes sense: a student either demonstrates a sustained performance

of some value (thus achieving at least 3 at the last assignment), or the student underperforms,

in which case a failure is imminent.

The large-scale experiment (bottom row in Table 12) is also interesting. Note that it contains

tests on the key attributes tested above in the smaller-scale experiments but it also interleaves

a test on the participation in a plenary class. We believe that this may be disguised over-

fitting: on one hand it generates a slightly better re-classification rate (see Table 8), but on the

other hand it tests on a reasonable attribute: the third plenary meeting (and other meerings

onward) usually commands attendance from the committed students, whereas the rest usually

consider turning in assignments without attending the plenary classes. The tree reported here,

however, suggested that there may be a class of students who are confident, skip the meeting

and eventually pass (so they are justifiably confident).

Note that, quite predictably, demographics are absent throughout. This is a key confirmation.

Page 19 of 30

Table 12: Model results for size-biased GATREE decision trees.

Classifier Model

(gen/pop) 100/100

(gen/pop) 250/100

(gen/pop) 100/250

(gen/pop) 250/250

(gen/pop) 1000/1000

Assgn3 > 6

PASS FAIL

Assgn2 in [3..6]

Assgn4 < 3 FAIL

FAIL PASS

Assgn4 < 3

Meet3 = NO FAIL

Assgn2 in [3..6] PASS

FAIL PASS

3 Experimental comparison in breadth

We now move to tests on new data sets. To do so we use eight data sets (one of which we

have been using all along until now).

These data-sets refer to students taking three distinct courses, over three consecutive

academic years. All three courses are at the first year, where failures are very common

Page 20 of 30

(reaching 50%). Courses are named according to an HOU naming convention but for the sake

of conciseness and clarity we will use the following convention:

X_Y refers to a course having a code-name X and running for the academic year Y..Y+1

For example, 10_2000 refers to the introductory course on Informatics (code INF10, as earlier

described) offered at the academic year 2000-1. This is the data set that we have been

experimenting with in all of section 0.

We drop the pre-processing step of quantizing the numeric attributes and experiment with raw

numbers instead (we also believe that manually selected quantization policies will not scale

well into the university-level prediction system that we aim at). Also, all data sets have been

stripped of all demographic data and all meeting data. Furthermore, GATREE results are

based on size-only bias settings.

We first report the results for data-set 10_2000, by comparing them directly with earlier

findings to also gauge the results of quantization. (Earlier findings are shown in small font, in

parentheses.)

Page 21 of 30

Table 13: Summary results for data-set 10_2000.4

Classifier Average Accuracy Accuracy on Training Set Model Size

J48 (65.41) 81.73 (82.27) 89.96 (57) 35

J48 (with REP) (69.48) 83.73 (75.00) 82.73 (33) 3

J48 (with bagging) (69.48) 82.93 (86.92) 90.76 ... N/A ...

MLP (66.86) 82.13 (93.02) 84.14 ... N/A ...

Logistic regression (67.44) 83.73 (75.58) 83.53 ... N/A ...

Naïve Bayes (63.08) 84.14 (65.12) 84.54 ... N/A ...

3-NN (63.66) 81.93 (75.58) 88.15 ... N/A ...

GATREE 250/250 (…..) 83.27 (83.14) 84.34 (5) 3

GATREE 1000/1000 (78.82) 83.27 (84.30) 84.34 (7) 3

The most crucial observation is that GATREE induced trees provide good accuracy

estimation, even without the cross-validation testing phase. This is, generally, a trait that

seems to be shared by all classifiers, however, note that GATREE has been generating close-

match estimates even with the quantized formats. This is an indication that GATREE can

produce quality results even in the presence of noise, something that confirms the findings

reported by Papagelis & Kalles (2001) in other data-sets (benchmark and synthetic ones).

We now move to report our results for the rest of the data sets (and omit the 1000/1000

GATREE test).

4 Data in parentheses was drawn from Table 5 and Table 10.

Page 22 of 30

Table 14: Summary results for data-set 10_2001.

Classifier Average Accuracy Accuracy on Training Set Model Size

J48 79.13 87.82 55

J48 (with REP) 79.98 84.80 25

J48 (with bagging) 80.94 88.54 ... N/A ...

MLP 80.22 82.63 ... N/A ...

Logistic regression 82.63 82.39 ... N/A ...

Naïve Bayes 80.94 80.70 ... N/A ...

3-NN 81.66 89.02 ... N/A ...

GATREE 250/250 78.78 80.83 5

Having confirmed that we can produce good small trees, we now summarize the results for

the current course (code 10) in Table 15. We show the derived models and defer comparisons

to Table 18.

Page 23 of 30

Table 15: GATREE induced tree for data-sets 10_2001 and 10_2002.

Data-set Model

10_2001

10_2002

Assgn3 <= 5

PASS Assgn2 <= 7

FAIL PASS

Assgn4 <= 8

In Table 16, we summarize the results for course code 11. Note, that due to the genetic

evolution, some redundancy may exist (testing on the second assignment, in the middle row).

PASS

FAIL Assgn2 <= 6

FAIL

Assgn3 <= 4

PASS

Page 24 of 30

Table 16: GATREE induced tree for data-sets 11_2000, 11_2001 and 11_2002.

Data-set Model

11_2000

11_2001

11_2002

Assgn3 <= 5

PASS Assgn2 <= 7

FAIL PASS

Assgn3 <= 5

Assgn2 < 3

PASS

Assgn4 < 4 FAIL

Assgn1 <= 6

FAIL PASS

PASS

FAIL

Assgn2 <= 3

FAIL

Assgn2 <= 5

PASS

Page 25 of 30

In Table 17, we summarize the results for course code 12. Note, that due to the genetic

evolution, some non-expected findings may exist (testing on whether the sixth assignment has

been submitted at all, in the top row).

Table 17: GATREE induced tree for data-sets 12_2001 and 12_2002.

Data-set Model

12_2001

12_2002

Assgn4 < 5

Assgn2 < 7 FAIL

Assgn6 <= 0 PASS

FAIL PASS

Assgn6 < 2

Assgn1 < 3 FAIL

In Table 18, for GATREE models and J48-REP models, we summarize the results for

accuracy and size. Accuracies are based on 10-fold cross-validation. Model size is based on

model reported by induction on the full training set.

PASS FAIL

Page 26 of 30

Table 18: Summary results GATREE vs. J48-REP.

Data-set GATREE Accuracy J48-REP Accuracy GATREE Size J48-REP Size

10_2000 83.27 83.73 3 3

10_2001 78.78 79.98 5 25

10_2002 77.53 79.58 7 17

11_2000 92.50 92.02 5 3

11_2001 77.95 79.19 7 7

11_2002 74.90 70.65 7 9

12_2001 67.04 70.65 7 7

12_2002 74.31 74.70 5 9

The above findings are aligned with our reporting in the previous paragraphs. GATREE

models are very short and, often, more accurate than their conventionally induced

counterparts (even if the latter ones employ pruning). There do exist a couple of data sets

where the accuracy gap exceeds 2% (in favour of J48-REP) but we believe the consistent

conciseness of the GATREE model coupled with its straightforward derivation and the ability

to pay in time in order to “buy” more models (by simply increasing the number of generations

and the populations) to be key quality differentiators when considering the use of such a

model at an operational scale, either to explain or to predict failure. We elaborate on those

directions below.

4 Discussion

We believe that a small accurate model is a very important tool at the hands of a tutor, to

assist in the task of continuously monitoring a student’s performance with reference to the

Page 27 of 30

possibility of passing the final exam. A small model is easier to communicate among peers,

easier to compare with competing ones and can have wider applicability.

There are, however, some key issues that have to be reflected in a future research agenda, and

nearly all of them are of an applied-research nature.

There is, first of all, the question of whether on should measure the performance of a model

based on subsequent years’ statistics and not on cross-validation testing (on the same data

set). This is not simply an exercise in generalization capability since some sort of data

engineering will be inevitable (for example, if a year’s performance model is based on four

assignments, yet the target year is based on six assignments).

A further refinement refers to our approach being used as an “early warning system”, which

suggests that we must bias the models to prefer test on the “earlier” rather the “later”

assignments. This would alert tutors well in advance of the final exam, however, this would

be only important if tutoring is mostly based on monitoring performance as opposed to also

developing a personal relationship. As HOU expects tutor to develop a trust relationship with

students, alert methods that are simply monitoring-based may be of limited use. It is thus

extremely important that the models be easy to derive so that tutors get to familiarize

themselves with the technology.

A further direction refers to the applicability of the modeling approach to advanced courses,

i.e. dealing with students approaching the end of their studies. Since our student population is

significantly skewed towards junior courses, and since one can expect that drop-out is

minimized after initial successes, we speculate that predicting pass/fail in advanced courses

will be more difficult and less important from an educational point of view.

Another research direction refers to the approach being tested across various disciplines or in

inter-disciplinary programs. In the opening of our paper we emphasized that (the lack of)

learning and math skills may be key obstacles to success. In the humanities, lack of math

Page 28 of 30

skills is not an issue (at least, this is conventional wisdom). So, the applicability of our

approach remains to be seen.

However, perhaps the most difficult and challenging research question we need to ask is

whether our approach measures the performance of students or of the system in general. For

this to be answered we have to attempt embedding it in the everyday practice of the

university, from both an academic and an administrative point of view. The underlying

problem is rather simple: one can choose between micro-viewing our models as a

comprehensive alert system for students or between macro-viewing them as an intuitive alert

system for the organization and running of the courses. The latter could suggest, for example,

a periodic review of our policy on assignments, whereas the first one could mean more one-

to-one tutoring. HOU tutors do develop personal relationships with their students to address

the communication and support issues that arise in distance learning; we thus believe that if,

widely adopted, our approach would slowly drift away from the student alert functionality

towards the system alert functionality.

5 References

Booker, L.B., Goldberg, D.E., & Holland, J.H. (1990). Classifier systems and genetic

algorithms. In, Machine learning: paradigms and methods (pp. 235-282). Cambridge, MA:

MIT Press.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24:2, 123-140.

Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining

and Knowledge Discovery, 2, 1-47.

Cardon, P., & Christensen, K. (1998). Technology-based programs and drop-out prevention.

The Journal of Technology Studies, http://scholar.lib.vt.edu/ejournals/JOTS/.

Keegan, D. (1993). Theoretical Principles of Distance Education. Routledge, London.

Page 29 of 30

Kotsiantis, S., Pierrakeas, C., & Pintelas, P. (2004). Predicting students’ performance in

distance learning using Machine Learning techniques. Applied Artificial Intelligence, 18:5,

411-426.

Mitchell, T. (1997). Machine Learning. McGraw Hill.

Open Leaning (2004). Special issues on “Student retention in open and distance learning”.

19:1, http://www.tandf.co.uk/journals/titles/02680513.asp.

Papagelis, A., & Kalles, D. (2001). Breeding decision trees using evolutionary techniques. In

Proceedings of the International Conference on Machine Learning, Williamstown,

Massachusetts.

Perlich, C., Provost, F. & Simonoff, J.S. (2004). Tree induction vs. logistic regression: a

learning-curve analysis. Journal of Machine Learning Research, 4:2, 211-255.

Quinlan, J.R (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan

Kaufmann.

Rumelhart, D., Widrow, B., & Lehr, M. (1994). The basic ideas in neural networks.

Communications of the ACM, 37:3, 87-92.

Wall, M. (1996). GAlib: A C++ Library of Genetic Algorithm Components. M.I.T.

http://lancet.mit.edu/ga/.

Witten, I., & Frank, E. (2000). Data mining: practical machine learning tools and techniques

with Java implementations. San Mateo, CA: Morgan Kaufmann.

Xenos, M., Pierrakeas, Ch. & Pintelas, P. (2002). A survey on student dropout rates and

dropout causes concerning the students in the Course of Informatics of the Hellenic Open

University. Computers & Education, 39, 361-377.

Page 30 of 30