ANALYZING STUDENT PERFORMANCE IN DISTANCE LEARNING WITH GENETIC ALGORITHMS AND DECISION TREES
-
Upload
independent -
Category
Documents
-
view
3 -
download
0
Transcript of ANALYZING STUDENT PERFORMANCE IN DISTANCE LEARNING WITH GENETIC ALGORITHMS AND DECISION TREES
Analyzing student performance in distance learning with genetic algorithms and
decision trees
Dimitris Kalles, [email protected]
Hellenic Open University, Sachtouri 23, 262 22, Patras, Greece
Christos Pierrakeas, [email protected]
Hellenic Open University, Sachtouri 23, 262 22, Patras, Greece
Abbreviated title
Analyzing student performance in distance learning
Restore: double line spacing for normal, left and right margins at 3.17 cm, top and bottom
margins at 2.54 cm
Page 1 of 30
Abstract
Students who enrol in the undergraduate program on informatics at the Hellenic Open
University (HOU) demonstrate significant difficulties in advancing beyond the introductory
courses. We have embarked in an effort to analyse their academic performance throughout the
academic year, as measured by the homework assignments, and attempt to derive short rules
that explain and predict success or failure in the final exams. In this paper we review previous
approaches, compare them with genetic algorithm based induction of decision trees and argue
why our approach has a potential for developing into an alert tool.
Introduction
The Hellenic Open University (HOU) was founded in 1992. Its primary goal is to offer
university-level education using distance learning methods and to develop the appropriate
material and teaching methods to achieve this goal. The HOU targets a large number of adult
candidate students and offers both undergraduate and postgraduate studies. It should be noted
that, currently, the HOU is the only University in Greece offering distance-learning
opportunities to students. The courses of the HOU were initially designed and offered
following the distance learning methodology of the British Open University. The first Courses
were offered by the HOU in 1998 and currently (2004) nearly 20,000 students attend the
HOU, up from 10,000 about one year and a half ago.
The undergraduate programme in informatics is heavily populated, with more than 2,000
enrolled students. About half of them currently attend junior courses on mathematics,
software engineering, programming, databases, operating systems and data structures.
Currently, we face two key educational problems. The first is that these courses are heavy on
mathematics and adult students have not had many opportunities to sharpen their
mathematical skills since high-school graduation (which may have occurred at least 5 years
and typically at about 10-15 years before enrolling at HOU). The second is that the lack of a
Page 2 of 30
structured academic experience, notwithstanding the exposure to mathematics, has also meant
that general learning skills and attitudes may also be lacking.
These problems result in substantial failure rates at the introductory courses. Such failures
usually mean that students do not progress smoothly towards the senior courses, thus skewing
the academic resources of the HOU system towards filtering the input rather than polishing
the output, from a quantitative point of view. Even though this may be perfectly acceptable
from an educational, political and administrative point of view, we must analyse and strive to
understand the mechanism and the reasons of failure. This could significantly enhance the
ability of HOU to fine-tune its tutoring and admission policies without compromising the
academic rigour that is expected of a large university.
In this work we build upon an earlier study at HOU (Kotsiantis et al., 2004), which applied
various machine learning techniques to the problem of failing. We acknowledge its findings,
confirm their validity from a qualitative point of view, and strive to produce consistently
shorter models, with rudimentary setting of parameters for model derivation, based on a
combination of machine learning approaches, namely decision trees and genetic algorithms.
This paper is structured in four subsequent sections. In the next section, we briefly review the
problem of predicting -at large- student performance and we focus the problem on adult
students learning at a distance, specializing on how HOU has been addressing this issue. In
the same section we also review our algorithmic approach, that of combining genetic
algorithms and decision trees. We then move on to the next section by summarizing previous
findings in the same operational context and by providing new experimental results that
validate our proposed algorithmic approach, based on the benchmark data. We then move to
conduct more extended but more focused experiments on a range of other (similar) data sets,
to gauge the (application) scalability of our approach in terms of several data sets over a
number of years over a number of different study modules. Finally, we conclude by
discussing promising findings, limitations and issues of large-scale operational deployment.
Page 3 of 30
1 Background
1.1 Summarizing the application domain
The aim of the establishment of Open Universities around the world is to cover the
educational and re-educational needs of the workforce by providing a high level of studies.
The educational philosophy of the activities developed is different from the conventional
educational systems. This philosophy intends to promote “life long education” and to provide
adults with “a second educational chance” (Keegan, 1993). The method used is known as
“distance learning” education.
The ability to predict a student’s performance could be useful in a number of ways associated
with university-level distance learning, especially preventing students’ dropout. Students drop
out quite often in institutions providing education using open and distance learning methods
(Cardon & Christensen, 1998) due to a number of factors. Examples of such factors are the
degree of difficulty of the selected courses and subject areas, the false estimation of “free”
time that could be dedicated to studies, health, family and occupation problems, etc.
In open and distance learning, dropout rates are definitely higher than those in conventional
universities. Moreover, they vary depending on the education system adopted by the
institution providing distance learning, as well as on the selected subject of studies. Recently
the Open Learning journal (2004), published a volume on issues on student retention in open
and distance learning, where similarities and differences across systems is discussed,
highlighting issues of institutions, subjects and geographic areas.
Previous work at HOU on the subject of student dropout in the field of informatics showed
that the total percentage of dropouts reaches 28.4%, compared to 71.6% who continue (Xenos
et al., 2002). It is important to mention that the great majority of students dropped out after
failing to deliver the first one or two written assignments.
With regard to the reasons provided by the students for dropping out, more than half of them
(who dropped out) claimed that they were not able to correctly estimate the time that they
Page 4 of 30
would have to devote to their professional activity and as a result, the time dedicated to their
education decreased unexpectedly. The second reason offered by approximately one out of
four students was their feeling that their knowledge was not sufficient for university-level
studies (other reasons, such as family or health issues were also quoted). It is, thus, reasonable
to assert that predicting a student’s performance can enable a tutor to take early remedial
measures by providing more focused coaching, especially in issues such as priority setting
and time management.
1.2 Summarizing the AI technologies used in the application domain
Decision trees can be considered as rule representations that, besides being accurate, can
produce comprehensible output, which can be also evaluated from a qualitative point of view
(Mitchell, 1997). Decision trees are usually produced by analyzing the structure of examples
(training instances), which are usually given in a tabular form.
For example, a decision tree looks like the one shown in Table 1, conveys the information that
a mediocre grade at an assignment, turned in at about the middle (in the time-line) of the
module, is an indicator of possible failure at the exams, whereas a non-mediocre grade refers
the alert to the last assignment. An excerpt of a training set that could have produced the
above tree could be the one shown in
Table 2.
Table 1: A sample decision tree.
Assgn2 in [3..6]
Assgn4 < 3 FAIL
FAIL PASS
Page 5 of 30
Table 2: A sample decision tree training set.
Assgn1 Assgn2 Assgn3 Assgn4 Exam
… … … … …
4.6 7.1 3.8 9.1 PASS
9.1 5.1 4.6 3.8 FAIL
7.6 7.1 5.8 6.1 PASS
… … … … …
Genetic algorithms (Booker et al., 1990; Mitchell, 1997) can directly evolve binary decision
trees that explain and/or predict the success/failure patterns of junior undergraduate students.
To do so, we evolve populations of tree representations according to a fitness function that
allows for fine-tuning decision tree size vs. accuracy on the training set. At each time-point
(in genetic algorithms dialect: generation) a certain number of decision trees (population) is
generated and sorted according to some criterion (fitness). Based on that ordering, certain
transformations (genetic operators) are performed on some members of the population to
produce a new population that will be subject to the same cycle until a predefined number of
generations is reached (or no further improvement is detected).
These concepts form the basis of the GATREE System (Papagelis & Kalles, 2001), which has
been built on top of the GALIB library (Wall, 1996).
The genetic operators on the tree representations are relatively straightforward. A mutation
may modify the test attribute at a node or the class label at a leaf. A cross-over may substitute
whole parts of a decision tree by parts of another decision tree. Examples of both operators
are shown in Table 3.
Page 6 of 30
Table 3: GATREE genetic operators (from Papagelis & Kalles (2001)).
Operator Example
Mutation
cross-over
Mutated Leaf
New Class
Mutated Node
New Test Value
Chosen NodeChosen Node
The GATREE fitness function is
( )xsize
xssifiedCorrectClaTreefitnessi
ii += 2
2 *
The first part of the product above is the actual number of training instances that a decision
tree (a member of a population) classifies correctly. The second part of the product (the size
factor) includes a factor x which has to be set to an arbitrary big number. Thus, when the size
of the tree is small, the size factor is near one, while it decreases when the tree grows big.
This way, the payoff is greater for smaller trees. The size factor can be altered to match
Page 7 of 30
individual needs, but this must be exercised with care since we never know whether a target
concept can be represented with a decision tree of a specific size.
2 Experimental comparison in depth
Before proceeding we elaborate on some background that sets the context for the experiments
and the data1.
The basic educational unit at HOU is the module, which runs for about ten months and is the
equivalent of about 3-4 conventional university semester courses. A student may register with
up to three modules per year. For each module, a student is expected to attend five plenary
class meetings throughout the academic year (a class contains about thirty students). Each
meeting is about four hours long and may be structured along tutor presentations, group-work
and review of assigned homework. Furthermore, each student must turn in some written
assignments (typically four or six), which contribute towards the final grade, before sitting a
written exam. Assignments are graded on a 0-10 scale (with an average 5 securing a pass) and
are prepared on the assumption that a student will normally devote 1-2 hours per day to
studying and about 10-15 hours to complete an assignment to a satisfactory level.
The “Informatics” programme is composed of 12 modules and leads to a Bachelor Degree.
The vast majority (up to 98%) of registered students in the “Informatics” programme, upon
being admitted at HOU, selects the module “Introduction to Informatics” (INF10), which
allows us to use INF10 as a test-bed for experimentation.
Key demographic characteristics of students (such as age, sex, residence etc), their marks in
written assignments and their presence or absence in plenary meetings constituted the initial
training set for a supervised learning algorithm, to predict whether a student would eventually
pass or fail a specific module.
1 The data sets are available for academic research only, for free.
Page 8 of 30
2.1 Summarizing the previous findings
Initial experimentation consisted of several Machine Learning techniques to predict student
performance with reference to the final examination. These techniques were decision trees,
neural networks, Naïve Bayes, instance-based learning (see Mitchell (1997) for introduction),
logistic regression (Perlich et al., 2004) and support vector machines (Burges, 1998).
The experiments took place in two distinct phases. During the first phase (training phase) the
algorithms were trained using the students’ data collected from the academic year 2000-1.
The training phase was divided in 9 consecutive steps. The 1st step included the demographic
data and the resulting class (pass or fail); the 2nd step was enhanced with data from the first
plenary meeting; the 3rd step was enhanced with data from the first written assignment; the 4th
step with data from the second plenary meeting, until the 9th step that included all attributes
(demographic data, data from all four plenary meetings and the marks of all four written
assignments). Assignment (numeric) data were manually quantized.
Subsequently, ten groups of data for the new academic year (2001-2) were collected from 10
tutors and the HOU registry. Each one of these 10 groups was used to measure the prediction
accuracy within these groups (testing phase). The testing phase also took place in 9 steps.
During the 1st step, the demographic data of the new academic year were used to predict the
class (pass or fail) of each student. During the 2nd step these demographic data along with the
data from the first plenary meeting were used in order to predict the class of each student. The
remaining steps were completed in the same way as described above for the training phase.
The overall accuracy was computed in a rather unconventional way. First, a model was
created based on the academic year 2000-1 data (from all tutors). Then, this model was tested
on 10 data sets of academic year 2001-2. The results were averaged for all combinations of
the nine (9) steps and the ten (10) test data sets. This was, then, repeated for each classifier
(type). The overall accuracy of the algorithms used is shown in Table 4.
All accuracies (reported in this paper) are measured in %.
Page 9 of 30
Table 4: Accuracy results of initial experimentation.
Classifier Overall Accuracy
C4.5 69.99
BP 72.26
Naïve Bayes 72.48
3-NN 66.93
Logistic regression 72.32
SMO 72.17
Indeed, it was proved that learning algorithms could enable tutors to pinpoint students at risk
of failure with satisfactory accuracy; however, demographics seem to play a rather limited
role with respect to the rest of the available information (actual academic performance).
Furthermore, the analysis of the experiments and the comparison of the six algorithms had
demonstrated sufficient evidence that the Naïve Bayes algorithm was very appropriate to be
used for the prediction of students’ performance, also given the ease of implementation.
2.2 Setting out the reference base
In our work, the latest version of WEKA (Witten & Frank, 2000) was used to bring the
experimental results up-to-date, in the sense that more conventional reporting techniques were
used. For the sake of simplicity and completeness we conducted a 10-fold cross-validation
experiment with the classifiers shown in Table 5, and reported the average accuracy as well as
the accuracy on the training set. We also report the model size based on the full training data
set (where applicable, i.e. only for symbolic classifiers). Leaves were counted as nodes and
the trees were binary.
All decision tree model sizes (reported in this paper) are measured in nodes.
Page 10 of 30
Table 5: Summary results for conventional classifiers.
Classifier Average Accuracy Accuracy on Training Set Model Size
J48 65.41 82.27 57
J48 (with REP) 69.48 75.00 33
J48 (with bagging) 69.48 86.92 ... N/A ...
MLP 66.86 93.02 ... N/A ...
Logistic regression 67.44 75.58 ... N/A ...
Naïve Bayes 63.08 65.12 ... N/A ...
3-NN 63.66 75.58 ... N/A ...
J48 is a WEKA implementation of the C4.5 algorithm (Quinlan, 1993). REP stands for
reduced-error-pruning, a technique for pruning decision trees (Quinlan, 1993). Bagging is a
technique used to reduce variance in classifiers (Breiman, 1996). MLP is a standard neural
network implementation (Rumelhart et al., 1994). 3-NN stands for 3-Nearest-Neighbour. N/A
stands for “not applicable”, for example, in MLP, the model size is specified by the model
builder and is not something that can be evolved during model building (of course, the model
size can have an effect on other quality parameters, such as speed, accuracy, etc. but we have
selected a default WEKA-supplied value because this is beyond the scope of our
experimentation).
The previous findings were useful indeed2. Based on our tutoring experience, success in the
initial written assignments is a strong indicator of successful follow-up. This is also
corroborated by the fact that such success motivates the student towards a more whole-
2 Predictably, the results in Table 4 and in Table 5 are not comparable.
Page 11 of 30
hearted approach to studying, based on the positive reinforcement. The usual risk here is that
there might be a temptation (on the tutor’s part) to artificially increase early grades; this is a
risky approach that may hide weaknesses and could convey the wrong message to the student.
To conduct experimentation with reference to the above (baseline) results, we have used the
GATREE system (Papagelis & Kalles, 2001). Its basic version allows for experimenting with
up to 150 generations and up to 150 members per generation. The results are shown in Table
6. (Average accuracy is reported based on 10-fold cross-validation.)
Table 6: Summary results for basic GATREE decision trees.
Classifier Average Accuracy Accuracy on Training Set Model Size
(gen/pop) 250/250 77.06 87.50 33/37
(gen/pop) 250/100 80.00 86.63 39
(gen/pop) 100/250 78.53 86.05 15/17
(gen/pop) 100/100 76.76 84.59 19
To ensure the validity of the experimentation we used the same data sets of the original
experimentation, including demographic data and quantized data. For all experiments we used
the default settings for the genetic algorithm operations (cross-over probability at 0.99,
mutation probability at 0.01, error rate at 0.95 and replacement rate at 0.25).
Where an X/Y number is reported as model size the interpretation is as follows: the genetic
algorithm produced (at the end) a best tree of Y nodes, which was then post-processed by
hand to eliminate redundant nodes (which sneak into the structure due to its pseudo-random
evolution and our decision to not yet constrain the genetic operators) resulting in a tree of X
nodes.
Page 12 of 30
The initial findings suggest that, when compared to conventional decision-tree classifiers, our
approach produces significantly more accurate trees. The tree size is comparable and the
running time is longer, but due to its overall short duration, the exercise can be considered a
success.
2.3 A quantitative comparison: model accuracy, model size and model derivation cost
The real issue is whether we can use our approach to achieve flexible trade-offs between
speed, accuracy and simplicity. The experiments presented in this section serve to answer this
question affirmatively and unequivocally.
We first investigate whether increasing the number of generations and the population size has
an effect on the quality of the final solution. We can hope that as we pay more in time, we
might be able to evolve more accurate or more compact trees. We used the unlimited-use
version of the GATREE system for these experiments. The results are shown in Table 7.
Table 7: Summary results for extended-run GATREE decision trees.
Classifier Average Accuracy Accuracy on Training Set Model Size
(gen/pop) 100/1000 75.59 86.92 41
(gen/pop) 250/1000 79.12 88.95 43/49
(gen/pop) 1000/100 74.71 88.66 27
(gen/pop) 1000/250 79.12 88.95 25
(gen/pop) 1000/1000 80.00 92.15 51
A casual comparison suggests that the extra effort may not be warranted, since the
improvements do not seem to be related with either the number of generations or the size of
the population. However, it does seem that the relatively big reclassification accuracy in the
Page 13 of 30
1000/1000 case (as measured on the training set) is a strong symptom of over-fitting, as
witnessed by the tree size.
While it would probably be a substantial improvement in itself to claim here that we have a
better technique of predicting drop-out, the underlying realities of the experience of the tutors,
and the difficulty of setting up a mechanism to collect and process data across all courses of
the university (meaning the deployment of tens of tests per course), suggest that we take a
closer qualitative look at the findings.
There are two directions that we should look at. The first direction has to do with the
comprehensibility of the induced decision trees themselves, as compared to the initial
findings: we would like to be able to confirm or refute the finding that demographics play a
limited role in the drop-out rate and that the most important indicators have to do with
performance at the initial written assignments. The second direction has to do with
GATREE’s ability to trade-off size with accuracy in the evolution mechanism: we would like
to see whether this ability can facilitate our planning to maybe use such findings on an
operational scale (covering tens of thousands of students and hundreds of tutors, the latter
with diverse IT literacy levels, every academic year).
Note that a trade-off between size and accuracy is explicitly stipulated in the fitness function
of GATREE (Papagelis & Kalles, 2001). That the trade-off is reasonable, as set in the default
parameters, can be argued by referring to the experimental results as reported in Table 5,
Table 6 and Table 7. Note that the gap between the re-classification error and the cross-
validation error is significantly smaller in GATREE induced trees (slightly more than 10%, at
worst). Equally important, that gap does not seem to particularly depend on the model size. In
WEKA induced classifiers, one must either explicitly use a post-processing technique (J48
with REP) or settle for a mediocre performance (naïve Bayes) to see a small gap.
To further gauge trade-offs we used GATREE in two extreme operating modes. In size-biased
and accuracy-biased modes, we respectively shift the fitness function towards either very
small or very accurate trees. Since we have observed that the gap between re-classification
Page 14 of 30
error and cross-validation error is (relatively) small for GATREE, we only report re-
classification error results in Table 8 and in Table 9. In
Table 10, however, we also report the cross-validation error for the largest-scale experiment.
Table 8: Summary results for extended-run size-biased GATREE decision trees.
Classifier Accuracy on Training Set Model Size
(gen/pop) 100/100 79.36 3
(gen/pop) 100/250 83.14 5
(gen/pop) 100/1000 84.30 7
(gen/pop) 250/100 83.14 5
(gen/pop) 250/250 83.14 5
(gen/pop) 250/1000 84.30 7
(gen/pop) 1000/100 83.14 5
(gen/pop) 1000/250 83.14 5
(gen/pop) 1000/1000 84.30 7
Page 15 of 30
Table 9: Summary results for extended-run accuracy-biased GATREE decision trees.
Classifier Accuracy on Training Set Model Size
(gen/pop) 100/100 84.30 17
(gen/pop) 100/250 85.47 29/33
(gen/pop) 100/1000 87.79 127/133
(gen/pop) 250/100 86.05 33
(gen/pop) 250/250 88.95 107/123
(gen/pop) 250/1000 90.41 135/165
(gen/pop) 1000/100 89.24 51
(gen/pop) 1000/250 91.57 67/73
(gen/pop) 1000/1000 93.31 93/115
Table 10: Cross-validation results for gen/pop:1000/1000 GATREE decision trees.
Classifier Average Accuracy Accuracy on Training Set Model Size
Size-biased 78.82 84.30 7
Accuracy-biased 76.82 93.31 93/115
A few comments are due by taking into account all results and, in particular Table 5, Table 6,
and
Table 10. GATREE can indeed produce small and accurate decision trees quite early, and still
be able to improve over them with time. This is an overall positive evaluation of the approach.
However, with reference to
Page 16 of 30
Table 10, one can most clearly observe the problem of over-fitting, as manifested by the
discrepancy of re-classification error and the cross-validation error. This gap can clearly be
associated with the model size of the two extreme approaches.
The final round of experiments in trade-offs utilize a particular feature of GATREE, namely
the (monotonic) adjustment of bias towards either accuracy-biased or size-biased decision
trees. In this set of experiments we start at one end of the spectrum (for example, accuracy-
biased) and throughout the evolution we shift our bias towards the other spectrum (in that
case, size-biased).
There is an obvious intuitive appeal to both approaches. In principle, starting with small trees
could allow a small structure to be optimized for accuracy. Also, in principle, starting with
accuracy-biased trees could mean that evolution could lead to the trimming (pruning) of
superfluous tree parts.
We report the results in Table 11 (note that sa refers to size being preferred at the start of the
evolution and accuracy being preferred towards the end, and as should be interpreted
likewise). We simply note on the results that they confirm earlier findings regarding the gap
between re-classification error and the cross-validation error. We also note that the non-
impressive performance gap between the tree sizes in static-bias and dynamic-bias approaches
may suggest that a non-educated uniform policy on bias modification is not warranted.
Page 17 of 30
Table 11: Cross-validation results for gen/pop:1000/1000 GATREE decision trees.
Classifier Average Accuracy Accuracy on Training Set Model Size
(gen/pop) 100/100 sa 78.53 83.43 45/57
(gen/pop) 100/100 as 77.65 85.47 17/19
(gen/pop) 250/250 sa 80.00 88.66 113/135
(gen/pop) 250/250 as 79.71 88.37 47/53
(gen/pop) 1000/1000 sa 78.53 95.06 129/141
(gen/pop) 1000/1000 as 77.06 92.73 125/141
2.4 A qualitative comparison: model consistency
In this section we document how the GATREE-induced trees confirm3 earlier findings about
the (lack of) importance of student demographics, with reference to the findings summarized
in section 2.1 above.
Based also on the description of the HOU context, as set out earlier in this paper, in Table 12
we report the generated trees for the size-biased variants. We use the notation AssgnX to refer
to the grade obtained at the Xth assignment and the notation MeetY to refer to the presence or
absence at the Yth plenary class meeting. In the embedded figures, solid lines refer to a YES
answer, whereas dotted lines refer to a NO answer.
We now move to explain the findings, based on our tutoring experience. For a start, we note
that all results make sense. This is not counter-intuitive; indeed, we know that many
3 The reader should interpret this extra step as an effort to replicate previous results using a new
technique. The previous sections dealt with model accuracies; we now discuss the actual models.
Page 18 of 30
hypotheses may be compatible with a target concept and all of them may be of value (after all,
this has given rise to ensemble classifiers).
Let us first consider the small-scale experiment (top row in Table 12). By referring to Table 8,
we note that its classifying ability is clearly the least favourable of all alternatives and can be
compared to a single decision stump. In essence, we interpret, if a student achieves at least 6
in a sophisticated assignment (as is usually the 3rd one), a pass can be normally expected.
The medium-scale experiment is more interesting (middle row in Table 12); note that it also
suggests that a relative stability exists in the induced model despite the fluctuations in
generations and populations. Its interpretation is as follows: a mediocre grade at an
assignment, turned in at about the middle (in the time-line) of the course, is an indicator of
possible failure at the exams, whereas a non-mediocre grade refers the alert to the last
assignment. Note that this makes sense: a student either demonstrates a sustained performance
of some value (thus achieving at least 3 at the last assignment), or the student underperforms,
in which case a failure is imminent.
The large-scale experiment (bottom row in Table 12) is also interesting. Note that it contains
tests on the key attributes tested above in the smaller-scale experiments but it also interleaves
a test on the participation in a plenary class. We believe that this may be disguised over-
fitting: on one hand it generates a slightly better re-classification rate (see Table 8), but on the
other hand it tests on a reasonable attribute: the third plenary meeting (and other meerings
onward) usually commands attendance from the committed students, whereas the rest usually
consider turning in assignments without attending the plenary classes. The tree reported here,
however, suggested that there may be a class of students who are confident, skip the meeting
and eventually pass (so they are justifiably confident).
Note that, quite predictably, demographics are absent throughout. This is a key confirmation.
Page 19 of 30
Table 12: Model results for size-biased GATREE decision trees.
Classifier Model
(gen/pop) 100/100
(gen/pop) 250/100
(gen/pop) 100/250
(gen/pop) 250/250
(gen/pop) 1000/1000
Assgn3 > 6
PASS FAIL
Assgn2 in [3..6]
Assgn4 < 3 FAIL
FAIL PASS
Assgn4 < 3
Meet3 = NO FAIL
Assgn2 in [3..6] PASS
FAIL PASS
3 Experimental comparison in breadth
We now move to tests on new data sets. To do so we use eight data sets (one of which we
have been using all along until now).
These data-sets refer to students taking three distinct courses, over three consecutive
academic years. All three courses are at the first year, where failures are very common
Page 20 of 30
(reaching 50%). Courses are named according to an HOU naming convention but for the sake
of conciseness and clarity we will use the following convention:
X_Y refers to a course having a code-name X and running for the academic year Y..Y+1
For example, 10_2000 refers to the introductory course on Informatics (code INF10, as earlier
described) offered at the academic year 2000-1. This is the data set that we have been
experimenting with in all of section 0.
We drop the pre-processing step of quantizing the numeric attributes and experiment with raw
numbers instead (we also believe that manually selected quantization policies will not scale
well into the university-level prediction system that we aim at). Also, all data sets have been
stripped of all demographic data and all meeting data. Furthermore, GATREE results are
based on size-only bias settings.
We first report the results for data-set 10_2000, by comparing them directly with earlier
findings to also gauge the results of quantization. (Earlier findings are shown in small font, in
parentheses.)
Page 21 of 30
Table 13: Summary results for data-set 10_2000.4
Classifier Average Accuracy Accuracy on Training Set Model Size
J48 (65.41) 81.73 (82.27) 89.96 (57) 35
J48 (with REP) (69.48) 83.73 (75.00) 82.73 (33) 3
J48 (with bagging) (69.48) 82.93 (86.92) 90.76 ... N/A ...
MLP (66.86) 82.13 (93.02) 84.14 ... N/A ...
Logistic regression (67.44) 83.73 (75.58) 83.53 ... N/A ...
Naïve Bayes (63.08) 84.14 (65.12) 84.54 ... N/A ...
3-NN (63.66) 81.93 (75.58) 88.15 ... N/A ...
GATREE 250/250 (…..) 83.27 (83.14) 84.34 (5) 3
GATREE 1000/1000 (78.82) 83.27 (84.30) 84.34 (7) 3
The most crucial observation is that GATREE induced trees provide good accuracy
estimation, even without the cross-validation testing phase. This is, generally, a trait that
seems to be shared by all classifiers, however, note that GATREE has been generating close-
match estimates even with the quantized formats. This is an indication that GATREE can
produce quality results even in the presence of noise, something that confirms the findings
reported by Papagelis & Kalles (2001) in other data-sets (benchmark and synthetic ones).
We now move to report our results for the rest of the data sets (and omit the 1000/1000
GATREE test).
4 Data in parentheses was drawn from Table 5 and Table 10.
Page 22 of 30
Table 14: Summary results for data-set 10_2001.
Classifier Average Accuracy Accuracy on Training Set Model Size
J48 79.13 87.82 55
J48 (with REP) 79.98 84.80 25
J48 (with bagging) 80.94 88.54 ... N/A ...
MLP 80.22 82.63 ... N/A ...
Logistic regression 82.63 82.39 ... N/A ...
Naïve Bayes 80.94 80.70 ... N/A ...
3-NN 81.66 89.02 ... N/A ...
GATREE 250/250 78.78 80.83 5
Having confirmed that we can produce good small trees, we now summarize the results for
the current course (code 10) in Table 15. We show the derived models and defer comparisons
to Table 18.
Page 23 of 30
Table 15: GATREE induced tree for data-sets 10_2001 and 10_2002.
Data-set Model
10_2001
10_2002
Assgn3 <= 5
PASS Assgn2 <= 7
FAIL PASS
Assgn4 <= 8
In Table 16, we summarize the results for course code 11. Note, that due to the genetic
evolution, some redundancy may exist (testing on the second assignment, in the middle row).
PASS
FAIL Assgn2 <= 6
FAIL
Assgn3 <= 4
PASS
Page 24 of 30
Table 16: GATREE induced tree for data-sets 11_2000, 11_2001 and 11_2002.
Data-set Model
11_2000
11_2001
11_2002
Assgn3 <= 5
PASS Assgn2 <= 7
FAIL PASS
Assgn3 <= 5
Assgn2 < 3
PASS
Assgn4 < 4 FAIL
Assgn1 <= 6
FAIL PASS
PASS
FAIL
Assgn2 <= 3
FAIL
Assgn2 <= 5
PASS
Page 25 of 30
In Table 17, we summarize the results for course code 12. Note, that due to the genetic
evolution, some non-expected findings may exist (testing on whether the sixth assignment has
been submitted at all, in the top row).
Table 17: GATREE induced tree for data-sets 12_2001 and 12_2002.
Data-set Model
12_2001
12_2002
Assgn4 < 5
Assgn2 < 7 FAIL
Assgn6 <= 0 PASS
FAIL PASS
Assgn6 < 2
Assgn1 < 3 FAIL
In Table 18, for GATREE models and J48-REP models, we summarize the results for
accuracy and size. Accuracies are based on 10-fold cross-validation. Model size is based on
model reported by induction on the full training set.
PASS FAIL
Page 26 of 30
Table 18: Summary results GATREE vs. J48-REP.
Data-set GATREE Accuracy J48-REP Accuracy GATREE Size J48-REP Size
10_2000 83.27 83.73 3 3
10_2001 78.78 79.98 5 25
10_2002 77.53 79.58 7 17
11_2000 92.50 92.02 5 3
11_2001 77.95 79.19 7 7
11_2002 74.90 70.65 7 9
12_2001 67.04 70.65 7 7
12_2002 74.31 74.70 5 9
The above findings are aligned with our reporting in the previous paragraphs. GATREE
models are very short and, often, more accurate than their conventionally induced
counterparts (even if the latter ones employ pruning). There do exist a couple of data sets
where the accuracy gap exceeds 2% (in favour of J48-REP) but we believe the consistent
conciseness of the GATREE model coupled with its straightforward derivation and the ability
to pay in time in order to “buy” more models (by simply increasing the number of generations
and the populations) to be key quality differentiators when considering the use of such a
model at an operational scale, either to explain or to predict failure. We elaborate on those
directions below.
4 Discussion
We believe that a small accurate model is a very important tool at the hands of a tutor, to
assist in the task of continuously monitoring a student’s performance with reference to the
Page 27 of 30
possibility of passing the final exam. A small model is easier to communicate among peers,
easier to compare with competing ones and can have wider applicability.
There are, however, some key issues that have to be reflected in a future research agenda, and
nearly all of them are of an applied-research nature.
There is, first of all, the question of whether on should measure the performance of a model
based on subsequent years’ statistics and not on cross-validation testing (on the same data
set). This is not simply an exercise in generalization capability since some sort of data
engineering will be inevitable (for example, if a year’s performance model is based on four
assignments, yet the target year is based on six assignments).
A further refinement refers to our approach being used as an “early warning system”, which
suggests that we must bias the models to prefer test on the “earlier” rather the “later”
assignments. This would alert tutors well in advance of the final exam, however, this would
be only important if tutoring is mostly based on monitoring performance as opposed to also
developing a personal relationship. As HOU expects tutor to develop a trust relationship with
students, alert methods that are simply monitoring-based may be of limited use. It is thus
extremely important that the models be easy to derive so that tutors get to familiarize
themselves with the technology.
A further direction refers to the applicability of the modeling approach to advanced courses,
i.e. dealing with students approaching the end of their studies. Since our student population is
significantly skewed towards junior courses, and since one can expect that drop-out is
minimized after initial successes, we speculate that predicting pass/fail in advanced courses
will be more difficult and less important from an educational point of view.
Another research direction refers to the approach being tested across various disciplines or in
inter-disciplinary programs. In the opening of our paper we emphasized that (the lack of)
learning and math skills may be key obstacles to success. In the humanities, lack of math
Page 28 of 30
skills is not an issue (at least, this is conventional wisdom). So, the applicability of our
approach remains to be seen.
However, perhaps the most difficult and challenging research question we need to ask is
whether our approach measures the performance of students or of the system in general. For
this to be answered we have to attempt embedding it in the everyday practice of the
university, from both an academic and an administrative point of view. The underlying
problem is rather simple: one can choose between micro-viewing our models as a
comprehensive alert system for students or between macro-viewing them as an intuitive alert
system for the organization and running of the courses. The latter could suggest, for example,
a periodic review of our policy on assignments, whereas the first one could mean more one-
to-one tutoring. HOU tutors do develop personal relationships with their students to address
the communication and support issues that arise in distance learning; we thus believe that if,
widely adopted, our approach would slowly drift away from the student alert functionality
towards the system alert functionality.
5 References
Booker, L.B., Goldberg, D.E., & Holland, J.H. (1990). Classifier systems and genetic
algorithms. In, Machine learning: paradigms and methods (pp. 235-282). Cambridge, MA:
MIT Press.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24:2, 123-140.
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery, 2, 1-47.
Cardon, P., & Christensen, K. (1998). Technology-based programs and drop-out prevention.
The Journal of Technology Studies, http://scholar.lib.vt.edu/ejournals/JOTS/.
Keegan, D. (1993). Theoretical Principles of Distance Education. Routledge, London.
Page 29 of 30
Kotsiantis, S., Pierrakeas, C., & Pintelas, P. (2004). Predicting students’ performance in
distance learning using Machine Learning techniques. Applied Artificial Intelligence, 18:5,
411-426.
Mitchell, T. (1997). Machine Learning. McGraw Hill.
Open Leaning (2004). Special issues on “Student retention in open and distance learning”.
19:1, http://www.tandf.co.uk/journals/titles/02680513.asp.
Papagelis, A., & Kalles, D. (2001). Breeding decision trees using evolutionary techniques. In
Proceedings of the International Conference on Machine Learning, Williamstown,
Massachusetts.
Perlich, C., Provost, F. & Simonoff, J.S. (2004). Tree induction vs. logistic regression: a
learning-curve analysis. Journal of Machine Learning Research, 4:2, 211-255.
Quinlan, J.R (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan
Kaufmann.
Rumelhart, D., Widrow, B., & Lehr, M. (1994). The basic ideas in neural networks.
Communications of the ACM, 37:3, 87-92.
Wall, M. (1996). GAlib: A C++ Library of Genetic Algorithm Components. M.I.T.
http://lancet.mit.edu/ga/.
Witten, I., & Frank, E. (2000). Data mining: practical machine learning tools and techniques
with Java implementations. San Mateo, CA: Morgan Kaufmann.
Xenos, M., Pierrakeas, Ch. & Pintelas, P. (2002). A survey on student dropout rates and
dropout causes concerning the students in the Course of Informatics of the Hellenic Open
University. Computers & Education, 39, 361-377.
Page 30 of 30