Veda: an on-line generative testing system

14
Education and Information Technologies, 2(3), 1997, pp. 219-234 Veda - An On-line Generative Testing System KSR Anjaneyulu, R Chandrasekar and S Ramani National Centre for Software Technology Gulmohar Cross Road No. 9, Juhu Bombay, India 400 049 Email: anji, mickey, ramani @saathi.ncst.ernet.in Abstract Veda is a generative testing system which supports the design and administration of tests, on-line and off-line. Veda provides tools for test designers to create question generators. Veda authenticates the candidate’s identity, administers a test, evaluates the candidate’s answers and keeps a record of the candidate’s scores. The system also provides facilities for item analysis. There are provisions for computing statistical indices for individual questions, based on data from tests administered. These indices can be used for item evaluation and improvement. An interesting feature of Veda’s design are the security features built into it. Cryptographic tech- niques have been used to create a system offering a high degree of protection against leakage, theft or unauthorized modification. Veda was developed using the programming language C, under the UNIX operating system, and is in heavy use. Keywords: Generative, Testing 1 Introduction Testing has a crucial role in on-line instruction. Testing permits the reliable sensing of the level of comprehension of a candidate. Within the area of testing, our work focusses on generative techniques [3]. It demonstrates that the technology to support the widespread use of generative tests has come of age. Our emphasis has been on tests involving multiple questions, used independently of any tutoring system. The tests can be anywhere from a few minutes to a few hours long, and are generally administered and graded on-line by the system. Similar tests, administered in a more conventional style, play an important role in education at all levels today. Teachers require tools to help them design and construct these tests. Sophisticated computer based tools have become increasingly available for this purpose. Computer assisted testing techniques have been used primarily for assisting teachers to construct tests, and analyse them off-line. One significant use of computers in this area has been in question banking; questions are kept in an on-line database [1,2] and accessed using selected attributes. Question banking systems provide for producing scrambled tests, randomizing the order of questions and randomizing the order of alternatives in multiple-choice questions. They also provide for formatting and printing of question papers. However, it is possible to use computers more effectively, utilizing their interactive capabilities to achieve a lot more. Computers can be, and should be, used to administer on-line tests, evaluate the answers of the candidate and award marks. Some of the advantages of automated testing are listed below: a) Testing can take place throughout the year 1

Transcript of Veda: an on-line generative testing system

Education and Information Technologies, 2(3), 1997, pp. 219-234

Veda - An On-line Generative Testing System

KSR Anjaneyulu, R Chandrasekar and S RamaniNational Centre for Software Technology

Gulmohar Cross Road No. 9, JuhuBombay, India 400 049

Email: fanji, mickey, [email protected]

Abstract

Veda is a generative testing system which supports the design and administration of tests, on-line andoff-line. Veda provides tools for test designers to create question generators.

Veda authenticates the candidate’s identity, administers a test, evaluates the candidate’s answersand keeps a record of the candidate’s scores. The system also provides facilities for item analysis.There are provisions for computing statistical indices for individual questions, based on data from testsadministered. These indices can be used for item evaluation and improvement.

An interesting feature of Veda’s design are the security features built into it. Cryptographic tech-niques have been used to create a system offering a high degree of protection against leakage, theft orunauthorized modification.

Veda was developed using the programming language C, under the UNIX operating system, and is inheavy use.

Keywords: Generative, Testing

1 Introduction

Testing has a crucial role in on-line instruction. Testing permits the reliable sensing of the level ofcomprehension of a candidate. Within the area of testing, our work focusses on generative techniques [3].It demonstrates that the technology to support the widespread use of generative tests has come of age.

Our emphasis has been on tests involving multiple questions, used independently of any tutoring system.The tests can be anywhere from a few minutes to a few hours long, and are generally administered andgraded on-line by the system.

Similar tests, administered in a more conventional style, play an important role in education at all levelstoday. Teachers require tools to help them design and construct these tests. Sophisticated computerbased tools have become increasingly available for this purpose.

Computer assisted testing techniques have been used primarily for assisting teachers to construct tests,and analyse them off-line. One significant use of computers in this area has been in question banking;questions are kept in an on-line database [1,2] and accessed using selected attributes. Question bankingsystems provide for producing scrambled tests, randomizing the order of questions and randomizingthe order of alternatives in multiple-choice questions. They also provide for formatting and printing ofquestion papers.

However, it is possible to use computers more effectively, utilizing their interactive capabilities to achievea lot more. Computers can be, and should be, used to administer on-line tests, evaluate the answers ofthe candidate and award marks.

Some of the advantages of automated testing are listed below:

a) Testing can take place throughout the year

1

Education and Information Technologies, 2(3), 1997, pp. 219-234

b) Professional effort (for test design, evaluation, etc) is not required every time the test is conductedc) Automated scoring is possible without restricting oneself to multiple choice questionsd) Test results of candidates can be made available immediatelye) Such tests do not require much change in existing practicesf) Useful statistics can be gathered on-line, and used to improve the tests

In addition, the use of generative techniques provides the following advantages:

a) If a candidate has to be tested repeatedly on a concept, generative mechanisms provide variety.They avoid the monotony and pointlessness of giving him the same questions again and again.

b) The test continues to be useful even after it has been exposed to several thousand candidates.

Further, on-line testing opens up a whole new spectrum of possibilities. It is possible to test a candidate insituations which cannot normally be tested in a traditional test. For example, there could be a simulationof an experiment and the candidate could be tested on his understanding of that experiment. It is alsopossible to do what cannot normally be done in a traditional test. For example, a candidate can be givena second chance to answer a question (possibly with lower credit) if he is wrong on the first attempt. Thecomputer can also advise a person to go to the next question when he has spent too much time on it, oroffer him a hint at that time.

Veda is a generative on-line testing system developed at the National Centre for Software Technology(NCST). It has been primarily used to test candidates who apply for admission to NCST’s courses inSoftware Technology. Veda is also used in many courses being taught at NCST, covering subjects such asPascal, UNIX, AI Programming, and Expert Systems.

In this paper, we describe Veda, its design, implementation and use. We initially present the motivationfor the work and then show how tests are created using Veda. We then describe the overall design of Veda.Later sections deal with security features offered by Veda, the provision to compute statistical indices foritem analysis, our experience with on-line testing, the limitations of Veda, possible extensions and finallyour conclusions.

2 A Typical Application of Veda

Veda has been used extensively to test about a thousand candidates a year, over many years. Thesecandidates take an entrance examination for NCST’s course in Software Technology. Out of these, about90 are finally selected for admission. The entrance examination is split into two parts. The first part is atest of general aptitude that serves as a qualifying examination for the second part. The second part is atest of basic computer science and programming knowledge. This part is to be taken after the applicantreads recommended books on Computer Science and Pascal.

On-line tests based on Veda are available to cover both parts. Candidates are tested on the followingdomains: quantitative reasoning, concepts relating to high school maths, visuo-spatial reasoning, logicalreasoning and verbal ability.

Part II, covering basic computer science and programming knowledge, was automated later.

3 Creating Generative Questions in Veda

Most questions in Veda are of the generative type. This section deals with the structure of generativequestions and the process of their creation. Later sections give a description of the Veda system.

Consider the following question in geography:

What is the capital of India?

This question can be generalised to make it generative. The word ‘India’ can be made into a variable (a‘slot’) so that the same generator can generate questions about capitals of different countries, thus:

What is the capital of <country>?

2

Education and Information Technologies, 2(3), 1997, pp. 219-234

Individual instances are generated by substituting proper values in place of the slot <country>. Thisgenerator tests for the same type of information i.e, the capital city of a country. It can generate differentquestion instances by using a proper filler for the slot <country> using a given set of countries. The set[India, Pakistan, Sri Lanka, Bangladesh] can be one such set. The answer will depend on the filler selectedfor the slot <country>. In general, there can be more than one slot in a question template. In such asituation, the interdependence among slots could be complex.

Now, let us consider a slightly more complex example. This example also illustrates how a generatorevolves out of an instance. Consider a typical question on polygons.

The shape of a field is a square. Its side measures 100 metres. What is the area of the field?

This question can be made generative by identifying the parts that can be changed. For example the word‘square’ can be changed to ‘rectangle’, ‘rhombus’ etc. So by making that part of the question a slot, we geta template that can generate a question on any quadrilateral. The template will be as follows:

The shape of a field is a <quadrilateral>. Its side measures 100 metres. What is the area of thefield?

Now we can specify the set of quadrilaterals [square, rectangle, rhombus, ...] as the choice set for the slot<quadrilateral>. But the question is not yet complete. If the filler for the slot<quadrilateral> is any thingother than square, the area cannot be computed from the information we have so far. This is because thedescription gives only one dimension and other quadrilaterals like rectangle can be specified fully onlywith two dimensions. The question, therefore, has to give more information depending on the filler forthe slot <quadrilateral>. We therefore have to make the description of the quadrilateral as a new slot<description> which will take different fillers depending on the filler for <quadrilateral>. The templatenow becomes:

The shape of a field is <quadrilateral>. <description>. What is the area of the field?

The filler for the slot <description> depends on the filler selected for the slot <quadrilateral>. For eachchoice of the filler for <quadrilateral>, the user has to specify the filler (a sub-template here) for the slot<description>.

If <quadrilateral> is ‘square’, then<description> should be ‘The diagonal of the <quadrilateral> is <diagonal>’

If <quadrilateral> is ‘rectangle’, then<description> should be ‘It has a diagonal measuring <diagonal> and a side measuring <side>’

If <quadrilateral> is ‘rhombus’, then<description> should be ‘The diagonals of the <quadrilateral> are <diag1> and <diag2>’

If <quadrilateral> is ‘trapezium’, then<description> should be ‘The two parallel sides are <side1> and <side2> and the distance betweenthem is <dist>’

Now the template is complete and it can be used to generate questions on different quadrilaterals. Wecan make the template more general by making the answer that has to be computed, also a variable (say<feature>) instead of being only area. Now the template becomes:

The shape of a field is <quadrilateral>. <description>. What is the <feature> of the field?

Then we can specify the set [area, perimeter, diagonal] as the choice set for slot <feature>. We have tobe careful in specifying the choice set for <feature> because the filler selected for <feature> should notbe a value given in the description, i.e., the filler for the slot description. For example, if the filler fordescription for a square gives the length of the diagonal, then the filler for <feature> should not be thelength of the diagonal. It can either be area or perimeter. One way to ensure this is to select two mutuallyexclusive sets for<feature> and<description>. That means any of the parameters given in the choice setfor <feature> should not be given in any of the fillers for the slots in the sub-template for <description>.But this may be too strong a restriction. For example, when the filler for quadrilateral is ‘rectangle’ and thefiller for description contains only two sides of a rectangle, then the filler for feature can be anything fromthe set [diagonal, area, perimeter] depending on the filler for description. We can generate meaningfulproblems by ensuring that the features selected to describe the given object are not the ones about whicha question is asked. The filler for <feature> can be anything from the set [area, perimeter, diagonal],

3

Education and Information Technologies, 2(3), 1997, pp. 219-234

provided that the filler selected for description does not have that particular value. This can be achievedby using an additional constraint:

when <quadrilateral> is ‘rhombus’ then <feature> not equal to ‘diagonal’

Thus, the process of creating generative questions involves the following stages:

a) Creating question templates with markers to identify variable parts (i.e. slots),b) Specifying a set of possible fillers (choice set) for each slot, and the relationship among slots, andc) Specifying the answer, as a procedure.

4 Creating Tests

Veda supports test designers in creating three types of questions, namely:

a) Question Generators,b) Non-generative Questions andc) Comprehension Questions, in which several questions are based on a common paragraph

All three types of questions are automatically graded.

The system provides a set of tools that can be used to develop question generators, making this taskrelatively easy. The designer of the generator uses his understanding of the structure of a type of question,and the generative mechanisms available, to create questions See Figure 1 for an example. The non-generative questions mentioned in (b) and (c) above, on the other hand, are simply stored as structuredtext and retrieved when required.

� � � Figure 1 comes here � � �

Questions can be grouped into subject domains, such as quantitative, verbal, etc. These domains canbe ordered in any sequence for presentation to the candidates. The domains are, therefore, the areas ortopics on which the candidate is examined.

For each question, the test designer is allowed to specify the maximum time recommended for thecandidate to answer the question. During test administration, on expiry of this time, the candidate isgiven a warning and advised to go to the next question. On pre-tested questions, this time could be set atthe time at which 90% of the candidates from the test group who got it right were able to answer it. Thestatistics collected by the system enable the computation of this time, called T90, for each question.

Question generators are ‘C’ programs, which call generative tools in the form of a set of ‘C’ functions. Afterthese programs are created, they can be easily integrated with the Veda system by just updating entriesin the system files. Non-generative questions can also be added easily to the system’s databases usinginteractive facilities provided by the system.

Though the system is primarily meant for on-line testing, it can also be used to generate tests which canbe duplicated and administered off-line as conventional written tests.

5 Test Administration

At the time of the test, the invigilator who looks after the system issues the candidate a roll number and asession password. The invigilator then starts a terminal session, entering his (the supervisor’s) password.The candidate then starts the test session using his session password and roll number. A combination ofthe two passwords is used to decrypt questions in the system.

At the beginning of the test, the candidate is given a familiarization session. This session, which lastsabout five minutes, allows the candidate to get used to the terminal keyboard. It also presents samplequestions which familiarise him with the format of the test. The candidate is then given the real test.

4

Education and Information Technologies, 2(3), 1997, pp. 219-234

For each test, a qualifying mark can be fixed. When a question is presented, the number of additionalmarks that have to be obtained to qualify is displayed. The number of questions remaining in the test isalso displayed, along with an indication of the remaining time. If a candidate gets a question wrong, he isallowed another attempt at it. If he tries again and his answer is correct, he is awarded half of the marksfor the question. A candidate’s test can be terminated when it is clear that he does not have a chance ofqualifying (i.e. if the number of marks yet to be earned is greater than the number of questions left). Acandidate who obtains the qualifying mark can also have his test terminated 1.

The total duration of the test can be set to any value as derived by the test designer; no time limit is imposedon individual questions. The warning given at the end of the time specified for individual questions isonly by way of advice, and a candidate may decide to ignore it.

At the end of the test, the system elicits and stores information about the candidate, for example his basicqualification, the field in which he has obtained his degree, and other information that may be relevantto those administering the test.

The system automates most of the routine administrative functions that have to be carried out in con-ducting a test. Except for an invigilator who needs no significant knowledge of the system, no one elseis required to look after the day-to-day administration of tests. There are routines to produce hall tick-ets(with passwords) for the candidates, print out the mark lists of the candidates at the end of the day,and produce a daily report on the performance of candidates.

6 Veda: A General Overview

The general structure of Veda is shown in Figure 2. There are two system files that have to be maintainedby the test designer. These are the ‘structure’ file and the ‘timeandmarks’ file. The ‘structure’ file specifiesthe domains included in the test, the order in which they are to be covered and the number of questionsthat have to be administered from each domain. The ‘timeandmarks’ file specifies the duration of the test,the total number of questions that should be administered to the candidate and the qualifying marks forthe test.

� � � Figure 2 comes here � � �

The ‘init’ program validates a candidate and gives him a familiarization session. It determines the order ofquestions in a domain and initiates three schedulers; one for each category of questions. The init programpasses a ‘master key’ 2 and the order of the questions to the schedulers. The schedulers then presentthe questions to the candidate. In the case of the Question Generator Scheduler, a ‘question generator’program is executed by an additional process to create the question. The schedulers are also responsiblefor maintaining relevant information about each question.

Lists of question generators are kept in files, each list corresponding to the domains in which a person isbeing tested. Databases of short answer questions, each database corresponding to a different domain,and lists of comprehension passages are kept in separate files.

For every candidate who takes a test, two files are created: a unique ‘test’ file and a unique ‘log’ file. Theschedulers write the results for each question posed to the candidate in the ‘test’ file. The informationabout the candidate (name, address, qualification etc) elicited at the end of the test is also stored in thisfile. The ‘log’ file is used to record (in encrypted form) all questions that are given to a candidate and allthe answers given by the candidate. If required, this file can be examined by the technical supervisors (thepeople who are concerned with the overall management of the test), for example, to deal with a complaintabout the system. There is considerable security in the entire process, including the process of examiningthe log.

1The test designer can arrange to let a test run to completion, presenting all questions to each candidate, or have it terminatedas mentioned above.

2This is required for decryption of the question. See Section 8 for details

5

Education and Information Technologies, 2(3), 1997, pp. 219-234

7 UNIX Process Structure

To describe how the question generators in the system work, it is necessary to first outline the structureof processes in UNIX that execute the generators in Veda.

A ‘process’ is a program in execution. UNIX allows a user to have more than one process running at atime. It also provides mechanisms for one process to communicate with another. This feature is used inquestion generators in Veda.

The process that runs initially in Veda is the ‘init’ program, which then passes control to the QuestionGenerator Scheduler. This scheduler determines which question generators are to be run, and then ‘forks’into two processes. The scheduler now becomes a parent process and the other process is the childprocess. The child process overlays a question generator on itself and executes it (see Figure 3). Theparent process waits until the child completes its task and then continues.

� � � Figure 3 comes here � � �

The parent has to pass the child a ‘key’ which we call the ‘master key’. This key is used to decryptconstituent items of text of the question to be produced, since these items are stored in files in encryptedform. The key is passed to the child process through a Unix ‘pipe’. Pipes in Unix allow processes thatbelong to the same hierarchy to communicate with each other. After the child process has finished itswork, The child process then generates a question, takes the candidate’s response, evaluates his answerand passes back information (such as the marks obtained by the candidate for the question, time takento answer the question etc) through the pipe to the parent.

This important feature allows the test designer to design generators as independent modules, and easilyintegrate them with the system. Each generator is a ‘C’ program that can be developed in total isolation(using the tools provided by Veda) and then later integrated with the system.

8 Security Aspects

Since the whole test is automated, it is important to protect the files related to the test. Any unauthorisedaccess to questions from Veda’s database must be rendered impossible. It is not enough to rely on Unix filesecurity, since the privileged user ‘root’(the system administrator) can override file security mechanisms.

Veda uses encryption as its main protection mechanism. The scheme we use is safe, even if there aresecurity lapses by those controlling the computer. All files in the system are encrypted, thus making itimpossible for intruders to get to know the questions on the system.

We envisage that three technical supervisors will look after the overall system. At least two of them arerequired to type in their passwords to encrypt (or re-encrypt) all files in the system. When the files in thesystem are encrypted, pairs of administrative and session passwords are produced. A different admin-istrative password is generated for each day for which the test is being conducted. A session passwordis required for each candidate appearing for the test. Only a combination of the right administrativepassword and session password will allow a test to be taken.

When a candidate comes to take the test, the administrative supervisor types in his password (the ad-ministrative password needs to be typed in only once a day). The candidate also types in his password,independently. These passwords are combined to get the master key.

The program that determines the master key is part of the init program, which chooses the questionsto be displayed to the candidate. The text of the question generators are encrypted before compilation.That is, only the alpha-numeric strings are encrypted and not the executable code. When a ‘questiongenerator’ is invoked as a child process, the parent process sends the master key to the child, which usesit in decryption. Decryption of the question text is done in memory. Thus no plain text files are everavailable in the filing system. Without the proper master key, the ‘question generating’ program that isinvoked will not be able to carry out its task.

6

Education and Information Technologies, 2(3), 1997, pp. 219-234

Notice the multiple levels of security. At the top, no technical supervisor can compromise any informationwithout the connivance of at least one other technical supervisor. The system is protected against misuseby the administrative supervisor or the candidate, since both the session password and administrativepassword are required to run this program. The master key may be changed at periodic intervals for addedsecurity.

9 Statistical Analysis of Test Scores

Some questions that test designers have to ask themselves are:

a) How reliable are our tests? Are measurements repeatable?b) Do they measure what they should ?c) What can we do to improve our tests ?d) How good is our test in predicting real-life performance after the test?

Answers to the questions above may be obtained if we start with the observation: ‘Just as questions testpeople, candidates taking the test, test questions’. Test designers often have an ‘intuitive’ feel of howgood a question is. Statistics is a powerful tool that allows test designers to get objective and quantitativemeasures of the quality of questions. However, teachers will not, in general, use statistical tools unless thenecessary computations are automated and incorporated in testing software.

In the following sections, we introduce some indices that we have used in our work that have helped us toimprove our tests. Veda incorporates utilities that take test files of candidates as input and produce theseindices.

9.1 Predictive Value (PV)

Consider the classification of candidates into four categories with respect to

� performance in the test as a whole, and� performance in answering a specific, single question correctly.

The categories are:

a) Candidates who have not qualified in the test and have answered the particular question wrongb) Candidates who have not qualified in the test and have answered the question rightc) Candidates who have qualified in the test and have answered the question wrong, andd) Candidates who have qualified in the test and have answered the question right.

With such a classification for each question, we would get a table like the one shown in Figure 4, where a,b, c and d denote the number of candidates in each category, e, f, g and h are row-wise and column-wisetotals respectively, and k = a + b + c + d. Note that k is the total number of candidates who have attemptedthe question.

� � � Figure 4 comes here � � �

What do these numbers tell us? Suppose a + d = k. What does this mean? It really means that everyonewho qualified in the test got the question right and everyone who did not qualify in the test failed toanswer the question. Such a question classifies all candidates perfectly, and so is a very good one. On theother hand, if b + c = k, it is clear that the question is very bad.

We now define the Predictive Value (PV) of a question as the number of ‘properly’ classified candidatesdivided by the total number of candidates who attempted the question. In the case where b + c = k, PV= 0; where a + d = k, PV = 1. Based on the previous argument and these values of PV, we can infer that aquestion with a high PV is a good one and one with a low PV is a bad one.

7

Education and Information Technologies, 2(3), 1997, pp. 219-234

9.2 Level of Difficulty (LD)

The Level of Difficulty (LD) gives a test designer a simple measure of how difficult a question is. The LDfor a question is the ratio defined in Figure 5.

� � � Figure 5 comes here � � �

A question with an LD of 0.0 is very easy, since everyone has got it right. On the other hand, a question withan LD of 1.0 is very difficult; no one has got it right. Both questions are undesirable in a test, as neitherof them distinguishes one candidate from another. In other words, such questions do not contributeanything to the ‘discriminatory’ power of the test and so do not deserve inclusion in the test. A relatedquestion of relevance is: What should be the LDs of questions in a test? Is it better to have questions withan average level of difficulty? The answer seems to be yes; but the ‘average’ level of difficulty will have tobe determined based on the level at which a test is meant to discriminate.

9.3 Correlation Coefficient (CC)

If we correlate the scores obtained on a particular question with the total scores of the candidates for thetest, the correlation coefficient will give us a measure of the ‘goodness’ of the question. A low correlationcoefficient indicates that candidates who have got that question right have not done well in the test asa whole. On the other hand, a high correlation coefficient indicates that candidates who have got thequestion right have done well in the test. Obviously, the second question is a better question than thefirst.

� � � Figure 6 comes here � � �

The CC also provides information about the classification of a question. Consider a test consisting ofquestions in three domains, say quantitative reasoning, high school maths and visuo-spatial reasoning.The classification of questions is decided by the test designer when he designs questions. Questionsusually test abilities in more than one domain, as the domains involved are not mutually exclusive. Theymay not strictly belong to a particular domain, though they are classified in that domain. If we carry out acorrelation analysis of the question score to the domain scores, we may find that the correlation with onedomain X is greater than the correlation with the other domain Y in which it was originally classified. Wemight then decide that the question has been wrongly classified and decide to change the classification toX, based on the results obtained. This reclassification helps us to have more accurate tests for the future.

9.4 Chi-Square Coefficient

The chi-square test employed by us in Veda evaluates the matrix of a, b, c, d given in Figure 4 to determinethe statistical soundness of available evidence (see Figure 7). Unlike all other indices mentioned in thispaper, the chi-square coefficient is dependent on the size of the test-taking population. The LD measure,for instance, may not change significantly as we go from information on 10 candidates to information on100 candidates. But the chi-square coefficient does change significantly.

� � � Figure 7 comes here � � �

9.5 Equivalence of tests

A major feature of automated on-line testing is that each test that is administered is generated whenneeded. This is in contrast to conventional testing where a copy of one test is administered to a wholebatch of candidates. In on-line testing, a specified number of questions can be selected from a largequestion bank and administered to each candidate at the time of the test. This has an obvious advantage:

8

Education and Information Technologies, 2(3), 1997, pp. 219-234

a candidate can get little benefit by consulting someone who has taken the test earlier. The protection ofinformation about the test is further enhanced if the questions are generated rather than retrieved frompassive storage. No candidate can then pass on specific numbers, names etc., as the answers expected,to other candidates.

However, this creates a new problem. If the test given to candidate varies from that given to another, howdo we compare the performance of one candidate with that of another? The problem can be posed in thefollowing form, making reasonable assumptions and simplifications.

If there is a question bank of M questions and each candidate is given N questions randomly picked outof these M, how do you ensure that tests given to different candidates are equivalent?

Statistics provides the answer to this important question, if we accept a simple model described below. Letus assume that questions are drawn from a large population of questions with a mean level of difficulty,averaged over all questions, of LDp, and a standard deviation over the whole population of questions,SDp. Each test (set of questions) drawn from this population is a sample. We can, therefore, estimate thelikely variation in the mean level of difficulty of the test, as we go from the test given to one candidate(consisting of N questions randomly selected from the total population of questions) to the test given toanother candidate.

An estimate of this, in the form of probable error of the level of difficulty of the test (in comparison to theLDp) is

SDp/square-root(N)

where N is the number of questions in the test.

As an indication of variability, in one of series of tests we conducted we found that:

LDp = 0.43, SDp = 0.21 and Probable error in the level of difficulty of a test is equal to 0.05.

The convention used here involved normalizing the maximum score to 1.0. Therefore variability in LD isof the the order of 5%. This figure is well within the variability seen in conventional testing.

The variability in LD from one test to another will be much less if a test covers several domains, and a fixednumber of questions are selected from each domain. This will amount to stratified random sampling andwill result in a smaller variation of LD from test to test.

Based on this feedback, some minor changes were carried out on the system. Veda was later extensivelytested in NCST in informal use and then put to regular use, to screen applicants to the one-year part-timecourse. About 600 to 1000 applicants have taken on-line tests generated by the system every year, overthree years.

Summary of Experience and Findings

a) It was found that a familiarization session carrying no marks, prior to the actual test, was extremelyuseful. Most of the candidates did not have problems using a terminal. This is probably becausethey had to do very little typing.

b) The candidates appreciated their results being declared immediately on completion of the test.c) All questions in the test were of the short answer type, involving a number, a word or a few words

being typed in. Such questions eliminate errors due to guessing that is common with multiple-choicequestions to avoid ambiguities. It was found necessary, to provide hints along with the questions.These hints have to help the candidate uniquely determine the answer (see example in Figure 8).

� � � Figure 8 comes here � � �

d) Tests were initially administered on a HP9040 running HP-UX. Later they were administered on theVAX8600 running Ultrix. The system was found to be stable, and performed well even when the hostsystem was fully loaded.

e) The major overhead of the system arises from the ‘fork’ operation involved in running a questiongenerator. Each question generator creates an additional process on the system. On the VAX8600,with 16 megabytes of memory, and about 20 other users, we were able to run 20 tests at a time withno problems whatsoever. This was adequate for our needs, and is probably much lower than the

9

Education and Information Technologies, 2(3), 1997, pp. 219-234

limit set by the capabilities of the computer. On any good general purpose computer running UNIX,every terminal available should be able to run a test, at any given time.

10 Limitations and Possible Extensions

Veda, like any other software system, has its own limitations. Some of these were envisaged beforehand.We have overcome some of these limitations now. Some existing limitations and possible extensions aregiven below:

a) A person cannot skip a question and come back to it later. The facility to skip questions can beprovided by storing the generated question along with the answer in a file (in encrypted form). Wecan then allow candidates to refer to questions skipped previously, and to attempt to answer them.

b) The generators had to be programmed in C.We have developed an Authoring System for Veda [4]. This system takes a high level specificationcorresponding to a generative question and creates the C program that Veda requires. Thus Vedahas been made easier to use, by the creation of this authoring system, as the test designer need nothave programming experience.

c) At present, we are using the statistical indices mentioned in section 9 to help the designer determinethe goodness of a question. Ideally one should have these indices stored along with the questions.These indices could then be used to select the questions for a test. For example, a test designer maywant only questions with a level of difficulty (LD) greater than 0.5. By using this index, the designercould have better control on the difficulty of the questions in the test. Currently we do not have aprovision to store these parameters with the questions.

11 Applications for Veda

Applications that we envisage for Veda include the following:

a) Veda can be used with a Computer Aided Instruction system. The CAI system could teach a candidateconcepts in a certain subject. Veda could then test them to make sure they have understood whathas been taught. Veda could be extended to provide references to remedial material that should belooked into, pin-pointing concepts that the candidate has not understood.

b) Veda provides a mastery learning mode which we use quite frequently in our courses. The intentionin this mode is not to award marks, but to allow the candidate to have multiple attempts at a testand determine what concepts he needs to revise. Veda keeps track of questions he has got wrong inan attempt and gives only those questions in subsequent attempts.

c) Veda can be used for Distance Learning applications. Course participants could use remote accessto a Veda system, over a data communication network. Veda could play a valuable role in evaluatingstudent progress and in giving feedback.

d) Veda can be used for a highly centralized testing, as in an Open University or a National TestingService.

12 Conclusions

Testing is an integral part of instruction. In general, teachers are not able to test candidates as often asthey would like to. This is because of the effort involved in designing and conducting tests. On-line testingseems to be one way of relieving the burden of a teacher. With on-line testing, it would be possible forcandidates to take a test almost anytime, in any subject, and get feedback on their ability and knowledge.

Considering the limited resources available in a developing country, testing can play a very invaluablerole. Improvements in evaluation of student performance, using computer based testing, can be morecost-effective than automating the bulk of the instructional process.

10

Education and Information Technologies, 2(3), 1997, pp. 219-234

Acknowledgements

The work described in this paper was carried out as part of the Knowledge Based Computer SystemsProject funded by the Department of Electronics, Government of India, with assistance from the UnitedNations Development Programme.

Several colleagues at NCST have contributed to the efficient administration of tests, and to the use of Vedain testing. We are grateful to all of them for their cooperation.

References

1. R Chandrasekar, SB Chikarmane, PM Desai, SD Laud and R Ramanan. Design and Implementationof an Instructional Data Base, in: Proceedings of EDINFO-82, International Symposium on Educationin Informatics (Computer Society of India, Madras, 1992).

2. G Lippey (ed). Computer-Assisted Test Construction. (Educational Technology Publications. NewJersey. 1974).

3. S Ramani and A Newell, On the Generation of Problems, Technical Report, Dept of Computer Science,Carnegie Mellon University, November 1973.

4. P Srinivas, KN Prakash, KSR Anjaneyulu and S Ramani. An Authoring System for Generative Testing,in: V Rajaraman (ed), Proceedings of the Knowledge Based Computer Systems Conference - KBCS ’88(Computer Society of India. Bangalore. 1988) 64-71.

11

Education and Information Technologies, 2(3), 1997, pp. 219-234

The area of a f pentagonal j hexagonal j octogonalg field is f A g square metres. If all its sides are equal inlength, how long is a side of the field?Constraints: Side S is in the range 6.5 to 10.5, with step 1. Number of sides N = 5, 6 or 8Algorithm for Generation:Choose S and N randomly, within the constraints specified.

A = (N * S * S) /(4 * tan(180/N))Substitute the computed quantity for A in the question.

Answer = SNote: In this example, there is no need to ‘compute’ the correct answer, as you generate the questionstarting with the correct answer. However, other questions may require the answers to be computed, as afunction of the attributes (slots).

Figure 1: Example of a Question Generator

Structurefile

Init Program

Timeandmarksfile

Question Short ComprehensionGenerators Questions QuestionsScheduler Scheduler Scheduler

List 1 List K Database 1 Database L List 1 List Mof ... of of ... of of ... ofQGs QGs SQs SQs CQs CQs

Figure 2: Veda - A General Overview

12

Education and Information Technologies, 2(3), 1997, pp. 219-234

_____________________| Question Generator|| Scheduler || (Parent Process) ||___________________|

||------------------ create another process and

wait until the | | execute the generatorchild finishes | ________________

| | Question || | Generator || |(Child Process)|| |_______________|| || | terminate|-------------------

store results in |candidate’s file

Figure 3: Execution of Question Generators

How well does the question discriminate?

Got Ques Got QuesWrong Right Totals

_________________Not | | |

Qualified | a | b | ein Test |_______|_______|

| | |Qualified | c | d | fin Test |_______|_______|

Totals g h k

PV = (a + d)/(a + b + c + d) = (a + d)/k

Figure 4: Predictive Value

13

Education and Information Technologies, 2(3), 1997, pp. 219-234

How Difficult is a Question?

No. of candidates who failedto answer the question correctly a + c

LD = -------------------------------- = -----Total no. of candidates k

Figure 5: Level of Difficulty

Does the question fit in the test?Is it properly classified?

Consider

a) Correlation of Question score withtotal score for that domain

b) Correlation of Question score withtotal test score.

Figure 6: Correlation Coefficient

�2 = (bc � ad)2 � k=e � f � g � h

Figure 7: Chi-Square Test

Complete the following analogy:

satellite : orbit :: rocket : _ _ _ _ _ _ _ _ _ _(10 letters)

Hint: t _ _ _ _ _ _ _ _ _

* The hint tells the candidate that the answer starts withthe letter ‘t’, and is a word ten letters long

Figure 8: Example of a Short-Answer Question

14