Instructor evaluation ratings: A longitudinal analysis

10
Journal of Personnel Evaluation in Education 10:83-92, 1996 © 1996 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands Instructor Evaluation Ratings: A Longitudinal Analysis LARRY H. LUDLOW, Ph.D. Boston College, School of Education, 140 Commonwealth Avenue, Chestnut Hill, MA, 02167-3813 It is a fact of life in the academic environment that tenure, promotion, and yearly salary decisions are based, in part, upon student evaluation ratings of faculty classroom performance. From one institution of higher education to the next, evaluation instru- ments take on various forms and formats as they attempt to gather performance information about individual faculty members. Likewise, from one institution to the next, the degree to which administrators, typically deans, incorporate this information into some overall conclusion about the worth of a faculty member to the institution is often vague at best. The point is that when high-stakes decisions are based upon one year's data, or even some perception of a pattern over a number of years, there may be factors beyond the control of the instructor that have influenced the ratings received in a given year. The literature is voluminous with respect to student ratings of instructors (Millman & Darling-Hammond, 1990; Theall, Franklin, & Ludlow, 1990). What is missing, however, is any discussion at the micro level that attempts to understand the dynamics leading to the ratings received by a single instructor over an extended period of teaching. The purpose of this paper is to present a methodology that is capable of modeling longitudinally the performance ratings of an instructor as a function of typical classroom environment variables. The significance of this approach is that it enables an evaluation of an instructor to be taken in a longitudinal context, as opposed to a single year's snapshot of performance. Such a purpose and methodology are also consistent with The Personnel Evaluation Standards (The Joint Committee on Standards for Educational Evaluation, 1988). METHOD Sample The evaluation data cover a ten year period ending with the spring semester of 1994. Within this period, the student evaluations of 49 courses were available. The evalua-

Transcript of Instructor evaluation ratings: A longitudinal analysis

Journal of Personnel Evaluation in Education 10:83-92, 1996 © 1996 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands

Instructor Evaluation Ratings: A Longitudinal Analysis

LARRY H. LUDLOW, Ph.D. Boston College, School of Education, 140 Commonwealth Avenue, Chestnut Hill, MA, 02167-3813

It is a fact of life in the academic environment that tenure, promotion, and yearly salary decisions are based, in part, upon student evaluation ratings of faculty classroom performance. From one institution of higher education to the next, evaluation instru- ments take on various forms and formats as they attempt to gather performance information about individual faculty members. Likewise, from one institution to the next, the degree to which administrators, typically deans, incorporate this information into some overall conclusion about the worth of a faculty member to the institution is often vague at best. The point is that when high-stakes decisions are based upon one year 's data, or even some perception of a pattern over a number of years, there may be factors beyond the control of the instructor that have influenced the ratings received in a given year.

The literature is voluminous with respect to student ratings of instructors (Millman & Darling-Hammond, 1990; Theall, Franklin, & Ludlow, 1990). What is missing, however, is any discussion at the micro level that attempts to understand the dynamics leading to the ratings received by a single instructor over an extended period of teaching.

The purpose of this paper is to present a methodology that is capable of modeling longitudinally the performance ratings of an instructor as a function of typical classroom environment variables. The significance of this approach is that it enables an evaluation of an instructor to be taken in a longitudinal context, as opposed to a single year 's snapshot of performance. Such a purpose and methodology are also consistent with The Personnel Evaluation Standards (The Joint Committee on Standards for Educational Evaluation, 1988).

M E T H O D

Sample

The evaluation data cover a ten year period ending with the spring semester of 1994. Within this period, the student evaluations of 49 courses were available. The evalua-

84 L.H. LUDLOW

tions were administered at the end of each semester and were answered anonymously. A total of 1189 students submitted evaluations. The courses ranged from freshman Child Development to doctoral-level final-semester Psychometric Theory.

Over this period the faculty member held the ranks of instructor, assistant professor, and associate professor. The teaching specialization was that of applied statistics in a Department of Education.

Variables

It is widely believed among statisticians (even if not necessarily widely published) that the low evaluations that often result from required introductory statistics courses are more a function of the alien (to the student) course content than the skill/competence of the instructor. In fact, such evaluations may result in the instructor being advised that the ratings are below those of the average of the school, yet the same instructor may have superb ratings when evaluations from specialty courses are considered. How, then, should the teaching component of an academic's profession be equitably determined by the administration?

It would seem that some long-term analysis of teaching performance would be useful. For example, do ratings improve over time? Are ratings related to course content? A function of class size? A function of the amount of time a student spends on the material?

From classroom evaluations beginning in the fall of 1983 and ending in the spring of 1994, the following data were gathered: Course number (030 to 960); Year course taught (84 to 94); percent Excellence rating of instructor (5% to 100%); class Size (4 to 54); and percent Time spent on the class by students relative to other classes (either much more or more: 0% to 100%).

The Course variable was further broken down by the number of times it had been taught. For example, the first presentation of ED468 (Introductory Statistics) was coded as 468.01. This particular class was taught most frequently and its last entry was 468.11. The purpose for such coding was to enable an analysis of any particular course over time. The data were entered into the database sequentially as the course was taught.

In essence, Course is a proxy variable for the sophistication of the course content. Size serves as a proxy for the effect of administrative policy. Time serves as a proxy for the student effort associated with a given class. Year serves as a proxy for the instructor's experience gained over time. These four variables will be considered the independent variables. Exeell serves as the dependent variable.

RESULTS

The Pearson correlations among the five variables are reported in Table 1. A simple scan of this table reveals some fascinating relationships. For example, the correlation

FACULTY EVALUATIONS 85

T~ble I. Zero-order Correlations Among Evaluation Variables (N=49).

Excell Course Size Year

Excell Course .64 Size - .57 -.63 Year .17 .54 -.12 Time .27 .65 -.31 .55

between Excell and Time spent on the course is (r=.27). The amount of time spent by the student on the course was positively related to the instructor 's excellence ratings. This seems unexpected given that these were predominately statistics courses that were being evaluated. It does make more sense, however, when it is seen that the relation between Time and Course is (r=.65) and the relation between Course and Excell is (r=.64). The workload increased as the courses became more technical and the ratings rose in those more technical courses.

The relation between Excell and Size ( - .57) is consistent with the instructor 's initial expectation of a negative relation. This finding is not surprising given that the relation between Size and Class is ( - .63) . Smaller classes seem to have led to better learning environments (e,g., greater student/instructor interaction and individualized instruction) and higher ratings.

The relationship between Excell and Year is interesting (r=. 17) in the context of the initial problem that prompted this analysis. As the years went by, the instructor 's ratings increased even though the relation between Year and Time (r=.55) indicates that the instructor increased the work load. Obviously, other simple conclusions may be extracted from this table.

For our analysis purposes, these correlations do indicate that the evaluation ratings are a function of typical classroom environment variables. The next analyses address consequences of these findings. Specifically, we may ask questions of the following form: "What effect does increasing class size have upon the excellence ratings?"

The first analysis consisted of a simple ordinary least squares regression (Pedhauzer, 1982) of Excell upon Course. The resulting equation was

E x c e l l = 3 .391 + . 0 6 2 * C o u r s e

(tb = 5.27, F = 27.776, p < .001).

This relation is seen in Figure 1. In addition to the scatterplot, the regression line and the .95 confidence intervals are graphed. Clearly, as Course number (or level of technical content) increased so did the percent of excellence ratings. Note, however, a most interesting point in the plot. At about Cour se number 800 it is seen that one rating appears surprisingly lower than expected. This course was numbered 765 (Achievement Test Construction). Further investigation revealed that this course is no longer numbered 765 but instead has been changed to 462.

Figure I. Excellence rating by course number (765).

-4 4~

u

I

g

N M

100

80

60

40

20

0 0

• LLINE • ULINE

. , EX

200 400 600 800 i000

86 L.H. LUDLOW

Course Number

(r=.61, b=.06)

LEGEND LLINE: lower .95 confidence interval ULINE: upper .95 confidence interval EX: Excellence rating point

When the data file was modified to reflect this change the solution, too, changed. This particular data point is referred to as an influential observation (Belsley Kuh, and Welsch, 1980) because of its overall effect upon the analysis. Other evidence of its influence upon the solution was available through the magnitude of the studentized residual (the largest negative of the 49 residuals) and the magnitude of Cook's D - - again the largest for the underestimates of predicted excellence rating (Cook and Weisberg 1982).

Figure 2 reflects the modification to the course number from 765 to 462. No longer does there appear to be a surprisingly low rating for any of the upper-level courses. The regression solution now becomes

Excell = 2.017 + .066 * Course (tb=5.639, F=31 .8 , p< . 001 )

In essence, the regression coefficient indicates that a 100 unit change in Course number resulted in an approximate increase of 7 percentage points in excellence ratings. All the following analyses are based on the modified data.

There is another interesting point in Figure 2. At about course number 666 in the plot, it is seen that there are three points below the regression line. The lower of these points is actually course ED669 (Psychometrics). This point represents the fourth time

Figure 2. Excellence rating by course number (462).

o o III

,,-I II o N N

100

80

60

40

2 0 • LLINE • ULINE

• EX

200 400 600 800 I000

Course Number

(r=.635, b=.066)

F A C U L T Y E V A L U A T I O N S 8 7

this course had been taught. The other three ratings for this course are the three highest ratings at this location. This could be determined because the Course variable was also coded according to the number of times it had been taught. Obviously, something peculiar happened during this fourth presentation of the course. Further investigation revealed that the class size had nearly doubled from previous offerings. This was because for the first time masters level students were admitted. They had requested they be allowed to take the course because it was only offered every other year, and they did not want to miss it.

Unfortunately they generally did not have the requisite statistical background. This was reflected in their evaluations: there were numerous complaints about the intensive statistical detail in the lecture. Accordingly, they did not rate the course very highly. In addition to this problem, the weather in New England that winter and early spring was terrible. A number of lectures were canceled due to the weather. These lectures could not be made up. Numerous complaints were noted about the smaller than usual number of lectures. These, too, resulted in lower than usual course ratings.

At this point another analysis was conducted. Given the relationship between Course level and Excell ratings, it was hypothesized that if the data were sorted by Course and the Durbin-Watson statistic d were computed, there might be an autocorre- lation effect (Chatterjee and Price, 1977). Specifically, this would mean that low-level courses would tend to have higher predicted ratings than were actually recorded while the upper-level courses would tend to have lower predicted ratings than were observed.

The data were sorted on Course and the same regression of Excell upon Course was conducted. The d was computed (d = 1.16) upon the resulting residuals. It was compared against tabled values for n = 50 and alpha = .05. The critical values were

88 L.H. LUDLOW

Figure 3. Excellence rating by class size.

.4

0

,4

g O N N

100

80

60

40

20

10 20 30 40 50 60

Class Size

(r=-.571, b=-.967)

LLINES ULINES

• EX

dl = 1.5 and du = .59. Since d < dl the null hypothesis of no autocorrelation was rejected. The lag 1 correlation was (.414) with a standard error of (. 143). An inspection of the residuals revealed a clear serial effect: lower-level course ratings were less than predicted (particulary for the Introductory Statistics sections)and the upper-level course ratings were higher than predicted under a linear regression model. There was no significant autocorrelation effect upon the original unsorted data (r = - .155 , se = . 143).

The next regression analysis (upon the unsorted data) was of Exeell upon Size. The solution was

Exeell = 55.299 - . 9 6 7 * S i z e

(tb = -4 .773 , F = 22.786, p<.001).

Each additional student enrolled in class resulted in a decrease of one excellence rating percentage point. Figure 3 represents this relationship. It is clear that smaller courses tended to have higher ratings, although it is seen that the largest class of all had a surprisingly high rating. Further investigation revealed that this class was the first Introductory Statistics course taught by the instructor, and that the content and format were considerably different from the content taught and format followed by previous instructors.

The next analysis regressed Exceil upon Time. The solution was

Exeeil = 20.945 + . 186 * Time (tb = 1.885, F = 3.552, p<.07).

FACULTY EVALUATIONS 89

Although not statistically significant, the regression coefficient suggests that an increase in workload of about 6 "percent more time spent on these courses" resulted in an increase of one additional excellence rating percentage point.

Finally, Excell was regressed upon Year. The solution was

Excell = 88.65 + 1.35 * Year (tb= 1.215, F = 1.48, p < .25).

Although not statistically significant, the regression coefficient suggests that a one- point rise in excellence ratings resulted from each additional year taught.

Given the results of the simple regressions and the fact that the predictors are all intercorrelated themselves, a multiple regression was conducted. This regression followed a specific theoretical model. Rather than employ some variation of a stepwise procedure, a forced entry procedure was employed. The order of the variables as they were entered into the model was designed to test those effects in the order by which they were most closely identified with controllable classroom practices. Consequently, Course was entered first, followed by Size, Time, and Year.

The overall four-predictor solution was statistically significant (F = 10.466, p < .001), but only one predictor of the four actually accounted for significant variation (Course: tb = 3.56, p < .001). The next step, therefore, was to remove extraneous error variation one variable at a time to determine if the effects of multicollinearity were suppressing the significant effect of any of the other variables.

All variables were forced into a single solution. Then using an interactive, backward deletion procedure (SYSTAT, 1992) variables were removed from the solution one at a time. The first variable removed was Year. At that point the variation accounted for by Size nearly reached statistical significance (p = .089). Time was the next variable to be removed. At this point Size achieved statistical significance. The residual distributions were checked through the use of probability plots and histograms and no obviously noticeable unexpected patterns were seen (Chamber, Cleveland, Kleiner, and Turkey, 1983). The final solution became

E x c e l l = 2 2 . 0 9 9 + . 0 4 7 * C o u r s e - . 4 8 * S i z e

(tbt =3.23, p=.002; tb2=-2.01, p=.05; F= 18.9, p<.001; se estimate= 17.8; r2=.45).

Although the coefficients differ slightly from the simple regression results, the conclusions remain the same. In fact, given the variation in the previous plots, the magnitude of the r 2 (.45) for the two predictors is remarkable (adjusted r2--.43). As course level increased, so did the excellence ratings. As the size of the class increased, the excellence ratings decreased. These results are reasonable in light of the fact that higher-level courses usually attract the more committed and technically proficient students and these courses are generally small.

The joint relationship among these three variables is presented in the spike plot (Shepard, Romney and Nerlove, 1972) shown in Figure 4. The vertical z-axis represents the excellence ratings. The y-axis is the course number. The x-axis is the

9 0 L.H. L U D L O W

Figure 4. Excel lence ra t ing * c lass size * course number .

Excellence Rating

Class Size

80

60

40

20

60 5

4 v 3 0 ~ ~

0 o o Course Number

size of the class. A careful inspection of the three-dimensional plot reveals that the higher-level courses were associated with higher excellence ratings, and these courses tended to be the ones with the smaller class sizes.

Another way of looking at the relationship among these three variables is to construct a response surface plot (Box, Hunter, and Hunter, 1978). This type of plot reveals the extent to which the relationship among the variables is a smooth linear function. As can be seen in Figure 5, the relationship is punctuated by spikes representing unexpected high ratings for some of the mid-range course. Further investigation revealed that these spikes tended to be second semester Intermediate Statistics sections. This course was not required of all students, unlike the first semester Introductory Statistics course required of all doctoral (and most masters) students. In essence, both the students and the instructor came to interact on a more comfortable basis. It can also be seen that some of 600-level courses yielded unexpect- edly high ratings.

Discussion

The interpretations of the results of these analyses yield some insights into the dynamics that may operate within the professional career of a member of the teaching faculty. Specifically, to base high-stakes decisions upon one year's performance alone may produce an incomplete profile for determining the contribution that an individual makes in one's program. Many instructors have no control over class size. In addition, there is often very little an instructor can do in one semester to change what may be a

FACULTY EVALUATIONS 91

Figure 5. Excellence rating * class size 8 course number.

Excellence Rating

Class Size

60

2O

o o - - 2 0 " < . . j . t ~ o o o 1 0 / o o ~'

U 0 '~ 0 oa Course ~u~er

l ifelong negative attitude some students have about a particular topic. In the present

case, attitudes about mathematics, statistics, and the use of computers for more than

just word processing would all be included. If initial attitudes, uncontrol lable class size, and inadequate training do have a negative effect upon faculty ratings, then these

conditions should be factored into any performance rating.

Finally, the statistical methods employed in this article are not restricted to the

specifics of this particular s t u d y - - t h e y are generalizable in the sense that they may be

appropriate for any other set of longi tudinal data. In fact, these methods provide an actual opportunity to predict expected and unexpected evaluation ratings.

References

Belsley, D.A., Kuh, E., and Welsch, R.E. (1980). Regression diagnostics. John Wiley and Sons: New York. Box', G.E.P., Hunter, W.G., and Hunter, J.S. (1978). Statistics for experimenters. John Wiley and Sons: New

York. Chambers, J.M., Cleveland, W.S., Kleiner, B., and Tukey, P.A. (1983). Graphical methods for data analysis.

Duxbury Press, Boston. Chatterjee, S., and Price, B. (1977). Regression analysis by example. John Wiley and Sons: New York. Cook, R.D., and Weisberg, S. (1982). Residuals and influence in regression. Chapman and Hall: New York. Millman, J. and Darling-Hammond, L. (Eds.) (1990). The new handbook of teacher evaluation. SAGE: New

bury Park. Pedhauzer, E.J. (1982). Multiple regression in behavioral research. Holt, Rinehart, and Winston: New York. Shepard, R.N., Romney, A.K., and Nerlove, S.B. (Eds.) (1972). Multidimensional scaling: Volume 1.

Seminar Press: New York. SYSTAT: Statistics, Version 5.2 Edition. Evanston, 1L: SYSTAT, Inc., 1992.

9 2 L.H. LUDLOW

The Joint Committee on Standards for Educational Evaluation. (1988). The Personnel Evaluation Standards.

SAGE: New bury Park. Theall, M., Franklin, J., & Ludlow, L. (1990). Attributions or retributions: Student ratings and the perceived

causes of performance. Paper presented at the annual meeting of the American Educational Research Association. Boston, MA.