SIGCSE 2005 DC Application

Titus Winters (titus@cs.ucr.edu)

Department Computer Science & Engineering Surge Building Room 281
University of California at Riverside, 92521
Phone: 909 262 0385
Keywords: accreditation, data mining, curriculum assessment
Personal Homepage: http://www.cs.ucr.edu/~titus
Advisor: Tom Payne (thp at cs ucr edu)

Applications of Data Mining in Student Assessment

Introduction

The main thrust of the recent accreditation requirements from ABET, The Accreditation Board for Engineering and Technology, is that accredited engineering programs must have in place, and demonstrate use of, a "continuous-improvement process." A continuous-improvement process means one thing: feedback. The output from the educational program must be taken into account in the early stages of each cycle of the system and adjustments made in an attempt to increase the quality of the next batch of outputs. In order to maximize the effectiveness of our improvement process, data gathering and analysis are of prime importance in this process

The Plan at UCR

Many institutions are using surveys and course grades as "evidence" of the educational process. In contrast, the CS&E Department at UCR seeks to apply data mining and machine learning to measure instructional effectiveness. The goal is to gather information on every student in every course, down to the individual question level. According to ABET[3], a student's course grade fails to capture the student's knowledge in enough detail because too much information is lost in the aggregation of individual assessments. Grading must be done on the question level; we are trying to keep that level of detail and prevent information loss due to aggregation.

Goals of the Research

Concretely, we are developing and deploying an information system to track student performance through our program down to the question level. Once this information is in place we can turn to data-mining algorithms, both well-known and developed specifically for this problem. We can then answer a variety of questions, most importantly for accreditation, "What percentage of our students understand topics X, Y, and Z?" In general we hope to do much more than this, detecting good and bad test questions based on student performance, and perhaps even finding which topics a student isn't doing well in to help make tutoring time more efficient. Deployment and instructor adoption of this system will allow for more effective use of IT, including generating digital student portfolios, continuously growing question banks, automated cheat-checking, and more.

Background Theory

One of the most useful concepts in this analysis is that of Item Response Theory (IRT)[1]. IRT is the theoretical foundation behind such tests as the GRE Computer Adaptive Test[6] which interactively "homes in" on a score for students by giving harder questions in response to correct answers and easier questions in response to incorrect answers. The concepts behind IRT are simple: each question (item) corresponds to a particular topic. For that topic, each student has a particular numeric ability score. Since these scores cannot be directly measured, these are known as their latent ability scores. Each question is assumed to have a "characteristic curve" which plots the probability of getting the question correct as a function of the ability of the student. A logistic or sigmoid curve is often used for this. In such a curve there are two major parameters: difficulty (the ability for which a student would have a 50% chance of getting the question correct), and discrimination (roughly corresponding to the slope of the curve at that 50% mark). A question with perfect discrimination means that no student with an ability less than the difficulty will answer it correctly, and no student with an ability greater than the difficulty will answer incorrectly. Obviously, there are few perfect questions.
    The concepts of latent ability, difficulty, and discrimination pervade this work, although we are setting out to solve a harder problem than IRT solves by not labeling question topics and not performing extensive experimentation for determine item parameters. The GRE CAT does this by having students take an "extra" test section to calibrate the parameters of new questions. Without greatly altering instructional techniques, this is not an option, so we need to simultaneously identify question topic and student latent abilities with minimal training data.

Current Status

The first work on this project was done in the Spring of 2003. Since then we have deployed tools to aid in the grading of student work, for programming work that can be automatically tested, automatic grading of multiple choice forms, and tools for mimicking traditional "red-pen" grading of electronic submissions. We have also put together a relational database for this information and begun the process of feeding grade data into it. Full course information for several courses is available at this time, and we expect 30% - 40% of instructors to comply this quarter, our first quarter of official deployment.
    For the past few months we have been investigating which data mining algorithms work best with the data we are gathering. Some things are relatively easy to do and have already been implemented. So far the biggest success has been determining from current course grades and question scores which questions on a test were "good", "bad", had the wrong answer key, were inadequately covered, etc. The IRT concepts of difficulty and discrimination are the key here.
    Other things have been experimented with but have not yet yielded up good results. Most notable among these is the fact that few (if any) current classification/prediction techniques in machine learning (such as [4]) can produce a simple and accurate model for predicting student grades based on their question scores. For example, given the set of 48 questions used on the first 5 CS1 quizzes this quarter and the scores of the 125 students that took all 5 quizzes, the best algorithms we have tested can only predict overall lecture grades (letter grade for quizzes and tests) with 65% accuracy. It appears that there is too much noise in individual scores, which is not surprising: a given student may study for one quiz, party the night before a second quiz, come to class sick for a third, etc. Idealized notions of latent traits aside, education happens in a far-from idealized real world. This noise has similarly hampered our attempts to extract topic information from the question score data.
    However, hope is by no means lost. We are currently investigating two approaches that can overcome these obstacles. Once we have meaningful estimates of either question topic or student ability, calculating the other is relatively simple maximum expectation process.
    The first remaining technique initially ignores the scores entirely: as part of the process of archiving scores, we are also requiring participating instructors to submit the text of the questions for those scores. This provides us with a nearly noise-free source of data: the question text itself. Text-similarity measurements can then be applied to the questions, either alone or in conjunction with question scores, and then fed to standard clustering algorithms [2]. Given knowledge of which topics each question relates to, we can then estimate student abilities on those topics based on their performance on those questions.
    The second technique is both more complex and more exciting: collaborative filtering [5]. This is the technique used by retailers like Amazon.com to recommend books based on previous ratings. In that domain, you have certain levels of preference for different classes of books, and each book has some correlation to each of these idealized classes. Neither of these can be directly measured, but they are highly related: someone that loves Sci-Fi novels is more likely to give a high rating to a story about colonizing Mars than to a story about 18th century England. Classes are not mutually exclusive: some people like both historical fiction and sci-fi novels. In keeping with our data, there is some noise in these ratings, since a huge number of factors can influence a rating. While this "noise"may be less pronounced than the score data, we have a comparable advantage in having nearly complete "ratings", since nearly every student in a course takes each test and quiz.

Open Issues

There are a number of issues that are very uncertain at this stage of development. Some are more concerning than others, but a brief list at this stage includes:

Current Stage in my Program of Study

I advanced to PhD Candidacy in the Spring of 2004. I hope to defend my dissertation in Spring 2006.

What I Hope to Gain from participating in the Doctoral Consortium

Since there is no formal Computer Science Education program at UCR, my focus has often been guided by the instructional problems that we are facing at any given time. My hope is that by attending the Doctoral Consortium I will gain exposure to more academic and formal approaches to CSE, as well as to publicize the work that we are undertaking here at UCR. Last year's DC was one of the most valuable experiences I have had as a graduate student. I hope to return this year having moved this project from the initial planning stages into the more advanced areas of data analysis.

Bibliographic References

  1. Baker, Frank B. Fundamentals of Item Response Theory. 2001. ERIC Clearinghouse on Assessment and Evaluation.
  2. Ng, Jordan, Weiss. On Spectral Clustering: Analysis and an Algorithm. NIPS 2001.
  3. Rogers, Gloria. Do Grades Make the Grade for Program Assessment. ABET Quarterly New Source. Fall/Winter 2003.
  4. Schapire, Robert E. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002.
  5. B. Taskar, E. Segal, D. Koller. Probablistic Clustering in Relational Data. Seventeenth International Joint Conference on Artificial Intelligence. August 2001.
  6. Van der Linden, Glas. Computerized Adaptive Testing: Theory and Practice. July 2000. Kluwer Academic Publishers