University of California at Riverside, 92521

Phone: 909 262 0385

Keywords: accreditation, data mining, curriculum assessment

Personal Homepage: http://www.cs.ucr.edu/~titus

Advisor: Tom Payne (thp at cs ucr edu)

The concepts of latent ability, difficulty, and discrimination pervade this work, although we are setting out to solve a harder problem than IRT solves by not labeling question topics and not performing extensive experimentation for determine item parameters. The GRE CAT does this by having students take an "extra" test section to calibrate the parameters of new questions. Without greatly altering instructional techniques, this is not an option, so we need to simultaneously identify question topic and student latent abilities with minimal training data.

For the past few months we have been investigating which data mining algorithms work best with the data we are gathering. Some things are relatively easy to do and have already been implemented. So far the biggest success has been determining from current course grades and question scores which questions on a test were "good", "bad", had the wrong answer key, were inadequately covered, etc. The IRT concepts of difficulty and discrimination are the key here.

Other things have been experimented with but have not yet yielded up good results. Most notable among these is the fact that few (if any) current classification/prediction techniques in machine learning (such as [4]) can produce a simple and accurate model for predicting student grades based on their question scores. For example, given the set of 48 questions used on the first 5 CS1 quizzes this quarter and the scores of the 125 students that took all 5 quizzes, the best algorithms we have tested can only predict overall lecture grades (letter grade for quizzes and tests) with 65% accuracy. It appears that there is too much noise in individual scores, which is not surprising: a given student may study for one quiz, party the night before a second quiz, come to class sick for a third, etc. Idealized notions of latent traits aside, education happens in a far-from idealized real world. This noise has similarly hampered our attempts to extract topic information from the question score data.

However, hope is by no means lost. We are currently investigating two approaches that can overcome these obstacles. Once we have meaningful estimates of

The first remaining technique initially ignores the scores entirely: as part of the process of archiving scores, we are also requiring participating instructors to submit the text of the questions for those scores. This provides us with a nearly noise-free source of data: the question text itself. Text-similarity measurements can then be applied to the questions, either alone or in conjunction with question scores, and then fed to standard clustering algorithms [2]. Given knowledge of which topics each question relates to, we can then estimate student abilities on those topics based on their performance on those questions.

The second technique is both more complex and more exciting: collaborative filtering [5]. This is the technique used by retailers like Amazon.com to recommend books based on previous ratings. In that domain, you have certain levels of preference for different classes of books, and each book has some correlation to each of these idealized classes. Neither of these can be directly measured, but they are highly related: someone that loves Sci-Fi novels is more likely to give a high rating to a story about colonizing Mars than to a story about 18th century England. Classes are not mutually exclusive: some people like both historical fiction and sci-fi novels. In keeping with our data, there is some noise in these ratings, since a huge number of factors can influence a rating. While this "noise"may be less pronounced than the score data, we have a comparable advantage in having nearly complete "ratings", since nearly every student in a course takes each test and quiz.

- Can the "noise" in score data be mitigated enough to get good topic or student ability estimates? A solution to either problem makes the other fairly straightforward.
- How many "seed" topic identifications are necessary to get good
results? That is, what is the smallest
*n*such that if we**know***n*questions from each topic for each course we can accurately extract the topic groups? - How do we merge results across course offerings and quarters? Students take other courses, and questions are reused. That information should simplify future analysis.
- Has our "minimal impact" software deployment been successful?

- Baker, Frank B.
__Fundamentals of Item Response Theory.__2001. ERIC Clearinghouse on Assessment and Evaluation. - Ng, Jordan, Weiss.
*On Spectral Clustering: Analysis and an Algorithm*. NIPS 2001. - Rogers, Gloria.
*Do Grades Make the Grade for Program Assessment.*ABET Quarterly New Source. Fall/Winter 2003. - Schapire, Robert E. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002.
- B. Taskar, E. Segal, D. Koller. Probablistic Clustering in Relational Data. Seventeenth International Joint Conference on Artificial Intelligence. August 2001.
- Van der Linden, Glas.
__Computerized Adaptive Testing: Theory and Practice.__July 2000. Kluwer Academic Publishers