SIGCSE 2005 DC Application
Titus Winters (titus@cs.ucr.edu)
Department Computer Science & Engineering
Surge Building Room 281
University of California at Riverside, 92521
Phone: 909 262 0385
Keywords: accreditation, data mining, curriculum assessment
Personal Homepage:
http://www.cs.ucr.edu/~titus
Advisor: Tom Payne (thp at cs ucr edu)
Applications of Data Mining in Student Assessment
Introduction
The main thrust of the recent accreditation requirements from ABET, The Accreditation Board for
Engineering and Technology, is that accredited engineering programs
must have in place, and demonstrate use of, a "continuous-improvement
process." A continuous-improvement process means one thing: feedback.
The output from the educational program must be taken into account in
the early stages of each cycle of the system and adjustments made in
an attempt to increase the quality of the next batch of outputs. In
order to maximize the effectiveness of our improvement process, data
gathering and analysis are of prime importance in this process
The Plan at UCR
Many institutions are using surveys and course grades as "evidence" of
the educational process. In contrast, the CS&E Department at UCR
seeks to apply data mining and machine learning to measure
instructional effectiveness. The goal is to gather information on
every student in every course, down to the individual question level.
According to ABET[3], a student's course grade fails to capture the
student's knowledge in enough detail because too much information is
lost in the aggregation of individual assessments. Grading must be
done on the question level; we are trying to keep that level of detail
and prevent information loss due to aggregation.
Goals of the Research
Concretely, we are developing and deploying an information system to
track student performance through our program down to the question
level. Once this information is in place we can turn to data-mining
algorithms, both well-known and developed specifically for this
problem. We can then answer a variety of questions, most importantly
for accreditation, "What percentage of our students understand topics
X, Y, and Z?" In general we hope to do much more than this, detecting
good and bad test questions based on student performance, and perhaps
even finding which topics a student isn't doing well in to help make
tutoring time more efficient. Deployment and instructor adoption
of this system will allow for more effective use of IT, including
generating digital student portfolios, continuously growing question
banks, automated cheat-checking, and more.
Background Theory
One of the most useful concepts in this analysis is that of Item
Response Theory (IRT)[1]. IRT is the theoretical foundation behind such
tests as the GRE Computer Adaptive Test[6] which interactively
"homes in" on a score for students by giving harder questions in
response to correct answers and easier questions in response to
incorrect answers. The concepts behind IRT are simple: each question
(item) corresponds to a particular topic. For that topic, each
student has a particular numeric ability score. Since these scores
cannot be directly measured, these are known as their latent
ability scores. Each question is assumed to have a
"characteristic curve" which plots the probability of getting the
question correct as a function of the ability of the student. A
logistic or sigmoid curve is often used for this. In such a curve
there are two major parameters: difficulty (the ability for which a
student would have a 50% chance of getting the question correct), and
discrimination (roughly corresponding to the slope of the curve at
that 50% mark). A question with perfect discrimination means that no
student with an ability less than the difficulty will answer it
correctly, and no student with an ability greater than the difficulty
will answer incorrectly. Obviously, there are few perfect
questions.
The concepts of latent ability, difficulty, and
discrimination pervade this work, although we are setting out to solve
a harder problem than IRT solves by not labeling question topics and
not performing extensive experimentation for determine item
parameters. The GRE CAT does this by having students take an "extra"
test section to calibrate the parameters of new questions. Without
greatly altering instructional techniques, this is not an option, so
we need to simultaneously identify question topic and student latent
abilities with minimal training data.
Current Status
The first work on this project was done in the Spring of 2003. Since
then we have deployed tools to aid in the grading of student work, for
programming work that can be automatically tested, automatic grading
of multiple choice forms, and tools for mimicking traditional
"red-pen" grading of electronic submissions. We have also put
together a relational database for this information and begun the
process of feeding grade data into it. Full course information for
several courses is available at this time, and we expect 30% - 40% of
instructors to comply this quarter, our first quarter of official
deployment.
For the past few months we have been investigating which
data mining algorithms work best with the data we are gathering. Some
things are relatively easy to do and have already been implemented.
So far the biggest success has been determining from current course
grades and question scores which questions on a test were "good",
"bad", had the wrong answer key, were inadequately covered, etc. The
IRT concepts of difficulty and discrimination are the key
here.
Other things have been experimented with but have not
yet yielded up good results. Most notable among these is the fact
that few (if any) current classification/prediction techniques in
machine learning (such as [4]) can produce a simple and accurate model
for predicting student grades based on their question scores. For
example, given the set of 48 questions used on the first 5 CS1 quizzes
this quarter and the scores of the 125 students that took all 5
quizzes, the best algorithms we have tested can only predict overall
lecture grades (letter grade for quizzes and tests) with 65% accuracy.
It appears that there is too much noise in individual scores, which is
not surprising: a given student may study for one quiz, party the
night before a second quiz, come to class sick for a third, etc.
Idealized notions of latent traits aside, education happens in a
far-from idealized real world. This noise has similarly hampered our
attempts to extract topic information from the question score
data.
However, hope is by no means lost. We are currently
investigating two approaches that can overcome these obstacles. Once
we have meaningful estimates of either question topic or
student ability, calculating the other is relatively simple maximum
expectation process.
The first remaining technique initially ignores the
scores entirely: as part of the process of archiving scores, we are
also requiring participating instructors to submit the text of the
questions for those scores. This provides us with a nearly noise-free
source of data: the question text itself. Text-similarity
measurements can then be applied to the questions, either alone or in
conjunction with question scores, and then fed to standard clustering
algorithms [2]. Given knowledge of which topics each question relates
to, we can then estimate student abilities on those topics based on
their performance on those questions.
The second technique is both more complex and more
exciting: collaborative filtering [5]. This is the technique used by
retailers like Amazon.com to recommend books based on previous
ratings. In that domain, you have certain levels of preference for
different classes of books, and each book has some correlation to each
of these idealized classes. Neither of these can be directly
measured, but they are highly related: someone that loves Sci-Fi
novels is more likely to give a high rating to a story about
colonizing Mars than to a story about 18th century England. Classes
are not mutually exclusive: some people like both historical fiction
and sci-fi novels. In keeping with our data, there is some noise in
these ratings, since a huge number of factors can influence a rating.
While this "noise"may be less pronounced than the score data, we have
a comparable advantage in having nearly complete "ratings", since
nearly every student in a course takes each test and quiz.
Open Issues
There are a number of issues that are very uncertain at this stage of
development. Some are more concerning than others, but a brief list at
this stage includes:
- Can the "noise" in score data be mitigated enough to get good
topic or student ability estimates? A solution to either problem
makes the other fairly straightforward.
- How many "seed" topic identifications are necessary to get good
results? That is, what is the smallest n such that if we
know n questions from each topic for each course we can
accurately extract the topic groups?
- How do we merge results across course offerings and quarters?
Students take other courses, and questions are reused. That
information should simplify future analysis.
- Has our "minimal impact" software deployment been successful?
Current Stage in my Program of Study
I advanced to PhD Candidacy in the Spring of 2004. I hope to defend
my dissertation in Spring 2006.
What I Hope to Gain from participating in the Doctoral Consortium
Since there is no formal Computer Science Education program at UCR, my
focus has often been guided by the instructional problems that we are
facing at any given time. My hope is that by attending the Doctoral
Consortium I will gain exposure to more academic and formal approaches
to CSE, as well as to publicize the work that we are undertaking here
at UCR. Last year's DC was one of the most valuable experiences I
have had as a graduate student. I hope to return this year having
moved this project from the initial planning stages into the more
advanced areas of data analysis.
Bibliographic References
- Baker, Frank B. Fundamentals of Item Response Theory.
2001. ERIC Clearinghouse on Assessment and Evaluation.
- Ng, Jordan, Weiss. On Spectral Clustering: Analysis and an
Algorithm. NIPS 2001.
- Rogers, Gloria. Do Grades Make the Grade for Program
Assessment. ABET Quarterly New Source. Fall/Winter 2003.
- Schapire, Robert E. The
boosting approach to machine learning: An overview. In MSRI
Workshop on Nonlinear Estimation and Classification, 2002.
- B. Taskar, E. Segal, D. Koller. Probablistic
Clustering in Relational Data. Seventeenth International Joint
Conference on Artificial Intelligence. August 2001.
- Van der Linden, Glas. Computerized Adaptive Testing: Theory
and Practice. July 2000. Kluwer Academic Publishers