SIGCSE 2004 DC Application

Titus Winters (titus@cs.ucr.edu)

Department Computer Science & Engineering Surge Building Room 281
University of California at Riverside, 92521
Phone: 909 262 0385
Keywords: Automated grading, database, kdd, curriculum assessment
Personal Homepage: http://www.cs.ucr.edu/~titus
Advisor: Tom Payne (thp at cs ucr edu)

The Archive

Introduction

The main thrust of the recent accreditation requirements from ABET, The Accreditation Board for Engineering and Technology, is that accredited engineering programs must have in place, and demonstrate use of, a "continuous-improvement process." Much like the "Total Quality Management" phenomenon that swept through industrial process development in the 1990s, a continuous-improvement process means one thing: feedback. The output from the system, in this case the educational program, must be taken into account in the early stages of each cycle of the system and adjustments made in an attempt to increase the quality of the next batch of outputs.

The Plan at UCR

Many institutions (plan to) use surveys and course grades as "evidence" of the educational process. In contrast, the CS&E Department at UCR seeks to apply the tools of the realm of data mining, knowledge discovery, and machine learning to the problem of measuring instructional effectiveness. The goal is to gather detailed information on the assessment of all of our students in every course, down to the individual question level. In our opinion, a student's course grade, or even aggregate score on an assignment or exam, fails to capture the student's knowledge as it pertains to our educational program objectives. When taken individually, a question that involves linked-lists gives us some information about whether a given student understands linked-lists or not, while the aggregate score on an exam that tests various data structures can only imply a students overall level of understanding.

Archiving this information, for all questions, for all students, in all courses in our department is a massive undertaking. However, once the data is captured, we anticipate the use of Baysean networks to extract the relationships between each individual unit of assessment and our program objectives. After this information extraction, and given a student's performance history on units of assessment pertaining to objective O, we can estimate to what level each student has reached that objective in the range [0-1]. Summed across our entire student population, we can now obtain a number that in some sense represents the level that our students know that material. If our estimate is that number is too low, appropriate action must be taken.

Goals of the Research

Our primary goal is to increase the effectiveness of our educational program. Our hope is that we can use the process described above to bring our instructional efforts out of the realm of qualitative, irreproducible events, and into the realm of numerically verifiable, quantitative experiment. (Our approach has been described as akin to the difference between Aristotelean and Newtonian dynamics.) KDD on educational data has been a tried-and-true area for some time ([1])

Current Status

Obviously, a project of this scope requires an extensive groundwork. Additionally, much of the groundwork is necessary in order to overcome the distaste that research faculty have against time-consuming modifications to their instructional techniques. I hope to allow and encourage faculty to record this fine granularity data by presenting a set of tools that will reduce the time that instructional tasks, such as assessment/feedback and test preparation, take while gathering data as a side-effect.

Agar

My primary task is currently the development of Agar, a framework for automating grading. Agar is unique and interesting because of two main features: the tool framework and the comment system.

Agar was developed with the intent to be as general as possible, with the idea that if the functional tests that are provided with Agar are found to be insufficient, it should be easy for a grader or instructor to develop new tests for the assignment in question with relative ease. To this end, rather than following a standard high-performance route for plugin development using dynamically linked libraries, Agar takes a much more simplistic approach. Tools are written to respond to the "--help" agrument with a list of command-line parameters, which are then parsed by Agar to create a dynamically generated dialog box within the GUI system to allow configuration of the functional tool in question. Additionally, tools must respond to "--name" with a human-readable name describing the tool (for example, "C++ Driver Tests" or "Detect Line Wraps"). Finally, when the test is executed on the appropriate input (either the compiled executable, if applicable, or the source files for the assignment), the tool must exit with return code 0 for success and 1 for failure, in which case the contents of the tool's standard output stream will be appended to the student's results. Other features, such as conditional execution of tests and dynamic submission identification add ease-of-use, but the main power of Agar for functionality testing is this tool interface.

While automated grading is by no means a new idea ([4], [6], [7]), I feel that Agar represents a significant advancement over previous attempts in that it also has shown greatly reduced effort required to grade NON programming material such as written work and quizzes. The automated testing features of Agar make it ideal for grading programming work of all kinds, but the real benefit comes from the time-savings for human graders in providing detailed feedback. Additionally, Agar is intended to be fully open source, so anything that isn't handled by the tool interface can be added to the system internally.

Any grader can attest that different students make the same mistakes A simple CS1/CS2 example would be forgetting to write a base case for a recursive function on a quiz. Using the Agar framework, the first time a human grader finds such a problem they create a new Comment, assign a point value (positive for bonus, zero to just write a comment, and negative for penalty) to the Comment, and write out a note to the student. A drag-and-drop system within the Agar interface then allows that Comment to be assigned to any other student that is found to have the same mistake. Further, since Comments are assigned by reference, the point value or feedback can be changed later on, and all submissions that received that comment will automatically be updated.

This comment system allows for much greater feedback to be generated for each student in a much shorter amount of time. For C++ homework in our lower-division courses, a two-week project for a class of 60-70 students now takes about 4-6 hours to grade, record scores, generate, and send out detailed feedback for students. Previously, lower quality feedback and less accurate grading was taking upward of 10-15 hours. Similarly, 12 written problems for the same course were graded and commented on in just under 5 hours, or slightly less than 5 minutes per student. Students in this course have expressed how helpful they find it to get their work returned to them within 1-2 days of turning it in, and have detailed comments and feedback emailed to them while they still remember what the assignment involved. Graders are similarly pleased in that they get to do more in less time, and no longer have any bookkeeping to do since Agar can automatically exports its results to a course grade-book in the form of a spreadsheet, and options to export to the campus BlackBoard system coming soon.

PACE: Program for Accelerated Creation of Exams

The second prong of our attack against instructional inertia is the development of a tool to develop question banks for exams. This has been done before many times, including corporate attempts like Respondus, but we have need of a tool that will interface with our Archive, store student results courtesy of Agar, and allow us a bit more freedom in how we manipulate the question data. PACE is that tool. PACE is in early testing stages, but we believe that by generating question banks for each course, instructors (especially in courses that change instructors often) will be more inclined to teach similar material, since the effort of assessing that material will be already done and provided for them by the previous instructor.

Interim Conclusions

Fine-grained information is always useful. It reveals immediately that on most exams there are some questions that are poorly worded, that some topics were not as well understood by the class as the instructor might think, and possibly even that some questions were graded with an incorrect or incomplete answer in mind.

We have also discovered that computer assisted grading, using tools such as Agar, can greatly reduce the amount of time necessary both to perform basic grading and evaluation and to provide detailed feedback to the students. We have cut the time requirements to grade programming homework by 50-60%, while increasing the detail of our records, the quality of feedback to the students, and the consistency of the grading and clerical reporting. More interestingly, non programming homework can still be graded with a significant time-savings using a tool like Agar. Agar and other user-interface tools will hopefully inspire instructors to perform the detailed score recording that would be too tedious to do by hand, while saving them time overall.

Open Issues

There are a number of issues that are very uncertain at this stage of development, and a frighteningly large number of them could be "deal-breakers" with regard to the final completion of the project as currently envisioned. A brief list of these concerns includes:

How can we most efficiently gather problem-level data? Can we make assessment easy enough that Professors, many of whom are stubbornly set in their ways and don't want to "waste" time on clerical aspects of instruction, will be willing to use them?
Can Baysean analysis be effectively performed on a matrix of 2500-25,000 cells?
Would unsupervised learning, such as simple clustering, provide better information about which program objectives an item of assessment pertains to?
Is an individual student's performance so "noisy" that it will invalidate the final data mining? Student's are difficult experimental subjects on an individual level, since individually they have sicknesses, personal issues, other classes, late night parties, etc etc.

Current Stage in my Program of Study

I am currently finishing up my coursework and preparing for my Oral Examination. I intend to advance to Candidacy for a PhD in March of 2004.

What I Hope to Gain from participating in the Doctoral Consortium

Since there is no formal Computer Science Education program at UCR, my focus has often been guided by the instructional problems that we are facing at any given time. My hope is that by attending the Doctoral Consortium I will gain exposure to more academic and formal approaches to CSE, as well as to publicize the work that we are undertaking here at UCR. I feel that the DC will be a wonderful opportunity for me to begin networking with other researchers in CSE, which is extremely important as my work is mostly being performed in isolation.

Bibliographic References

ABET
BlackBoard
The New Automated Grader Master Page
On Automated Grading of Programming Assignments in an Academic Institution
Respondus
Lass, et. al. Tools and Techniques for Large Scale Grading using Web-based Commerical Off-The-Shelf Software. SIGCSE Conference on Innovation and Technology in CSE, 2003.
Using KDD To Analyze the Impact of Curriculum Revisions in a Brazilian University