University of California at Riverside, 92521

Phone: 909 262 0385

Keywords: accreditation, data mining, curriculum assessment

Personal Homepage: http://www.cs.ucr.edu/~titus

Advisor: Tom Payne (thp at cs ucr edu)

- Were linked-lists covered more this semester than in the previous offering?
- Did students get more binary search questions correct this semester than in the previous offering?
- Which topics this semester had the lowest class averages?
- Which topic does student
*X*have the lowest average on?

The focus of this work is therefore to find algorithms from data mining which produce question groupings similar to those that would be produced by humans. Note that this is

Given these groups, we extract all pairs of questions that are valid to be paired together. For each algorithm that we evaluate, we determine group membership according to that algorithm, and again create the pairs of questions that are grouped together. The validity of the algorithm is then measured by determining the overlap between these two sets of pairs: the "correct" answer, and the generated answer. Additionally, many algorithms give a "certainty" to each question. By varying our certainty, we get a parametric plot of accuracy.

To visualize the correctness of each algorithm, we generate Precision vs. Recall graphs. As we vary the certainty parameter, starting with only the questions that are most certain, we get larger and larger groups of question pairings. For each setting of the certainty parameter, we evaluate the

With the assumption that the flaw was in the lack of topic structure in the data, rather than an inability for numerous published algorithms to perform, we gathered score data from many people on questions with known topics. We used two 40-question quizzes, one drawn from Trivial Pursuit questions, and one drawn from questions out of SAT Subject Test study guides. Both quizzes had 4 topics with 10 questions each. The Trivial Pursuit tested Science and Nature, Sports and Leisure, Arts and Entertainment, and Geography. The SAT tested Math, Biology, World History, and French. Both were organized as short-answer tests, where the correct answers were two words or less. A string-matching algorithm was used to compare given answers against the list of known good answers to discount misspellings as a source of error, and all of the "incorrect" answers were checked by hand to ensure that they were in fact incorrect. Over a period of about a week these quizzes were available online. The Trivia quiz was completed by 467 participants, the Academic quiz was completed by 297.

The results are fairly conclusive. The Academic data clearly confirms that some topics present a deeper knowledge and level of understanding, while others are more based on isolated bits of knowledge. The Math and French questions from the Academic quiz can be easily separated by most of our algorithms, with accuracy greater than 85%. Even when the number of "students" in the dataset is reduced to normal class sizes of 20 to 30, there is sufficient structure in the Math and French data to separate out these two subjects. In the whole four-topic data, there is more noise, but the results are still significantly better than random chance. Depending on the algorithm utilized for the grouping, precision remains in the 50-60% range.

In contrast, the data from the Trivia test behaves poorly. Success in answering trivia questions does not require deep knowledge of the topic, but isolated fact retrieval. The trivia data, even with nearly 500 "students" in the dataset, is effectively inseparable, producing results that are no better than random chance with most algorithms. On the trivia data, most candidate algorithms produce worse-than-random results similar to the original class data.

- Student knowledge
*in class*, being tested on material only recently learned, has not yet been integrated into the whole of their learning. - Topic is not the major factor in whether a student can answer a question correctly.
- Topic only matters in situations where a topic was unstudied/completely forgotten (like Math or French.)
- The trivia data is somehow a fluke.

- ABET - Accreditation Board for Engineering and Technology
- Barnes, T.
*The Q-Matrix Method of Fault Tolerant Teaching in Knowledge Assessment and Data Mining.*PhD Thesis, North Carolina State University. 2003. - Gentle, J. E. "Singular Value Factorization." 3.2.7 in Numerical Linear Algebra for Applications in Statistics. Berlin: Springer-Verlag, pp 102-103, 1998.
- Lee, D., Seung H. S. Algorithm for Non-Negative Matrix Factorization. In proceedings, Advances of Neural Processing Systems v. 13, 2001.
- Ng, A., Jordan, M. On Spectral Clustering: Analysis and an Algorithm. In proceedings, Advances of Neural Processing Systems v. 13, 2001.
- Spearman, C.
*General Intelligence, Objectively Determined and Measured*. American Journal of Psychology, v. 15. pp 201-293. 1904. - Winters, T., Payne, T. What Do Students Know? In proceedings, First International Computing Education Research Workshop (ICER05).