Hierarchical Clustering of Multiple Choice Questions

Process

Input

Answers (a-f) on 48 questions from 5 quizzes in CS1 (UCR CS10) for ~220 students.

Processing

(The processing scripts and raw data used in this analysis can be gotten here).
  1. First I removed those students that didn't take every quiz. This left 125 students.
  2. Next I calculated the Mutual Information pairwise for each set of questions to come up with a similarity measure (higher values mean greater relation). This yielded a 48x48 matrix of M.I. values.
  3. Then I performed hierarchical clustering on these values:
    1. Initialize all questions into singleton clusters
    2. Repeat n-1 = 47 times:
    3. Find the two clusters m, n that have the highest similarity (M.I.)
    4. Merge those clusters by removing their rows/columns from the matrix and adding a new row and column where the ith entry is min(m[i], n[i]) (this means that the similarity of the new cluster to all of the remaining clusters is given by the _least_ similar member of the new cluster: complete-linkage.)
  4. I then output the resulting hierarchical clustering in the result format of DDraw, a Java package to draw dendrograms (also will do hierarchical clustering in 2-D Euclidean spaces).
  5. The resulting tree is shown below. Terminal nodes of the form tiqj represent the jth question on quiz i. The text of those questions is below this tree.
  6. Two nodes that are connected in this tree (such as t3q4 and t2q6, at the top) are MORE SIMILAR if the horizontal distance between them is SMALL. Thus, t3q8 and t3q1 are slightly more similar than t3q4 and t2q6.
  7. The distances between nodes are being plotted as 1/(MI + 0.1), since DDraw wants small distances to be good and large to be bad. So these are in no way linear.
  8. t5q9 and t5q10 are the best match, with an M.I. of 0.6. (Meaning if you know a student's score on one of them, you know 0.6 bits of their score on the other. Since there are 5 choices, a perfect correlation would be ~2.3 bits.
48-node dendrogram