Generalized Linear Discriminant Analysis

The conventional linear discriminant analysis (LDA) requires that the within-class scatter matrix Sw be nonsingular. In many applications such as cancer classification with gene expression profiling, face recognition, web document classification, etc., however, Sw is always singular due to the small sample size problem, i.e., the number of samples is less than the dimensionality of data. To solve the problem, we propose the generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize the modified Fisher's criterion. Different from the conventional LDA, GLDA does not assume the nonsingularity of Sw, and thus solves the small sample size problem. This is achieved by carefully investigating the properties of scatter matrices. GLDA is mathematically well-founded and coincides with the conventional LDA when Sw is nonsingular. To accommodate the very high dimensionality of datasets, a fast algorithm for GLDA is developed. Extensive experiments on cancer classification show that our method is superior to the widely used classification methods such as support vector machines, random forests, and k-nearest neighbor method, especially for the datasets with many classes and very high dimensionality.

Here is an R implementation of GLDA. The function glda.predict(train.x, train.y, test.x) can be used to train the model on the training data <train.x,train.y> and predict on the test data test.x, where train.x is a matrix of which each row is a sample and train.y is a vector of corresponding labels. The matrix test.x is the test data. The output is the vector of predicted labels of test data.

Please send comments and questions to Haifeng Li

Total visits: Web Counter