Least squares twin support vector machines for text categorization

Kumar M.A.; Madan Gopal

doi:10.1109/NATSYS.2015.7489094

Least squares twin support vector machines (LSTSVM) [1] is a popular kernel-based SVM formulation for binary classification tasks. LSTSVM is an efficient algorithm to learn linear/nonlinear classification boundaries as it just requires solution of two linear systems of equations. LSTSVM has been applied to text categorization with simple bag-of-words representation and conventional feature selection. The disadvantage of this approach is that it is likely to hurt the classification performance due to loss of information as the features that are ranked lowest are discarded. However, as LSTSVM training involves solving for linear systems of equations of the size of input space dimension it is extremely important to keep the input dimension small and hence all features cannot be considered for training. Thus there is a need to learn a dense concept that combines many features without throwing them and produces a compact representation. Distributional clustering of words is an efficient alternative to traditional feature selection measures. Unlike feature selection measures which discard low ranked features, it generates extremely compact representation for text documents in word cluster space. It has been shown that SVM's classification performance is better/on par when using this new representation compared to traditional bag-of-words despite the advantages of reduced dimensionality of text documents. In this paper, we propose a new text categorization system combining distribution clustering of words for document representation and linear LSTSVM for document classification. We verified its effectiveness by conducting experiments on two benchmark text corpuses: WebKB, SRAA and comparing its results with SVMlight based classification in a similar setting. © 2015 IEEE.