Improving Text Classification Performance with the Analysis of Lexical Dependencies and Class-based Feature Selection
In this thesis, we present a comprehensive analysis of the feature extraction and feature selection techniques for the text classiﬁcation problem in order to achieve more successful results using much smaller feature vector sizes. For feature extraction, 36 diﬀerent lexical dependencies are included and analyzed independently in the feature vector as an extension to the standard bag-of-words approach. Feature selection analysis is twofold. In the ﬁrst stage, pruning implementation is analyzed and optimal pruning levels are extracted with respect to dataset properties and feature variations (words, dependencies, combination of the leading dependencies). In the second stage, we compare the performance of corpus-based and class-based approaches for feature selection coverage and then, extend pruning implementation by the optimized class-based feature selection. For the ﬁnal and most advanced test, we serialize the optimal use of the leading dependencies for each experimented dataset with the two stage (corpus- and class-based) feature selection approach.
For performance evaluation, we use the state-of-the-art measures for text classiﬁcation problems: two diﬀerent success score metrics and three diﬀerent signiﬁcance tests. With respect to these measures, the results reveal that for each extension in the methods, a corresponding signiﬁcant improvement is obtained. The most advanced method combining the leading dependencies with optimal pruning levels and optimal number of class-based features mostly outperform the other methods in terms of success rates with reasonable feature sizes. To the best of our knowledge, this is the ﬁrst study that makes such a detailed analysis on extracting individual dependencies and employing feature selection with two stage selection approach in text classiﬁcation and more generally in text domain.