This web site is no longer maintained and the content may be outdated.
Please visit www.cmpe.boun.edu.tr for up-to-date information.
 
CmpE RSS
No upcoming events...

Home / Graduate / M.S. Theses Completed
 
 
 
 
  Şerafettin Taşcı, 2008  [download thesis]    

Thesis Title

An Evaluation of Existing and New Feature Selection Metrics in Automatic Text Categorization


Abstract

In recent years, the amount of available documents in the electronic medium such as electronic books, digital libraries and email messages increased rapidly. Therefore, the task of organizing and manipulating these resources have gain more importance and became more difficult. Automatic text categorization is widely used for organizing and manipulating these documents in the electronic medium. However, since the data in text categorization are very high-dimensional, feature selection is crucial to make the task more efficient and precise.

In this study, we make an extensive evaluation of the feature selection metrics used in text categorization by using local and global policies. For the experiments, we use seven datasets which vary in size, complexity and skewness. We use SVM as the classifier and tfidf weighting for term weighting. We observed that almost in all metrics and datasets, local policy outperforms when the number of keywords is low and global policy outperforms as the number of keywords increases.

In addition to the evaluation of the existing feature selection metrics, we propose new metrics which have shown high success rates especially with low number of keywords. Moreover, we propose a keyword selection framework called Adaptive Keyword Selection (AKS). It is based on selecting different number of keywords for different classes and it improved the performance significantly in skew datasets.
 
 
Boğaziçi University Department of Computer Engineering
Address: 34342 Bebek, Istanbul, TURKEY
Phone: +90 212 359 4523-24 Fax: +90 212 287 2461
general information: infocmpe.boun.edu.tr   webmaster: webmastercmpe.boun.edu.tr