This web site is no longer maintained and the content may be outdated.
Please visit for up-to-date information.
No upcoming events...

Home / Graduate / M.S. Theses Completed
  Arzucan Özgür, 2004  [download thesis]    

Thesis Title

Supervised and Unsupervised Machine Learning Techniques for Text Document Categorization


Automatic organization of documents has become an important research issue since the explosion of digital and online text information. There are mainly two machine learning approaches to enhance this task: supervised approach, where pre-defined category labels are assigned to documents based on the likelihood suggested by a training set of labeled documents; and unsupervised approach, where there is no need for human intervention or labeled documents at any point in the whole process.

In this study we compare and evaluate the performance of the leading supervised and unsupervised techniques for document organization by using different standard performance measures and five standard document corpora. We conclude that among the unsupervised techniques we have evaluated, k-means and bisecting k-means perform the best in terms of time complexity and the quality of the clusters produced. On the other hand, among the supervised techniques support vector machines achieve the highest performance while naive Bayes performs the worst. Finally, we compare the supervised and the unsupervised techniques in terms of the quality of the clusters they produce. In contrast to our expectations, we observe that although k-means and bisecting k-means are unsupervised they produce clusters of higher quality than the naive Bayes supervised technique. Furthermore, the overall similarities of the clustering solutions obtained by the unsupervised techniques are higher than the supervised ones. We discuss that the reason may be due to the outliers in the training set and we propose to use unsupervised techniques to enhance the task of pre-defining the categories and labeling the documents in the training set.
Boğaziçi University Department of Computer Engineering
Address: 34342 Bebek, Istanbul, TURKEY
Phone: +90 212 359 4523-24 Fax: +90 212 287 2461
general information:   webmaster: