Developing a Concept Extraction System for Turkish
In recent years, due to growing vast amount of available electronic media and data, the necessity of analyzing electronic documents automatically is increased. In order to assess if a document contains valuable information or not, concepts, key phrases or main idea of the document have to be known. There are some studies on extracting key phrases or main ideas of documents for Turkish. However, to the best of our knowledge, there is no concept extraction system for Turkish although there are some studies for foreign languages.
In this thesis, a concept extraction system is proposed for Turkish. Since Turkish characters do not fit with the computer language and Turkish is an agglutinative and complex language a pre-processing step is needed. After pre-processing step, only nouns of corpus, which are cleared from their inflectional morphemes, are used because most concepts are defined by nouns or noun phrases. In order to define documents with concepts, clustering nouns is considered to be useful. By applying some statistical methods and NLP methods, documents are identified by concepts. Several tests are done on the corpus that is tested in the bases of words, clusters, and concepts. As a result, the system generates concepts with 51 per cent success, but unfortunately it generates more concepts than it should be. Since concepts are abstract entities, in other words they do not have to be written in the texts as they appear, assigning concepts is a very difficult issue. Moreover, if we take into account the complexity of the Turkish language this result can be seen as quite satisfactory.