Computer Processing of Turkish: Morphological and Lexical Investigation
The morphological analysis of Turkish is the subject of this thesis. Turkish belongs to the group of agglutinative languages. Because of its agglutinative nature, Turkish morphology is quite complex and includes many exceptional cases. Most recent research on Turkish morphology have limited themselves with a partial treatment of the language. The study has concentrated especially on the explanation and representation of the basic rules. The main objective of this thesis is to bring the full morphological structure of Turkish to light and to build its computer representation. Before this analysis is handled, the syntactic or semantic parsing of the language is quite impossible.
In this study, we divide the analysis of the morphology into two interrelated parts: morphophonemic analysis and morphotactic analysis. We investigate and define the morphological structure for both of these. Then we combine these in the Augmented Transition Network (ATN) formalism. This forms the formal representation of the Turkish morphological structure. This proposed morphological structure forms a basis for the language applications about Turkish. Among these applications, we design and implement a morphological parser and a spelling checker which incorporates a spelling corrector component.
We perform statistical analysis of Turkish based on this morphological representation and the implemented programs. This analysis is formed of two parts: lexical and morphological analysis, and corpus analysis. The first one uses the information about the structural parts of the language. The second one deals with the daily usage of the language. For this purpose, we form a corpus and run the spelling checker program on this corpus.
The work accomplished in this research consists of the following:
An Augmented Transition Network (ATN) formalism is introduced for Turkish morphology, containing all of the categories and the suffixes. This includes 14 categories and about 200 suffixes.
A root lexicon of about 21,500 words and a proper noun lexicon of about 11,500 words are formed in parallel to the ATN formalism.
A parser and a spelling checker (including a spelling corrector) are implemented for Turkish to test the completeness (coverage) and the efficiency of the formalism.
A test environment comprising of these elements is produced to study and test morphological properties of Turkish.
The lexicon is analyzed to obtain statistics on the structural and usage patterns of the Turkish morphology.
A corpus of about 2,200,000 words, which is currently the largest corpus on Turkish, is formed.
This corpus is analyzed to obtain statistical properties of Turkish.
Key words : Computational linguistics, Natural language processing, Morphological analysis, Turkish, Augmented transition networks, Spelling checking, Corpus