Home / Graduate / M.S. Theses Completed
  Haşim Sak, 2004  [download thesis]    

Thesis Title

A Corpus-Based Concatenative Speech Synthesis System for Turkish


Speech synthesis (text-to-speech) is the process of converting the written text into machine generated synthetic speech. Concatenative speech synthesis systems render speech by concatenating pre-recorded speech units. Corpus-based methods (unit selection) use a large inventory to select the units and concatenate. This thesis is part of an effort to design and develop an intelligible and natural sounding corpus-based concatenative speech synthesis system for Turkish. The implemented system contains a relatively simple front-end comprised of text analysis, phonetic analysis, and optional use of transplanted prosody. The unit selection algorithm is based on commonly used Viterbi decoding algorithm of the best path in the network of the units. The back-end is the speech waveform generation based on the harmonic coding of speech and overlap-and-add mechanism. In this work, the different unit sizes such as syllables, phones and half-phones have been experimented with. Speech corpus design and recording script preparation methods have been explained. A speech model based on harmonic coding of speech has been developed for speech representation and waveform generation. The harmonic coding has enabled us to compress the unit inventory size by a factor of three. A Viterbi decoding algorithm using spectral discontinuity cost and prosodic mismatch objective cost measures has been implemented. A Turkish phoneme set has been designed. Text-to-phoneme conversion for Turkish has been worked on, and a root words pronunciation lexicon has been constructed. A simple text normalization module has been implemented. The importance of prosody in unit selection has been studied by using transplanted prosody vs no synthetic prosody modeling in unit selection. Subjective tests have been carried out for evaluating the synthesized speech quality. The final Turkish speech synthesis system got 4.2 MOS like score in the listening tests.
