Researchers at Edinburgh University have developed a keyword-based pronunciation lexicon for use in text-to-speech synthesis (voice-building or run-time synthesis) and in speech recognition systems.
The lexicon, called Combilex, is available in three versions: Received Pronunciation English, General American, and Scottish English. Each lexicon contains around 145,000 entries, including the 20,000 most frequent words, and includes a variety of linguistic information alongside detailed pronunciations, including many proper names.
Combilex is an ASCII text file, one entry-per-line. Full manually notated orthographic-phonemic correspondences are included, allowing derivation of accurate grapheme-to-phoneme rules.
The system contains a rich specification for each word, covering pronunciation, with variants, part-of-speech tags, morphological boundaries, full correspondence between orthography and pronunciation, and semantic information where available.
The researchers say Combilex provides greater than 86 per cent accuracy and is accent-independent.
The system is implemented as a database, allowing compact representations of word-forms, their morphological derivations, compounds and cross-references. Transcriptions include a phonemic-orthographic link and developing letter-to-sound rules for out-of-vocabulary words. The transcriptions uses a meta-symbol set, which may be converted by rule into appropriate forms for various accents
Edinburgh University is seeking interest from commercial organisations to license this technology on a non-exclusive basis.
For more information, see the project’s page at: http://www.university-technology.com/details/combilex---a-keyword-based-pronunciation-lexicon