Automatic phonetic transcription of large speech corpora: a comparative study

TitleAutomatic phonetic transcription of large speech corpora: a comparative study
Publication TypePresentation
Year of Publication2006
Conference NameSummer Meeting on Corpus-based Research
AuthorsVan Bael, Christophe
PublisherNederlandse Vereniging voor Fonetische Wetenschappen
Conference LocationNijmegen, The Netherlands

In a recent study, we investigated whether automatic transcription procedures can approximate manually verified phonetic transcriptions typically delivered with contemporary large speech corpora. Ten automatic procedures were used to generate a broad phonetic transcription of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues) from the Spoken Dutch Corpus. The resulting transcriptions were compared to manually verified phonetic transcriptions from the same corpus.

We found that signal-based procedures could not approximate the manually verified phonetic transcriptions. A knowledge-based procedure did not give optimal results either. Quite surprisingly, a procedure in which a canonical transcription, through the use of decision trees and a small sample of manually verified phonetic transcriptions, was modelled towards the target transcription, performed best. The number and the nature of the remaining discrepancies compared to inter-labeller disagreements reported in the literature. This implies that future corpus designers should consider the use of automatic transcription procedures as a valid and cheap alternative to expensive human experts.