Maria Teleki, Xiangjue Dong, Soohwan Kim, James Caverlee
Texas A&M University
{mariateleki, xj.dong, cocomox26, caverlee}@tamu.edu
Use the sidebar on the left or at the top left corner of the page to navigate to the pages for: the full annotation guidelines, and the disfluent token distributions.
In this work, we evaluate the disfluency capabilities of two automatic speech recognition systems -- Google ASR and WhisperX -- through a study of 10 human-annotated podcast episodes and a larger set of 82,601 podcast episodes. We employ a state-of-the-art disfluency annotation model to perform a fine-grained analysis of the disfluencies in both the scripted and non-scripted podcasts. We find, on the set of 10 podcasts, that while WhisperX overall tends to perform better, Google ASR outperforms in WIL and BLEU scores for non-scripted podcasts. We also find that Google ASR's transcripts tend to contain closer to the ground truth number of edited-type disfluent nodes, while WhisperX's transcripts are closer for interjection-type disfluent nodes. This same pattern is present in the larger set. Our findings have implications for the choice of an ASR model when building a larger system, as the choice should be made depending on the distribution of disfluent nodes present in the data.
@inproceedings{teleki24_interspeech,
title = {Comparing ASR Systems in the Context of Speech Disfluencies},
author = {Maria Teleki and Xiangjue Dong and Soohwan Kim and James Caverlee},
year = {2024},
booktitle = {Interspeech 2024},
pages = {4548--4552},
doi = {10.21437/Interspeech.2024-1270},
}