Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Abstract. Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community.



Artificial Simulated Speech

Samples from VCTK-TTS/VCTK-Token test set.

Audio Reference Text Time-based Method Token-based Method
When the sunlight strikes raindrops in the air they act as a prism and form a rainbow.
W EH N DH AH S AH N L AY T S T R AY K S R EY N D R AA P S IH N DH AH EH R DH EY AE K T AE Z AH P R IH Z AH M AH N D F AO R M AH R EY N B OW
Type: Repetition
Start: 4.51s
End: 5.51s
W EH N DH AH S AH N L AY T S T R AY K S R EY N D R AA P S IH N DH AH EH R DH EY AE K T AE Z AH P R IH Z AH M AH N D F [REP] AO R M AH R EY N B OW
Please call Stella.
P L IY Z K AO L S T EH L AH
Type: Prolongation
Start: 0.18s
End: 0.77s
P L IY [PRO] Z K AO L S T EH L AH
Ask her to bring these things with her from the store.
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M DH AH S T AO R
Type: Deletion
Start: 2.65s
End: 2.65s
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M [DEL] DH AH S T AO R
Ask her to bring these things with her from the store.
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M DH AH S T AO R
Type: Substitution
Start: 3.36s
End: 3.64s
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M DH AH S T AO [SUB] R
Friday night was a pretty good night.
F R AY D IY N AY T W AA Z AH P R IH T IY G UH D N AY T
Type: Repetition
Start: 1.42s
End: 3.14s
Friday night was a pretty [REP] good night
Another High Street retailer was not so lucky.
AH N AH DH ER HH AY S T R IY T R IY T EY L ER W AA Z N AA T S OW L AH K IY
Type: Pause
Start: 0.87s
End: 2.84s
Another High [PAU] Street retailer was not so lucky
The challenge for the industry is to capture and retain that talent.
DH AH CH AE L AH N JH F AO R DH AH IH N D AH S T R IY IH Z T UW K AE P CH ER AH N D R IH T EY N DH AE T T AE L AH N T
Type: Deletion
Start: 0.72s
End: 0.72s
The challenge for [DEL] the industry is to capture and retain that talent
The new clubhouse was packed with people watching the game on TV.
DH AH N UW K L AH B HH AW S W AA Z P AE K T W IH DH P IY P AH L W AA CH IH NG DH AH G EY M AA N T IY V IY
/ The new clubhouse was [INS] packed with people watching the game on TV

PPA Speech

Primary Progressive Aphasia (PPA) is characterized by progressive impairments in speech and language. PPA can be categorized into three main variants, each with distinct clinical features. (1) Semantic Variant PPA (svPPA). Main Feature: Loss of word meaning. (2) Logopenic Variant PPA (lvPPA). Main Feature: Impairment in word retrieval and sentence repetition. (3) Nonfluent/Agrammatic Variant PPA (nfvPPA). Main Feature: Difficulty in forming grammatically correct sentences and impaired speech production. We provide some examples for nfvPPA.

Audio Reference Text Time-based Method Token-based Method
A long beard clings to his chin.
AH L AO NG B IH R D K L IH NG Z T UW HH IH Z CH IH N
Type: Repetition
Start: 0.62s
End: 1.35s
AH [REP] L AO NG B IH R D K L IH NG Z T UW HH IH Z CH IH N
Yet he still thinks as swiftly as ever.
Y EH T HH IY S T IH L TH IH NG K S AE Z S W IH F T L IY AE Z EH V ER
Type: Deletion
Start: 0.94s
End: 0.94s
Yet he still [DEL] thinks as swiftly as ever
A long beard clings to his chin.
AH L AO NG B IH R D K L IH NG Z T UW HH IH Z CH IH N
Type: Pause
Start: 1.26s
End: 1.98s
A long [PAU] beard clings to his chin
We have often urged him to work more and smoke less.
W IY HH AE V AO F T AH N ER JH D HH IH M T UW W ER K M AO R AH N D S M OW K L EH S
/ We have often urged him to [INS] work more and smoke less

Benchmark

Here we open source the simulated dataset we created and the trained models as benchmarks.

1. Time-based Method


Dataset

Dataset

Code

Code and Model


2. Token-based Method


Dataset

Dataset

Code

Code and Model