Abstract. Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community.
Samples from VCTK-TTS/VCTK-Token test set.
Audio | Reference Text | Time-based Method | Token-based Method |
---|---|---|---|
When the sunlight strikes raindrops in the air they act as a prism and form a rainbow.
W EH N DH AH S AH N L AY T S T R AY K S R EY N D R AA P S IH N DH AH EH R DH EY AE K T AE Z AH P R IH Z AH M AH N D F AO R M AH R EY N B OW |
Type: Repetition Start: 4.51s End: 5.51s |
W EH N DH AH S AH N L AY T S T R AY K S R EY N D R AA P S IH N DH AH EH R DH EY AE K T AE Z AH P R IH Z AH M AH N D F [REP] AO R M AH R EY N B OW | |
Please call Stella. P L IY Z K AO L S T EH L AH |
Type: Prolongation Start: 0.18s End: 0.77s |
P L IY [PRO] Z K AO L S T EH L AH | |
Ask her to bring these things with her from the store.
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M DH AH S T AO R |
Type: Deletion Start: 2.65s End: 2.65s |
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M [DEL] DH AH S T AO R | |
Ask her to bring these things with her from the store.
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M DH AH S T AO R |
Type: Substitution Start: 3.36s End: 3.64s |
AE S K HH ER T UW B R IH NG DH IY Z TH IH NG Z W IH DH HH ER F R AH M DH AH S T AO [SUB] R | |
Friday night was a pretty good night.
F R AY D IY N AY T W AA Z AH P R IH T IY G UH D N AY T |
Type: Repetition Start: 1.42s End: 3.14s |
Friday night was a pretty [REP] good night | |
Another High Street retailer was not so lucky.
AH N AH DH ER HH AY S T R IY T R IY T EY L ER W AA Z N AA T S OW L AH K IY |
Type: Pause Start: 0.87s End: 2.84s |
Another High [PAU] Street retailer was not so lucky | |
The challenge for the industry is to capture and retain that talent.
DH AH CH AE L AH N JH F AO R DH AH IH N D AH S T R IY IH Z T UW K AE P CH ER AH N D R IH T EY N DH AE T T AE L AH N T |
Type: Deletion Start: 0.72s End: 0.72s |
The challenge for [DEL] the industry is to capture and retain that talent | |
The new clubhouse was packed with people watching the game on TV.
DH AH N UW K L AH B HH AW S W AA Z P AE K T W IH DH P IY P AH L W AA CH IH NG DH AH G EY M AA N T IY V IY |
/ | The new clubhouse was [INS] packed with people watching the game on TV |
Primary Progressive Aphasia (PPA) is characterized by progressive impairments in speech and language. PPA can be categorized into three main variants, each with distinct clinical features. (1) Semantic Variant PPA (svPPA). Main Feature: Loss of word meaning. (2) Logopenic Variant PPA (lvPPA). Main Feature: Impairment in word retrieval and sentence repetition. (3) Nonfluent/Agrammatic Variant PPA (nfvPPA). Main Feature: Difficulty in forming grammatically correct sentences and impaired speech production. We provide some examples for nfvPPA.
Audio | Reference Text | Time-based Method | Token-based Method |
---|---|---|---|
A long beard clings to his chin.
AH L AO NG B IH R D K L IH NG Z T UW HH IH Z CH IH N |
Type: Repetition Start: 0.62s End: 1.35s |
AH [REP] L AO NG B IH R D K L IH NG Z T UW HH IH Z CH IH N | |
Yet he still thinks as swiftly as ever.
Y EH T HH IY S T IH L TH IH NG K S AE Z S W IH F T L IY AE Z EH V ER |
Type: Deletion Start: 0.94s End: 0.94s |
Yet he still [DEL] thinks as swiftly as ever | |
A long beard clings to his chin.
AH L AO NG B IH R D K L IH NG Z T UW HH IH Z CH IH N |
Type: Pause Start: 1.26s End: 1.98s |
A long [PAU] beard clings to his chin | |
We have often urged him to work more and smoke less.
W IY HH AE V AO F T AH N ER JH D HH IH M T UW W ER K M AO R AH N D S M OW K L EH S |
/ | We have often urged him to [INS] work more and smoke less |
Here we open source the simulated dataset we created and the trained models as benchmarks.