Self-supervised Speech Representations Still Struggle with African American Vernacular English
Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen
Interspeech 2024
[arXiv]
We demonstrate that SOTA speech models systematically underperform on African American Vernacular English (AAVE), providing a detailed analysis of these biases and highlighting the need for more inclusive speech technologies.
MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding
Yi-Hui Chou, I-Chun Chen, Chin-Jui Chang, Joann Ching, Yi-Hsuan Yang
Journal of Creative Music Systems 2024
[arXiv]
[Code]
[Slide]
[Talk]
We built MidiBERT-Piano, a 12-layer Transformer pre-trained on thousands of piano MIDI pieces, achieving SOTA performance on melody extraction, style classification, and emotion recognition with minimal fine-tuning.
Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus
Yi-Hui Chou, Kalvin Chang et al.
IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop 2023
[arXiv]
[Talk]
We developed a Taiwanese Hokkien dataset for the ML-SUPERB benchmark and evaluated self-supervised speech models, revealing that smaller models can outperform larger ones, with linguistic alignment between pretraining data and target language being a critical factor.
Listener Model for the PhotoBook Referential Game with CLIPScores as Reference Chain
Shih-Lun Wu, Yi-Hui Chou, and Liangze Li
ACL 2023
[arXiv]
[Code]
[Slide]
[Talk]
We developed a reference chain-free listener model for the PhotoBook collaborative dialogue game, leveraging a DeBERTa Transformer and CLIPScore features to predict shared images based on full-dialogue context. Our approach achieved over 77% accuracy on unseen image sets, outperforming previous models by more than 17% points.
Don't speak too fast: The impact of data bias on self-supervised speech models
Yen Meng, Yi-Hui Chou, Andy T. Liu, Hung-yi Lee
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2022, AAAI 2022 SAS workshop
[arXiv]
We investigated how biases (gender, content, and prosody) in pre-training data affect self-supervised speech models. Our findings revealed that models exhibit tolerance toward gender bias, show minimal sensitivity to content variations, and demonstrate a preference for slower speech rates, highlighting the importance of balanced and representative training data for robust speech model performance.