Self-supervised Speech Representations Still Struggle with African American Vernacular English

Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen

Interspeech 2024

We demonstrate that SOTA speech models systematically underperform on African American Vernacular English (AAVE), providing a detailed analysis of these biases and highlighting the need for more inclusive speech technologies.

MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding

Yi-Hui Chou, I-Chun Chen, Chin-Jui Chang, Joann Ching, Yi-Hsuan Yang

Journal of Creative Music Systems 2024

We built MidiBERT-Piano, a 12-layer Transformer pre-trained on thousands of piano MIDI pieces, achieving SOTA performance on melody extraction, style classification, and emotion recognition with minimal fine-tuning.

Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

Yi-Hui Chou, Kalvin Chang et al.

IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop 2023

We developed a Taiwanese Hokkien dataset for the ML-SUPERB benchmark and evaluated self-supervised speech models, revealing that smaller models can outperform larger ones, with linguistic alignment between pretraining data and target language being a critical factor.

Listener Model for the PhotoBook Referential Game with CLIPScores as Reference Chain

Shih-Lun Wu, Yi-Hui Chou, and Liangze Li

ACL 2023

We developed a reference chain-free listener model for the PhotoBook collaborative dialogue game, leveraging a DeBERTa Transformer and CLIPScore features to predict shared images based on full-dialogue context. Our approach achieved over 77% accuracy on unseen image sets, outperforming previous models by more than 17% points.

Don't speak too fast: The impact of data bias on self-supervised speech models

Yen Meng, Yi-Hui Chou, Andy T. Liu, Hung-yi Lee

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2022, AAAI 2022 SAS workshop

We investigated how biases (gender, content, and prosody) in pre-training data affect self-supervised speech models. Our findings revealed that models exhibit tolerance toward gender bias, show minimal sensitivity to content variations, and demonstrate a preference for slower speech rates, highlighting the importance of balanced and representative training data for robust speech model performance.