Evaluating Lyrics Alignment under Source Separated Conditions

Jiawen Huang, Emmanouil Benetos

Primary Subject: Early Research

Some of the required materials for this paper do not exist: Video

Abstract:

This work investigates the performance and robustness of lyrics alignment models under varying vocal conditions. While recent models perform well on either mixtures or separated vocals, real-world applications require robustness without assuming access to a specific separation method. We first extend a previously proposed alignment model to the multilingual setting by expanding the phoneme vocabulary and training on multilingual data. Evaluation on the Multi-Lang Jamendo dataset shows strong performance across multiple languages. To support more detailed evaluation, we introduce a word-level timestamp extension of the MUSDB18 test set. Using this resource, we compare model performance on clean vocal stems and vocals processed by three widely used source separation tools. The results reveal substantial variation depending on the separation method, with alignment accuracy generally improving as vocal quality increases. These findings highlight the limitations of current models under realistic audio conditions and the importance of evaluating alignment systems in diverse scenarios. All resources are publicly released to support further research.