SLAP: Siamese Language-Audio Pretraining without negative samples for Music Understanding

Julien Guinot; Alain Riou; Elio Quinton; George Fazekas

Abstract:

Joint embedding spaces have significantly advanced music understanding and generation by linking text and audio through multimodal contrastive learning. However, these approaches face large memory requirement limitations due to relying on large batch sizes to effectively utilize negative samples. Further, multimodal joint embedding spaces suffer from a modality gap wherein embeddings from different modalities lie in different manifolds of the embedding space.

To address these challenges, we propose Siamese Language-Audio Pretraining (SLAP), a novel multimodal pretraining framework that allows learning powerful representations without negative samples. SLAP adapts the Bootstrap Your Own Latent (BYOL) paradigm for multimodal audio-text training, promoting scalability in training multimodal embedding spaces.

We illustrate the ability of our model to learn meaningful relationships between music and text --- specifically, we show that SLAP outperforms CLAP on tasks such as text-music retrieval and zero-shot classification. We also observe competitive downstream performance on several MIR tasks, including with larger or supervised models (genre and instrument classification, auto-tagging).

Additionally, our approach has attractive properties, such as a quantifiably reduced modality gap and improved robustness to batch size variations on retrieval performance. Finally, its novel formulation unlocks large-scale training on a single GPU through gradient accumulation.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

While the proposed method is relevant to the specific task, I am not sure the paper itself provides a lot of insights that go beyond that specific scope. This is a great application of a method from an existing computer vision paper (BYOL), but I don't think it provide a lot more insights than the original paper

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Basically a mix of multimodal contrastive learning and BYOL methods, adapted to the music-text modalities.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This is a very interesting paper proposing to adapt the method from BYOL to the text-music multimodal case. This was worth trying, and the results seem to indicate that there is value in this method.

My main comment is that the paper is at times difficult to fully understand. For instance, the concepts of “online” and “target” representations are key to understand the method. However they are not explained in the paper. It is necessary for the reader to go back the original BYOL paper to understand these concepts. Similarly, understanding why a “stop-gradient” step is included requires reading the original paper. I would therefore recommend to provide summarized explanations of all key concepts, in particular those of online and target representations.

Also, one noticeable difference with BYOL is the absence of data augmentation, which seems like a core aspect of BYOL. It would be interesting to elaborate in this paper on why not using data augmentation (other than the computational gain), and ideally to measure the impact of using vs not using data augmentation (although I understand this is probably a tough ask for a 1-2 weeks work, but that may be e.g. mentioned in future work.)

Which versions of the GTZAN and MTAT datasets are used exactly? Recent literature has favored using the same processed versions (fault-filtered, top 50 tags, etc.) corrected from their original versions (https://github.com/jongpillee/music_dataset_split). Is this the case here, or do you use the original datasets? (The latter would be difficult to comprehend as there has been ample literature of the shortcomings of these original datasets.)

An important paper on the topic of text-music multimodal contrastive learning is missing: Enriched Music Representations With Multiple Cross-Modal Contrastive Learning (2021) https://ieeexplore.ieee.org/abstract/document/9395210

The closest existing work is cited in refs [40] to [43], and it would be interesting for the reader to find a more det describe

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Strong accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

This is a relevant paper for ISMIR, and there is a consensus among the 4 reviewers on its suitability for presentation at the conference. Note that there are also a few recommendations that would certainly further improve the paper. Please do go through all reviews and consider all these recommendations.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly disagree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly disagree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Let's look forward to the latest open-source CLAP but in music!

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

EMA might be better than contrastive learning on embedding space fusion when aligning different modalities.

Q17 (Would you recommend this paper for an award?)

Yes

Q18 ( If yes, please explain why it should be awarded.)

The paper is well-written and their experiments includes every detail which impresses me a lot.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes using EMA to learn a joint embedding space for text and audio modalities. With the EMA mechanism, the need for negative samples are nullified and computation costs are reduced. It is a clever way to fuse two modalities since this method does not explicitly push one embedding away from another, which may encourage the embedding spaces of the two modalities to fuse as much as possible. The authors discuss the embedding space gap in their experiments, which justifies this benefit. The paper is well written, and the figure looks intuitive and attractive. Good job on the idea and the detailed experimental results! I look forward to your open-source model and code!

Additionally, I have a question. I notice your probing attributes do not include key or chord progression, which I think a good general audio–text embedding space might fail to recognize, but a good music–text embedding space should. Would you consider incorporating more music-specific designs or inductive biases in your model training to see if your method can fully handle a music–text embedding space?

Other comments: 1. I suggest merging all the paragraphs of the abstract into a single paragraph. 2. See the numbered list format in past ISMIR papers and revise the list in your introduction accordingly. 3. Add punctuation at the end of each equation.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The paper provides an excellent literature review, referencing recent and foundational works in multimodal learning (e.g., CLAP, BYOL, MusCALL) while clearly outlining their limitations, especially the gaps that lead to the proposed method. Furthermore, the inclusion of recent methods like SigLIP and Reclap indicates that the authors are well-informed about current advancements in the field.

My only consideration is that the authors claim that SLAP is an adaptation of BYOL, but the paper does not carefully describe such an architecture. A reader who does not know BYOL can not understand its relationship with the proposed method.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Agree (Very novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper offers reusable insights by proposing a self-supervised training method for multimodal tasks and systematically addressing key limitations in recent literature (e.g., modality gap, scalability). Each identified challenge—and its solution—serves as a foundation for future work, from negative-free training to improved cross-modal alignment.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

SLAP introduces a scalable, negative-free method for aligning music and text representations, outperforming contrastive approaches like CLAP while reducing the modality gap and improving robustness.

Q17 (Would you recommend this paper for an award?)

Yes

Q18 ( If yes, please explain why it should be awarded.)

To the best of my knowledge, this is the first successful attempt at creating a learning strategy for audio (music) and text that does not rely on negative samples. It is very relevant due to the limitations it surpasses and is a clever adaptation of recent methods proposed in other domains. This relevant contribution is described in a well-written paper (with little room for improvement) and significant experimental evaluation.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a novel audio-text representation learning, effectively addressing key limitations of contrastive approaches like CLAP. Eliminating negative samples through EMA encoders and asymmetric predictors is particularly innovative, while the comprehensive experiments (covering retrieval, zero-shot tasks, and downstream probing) strongly validate the method's advantages. Analyzing modality gap reduction and batch-size robustness provides valuable insights for the field. The writing is generally clear, although the figures' quality/resolution could be improved.

Suggestions for improvements (that do not make the paper less relevant): - Exploring additional encoder architectures beyond HTS-AT/RoBERTa to demonstrate generalizability; Including explicit computational efficiency comparisons with CLAP; - Better describing BYOL; - Some technical aspects could be clarified, particularly the sensitivity to λ values in loss weighting and the domain shift observed in MusicCaps.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

No

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This approach brings in the ability to train multimodal embeddings without including batchwise pairs and provides for interesting future work for other tasks and modalities too.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The authors propose a new architecture SLAP for training multimodal models without having a contrastive loss. contrastive losses have the challenge that they introduce modality gap and they are GPU compute heavy because of the need to have relatively large batch sizes. They demonstrate that their technique overcomes both of these limitations well - with improved retrieval performance and comparable or better classification and tagging results.

Q17 (Would you recommend this paper for an award?)

No

Q18 ( If yes, please explain why it should be awarded.)

While the approach seems novel, it is hard to say if this method will extend to other problems.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The authors propose a new architecture SLAP for training multimodal models without having a contrastive loss. contrastive losses have the challenge that they introduce modality gap and they are GPU compute heavy because of the need to have relatively large batch sizes. They demonstrate that their technique overcomes both of these limitations well - with improved retrieval performance and comparable or better classification and tagging results.

The paper's approach tackles two important challenges in contrastive learning and maintains or improves task performance at retrieval, classification and tagging. This approach brings in the ability to train multimodal embeddings without including batchwise pairs and provides for interesting future work for other tasks and modalities too.

The paper is generally well written and is easy to read - however a few clarifications below would do well. Some rigour in statistical testing would be preferable on the results.

The authors have done good testing on different aspects of their model however I would have liked to understand the effect if any of EMA to be quantified - the numbers in Table 3 being almost identical (and not likely not statistically different) did this really do anything ? On the other hand not having L_A and L_B is leading to model collapse. This piece needs more clarity in exposition, if not experimentation.

Specific comments

what is meant by online context encoder in line 168? In particular what do the authors imply by the adjective "online" ?
Unable to follow what axes is the exponential moving average working on ? In particular it is taking both the raw audio (which has time dimension) as well as embedding space where the time dimension is removed (?)
Line 178. Not clear what is \bar{z} - takes some time to figure out from the diagram too.
Table 2 - Stat testing between SLAP and CLAP models individually for Pretrained and non-pretrained is preferable given numbers often are quite close (e.g. 5.7 vs 5.3 for CLAP Recall@1)
I cannot follow the lower 2 rows for Table 2. Especially the lower CLAP vs the CLAP in top section. Even the CLAP model in top section of Table 2 is reproduced from public github libraries? A clarification would help
It would be good to do statistical testing for the two rows in Table 3. The authors have noted in line 270 that both are viable but stat testing will establish it. For simplicity even a comparison of means would be helpful
Similar comments for comparing SLAP and CLAP in Table 4 and 5.
One aspect is not clear - why is downstream probing tasks being done on the head before z and not after. Also train-test splits for the downstream probing has not been specified
The observation from Figure 5 seems to indicate that the best result is at \lambda=0.5. Some insight into this would be interesting - although for multi-loss frameworks loss weights are often chosen equally, even a tuning leading to that is interesting.

Minor comments

Though can be inferred, lines 164,165 should have the defintitions of T_A and N
Line 336 MIPS is not expanded

P4-2: SLAP: Siamese Language-Audio Pretraining without negative samples for Music Understanding

Julien Guinot, Alain Riou, Elio Quinton, George Fazekas

Presented In-person

10-minute long-format presentation

Specific comments

Minor comments