P4-3: PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music

Hayeon Bang, Eunjin Choi, Seungheon Doh, Juhan Nam

Subjects: Music retrieval systems ; Representations of music ; Applications ; Open Review ; Knowledge-driven approaches to MIR

Presented In-person

4-minute short-format presentation

Abstract:

Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 ( The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The abstract states that their "design choices offer reusable insights" but I did not see clear discussion of insights beyond using more modalities can be helpful in an embedding problem.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Disagree

Q10 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chose, otherwise write "n/a"))

While overall the work appears to be sound, there is one sentence that causes me some pause. In section 5.1.1, the authors state that "the results demonstrate that the pre-training and fine-tuning approach consistently outperforms the combined training across all modality configurations and metrics." This is not true as shown in Table 1, where Audio for In-domain at R@1 does better in combined training and Symbolic for out-of-domain at R@5 and R@10 does better in combined training. In fact, when I saw these results in the table, I was hoping that the authors would address these cases and hypothesize why in these cases there would be results that are against our intuition.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The abstract states that their "design choices offer reusable insights" but I did not see explicit discussion of insights beyond using more modalities can be helpful in an embedding problem.

I believe that the paper may obliquely offer the reusable insights 1) that more modalities improve results and 2) that pre-training + fine-tuning also improve results, but both of these feel intuitive.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

An embedding system that makes use of 3 modalities instead of 2 and makes use of both generalized and task-specific training sets will provide more meaningful embeddings than other systems.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents PianoBind that leverages symbolic (ie. MIDI), audio, and text to create an effective embedding space for pop-piano music. I appreciated that the authors spent considerable time in the introduction and related works section to make the careful point that piano music, while based on a single instrument, is a more nuanced instrument than others, offering complex polyphonic pieces. This point is an important one, especially for those that seek to treat piano music as "just a single instrument."

The paper has many strengths, especially in how the authors contextualize their proposed method on existing work. The authors also offer a clear vision for how the PianoBind system addresses weaknesses in previously deployed systems. The depth and quantity of their evaluation studies are also impressive.

There are a few missed opportunities in this work, namely in the presentation of the work itself. As noted above, the authors could be more explicit about their reusable insights.

I also appreciated that the authors have made a demo and made their code available to the community.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The reviewers found many strengths in this work, including the well-articulated motivation, well-supported claims, and clear writing. Reviewers 1 and 2 as well as the meta-reviewer were moved by the specialized nature of the work that reflects the multi-modality nature of piano.

The strengths and weaknesses of the evaluation methods for this work are discussed across the reviews. The difference in the opinions between Reviewer 3 and the rest of the reviewers could potentially be due to the presentation of the work (and accompanying evaluation section). Both Reviewers 1 and 2 offer suggestions and questions about the evaluation section that could help the authors adjust their paper. In thinking about the details in review 1, I would recommend that the authors be a bit more forward about the limitations of their data, but in context of the task they are doing.

There are a number of further details in the individual reviews on the above points as well as other suggestions to strengthen the paper.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

While this paper addresses a specific genre/instrumentation, the authors try to create a blueprint for training similar systems on any other specific genre with limited expert data. The insights into training strategy and loss formulation should apply outside of pop piano music.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Text-based retrieval of highly-specialized types of music can be improved by training multimodal models on domain-specific datasets.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper explores training a multi-modal representation learning model for text-based music retrieval in a specific context, Pop piano music. The authors claim that SoTA models trained over large and diverse datasets of music can miss the subtleties that distinguish performances of piano music.

This claim is supported by the results in Section 5, as the retrieval performance is best with the model trained on a piano specific dataset. Additionally, there is a significant difference in retrieval performance between the two types of training strategies, with pre-training & fine-tuning the clear winner. Not all researchers have access to industrial-sized datasets for their specific domain. However, these results suggest that collecting a smaller/expertly annotated dataset to fine-tune a larger/noisier dataset could be a successful strategy.

One thing that could be added to the results section is an example of a query from the evaluation set paired with results from PianoBind and one of the broadly-trained CLAMP models. It could be useful to see what sorts of qualitative improvements PianoBind provides. Are there words that are rare outside of Pop piano music, but PianoBind has learned what content they refer to? I think some examples would help give readers confidence that this research could apply to their sub-genre as well.

For the final training objective, I am curious if the authors tried anything other than equal weighting of audio-text and MIDI-text contrastive losses. It makes intuitive sense to give these equal importance, but it would have been interesting to see an ablation study to explore this.

This paper provides some interesting insights for researchers working with highly specialized types of music. The authors make reasonable design choices and compare their results to SoTA models in this problem space. I think the main drawbacks here are the limited size of the evaluation datasets and the lack of evidence to support the effectiveness of this strategy on other genres. Nonetheless, I think this would be a useful contribution to the conference and I recommend acceptance.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The study affirms that multimodal learning, specifically integrating audio, symbolic (MIDI), and textual modalities, results in more discriminative semantic representations for solo piano music than unimodal/bimodal approaches. This highlights the importance of modeling the piano performance across multiple modalities.

The authors demonstrate that leveraging both large-scale, weakly aligned data and small-scale, expert-annotated data is most effective when applied in a staged manner—i.e., through initial pre-training on noisy data followed by fine-tuning on high-quality annotations. In contrast, naive combined training without separation of phases leads to performance degradation, highlighting the sensitivity of multimodal contrastive objectives to label noise.

Empirical results show that trimodal embedding models consistently outperform their bimodal counterparts (audio–text and symbolic–text) across both in-domain and out-of-domain retrieval tasks. This supports the conclusion that each modality contributes complementary information, and their joint alignment is critical for capturing fine-grained semantic nuances in homogeneous music domains such as solo piano.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Integrating audio, symbolic, and text modalities in a joint embedding model for piano music enables more accurate retrieval of nuanced genre, mood, and semantic characteristics than general-purpose approaches.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary

This paper presents PianoBind, a trimodal joint embedding model specifically designed for solo piano music. It effectively integrates audio, symbolic (MIDI), and textual modalities to capture nuanced semantic attributes such as genre, mood, and style. PianoBind is trained using a multi-source strategy that combines large-scale weakly aligned data with small-scale expert-annotated data through pre-training and fine-tuning processes. Experiments conducted on both in-domain (PIAST-AT) and out-of-domain (EMOPIA-Caps) datasets demonstrate that PianoBind significantly outperforms general-purpose models in text-to-music retrieval, with trimodal learning proving to be more effective than bimodal approaches.

Strengths

  • The paper addresses a timely and relevant problem in music representation: capturing fine-grained semantics in solo piano music through multi-modal modeling.

  • The evaluation framework is sound, with both in-domain and out-of-domain tests, and clearly demonstrates the advantages of the proposed trimodal approach.

  • Although focused on piano, the methods—particularly the staged multi-source training strategy—are broadly applicable to other homogeneous or low-resource musical genres.

Limitations

  • Line 333 mentions involvement of a “human music expert” in refining GPT-generated captions, but lacks detail on their qualifications or the verification protocol used, reducing transparency in evaluation.

Overall Assessment

This paper presents a well-motivated and technically sound contribution to music information retrieval, particularly within the under-explored space of domain-specific multi-modal modeling for solo piano. While some reproducibility and methodological transparency concerns remain, the model design, training strategy, and results offer valuable and transferable insights for MIR researchers working with low-resource or stylistically narrow domains.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

To be honest, the paper seems to use a pretty standard multimodal learning approach, and the experiments don't really dig deep enough. So, while it works for this specific piano task, it might not offer major new insights that the whole MIR field can run with.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Yet another music contrastive learning pretraining model

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper reads well and the logic is clear. The main strengths are its structure and clarity. However, the core idea feels quite familiar, especially with models like CLaMP already doing similar multimodal things. The bigger issue is the experiment setup for retrieval – testing on just a couple hundred tracks makes the task seem too easy and doesn't really convince me the model works well in a more realistic scenario.