P4-6: CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

Yinghao MA, Siyou Li, Juntao Yu, Emmanouil Benetos, Akira Maezawa

Subjects: Open Review ; Evaluation methodology ; Musical features and properties ; MIR fundamentals and methodology ; Evaluation, datasets, and reproducibility ; Representations of music ; Multimodality

Presented In-person

4-minute short-format presentation

Abstract:

Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking — reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The evaluation of multiple music-related LLMs in a broad set of tasks helps to understand the capabilities of such models, which are still far from optimal.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

A new dataset useful for finetuning and evaluating music-related LLMs, built as a reformulation of many MIR datasets in an instruction form.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a new dataset created from several other MIR datasets by reformulating the tasks in an instruction form, suitable for finetuning and evaluating multimodal music-related LLMs. Then the authors evaluate a number of available models in the dataset. The paper is well written and structured, and its main contribution—the unification of MIR tasks into a standardized instruction-tuning benchmark—is timely and highly relevant. The catalog of MIR tasks and LLMs surveyed is particularly useful to the community, and the benchmark aligns well with both NLP and MIR research interests. That said, there are areas that could be improved or clarified to strengthen the paper’s long-term utility. - The evaluation of zero shot learning with LLMs is heavily dependent on the prompt used, which in this case, is determined by the authors. Many of the models evaluated may have been trained with different set of instructions, which makes difficult to really trust the results provided in the paper. In addition, there is no ablation study or prompt variants that help the reader to trust the decided prompts used by the authors. This dataset may be more useful for finetuning of newer models that follows the defined instructions. - The paper refers to CMI-Bench as a benchmark, but lacks a clear leaderboard or scoring protocol that would encourage external adoption. A discussion about future integration with platforms like HuggingFace leaderboards would help clarify its long-term role.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The topic is highly relevant to the ISMIR community, especially considering the growing interest in applying LLMs to musical domains. The paper is well written and presents a substantial and clearly motivated contribution. Reformulating MIR tasks for instruction-based evaluation is a timely and important idea that responds to the way LLMs are increasingly used in practice.

Reviewers appreciated the breadth of the evaluation, the inclusion of multiple models and tasks, and the open release of the benchmark. The analysis of genre and cultural bias is also a welcome addition, but can be improved following the recommendations of reviewer #3.

One central issue is the lack of prompt ablation or prompt robustness analysis. Since the zero-shot results depend heavily on prompt wording, it’s difficult to assess whether the poor performance observed in some tasks reflects model limitations or suboptimal prompt design. Including even a small prompt variation experiment would have helped to clarify this.

Another limitation is the lack of a formal leaderboard structure or evaluation protocol. While the dataset is positioned as a benchmark, it would benefit from clearer guidance to encourage adoption—such as standard scoring procedures or integration with leaderboard platforms.

Some reviewers also found that the results are under-analyzed, especially given the number of metrics and tasks. The discussion of failures (e.g., hallucinations, invalid outputs) is often brief or qualitative. More quantitative data—for instance, the rate of invalid responses—would make the analysis more useful. In addition, the comparison with previous studies that showed better LLM performance on music tasks needs to be better contextualized.

Finally, a deeper reflection on the reliability of the underlying datasets, especially for subjective tasks like emotion annotation, would strengthen the benchmark's credibility. In several cases, it is unclear whether model “errors” are due to actual model failures or limitations in the data itself.

Despite these limitations, this paper makes a valuable contribution by providing the community with a reusable and extensible framework to evaluate LLMs on music-related tasks. While the methodology is still in early stages, the benchmark can serve as a foundation for future work, and will likely stimulate further discussion and experimentation in this space.

I recommend acceptance, with the hope that the authors can expand on some of the open questions and strengthen the benchmark for broader adoption.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper highlights that, following the authors’ chosen methodology, LLMs do not outperform MIR task-specific SOTA models. While this result may offer a valuable insight, it remains unclear to me whether it stems from the intrinsic capabilities of LLMs or from the specific reformulation strategies adopted by the authors for the different tasks.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Following the authors’ chosen methodology, LLMs do not outperform MIR task-specific SOTA models.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper is relatively well written and understandable. The amount of work presented is commendable, with 11 models and 20 different metrics applied to 14 different MIR tasks. Despite this impressive breadth, the paper remains readable and well structured. Section 2, although dense, offers a comprehensive state of the art. However, there are a number of typos (listed at the end of this review), and most of the figures and tables are difficult to read (I assume due to space constraints that led to font size reduction).

The authors propose using LLMs for a wide range of MIR tasks by reformulating task annotations into an instruction-following prompt paradigm to leverage the capacities of LLMs. Ultimately, "all models in our study fall significantly short of the performance achieved by task-specific supervised systems when evaluated using standard MIR metrics."

To me, the paper exhibits two major flaws:

1) The impact of the specific reformulations chosen in the paper on the results. If LLMs underperform so noticeably compared to task-specific SOTA models, is it solely due to the inherent limitations of LLMs and the explanations offered in the paper? Could different prompts have yielded better performance? A more systematic analysis of prompt engineering choices, or a prompt ablation study, would have been highly valuable to isolate the source of performance gaps.

2) The comparison with previous attempts at music-related instruction-following tasks. Section 5.1.1 cites other papers in which LLMs achieved excellent results—why do these discrepancies arise here?

The filtering of outputs produced by the selected LLMs (Section 4 Experiments) would have benefited from a more detailed analysis. For downbeat tracking, the authors note: "We filter non-numeric outputs". How frequently do such outputs occur? The model was expected to produce a list of tuples only. Similarly, for melody extraction: "We discard invalid tuples (e.g., missing pitches, or improperly formatted entries, etc.)." How often do models fail to produce valid outputs? How frequent are hallucinations? To what extent does post-processing affect the final results?

Section 5 Results is unfortunately hindered by the number of tasks being addressed simultaneously. Table 3 is not sufficiently referenced in the text. Given the number of metrics reported, it would be helpful for Table 3 to include arrows or annotations indicating which metrics are better when lower or higher. Subsection 5.1.3, "All Models Perform Poorly on DSing Transcription", fails to provide insight into why performance is so low.

Section 5.2, Culture and Gender Bias, raises interesting issues but suffers from several weaknesses. Accordion is not an orchestral instrument. "Performance drops significantly on bongo and harmonica -commonly associated with world, folk," Is this not simply because such instruments are underrepresented in the dataset? Is folk music really that rare in genre datasets?

The distinctions drawn are also inconsistent: "Western genres (e.g., 80s, 90s)" vs. "music traditions (e.g., Medieval, 60s)" : why are the 80s and 90s considered genres but the 60s a “tradition”? Is chanson considered world music? This section lacks both detail and quantitative results, although the topics discussed are undoubtedly of high relevance to the field.

I would like to emphasize that my decision to recommend a weak reject is in no way due to the presence of negative results. On the contrary, negative or underwhelming results are important and valuable. The work presented is substantial and of genuine interest.

However, the paper lacks fine-grained analysis of the results and shows little critical perspective on the design of the prompts—an issue that, in my view, is insufficiently addressed in the paper, except briefly in Section 5.1.4.

Minor remarks Figure 1, Figure 2, Table 1, and Table 2 are nearly illegible.

l.141: "sequential or sequential tasks" -> repetition

l.174/175: "seuqen-tiall" -> typo

l.232: "tupiles" -> should be "tuples"

Table 2: Checkmark and cross symbols are visually confusing

l.326: "Trainingset" -> spacing issue

l.330: "generalization.Qwen2-Audio" ->missing space

l.365: "While, different" formulation is strange

l.424+: "Audio-Flamingo’s performance on Bossanova and Chanson drops severely, respectively." → "respectively" is misused here

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

I really appreciate that this paper quantifies the biases and weaknesses of current audio-based music LLMs. To me, I am a fan of work that picks apart the state-of-the-art and studies what a lot of us have noticed intuitively— that there are genre biases (based on dataset representation), that tagging would help production, and that the audio signal being uses is difficult to tease individual components from.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper introduces an evaluation technique by extracting various types of information from outputs of different models, and finds how well various differnet models perform using this extracted data.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

I think this paper will generated lots of discussion, and its topic is central to current conversations about music LLMs.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

I'm not very familiar with this field, but do see sections relevant to the themes of LLMs, instruction-following, and the context of music.

I see the utility of drawing together the range of tasks, and the contribution of gathering the datasets to be used for benchmarking. But I will also say that I miss a discussion, or at least some further details of the benchmarks datasets. Paper guidelines of course only allow for relatively short papers at ISMIR, and the authors do include a description of what the datasets essentially are. However I feel some details abour their strengths and shortcomings are not only useful, but necessary, as benchmarks are essentially measurement instruments.

Welty, C., Aroyo, L. M., & Paritosh, P. K. (2019). A metrological framework for evaluating crowd‐powered instruments. In HCOMP-2019: AAAI Conference on Human Computation.

Key components of these datasets are often unreported, making it difficult to understand what the data represent: Geiger, R. S., Cope, D., Ip, J., Lotosh, M., Shah, A., Weng, J., & Tang, R. (2021). “Garbage in, garbage out” revisited: What do machine learning application papers report about human-labeled training data?. Quantitative Science Studies, 2(3), 795-827.

I particularly found myself wondering about the emotion annotation benchmark, as I expect this data is rather 'noisy', in the sense that I think humans will vary in the way that they annotate the data. Thus, where the LLMs may appear to perform poorly when it comes to annotating arousal and valence of music, it remains unclear how much humans would agree on the arousal and valence in a given moment of a musical piece. Some discussion on the reliability of the measurements, and other details of the benchmark datasets may help clarify how to interpret the results later on.

It may just be that some benchmarks are 'noisy', and that us as a community should be aware of this, so that we can go about collecting better data for your benchmark sets. Thus, I think details like this are crucial to share with us.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

I'll acknowledge that I'm not completely familiar with all the tasks. However there were aspects of this work I feel show quality, that also do not require task-specific knowledge.

The first is an assessment for at least some forms of bias in the benchmarks. Although I expect this, and further feel it ought to be generally expected in academic work, I do not always see it which is problematic. Given that our field is also music, and there is certainly demonstrable tendencies and biases here, observing that the authors put some thought into assessing bias is useful.

I further appreciate that the authors do not overinterpret the tools that they use - i.e. to use the performance of LLMs as definitive indications of human-like qualities indicative of intelligence. Rather the authors focus on observable quantities, which I find scientifically appropriate.

Lastly, I also appreciate the efforts to assess generalizability.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors collect several MIR tasks into a single framework that can be used to evaluate LLMs. The output is immediately reusable. Further the authors document how the benchmark was composed sufficiently clearly that others may create similar benchmarks if they wish.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Authors share a benchmark composed of common MIR tasks, adapted to allow for the evalutation of LLMs.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The increase in popularity of LLMs, and their recent adaptation for various music-related tasks certainly makes this work one with a timely topic. As popular as LLMs are, a consistent challenge is figuring out how to measure their performance. Thus, contributing to the evaluation of LLMs for MIR tasks makes this topic even more relevant at ISMIR.

I find the approach of instruction-following to be an interesting method for developing the benchmark, in particular when the output is a number within a range on a scale. I do miss an assessment of test-retest reliability of LLM responses, however. As they are stochastic by definition, I would be curious to also see how consitent they are. There are energy costs to this of course, but one might interpret the results of a top performing model differently if its output varies substantially when given the same input multiple times. As I also expect that this variance will vary based on LLM and task, I feel it would add substantial resolution to the results. Of course, I acknowledge the limited space and the necessity for the page-long table to show results.

The work further allows for an initial view on the state of LLMs applied to MIR related tasks. There are some clear successes and failures, which I expect will be of interest to our community.

I further appreciate the lack of overinterpretation of results that is so common in LLM work.