P3-1: Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
Yongyi Zang, Sean O'Brien, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack
Subjects: Evaluation methodology ; Evaluation, datasets, and reproducibility ; Evaluation metrics ; Open Review ; Awards Nominee ; Novel datasets and use cases
Presented In-person
10-minute long-format presentation
Large Audio Language Models (LALMs), where pretrained text LLMs are finetuned with audio input, have made remarkable progress in music understanding. However, current evaluation methodologies exhibit critical limitations: on the leading Music Question Answering benchmark, MuchoMusic, text-only LLMs without audio perception capabilities achieve surprisingly high accuracy of up to 56.4%, much higher than chance. Furthermore, when presented with random Gaussian noise instead of actual audio, LALMs still perform significantly above chance. These findings suggest existing benchmarks predominantly assess reasoning abilities rather than audio perception. To overcome this challenge, we present RUListening, a framework that enhances perceptual evaluation in Music-QA benchmarks. We introduce the Perceptual Index (PI), a quantitative metric that measures a question's reliance on audio perception by analyzing log probability distributions from text-only language models. Using this metric, we generate synthetic, challenging distractors to create QA pairs that necessitate genuine audio perception. When applied to MuchoMusic, our filtered dataset successfully forces models to rely on perceptual information—text-only LLMs perform at chance levels, while LALMs similarly deteriorate when audio inputs are replaced with noise. These results validate our framework's effectiveness in creating benchmarks that more accurately evaluate audio perception capabilities.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 ( The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Strongly agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Strongly agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Strongly agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The paper highlights the limitations of current Q/A benchmark datasets used for evaluating large audio-language models and provides a methodology to improve them. Current datasets, such as MuChoMusic, are prone to allowing correct answers to be guessed based solely on the text input modality. The paper proposes a method for automatically creating distractor answers that are more challenging for text-only models.
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
A method to increase the difficulty of Q/A benchmarks for large audio-language models to avoid their overreliance on the text modality.
Q17 (This paper is of award-winning quality.)
Yes
Q18 ( If yes, please explain why it should be awarded.)
The paper presents a methodology and a new dataset that is capable of advancing the evaluation of music understanding by large audio-text models, which is a hot topic in MIR and audio AI research.
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Strongly agree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Strong accept
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The paper presents a methodology and a new dataset that is capable of advancing the evaluation of music understanding by large audio-text models, which is a hot topic in MIR and audio AI research.
Large audio-text models are a hot topic, and evaluating their music understanding capabilities is of high importance. The paper continues recent efforts to create QA benchmark datasets, building on the MuChoMusic dataset proposed at ISMIR 2024. The authors highlight its key drawback, demonstrating how state-of-the-art (SOTA) text-only LLMs can achieve relatively high performance on this benchmark by exploiting their prior knowledge to effectively disregard part of the distractor answers.
The paper proposes a method to measure the difficulty of distractor answers in the benchmark in terms of the actual necessity of audio inputs (perceptual index). Using LLMs, the authors propose a method to generate synthetic distractor answers that are more challenging, eliminating the text bias present in the original MuChoMusic dataset. The authors contribute the resulting dataset as a new benchmark.
Overall, the approach is sound, the paper is well-written, and it represents an important contribution to the field.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Strong accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
All reviewers agree on the significant relevance and value of the contributions in the paper for advancing LALM benchmarks. The paper is well-written and presents a sound methodology, and the reviewers concur on a strong recommendation for acceptance. There are specific questions from reviewers that should be addressed to further improve the paper. We expect relevant changes to be implemented in the camera-ready submission.
Q2 ( I am an expert on the topic of the paper.)
Disagree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
No
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Strongly agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The paper makes an importance observation that both unimodal and multi-modal text(-audio) models reason over the text question to arrive to an answer. They also propose a system for measuring and making questions more perceptually challenging. These insights are certainly going to be very useful to the community, and will inspire work focusing on improving the perceptual capabilities of multimodal language models.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
Current text and audio-text models can reason over text questions to derive an answer without considering the audio, thus a methodology is proposed to create more perceptually challenging questions
Q17 (Would you recommend this paper for an award?)
Yes
Q18 ( If yes, please explain why it should be awarded.)
This is a critical observation and a good, comprehensive solution to it, that includes an extensive evaluation. I believe it will be a critical paper for those working on improving audio-text models.
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Strongly agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper presents a compelling and much-needed investigation into the perceptual awareness of audio language models within music question-answering benchmarks. The authors convincingly demonstrate that MuchoMusic can often be solved by text-only models relying on reasoning and prior knowledge, rather than genuine audio perception. The proposed RUListening framework, which introduces a perceptual index metric to quantify a question’s reliance on audio and a method to generate more challenging, perceptually-demanding distractors, is a valuable and well-reasoned approach, and can be very useful to the community..
The experiments are well designed and an extensive selection of models, including very large ones, are tested. The results support the central hypothesis, showing a significant reduction in text-only model performance and increased sensitivity of audio language models to audio input. The authors also qualitatively investigate possible issues with MuchoMusic, both in how it might encourage text reasoning, but also in how some questions seem to be invalid. While the authors mention the reliance on MuchoMusic as a limitation given these issues, their method is still applicable to other datasets, so it still remains valuable.
The paper is generally very well written, particularly in the explanation of the motivation, methods, and results. Personally, I found the writing style in the introduction a bit distracting, compared to the otherwise scientific and precise style for the rest of the paper. The language can be simplified a bit, both in complexity and “grandiosity”. I understand part of the intensity/confidence stems from the unconventional decision of starting off with some results to prove your motivation (which, while unexpected, I felt by the end that it was a good choice). Part of this is also the starting quote, which I really don’t think adds anything meaningful to the paper, and the starting sentence, which is long and convoluted. I think simpler language and a clearer explanation of the situation would make the paper much easier to read - it could be more clearly stated that the text models are simply responding to the text question without the audio. Persisting on them not being able to perceive and being “deaf” prolonged my confusion about whether you are, for example, investigating providing some text encoding of a waveform to text models. A clearer explanation of the task (and dataset) would also benefit new readers to the area.
Still, as mentioned, I very much enjoyed reading the rest of the paper and think it’s a very strong and valuable contribution.
A paper that I think is relevant and could be discussed in related work is “I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition” from last year’s ISMIR.
I am not sure the ISMIR template is strictly followed. There are larger than usual page margins, but I can’t tell if there’s just added padding or if the column sizes are affected. References could be cleaner (links, capitalization, consistency).
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly disagree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))
In Section 2.2, I missed a mention of MMAU (A Massive Multi-Task Audio Understanding and Reasoning Benchmark). It is broader than MuChoMusic (i.e., also tests speech and environmental audio), but it is more comprehensive in the difficulty levels it is testing.
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Strongly agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The insights and the Perceptual Index proposed by the authors can be extended to other MusicQA benchmarks, such as MMAU. This can also be used when new datasets are created to ensure they will properly evaluate audio perception capabilities of Audio LLMs.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
The paper proposes a method to improve current benchmarks by creating distractors that force LALMs to use audio perception instead of only reasoning, thus evaluating the real capabilities of music understanding.
Q17 (Would you recommend this paper for an award?)
Yes
Q18 ( If yes, please explain why it should be awarded.)
As the number of LALMs increase, we need to have reliable ways of evaluating these models. Multiple-choice questions are the most scalable way we know to date, but different works have shown that LLMs have comparable performance with LALMs even without accessing the audio. This work takes an important step into refining one benchmark and making it more suitable for multimodal models. I think the insights here can and should be reusable in new works with LALMs.
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Strongly agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
I recommend acceptance of this paper.
Strenghts: * Paper is easy to read and follow. Figures are appropriate and well-discussed in my opinion. * Authors conduct extensive experiments to validate their Perceptual Index. * Results indicate that the new questions makes it harder for LLMs to have comparable performance with LALMs, which is what we should see in MusicQA benchmarks. Weaknesses: I don't see any major weaknesses in this paper.
Questions: * L186-L190: Is the prompt also changed for LALMs? * L304-L316: How many distractors were generated per question? What was the prompt used to generate them? Was there any comparison with other LLMs to generate such distractors? * L329-L335: It would be interesting to have some examples of high and low semantic similarity score for CLAP. * L346-L348: Are you constraining the model output to only one token? If no, how are you controlling for the case in which the model does not follow the output format. For example, instead of answering the correct answer option, the model just write a very extensive answer. * L443-L450: Do authors see any relationship between the length of the response and the accuracy? In my experience, MuChoMusic evaluation criteria search for common terms between the answer words and the model output. I am not sure what is the impact of this on the evaluation.
- Appendix B: Maybe I am missing something, but I don't understand the role of the audio description in the model inspection. Was the description provided to the models together with the question? If yes, why?
Minor comments: * MuchoMusic -> MuChoMusic * L203, L227, L405, L444: remove space before the \footnote command. * Is there a specific reason why authors decide to refer to LLMs as text-only LMs, but Audio LLMs as LALMs?
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Strongly agree
Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))
The paper is clearly outlining the available Music Question Answering evaluation and multimodal perception datasets that are available up to the state-of-the-art.
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Strongly agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Strongly agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Strongly agree
Q15 (Please explain your assessment of reusable insights in the paper.)
One of the biggest gaps in Music Information Research is the lack of adequately, rigorously filtered datasets. Evaluation datasets are usually just scraped, crowd-sourced and little to no effort is being made in order to properly understand if they provide a good proxy to the question the are trying to answer. This paper clearly paves the way to properly start asking the questions "In what way does this evaluation dataset help me answer the question? How close to objective truth is it and therefore, how much trust can I put in the results obtained?"
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
Multimodal perception would greatly benefit from surrogate tasks and generative models (specifically Large Language Models) in order to come up with questions that rely in all of the modalities and avoid a single modality to be 'enough'.
Q17 (Would you recommend this paper for an award?)
Yes
Q18 ( If yes, please explain why it should be awarded.)
As stated in the reusable insights of my review, I think that this paper might lead into a whole sub-domain of critiquing and evaluating the current state of the art in evaluation. I think that we clearly lack properly, proof-read evaluation datasets and it will be good to evaluate automatic methods to at least, improve their validity if not, finding the 'groundtruth'.
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Strongly agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper focuses on evaluating the validity of multimodal system evaluation protocols and more specifically, MuchoMusic. In that sense, they ask the question, are truly Multimodal Question Answering datasets testing the multimodal capabilities of such systems? In what way is the reasoning capabilities of language models enough to come up with the right answer without regarding the audio signal?
Apart from that, in lines <216-221> the authors provide a statement as true, but no valid evaluation has been tested as far as I'm concerned. Therefore, they should change the phrasing to explicitly say that this is a hypothesis that models to posses the "world-prior" information for music with an existing corpus.
In line <262> the definition of the total probability in the parentheses is incorrect, as the total probability is defined as the multiplication between the posterior (which is in the formula) and the prior of each answer (for which we don't have access to). Also, it is implied that softmax has been applied and should be clearly stated because in multiclass classification you don't need to have that. The final answer is the most plausible one, i.e. the one with the most likelihood. Therefore, the conditional probabilities of answers conditioned to the questions are not bound to be added to one. As such, more details must be included or explicitly state the softmax approximation of the total probability.
In the paragraph between lines <284-302> it is hardly to follow the argument over the choice of perceptual index over entropy. First, evidence for correlation between PI and entropy is needed and therefore the argument between lines <292-295> should be rephrased to explicitly state that this is a hypothesis rather than a given. Also, the noun is missing from sentence at lines <296-302>, probably they imply 'confidently wrong' and correct answers?. Also the claim that high PI answers require perceptual information while high entropy ones are not, is not justified. For that to be completely true, experiments with filtering based on entropy are needed and then comparison between PI and entropy based filtered performance and similarity distributions need to be added. Otherwise, I would rephrase the paragraph in order to explicitly state that this is a hypothesis left to be explored in future work.
In line <308> context packages are not a term that is known and therefore need explicit definition. Changing the phrasing from 'with question...' to 'as a list of question...' would introduce more clarity that a package is the set of the triplets. Line <319> has a typo. I would change the x-axis label of figure-3 to explicitly state the similarity that has been chosen, which is the cosine similarity.
There is evidence that CLAP is not properly understanding musical terms and a paper was out for Instrument recognition in last years ISMIR, where the text encoder of CLAP failed to semantically understand music relationships between instruments which is proposing the directly opposite phenomenon of the example proposed. Also, the distributions themselves do not provide context. A better alternative would be to plot the distributions between c and distractors in MuchoMusic, the generated distractors and even random ones. I believe that there won't be a significant change between random distractors and chosen ones. With that, an aggregation function (such as the overlapping index) could be calculated to hint that there isn't any big difference between the distributions. Also, in these distributions, the similarity between distractors and the correct answer is approaching 1 and those cases should be considered or it should be explicitly be stated so.
In figure 4b, there are a lot of outliers with very small perceptual index and Pearson's coefficient is known to be sensitive to them. A lasso analysis should be performed given the visualization, and the R factor would be almost 0 in that case (with a threshold of 0.3). This doesn't happen with 4a, so the correlation value is mostly true. Also, it should be good to include the mean accuracy for both of the datasets to signify the difference in both cases.
In line <393> hardness value are not properly defined and in line <394-397> the argument doesn't have grounding, as this is the under investigation hypothesis. Up until this point, audio is only partially correlated with PI. Also, Figure 4 should include the same plots for MuchoMusic without the generation to properly compare between them two. I suspect that the change in the music PR correlation will be pretty much the same.
Apart from that, the paper was an interesting read, very precise and definitely addressed a large gap in the music-language multmodal comprehension. I think its a work properly addressing a main gap of our domain and community, the lack of rigorously defined evaluation sets. Also, a way of estimating the informativeness of specific question answer pairs and the problematic nature of evaluating multimodal systems was addressed succesfuly! As a remark, more experiments should be performed with less models on random distractors, different subsets from several steps of the RUListening framework (with variable distance from the optimal D*) and even test the entropy based filtering. That would further solidify that their approach is successful and that choosing PI is the right surrogate task for finding the right set of distarctors. With minimal phrases changed, I'm thinking that this paper is a nice addition to the ISMIR conference!