AN EVALUATION STRATEGY FOR LOCAL KEY ESTIMATION: EXPLOITING CROSS-VERSION CONSISTENCY

Yiwei Ding; Yannik Venohr; Christof Weiss

Abstract:

Local key estimation (LKE) is an important yet challenging task in music information retrieval since it involves a high level of musical abstraction, which entails ambiguity and low inter-annotator agreement. Relying on limited (small) datasets with a single annotation may introduce not only dataset bias but also annotator bias. To address such problems, we propose in this paper a novel, annotation-free evaluation strategy for LKE. To this end, we exploit datasets where multiple versions of the same musical work are available. We investigate the models' consistency across versions, expecting an effective and robust model to output similar predictions on different versions of the same work. In our experiments, we study the behavior of the proposed cross-version consistency measure at the example of different models and datasets, indicating a strong correlation between cross-version consistency and the models' effectiveness on in-domain data as well as their generalization to out-of-domain data. Our further studies show that, while being correlated to common evaluation metrics, cross-version consistency is also capturing different aspects of model behavior, thus serving as an additional figure of merit for evaluating LKE models.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

While the proposed metric can and probably should be adopted in future papers working on local key estimation, it teaches little about the type of errors being made, nor provides a straightforward way to lead to improved estimation.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

When multiple versions of a composition are available, the proposed metric can give insight into the local key estimation of an algorithm without relying on annotations.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak reject

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary

A new metric for local key estimation is proposed, which is based on the consistency of the estimated local key across different versions of a piece of music and therefore does nor rely on annotations. It is shown to be correlated to existing, annotation-based recall metrics, by performing experiments with multiple deep learning based systems. There are the differences between the proposed and existing metrics though, for instance the ranking between models is not necessarily preserved, but it is not clear what the proposed metric captures that the existing ones don't and vice versa.

Positives

A well written text, with a clear structure and a good didactic approach.
A variety of models and datasets used.

Negatives

Narrow focus on classical music. The method relies on the availability of multiple interpretations of the same piece of music, in a way that is common in classical music. It would be interesting to see if the proposed methed transcends this narrow field and would also work with popular or jazz music. The exact defintion of and difference between version would be crucial there. Is a remaster of a pop song sufficiently different? Does the metric still work with covers? Can this be used with repeated improvisations in jazz music?
Creating models of varying quality by saving checkpoints every 10 epochs is the easiest, but likely not the best way to do this because of the obvious correlation between subsequent checkpoints. Better would be to do small architecture variants, such as reducing the number of neurons in a layer/number of layers/etc.
While the clear explanation and increasing difficulty of the experiments is much appreciated, the proposed contribution feels relatively small to spread out over 6 pages. RQ3 is the one that matters in the end, the preceding RQs are part of the metric's development, but of lesser consequence.
It's unclear if the new metric will have a significant impact on the field. The CVC can/will be reported in future papers on LKE, but will new methods be proposed/accepted only based on CVC performance on unlabelled data? Either further insights into the differences between CVC and recall measures or demonstration of its applicability beyond classical music should be added to ensure impact of the work.

Overall

A well executed and presented paper that is easy to read and understand. The proposed metric is interesting, but the narrow focus on classical music and the lack of insight into the differences between the proposed and existing metrics diminish the value of its contribution.

Presentation

The correlation matrices presented in Fig. 6 would be better presented as triangular matrix to avoid unnecessary duplication of data and visual distraction.
l. 197: "mesaure" should be "measure"
l. 237: "architecure" should be "architecture"
l. 403: "shorteset" should be "shortest"

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak reject

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

All reviewers agree that the root of the ideas presented in this submission are very interesting and have great potential. The excellent presentation and writing style is also much appreciated. Some questions over the interpretation and applicability of the proposed metric remain, however, which would ideally be addressed in future iterations of this work. Do have a look at the individual reviews for more details.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The metric proposed is quite simple to implement. It only requires a DTW library (and the datasets). It has an obvious impact when developing novel key estimation methods in unlabelled datasets.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Aligning local key estimations in different song performances correlates with recall (not sure if micro or macro).

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary

This paper presents a simple yet effective idea: in the absence of ground truth labels, align multiple datasets of the same songs, and treat agreement between predictions as a proxy for recall.

Strengths - Clearly defined problem and straightforward solution - Comprehensive evaluation - Incorporates a music-knowledge-based scoring variation

Weaknesses - The proposed metric correlates with recall, but the rationale for prioritizing recall is not well-justified - Lack of analysis involving other standard metrics, such as precision or F1-score - Unclear whether the reported recall is micro or macro. Given that the authors align it with accuracy, it should be micro.

Detailed Comments

The authors should elaborate on the choice of recall as the primary evaluation metric. From the manuscript, it appears they are using micro recall, that corresponds to overall accuracy in most settings (single label as this one).

However, accuracy (or micro recall) is known to be biased toward the most frequent class. In highly imbalanced settings, a naive classifier that always predicts the most common label can achieve a high recall. It would be valuable to understand whether the proposed method accounts for this bias in any way.

Moreover, a discussion on the trade-offs and limitations of relying solely on recall would strengthen the paper’s evaluation.

Analysis I independently downloaded the dataset and used a quick script (with help from ChatGPT) to verify that the concerns raised above were not critical in this particular case. Nevertheless, the lack of discussion on metric selection remains a weakness, and I encourage the authors to address it.

https://chatgpt.com/share/681fe25f-dec4-8006-b7d2-8dd3d1f048c5

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

While using alignment for evaluation in music is quite unique and related works are mentioned, there could have been a bit more background on the use of generalization as an evaluation metric.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Annotation-free evaluation approaches in general have a lot of potential in MIR given the difficulty and cost of obtaining expert annnotations. I think the cross-version alignment trick proposed specifically could inspire new ideas in self-supervised representation learning and evaluation in music.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The performance consistency of a local key estimation model across versions of the same western classical music track correlates with its local key estimation performance

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper cleverly exploits the fact that different performances of a western classical music piece should have the same local key annotations (given the same annotator) to define an annotation-free local key estimation evaluation metric (CVC) describing the cross-version model performance consistency. The key to making this happen is time-aligning the different performances of the same piece.

The paper overall is well-written, with each concept and method being clearly explained. The music theory introduced and explained is sound. I would recommend that the fact that the method is applicable to western classical music be disclosed in the abstract, given that absolutely aligned reperformances with the same structural section ordering/repetitions are not standard outside of this context, at least for existing datasets.

As mentioned before, I think the potential of annotation-free evaluation is big, and it has certainly driven progress in some areas in MIR. This particular trick, while not particularly complex, is clever. While one could think of it as simply a use of domain-shift generalization as a proxy for model evaluation, this particular application is particularly suited, given its "domain-shift" almost perfectly preserves non-performance-related factors of variation/content.

Overall, I like the experiments conducted, particularly the idea of using the different checkpoints as a proxy to expected model performance. However, my main point of criticism is that while inter-annotator agreement is emphasized as a problem in these scenarios, and, thus, also of using recall/MIREX given they are based on annotations, the correlation experiments are still ran against them. If there is annotation bias in the annotations used, then correlating to recall based on the annotations is obviously limited and does not address the issue. Obviously I don't think this is an easy problem to solve, but I think the way it was introduced made me expect that an experiment would be conducted to address it. I think it would be useful to 1. acknowledge the limitations of using these metrics as a base of comparison in the experiment design, given the annotation limitations introduced, and 2. mention more valid uses of this metric, rather than the absolute measure of LKE performance (which currently is also hampered by the cross-model CVC inconsistencies) - one such use would be as a supervision signal.

My other criticism is that, while the figure 6 experiment is well designed, the conclusion that the CVC variants are a novel figure of merit because of it is a bit of a generous interpretation of the weaker correlation. That's particularly because of the small differences between the original and EMD variant compared to MIREX. I would have liked to see these results investigated further. Given the overall potential of the method, I would have prefered that more space is allocated to a discussion about possible directions of improvement (particularly in the direction of more musically/perceptually informed approaches), limitations, and future work.

Overall, I think this is a good paper with scientific merit, and some changes within the scope of the camera ready can increase its soundness and impact, though some mentioned limitations would remain.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

"Agree" is in favour of the authors, I am not too sure.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Disagree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

For what is in the paper, there is no issue. However, the correlation values do not indicate the necessity of the CVC. Since the authors have clarified that the CVC is not a replacement for current metrics, but complements them, the applications of CVC must be brought out. What additional utility does it bring?

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

In MIR, quantitative, reproducible measurements not requiring detailed human annotation are always welcome. This paper proposes one such quantification. Inter-annotation consistency also applies in the larger context of MIR, and not only in LKE.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Cross version consistency is a good candidate for evaluating LKE models.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The problem and the idea are both very interesting. However, the main comment is whether the results are convincing enough. In Figure 6, the MIREX and recall agree much better than do CVC_TVD and CVC_EMD, which in turn, are better than any cross combinations like Recall/MIREX and CVC_*.

On the other hand, the authors have clarified that the CVC is not a replacement for current metrics, but complements them. In that case, the applications of CVC must be brought out. What additional utility does it bring?

Some other points I noticed: Figure 2 is not referred in the text anywhere. Presumably illustrates the description in Section 3. Section 3: L is not defined, hopefully it is the number of gray segments in Figure 2. Similarly for M and N can be mapped to the figure. Please clarify in the text that the predictions are probabilities of the 24 keys at the introduction of p and q.

I'm not an expert in the subject, but "frame" in the context of audio suggests 10 to 60 ms. The frame length is not in the paper anywhere. It is hard to imagine a local key every frame, unless there is a window of a few seconds ending at that frame.

P2-5: AN EVALUATION STRATEGY FOR LOCAL KEY ESTIMATION: EXPLOITING CROSS-VERSION CONSISTENCY

Yiwei Ding, Yannik Venohr, Christof Weiss

Presented In-person

4-minute short-format presentation

Summary

Positives

Negatives

Overall

Presentation