P4-15: From Discord to Harmony: Decomposed Consonance-based Training for Improved Audio Chord Estimation

Andrea Poltronieri, Xavier Serra, Martín Rocamora

Subjects: Machine learning/artificial intelligence for music ; Music transcription and annotation ; Harmony, chords and tonality ; Open Review ; Knowledge-driven approaches to MIR ; MIR tasks ; Computational music theory and musicology ; Musical features and properties

Presented In-person

4-minute short-format presentation

Abstract:

Audio Chord Estimation (ACE) holds a pivotal role in music information research, having garnered attention for over two decades due to its relevance for music transcription and analysis. Despite notable advancements, challenges persist in the task, particularly concerning unique characteristics of harmonic content, which have resulted in existing systems' performances reaching a glass ceiling. These challenges include annotator subjectivity, where varying interpretations among annotators lead to inconsistencies, and class imbalance within chord datasets, where certain chord classes are over-represented compared to others, posing difficulties in model training and evaluation. As a first contribution, this paper presents an evaluation of inter-annotator agreement in chord annotations, using metrics that extend beyond traditional binary measures. In addition, we propose a consonance-informed distance metric that reflects the perceptual similarity between harmonic annotations. Our analysis suggests that consonance-based distance metrics more effectively capture musically meaningful agreement between annotations. Expanding on these findings, we introduce a novel ACE conformer-based model that integrates consonance concepts into the model through consonance-based label smoothing. The proposed model also addresses class imbalance by separately estimating root, bass, and all note activations, enabling the reconstruction of chord labels from decomposed outputs.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The perceptually motivated approach this work takes could instigate similar research in other problems where hard labels seemingly present a performance ceiling.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

A conformer-based model using label smoothing and a consonance-based evaluation metric shows strong promise for ACE.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The presented work uses non-binary distance measures to model inter-annotator agreements together with consonance-based label smoothing as a means to train an audio chord estimation (ACE) model. This is interesting and multidimensional work with compelling results that will receive interest from the research community.

Authors introduce a perceptually-informed distance metric, in the context of Western music, that models the agreement between annotators in a musically meaningful way. They first define Mechanical Distance which quantifies the magnitude of deviation for each incorrect note from the target chord and by combining it with consonance they arrive at the Mechanical-Consonance metric which weighs each semitone deviation according to its perceptual consonance value. This metric achieves a higher inter-annotator agreement score and has better discriminative ability with respect to random data.

The proposed model contains an acoustic front-end followed by a conformer model which combines CNN and transformer models to capture local and global dependencies. Three fully-connected layers are then used to predict bass, root, and chord pitch class content all from which a symbolic prediction is made after label smoothing where more consonant intervals receive higher similarity scores.

Evaluation is performed on standard datasets which reveal better performance compared to the BTC model. They also provide a simple analysis of the penultimate layer activations with and without consonance smoothing. All in all, the competitive results show the viability of the approach and this work will set a significant benchmark for ACE going forward.

The paper presents a novel approach and provides valuable insights into how to leverage the level of agreement of annotated data and label smoothing in a musically meaningful way toward improving ACE. It is well-written for the most part with strong motivation and literature review however since there are many different issues the paper tackles space seems to be tight. It would be beneficial for the authors to make another pass in order to make space for implementation details and include some additional results as suggested by the reviewers for improving the presentation. Please go through the references and format them properly (e.g. [35], [38]).

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

All reviewers have provided positive feedback about this work and are in agreement that the ISMIR community will benefit from publication of the paper. It contains useful insights into chord estimation and will be of interest to researchers working in this field. Please carefully address the issues raised by the reviewers (especially those from reviewers #2 and #3) in the final form of the paper prior to submission for publication. Since there is limited space you will need to be creative in compressing the existing content to insert the new informational and clarifying text that we are asking. Please remember to go through your references to ensure consistency and adherence to the required format.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

No

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper investigates in the subjectivity, or inter-rater disagreement in audio chord estimation, and proposes a new metric Mech-Cons to better capture the chord relationships beyond the binary metric. This metric can be used in other future research regarding chords and harmony.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper improves audio chord estimation by proposing a new Mech-Cons metric to better capture the chord relationships beyond the traditional binary one, which is then incorporated as consonance-based smoothing for the better estimation performances.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper improves audio chord estimation by proposing a new Mech-Cons metric to better capture the chord relationships beyond the traditional binary one, which is then incorporated as consonance-based smoothing for the better estimation performances.

Although the improvements of the model architecture seem incremental, I really appreciate the discussion of harmonic subjectivity/ambiguity by investigating inter-rater agreement in depth, which results in a new Mech-Cons metric to better reflect the chord relationships beyond the binary metric. This metric can be used in other future research regarding chords and harmony.

The formatting of references can be better. For example, the page number for [4] is missing, and some entries provide online links while others don't. I suggest using a unified reference style. Furthermore, the important information of the dataset split in Section 4 is missing out, which should be added and specified in the final version. The performances of the BTC model do not coincide with the ones proposed in the original paper, so I assume the authors let BTC infer on their test set. Please make sure this is the case (and also make sure the author's test set do not overlap with any training data of the original BTC), and provide corresponding explanations in the text.

Overall, this is a valid piece of work on audio chord estimation. I will accept this paper.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Authors proposed a new evaluation metric, a new model and a new training technique. All of them are worth studying in future work.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The authors propose a consonance-based evaluation metric for ACE and also propose a model trained with consonance-based label smoothing that outperforms the strong baseline.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper considers two important challenges in automatic chord estimation (ACE): class imbalance and annotation subjectivity. The paper proposes 1) (to tackle the subjectivity) Mechanical-Consonance metric (Mech-Cons) for evaluation, which takes the consonance relationship between pitches into consideration, 2) A Conformer-based ACE model that predicts root, bass and pitch and then decodes these information into chord symbols, and 3) (to tackle the imbalance) consonance-based label smoothing (LS) that penalizes the musically meaningful errors less Results show that the proposed Mech-Cons metric achieves higher inter-annotator agreement score, which means it is able to capture the ambiguity between different chords. The proposed model outperforms BTC, a strong ACE baseline, and the consonance-based LS further improves the performance, both in terms of standard mir_eval metrics and the proposed Mech-Cons.

Strengths - The Mech-Cons metric that authors propose show higher inter-ratter agreement than other metrics, especially the standard MIREX metric, which means that Mech-Cons better captures humans’ subjectivity in chord annotations. - The proposed model is novel. While the authors do not point out or do any further study, the process of predicting bass, root and pitch classes and then decode them to chord labels is new. Note that it is different from [12] where there is a trainable linear layer to predict the chord label from these information. - The proposed consonance-based smoothing improves the performance of the system, both in terms of standard MIREX scores and Mech-Cons.

Weaknesses and Questions - I think it is good work but the presentation can be improved to make it clearer. I understand that due to page limit, there are many things that the authors don’t have space to explain in detail (e.g. TbT, mechanical distance and section 4.2, specified below), but the lack of explanation adds to the difficulty of reading. - Table 1 should include binary agreement since it is something the authors explicitly mention to compare to. Here, TbT and Mech are proposed in [10] and is (I believe) less familiar to the community compared to MIREX score. However, the authors use minimal effort to explain what TbT and (in particular) Mech are. Mech serves as the basis of the proposed Mech-Cons, and this significantly adds to the difficulty of reading. E.g. why we can use the consonance vector as a parameter. - Section 4.1 and Table 3 are the main results for the model-wise contribution. (2 and 3 in the summary above) However, this is too fast, especially for the label-smoothing technique. To show that LS improves the model’s performance, we also need to see BTC with linear LS and consonance LS, otherwise it is reasonable to doubt that LS work only with the conformer model. Also, it is intuitive that consonance works the best with Mech-Cons because they share similar targets. Would authors agree with that? - The authors let the model predict bass, root and pitches and then use a separate decoder to convert them into chord labels. This is done by neither [12] nor BTC. It is still a fair comparison if the baseline BTC is also implemented in this way but the effect of such uncommon implementation is not investigated. - Section 4.2 seems a bit incomprehensive due to the limited space. The authors say that “Consonance-based smoothing promotes equidistance…” (Line 483-485) However, this is not correct. [11] claims that LS encourages equidistance because [11] uses the standard LS (linear, in this paper’s words), but since the authors uses a different, delicately designed, consonance-based LS, we should expect something different. To make the whole section 4.2 stronger, we need to compare the representation of No LS, linear LS and consonance LS, and also not only semitone-based “C-C#-D” but probably also fifth-based “C-G-D”.

Overall, I think this paper covers many, and even too many contents, including evaluation metric, improving from BTC to Conformer, training with label smoothing, each of which is probably worth writing a separate paper, especially the evaluation metrics. I greatly appreciate that the authors, based on their comprehensive literature review, have put together everything and proposed the strong model. However, squeezing everything into a paper means some discussions have to be on a superficial level and bring limited insights. I would recommend a weak accept. If the paper is accepted, I would suggest considering the weaknesses and questions mentioned above and focus on only the essential parts of their work in the paper (e.g. Get rid of line 223-240 and give more explanation about basic mechanical distance).

Minor corrections: - Put Figure 1 in the same page as 3.2 - Some of the references are not properly formatted. For example, [15] is published in ICASSP. Many papers are from ISMIR but are formatted differently, e.g. [4], [5], [6] and [14]. If the paper is accepted the authors should clean the references.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

A recently published ACE paper using Conformer architectures is highly relevant and could be discussed as part of the related work.

M.W. Akram, S. Dettori, V. Colla, and G. C. Buttazzo, “Chordformer: A conformer-based architecture for large-vocabulary audio chord recognition,” CoRR, vol. abs/2502.11840, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.11840

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The perceptually informed smoothing techniques can enhance model generalization and harmonic sensitivity, offering a musically grounded approach that can be extended to other MIR tasks

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper demonstrates incorporating consonance-based label smoothing into a conformer-based model enhances audio chord estimation by better capturing harmonic relationships.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes a novel conformer-based model for Audio Chord Estimation (ACE), enhanced with a consonance-based label smoothing technique. It aims to address persistent challenges in ACE by leveraging perceptually informed metrics and smoothing strategies grounded in music theory.

Strengths • Clarity and Structure The paper is well-written, with a clear structure that makes it easy to follow. The motivation is well-established, and the progression from problem identification to solution and evaluation is logically organized.

• Innovative Use of Consonance-based Smoothing The introduction of consonance-based label smoothing is both novel and musically meaningful. The authors make a compelling case for how this method aligns better with human perception than traditional binary or uniform smoothing techniques.

Weaknesses and Suggestions 1. Lack of Detail on Data Splits: The paper does not explain how the training, validation, and test sets were divided. This information is critical for reproducibility and should be clearly stated, including whether any cross-validation was performed or how the validation set was selected. 2. Since the reported performance improvements are relatively marginal in some metrics, it would strengthen the empirical claims if standard deviations were included. 3. Unclear BTC Evaluation Procedure: The paper compares the proposed model to the BTC model [24] in Table 3, but it is not clear whether the BTC results were reproduced using the original model weights, retrained under the same setup, or reimplemented. Clarifying this is essential for interpreting the fairness and reliability of the comparison. 4. While BTC is a known method, it differs from the proposed model in both architecture (bidirectional Transformer vs. Conformer) and decoding strategy (BTC lacks the root/bass/pitch-based decoding described in Section 3.4). A more suitable baseline would be the model in [23], which includes a decoding strategy and is publicly available. Comparing against [23] could better isolate the contributions of the conformer architecture versus the decoding mechanism. 5. Although the Mechanical-Consonance metric is well-motivated, it would be beneficial to also test label smoothing using the Tone-by-Tone metric in Table 3. This would provide further empirical grounding for the choice of Mechanical-Consonance smoothing. 6. Justification for Decoding Choice: It would be helpful if the authors provided rationale for adopting the decoding approach from [12] instead of the one from [23], especially since [23] argues that their decoding method offers advantages and is more expressive.

Minor Issues and Typos • Line 335 / 382: There is inconsistency in English style: “normalisation” (UK) in Line 335 and “normalize” (US) in Line 382. Please standardize the English style throughout the paper. • Table 3: The best result on the Tetrads metric is achieved by the model with Linear smoothing, but the bold text is mistakenly applied to the Consonance model. Please correct the formatting.

I appreciate the potential of this work and commend the authors for addressing an important challenge in audio chord estimation using a musically meaningful approach. However, due to several unresolved issues in the experimental setup and evaluation, I believe the work would benefit from further clarification and refinement before publication.