Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation

Frank Cwitkowitz; Zhiyao Duan

Abstract:

Multi-Pitch Estimation (MPE) continues to be a sought after capability of Music Information Retrieval (MIR) systems, and is critical for many applications and downstream tasks involving pitch, including music transcription. However, existing methods are largely based on supervised learning, and there are significant challenges in collecting annotated data for the task. Recently, self-supervised techniques exploiting intrinsic properties of pitch and harmonic signals have shown promise for both monophonic and polyphonic pitch estimation, but these still remain inferior to supervised methods. In this work, we extend the classic supervised MPE paradigm by incorporating several self-supervised objectives based on pitch-invariant and pitch-equivariant properties. This joint training results in a substantial improvement under closed training conditions, which naturally suggests that applying the same objectives to a broader collection of data will yield further improvements. However, in doing so we uncover a phenomenon whereby our model simultaneously overfits to the supervised data while degenerating on data used for self-supervision only. We demonstrate and investigate this and offer our insights on the underlying problem.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Disagree

Q10 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chose, otherwise write "n/a"))

Pls see my comments about $L_{eg}$ in the detailed discussion.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper implements some ways to incorporate invariance and equivariance principles for data augmentation. These may be useful for other applications in music and audio processing.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Data augmentation methods that help deep models; the paper proposes a self-supervision method that needs more work.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak reject

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Contributions: - The paper presents learning methods for multi-pitch estimation. - The paper uses transformations such as pitch shift and time stretching for augmenting the supervised data.

Limitations: The loss function $\mathcal L_eg$ in eq (4) leads to degradation of performance. I would expect a deeper analysis of this loss function. E.g., plot the target $\tilde X$ alongside the true $\tilde Y$ to see if it really captures what we want the model to learn. To me, it seems that $\tilde X$ must be smeared out leading to a trivial solution, like a uniform distribution.

This is an important contribution of the paper, and hence, must be studied properly. Without this, the contributions of the paper seem to little to accept the paper. I would recommend to submit the work after this analysis and the subsequent improvements.

Other comments: - Line 150: what is $u[k]$? How is applied to the input spectrograms? Please write it in the paper. - In eqs. (3) and (3), $\hat Y$, the estimated output, should be replaced with $\tilde Y$, the target output. $\mathcal F(t(X))$ already means the estimated output. - There are minor typos, such as "other other" in line 342.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

All reviewers appreciate the paper and the efforts that went into it. There are some critical comments and some suggestions for further investigations that the authors may look into.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Self-supervised learning can negatively affect generalization of a transcription model.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Self-supervised learning can negatively affect generalization of a transcription model.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper presents a method that uses self-supervised learning objectives within a supervised and semi-supervised approach to music transcription. It shows increase in performance when used in a supervised setting, and a surprising loss of generalization when used in a semi-supervised setting. Evaluation is thorough and although no "solution" to the problem is given, the obtained insights are useful.

I suggest to accept the paper, it is well written, of course more experiments and possible solutions would be welcome.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Well documented experiments on systematically increasing the amount of self-supervision in the training process.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

In MPE and with this combination of losses, self-supervised learning degrades the performance.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

In this paper, the authors present a series of experiments that combine supervised and self-supervised learning paradigms for training a multi-pitch estimation model. A key finding is that the model's performance deteriorates when additional unlabeled data from other datasets—used to enable self-supervised learning—is introduced. The authors report the effects of this behavior in detail.

Overall, the paper is very well written, with each section building logically on the previous one. It also provides a useful summary of best practices in data preparation and loss function design, making it a self-contained and informative read for those seeking insight into current state-of-the-art approaches.

A particularly strong aspect of the paper is its systematic and well-documented experimental setup. Table 1 presents a range of strong baselines, followed by results from the proposed model using various combinations of supervised and self-supervised loss functions on datasets that contain appropriate labels.

However, when additional unlabeled data is introduced for self-supervised training, model performance declines. The discussion in Section 4 is especially insightful, as it attempts to diagnose the reasons behind this drop. The authors suggest that when datasets closely related to the supervised data are added for self-supervision, the model performance is negatively impacted—an effect supported by the experimental results.

One point that remains unclear is whether the authors believe self-supervision still holds promise for this task. Could alternative datasets better complement the supervised data and address its limitations? I also wondered about the behavior of the loss functions when combining labeled and unlabeled samples. Are supervised losses deactivated for unlabeled samples? Furthermore, do the supervised and self-supervised losses operate on the same numerical scale, or does one dominate the other? It would be helpful if the authors could report the typical ranges of these losses—e.g., mean supervised loss vs. mean self-supervised loss for labeled and unlabeled data—though this might be more appropriate for future work.

I strongly recommend accepting this paper. It is a valuable contribution and is likely to spark productive discussions at ISMIR 2025.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper performs a series of experiments that will be helpful for those experimenting with self-supervised multi pitch prediction paradigms. As someone who has researched also in this field, I would have appreciated knowing that someone found similar problems before.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Multi pitch estimation could be not so trivial

Q17 (Would you recommend this paper for an award?)

Yes

Q18 ( If yes, please explain why it should be awarded.)

This paper proposes a set of self-supervision objectives for multi pitch estimation (MPE) a MIR task that has been long studied.

Although many self-supervision approaches have been used in other MIR tasks (tagging, tempo extraction, single-pitch estimation), there was no method for MPE.

Although the solution proposed here is not totally satisfactory for MPE (or at least with the MPE metrics defined until now) it is still a contribution, and an interesting one.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper builds upon the self-supervised objectives proposed in the preprint [1] and performs a series of experiments

It compares how the self-supervised objectives interact with the usual supervision. It provides interesting insights regarding the behavior. One could imagine that the self-supervised objectives could help improve the models in data for which no label is present, addressing the difficulty of annotating large amount of data in MPE. However, it is observed that worse results are achieved.

The selection of dataset seems sufficient, although Maestro and Slakh dataset are not included. Surprising given how much data is available in those. But again, the data selection seems to contain enough variety of domains.

The insights provided through the different experiments look very interesting as well, demonstrating that there is still a requirement for work in this domain.

I think that something is missing however during evaluation. As acknowledge by the authors, pitch is something perceptual. While labels employed for training consists usually of 1s and 0s, the truth is that the self-supervised objectives do not enforce this extreme choices. Moreover, if one thinks of the signal evolution over time, one could expect the pitches of certain instruments to decay over time, contrasting with the rigid 1s and 0s found in labels. Given that a threshold is applied to the model's output, it is expected that if the model is actually starting to predict such smooth changes over time. I would suggest to provide another extra evaluation: - For those frames with active pitch: are the active pitches from the labels within the top K active pitches of the model's prediction? In this way another type of evaluation that does not rely on choosing an appropriate threshold could be employed.

Other comments: - Line 69: What kind of refinement? They seem to me exactly like in [1] - There is no formal definition of degeneration - "teaches the model to degenerate on a specific distribution" sounds like a very weird phrase to me.

[1] F. Cwitkowitz and Z. Duan, “Toward fully self supervised multi-pitch estimation,” arXiv preprint arXiv:2402.15569, 2024."

P5-12: Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation

Frank Cwitkowitz, Zhiyao Duan

Presented In-person

4-minute short-format presentation