P4-12: Count The Notes: Histogram-Based Supervision for Automatic Music Transcription
Jonathan Yaffe, Ben Maman, Meinard Müller, Amit Bermano
Subjects: Machine learning/artificial intelligence for music ; Music transcription and annotation ; Annotation protocols ; Evaluation, datasets, and reproducibility ; Open Review ; Alignment, synchronization, and score following ; Knowledge-driven approaches to MIR ; MIR tasks
Presented In-person
4-minute short-format presentation
Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts while maintaining high transcription accuracy. Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods, improving AMT's robustness, scalability, and efficiency. Our project page is available at https://yoni-yaffe.github.io/count-the-notes.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 ( The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Disagree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
Histograms are a viable alternative to sequence information in AMT training supervision.
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
Histograms provide sufficient information as input to a theoretically sound training process.
Q17 (This paper is of award-winning quality.)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Weak accept
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This work builds entirely on NoteEM (Reference 5). The NoteEM work presents certain limitations of using weakly aligned targets such as repeated cadenza, subtle nuances such as trills (which can easily be rendered differently from the score) and chords or arpeggios where the notes can come with delays in different order. The current works aims to alleviate the assumption of undisturbed event order by using an even weaker form of supervision via a note histogram (i.e. note counting only). Their approach obtains an improvement over NoteEM with the GuitarSet and GAPs datasets. It would have helped if they had discussed the precise contexts where the predictions showed differences and thus provided insights involving real-world score-audio mismatches. The paper presents results on simulated noise added to MAESTRO dataset but it is not clear what real-world musical phenomena are being captured by the noisy histograms generated in this manner. Overall the descriptions and explanations fall short of ideal.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Weak accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
Summarising the reviews post discussion phase, all reviewers agree that while the approach of histogram supervision is relevant and interesting, more discussion is needed to justify the substantial novelty of the contribution over the NoteEM paper. Clearer discussion is needed on how histogram supervision accounts for real-world score-audio mismatches with examples of specific scenarios where it helps over the DTW method.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 (The title and abstract reflect the content of the paper.)
Disagree
Q4 (The paper discusses, cites and compares with all relevant related work)
Disagree
Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))
The addressed problem is not well defined. This paper focuses on multi-pitch estimation, not music transcription which can also include other information and label such as instrument timbres, time signature etc.
Second, the study is limited to supervised multiple f0-estimation methods and only propose to improve the training of the parameters.
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Disagree
Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))
The mathematical notations and definitions in the paper are not always clearly introduced. In particular, readers are often left to infer the meaning of symbols directly from Algorithm 1 or context, rather than from explicit definitions in the text. For example, it is not immediately obvious that $a_i$ represents an entire audio input (e.g., a time-domain or feature representation of a full audio segment), rather than a single sample at time i. Similarly, h_i refers to the corresponding note histogram for that segment, but this is only implicitly stated.
Furthermore, the initialization process in Algorithm 1 lacks clarity. It is not fully explained how the initial values for Y_i and $d_i^{\text{hist}}$ are set or used in the first iteration.
A clearer description of the algorithm's inputs, outputs, and initialization steps would improve the paper's readability and reproducibility.
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Strongly Disagree (Well-explored topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
CountEM can be integrated with existing AMT methods, potentially enhancing their performance.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
This papers introduces a new fine-tuning approach to improve supervised-based AMT methods using the histogram of the pitches.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper introduces a novel approach for improving automatic music transcription (AMT). To this end, it uses a pitch histogram criterion so-called CountEM which does not require temporal alignment and can fine-tune the estimated parameters of the model. While the idea is conceptually simple, it demonstrates potential usefulness and provides insights that could benefit and complement existing AMT techniques.
Strengths: -The proposed method is simple yet effective, and it shows the potential of the approach without relying on explicit temporal alignment.
-The framework offers reusable insights and can be integrated with existing AMT methods, potentially enhancing their performance.
-The authors provide listening examples and a demo, which help illustrate the practical relevance and perceptual quality of the results.
Weaknesses: -The proposed contribution appears incremental and closely related to NoteEM [5]. The paper should better emphasize the originality of its contributions and clearly delineate what is novel beyond existing work.
-Maybe the CountEM process could be replaced by a regularization term within a standard training pipeline, rather than requiring a dedicated EM algorithm. This alternative should be discussed and justified in more depth.
-The experimental results are limited. The study includes only two baseline methods (Hawthorne et al. [1,2] and Kong et al. [3]), while more recent and state-of-the-art methods such as Transkun [Yujia Yan & Zhiyao Duan, ISMIR 2024] are not considered. A broader comparison would help clarify the benefits of the proposed approach.
-The CountEM method may introduce errors when the histogram prior assumption is not met. This potential limitation should be addressed through targeted experiments and discussed to better understand the trade-offs.
-The paper would benefit from improved organization and clearer exposition, particularly through a more formal problem definition and consistent use of mathematical notation.
-No implementation code is provided, and there is no discussion of the computational complexity or runtime performance of the proposed method, which limits its reproducibility and practical evaluation.
Recommendation: Although the paper is borderline in terms of novelty and experimental depth, I recommend a weak accept, as the core idea is potentially useful and could inspire further developments within the ISMIR community.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Strongly agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
This work proposed a new idea for fine-tuning the transcription model without DTW refinement.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
CountEM introduces a novel AMT framework that uses histogram-based supervision to achieve improved transcription with reduced annotation effort and computational complexity.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
Summary: The paper introduces CountEM, a novel framework for Automatic Music Transcription (AMT) that leverages histogram-based supervision to convert audio recordings into symbolic musical representations. Unlike traditional AMT methods that rely on strongly aligned frame-level annotations or weakly aligned methods using Dynamic Time Warping (DTW) [1], CountEM uses note event histograms (counts of note onsets within time windows) as supervision. This approach eliminates the need for precise temporal alignment, reducing annotation effort and computational complexity. CountEM employs an Expectation-Maximization (EM) algorithm to iteratively refine predictions, starting with a model pre-trained on synthetic data. The framework is evaluated on piano (MAESTRO), guitar (GuitarSet), and multi-instrument (MusicNet, URMP) datasets, demonstrating performance improvements for both single-instrument and multi-instrument transcription. Review Comment: Recent literature has explored fine-tuning or semi-supervised retraining of transcription models pre-trained on datasets like MAESTRO. This paper offers a novel insight by proposing that DTW-based alignment can be replaced with histogram-based coarse peak picking to provide weak labels for fine-tuning. The experimental design is thorough, with detailed descriptions of the datasets and window sizes used. Notably, the noisy histogram setting is particularly compelling, as it simulates real-world transcription scenarios. I suggest the following revisions to strengthen the paper: Figure 1 and the corresponding paragraphs in Section 2 are difficult to follow. Terms such as “Ordering Inaccuracy,” “Translation Inaccuracy,” and “Timing Inaccuracy” are not clearly defined or referenced in the main text of Section 2. Additionally, the “[1*4] histogram supervision” is unclear; it took time to realize this refers to the count of pitches within a time window. The authors should provide a clearer explanation of Figure 1, as it appears central to the paper’s contribution. The highest scores in each table for each setting should be bolded to improve readability and emphasize the best-performing configurations. The paper should compare the training time of DTW-based methods versus histogram-based supervision. In some settings, methods from [1, 2] outperform CountEM, so the authors should emphasize the specific advantages of histogram-based supervision, such as reduced training time or improved performance in certain scenarios. The benefits of this approach need to be more clearly articulated, especially since fine-tuning is known to improve transcription performance based on prior literature. Overall, the paper proposes an innovative approach to fine-tuning transcription models, but its performance does not consistently surpass previous works. The advantages of histogram-based supervision should be further highlighted to strengthen the contribution. Based on the current content, I recommend a weak accept for this work.
[1] B. Maman and A. H. Bermano, “Unaligned supervision for automatic music transcription in the wild,” in Proceedings of the International Conference on Machine Learning (ICML), Baltimore, Maryland, USA, 2022, pp. 14 918–14 934. [2] X. Riley, Z. Guo, and S. Edwards, Drew abd Dixon, “Gaps: A large and diverse classical guitar dataset and benchmark transcription model,” Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), San Francisco, USA, 2024
Q2 ( I am an expert on the topic of the paper.)
Disagree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The presented method for weakly-supervised AMT can generally be very inspiring to a wide range of MIR fields that are dependent on alignment. Take e.g., phoneme-level lyrics alignment which also generally has a huge lack of precisely aligned data. Could be very interesting to see these methods applied to other alignment dependent tasks.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
Note event histograms can be used for supervising expectation-maximization to achieve state-of-the-art results in weakly-supervised automatic music transcription.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The core text it well written: The structure is logcial; the phrasing is understandable; the grammar is good; punctuation makes it easy to distinguish between individual ideas, and presents the work as easily digestible. I appreciate how they only introduce relevant information and supplies the information with contextual introductions, in case the reader is unaware of how it is relevant. Additionally, they add several examples, to put ideas in context. The figures are good, with figure 1 being self explanatory in conveying the overall idea of the paper. The experiments are well presented in a convincing manner.
As I understand, the code is not going to be public, and if that is the case, I have a feeling that section 2 might be a little difficult to follow for reproduction. Algorithm 1 could have some more thorough explanation in my opinion, as the presented math and how it fits together takes some time to understand. Considering the argument on the inefficiency of DTW, it would seem obvious to include a discussion on the efficiency increase.