Understanding Performance Limitations in Automatic Drum Transcription

Philipp Weyers; Christian Uhle; Meinard Müller; Matthias Lang

Abstract:

Recent advancements in Automatic Drum Transcription (ADT) have improved overall transcription performance. However, state-of-the-art (SOTA) models still struggle with certain drum classes, particularly toms and cymbals, and the specific factors limiting their performance remain unclear. This paper addresses this gap by leveraging the Separate-Tracks-Annotate-Resynthesize Drums (STAR Drums) dataset to create multiple dataset versions that systematically eliminate potential performance constraints. We conduct experiments using three common ADT deep neural network (DNN) architectures to identify and quantify these limitations. For drum transcription in the presence of melodic instruments (DTM), the primary limiting factor is interference from melodic instruments and singing. Aside from this, performance improves by approximately five percent when training and testing use the same single drum kit, only strong onsets are present, or notes are not played simultaneously. For drum transcription of drum-only recordings (DTD), nearly error-free transcription is achieved when simultaneous onsets are removed. This confirms that overlapping drum hits are the main performance constraint. By identifying key ADT challenges, we provide insights to enhance SOTA models and improve overall transcription accuracy.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The insights into training/test drum kit mismatch, simultaneous onset challenges, and weak onset impact are highly reusable for future ADT system design. The methodology could serve as a template for analyzing bottlenecks in other MIR tasks.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Performance limitations in ADT systems can be systematically analyzed and mitigated by controlling drum kit variation, onset overlap, and signal complexity using variants of the STAR Drums dataset.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper offers a detailed and well-structured investigation of the factors that limit the performance of current ADT systems. By systematically altering the STAR Drums dataset along key dimensions—number of kits, presence of weak/simultaneous onsets, and accompaniment context—it succeeds in quantifying the contribution of each constraint to the overall performance ceiling.

Strengths: - Methodologically rigorous and clearly explained. - Explores three model architectures, increasing generality of results. - Performance metrics are consistent and grounded in prior work.

Suggestions for improvement: - Some tables (especially Table 3) are dense and could benefit from clearer formatting or graphical summaries. - It would be useful to discuss the potential implications of these findings for realworld ADT applications (e.g., how these bottlenecks affect user-facing systems). - The discussion on model failure modes could be deepened with a few more qualitative examples.

Overall, this is a valuable contribution that helps the MIR community better understand how and why ADT systems fail—and where gains can still be made.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

This paper conducts a thorough investigation into the performance limitations of automatic drum transcription (ADT) systems by using a well-designed experimental setup based on variants of the STAR Drums dataset. Rather than introducing a novel model, the work takes a diagnostic approach, aiming to isolate and measure the contribution of various known challenges—such as drum kit variability, overlapping onsets, and melodic interference—on ADT performance.

The core value of this paper lies in its methodological clarity and diagnostic utility. Its key contribution is not architectural or algorithmic, but rather analytical: a reusable blueprint for benchmarking ADT bottlenecks. Multiple reviewers praised the clarity of the research questions and the strength of the experimental controls. The paper offers a useful “performance budget” that can help guide future efforts toward the most impactful areas of improvement.

That said, some limitations temper the impact:

Heavy reliance on re-synthesized data raises questions of external validity and real-world applicability.
The evaluation on real recordings is modest in scale and genre diversity.
The paper stops short of proposing or testing solutions to the identified issues, which reduces its immediate applicability to system development.

This paper offers a valuable and reusable diagnostic framework for understanding performance limits in ADT systems. While it lacks immediate practical solutions and broader validation, its analytical contributions may justify its inclusion in the ISMIR program.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The general approach to separate, transcribe, and resynthesize could be applicable to other instruments/tasks in MIR. Additionally, the idea of constructing different variants to isolate the impact of different limiting factors could be transferrable as well.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The presence of melodic instruments is still the main limiting factor in ADT, and simultaneous onsets or weak onsets are smaller but still relevant limiting factors.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper investigates the limiting factors of ADT through a systematic evaluation using synthetic data. Specifically, different variants of STAR Drums are synthesized by removing different limiting factors, which allows for the assessment of their individual impacts. Through the analysis of the evaluation results, this paper addresses the research questions with quantitative outcomes, shedding light on the future direction of ADT research. The paper is well-structured and nicely written. The literature survey is comprehensive, and the research questions are well-motivated.

One aspect I really appreciate about this work is the simplicity of the idea. By controlling variables such as number of drum kits, strong/weak onsets, and the presence of simultaneous sounds, one can study their impacts in a consistent environment with clear results. The core idea is straightforward, and the execution appears to be thorough. Generally speaking, I find the answers to the research questions convincing and well-supported.

The main frustration while reading this paper, however, lies in the presentation of the evaluation results (i.e., Table 3). With three CNN architectures and two scenarios (i.e., DTD and DTM), there are many numbers in the same table without labels. The discussion section attempts to guide readers through this table by referring to the changes in numbers, however, it is still very difficult to parse. It took me a while to memorize the order of things, and I still found myself going back and forth just to ensure I was reading the correct numbers. The occasional typo in the numbers (see minor comments below) also makes it harder to follow. In my opinion, this paper could really benefit from some simplification or aggregation in the evaluation section in order to increase clarity.

To summarize, I think this paper provides interesting insights for both DTD and DTM based on a simple yet effective idea. The evaluation is thorough, and the discussion is generally satisfactory. Overall, I believe this paper is a good contribution to the ADT community, and my recommendation is a weak accept. For further improvements, I would definitely encourage the authors to reconsider the presentation of the results.

============= Minor comments: Line 61, “STAR Drums” → missing reference to [6]?

Line 116-117, “... with global F-measures above 0.8 …” → on which evaluation set? Any reference?

Line 125, “... especially cymbals are often played with alternating weak and strong onsets…” → HiHat and Ride cymbals, to be more specific

Line 284, “... using a peak picking with” → a peak picking algorithm?

Line 331 → should be 0.79 to 0.73?

Line 334, “... 0.73 to 0.78 and 0.78 to 0.82…” → it is a bit confusing since the previous paragraph was talking about 10Kits. It took me a while to realize this is comparing 20Kits to 1Kits.

Line 350-352, “... suggests that even a relatively low number of virtual drum kits is sufficient to develop well-generalizing models for DTM” → this is a very dangerous statement, especially given the size of MDB Drums. Maybe the drum sounds in MDB are not diverse enough? Maybe the 10Kits somehow match the drum sounds in the MDB samples? In any case, I am a bit skeptical about this statement.

Line 407, “... and decreases for the other two architectures for DTM.” → why? Isn’t 20KitsNoSim an easier test set compared to 20Kits? Any insight/explanation?

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The tests provide a reasonable ballpark quantification of the benefits that can be expected should any of the three investigated challenges in drum trascription be addressed.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The "Separate-Tracks-Annotate-Resynthesize Drums" (STAR Drums) dataset is a convenient framework for investigating and quantifying the challenges in the task of automatic drum transcription.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Strengths - The idea, the purpose, and the setup of this study are sound and clearly presented - The work draws from and builds nicely on the previous efforts made by the MIR community - The results are clearly interpreted

Weaknesses - It would have been great if the dataset used for testing had been kept entirely segregated, i.e., avoiding using it even for validation. - The testing is only performed on tracks that are re-synthesized. It will be interesting to see how results generalize to "actual" real music.

[minor] One of the conclusions is formulated badly: "Ensuring that drum sounds between training and testing are identical ...". It is clearly impossible to "ensure" that. The conclusion should be formulated as "If the drum sounds in testing is identical to ..., then ...".
[minor] Some more details about the Star datasets would be appreciated. For instance, genre coverage.

Questions - It is argued that drum source separation could help addressing the challenge of overlapping onsets. How? How is that task different (easier) than ADT? If source separation is effective, then the same techniques could be applied to ATD, couldn't they? - Does it make sense to average the F-measure of experiments with different number of T/F positives and T/F negatives? Could/should the F-measure be weighted?

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors create five controllable variants of the STAR Drums dataset, each disabling one suspected bottleneck. This “switch-off-one-factor” blueprint can be copied in any transcription or detection task. By progressively simplifying data we can measure each factor’s cost before deciding where to invest modelling effort.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Melodic masking and overlapping drum hits are the dominant error sources in automatic drum transcription, so solving these two problems yields the greatest accuracy gains.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Strengths (1) Diagnostic clarity: Instead of yet another SOTA system, the paper isolates four hypothesized bottlenecks in ADT and quantifies their individual cost—e.g., melodic masking and simultaneous hits—providing a “performance budget” for future work. (2) Well-controlled dataset design: The five STAR-Drums variants (20 Kits, 10 Kits, 1 Kit, NoWeak, NoSim) form a clean, switch-off-one-factor benchmark that others can replicate or extend with minimal effort.

Weaknesses (1) External validity: Results rely largely on re-synthesised audio; the single real-recording test set (MDB, ≈22 min) is stylistically narrow, leaving domain-gap questions unresolved. (2) While the paper presents an analysis of limiting factors affecting ADT performance, it does not offer concrete solutions to the identified bottlenecks, thereby reducing its immediate applicability and practical impact in real-world scenarios.

Justification: The work offers a valuable analysis toolbox that will help the community allocate effort and funding more intelligently. However, the reliance on synthetic data, modest breadth of external evaluation, and absence of modelling advances limit immediate practical impact. Balancing methodological novelty against these drawbacks yields a review of Weak reject.

P5-10: Understanding Performance Limitations in Automatic Drum Transcription

Philipp Weyers, Christian Uhle, Meinard Müller, Matthias Lang

Presented In-person

4-minute short-format presentation