Exploring System Adaptations for Minimum Latency Real-Time Piano Transcription

Patricia Hu; Silvan Peter; Jan Schlüter; Gerhard Widmer

Abstract:

Advances in neural network design and the availability of large-scale labeled datasets have driven major improvements in piano transcription. Existing approaches target either offline applications, with no restrictions on computational demands, or online transcription, with delays of 128–320 ms. However, most real-time musical applications require latencies below 30 ms. In this work, we investigate whether and how the current state-of-the-art online transcription model can be adapted for real-time piano transcription. Specifically, we eliminate all non-causal processing, and reduce computational load through shared computations across core model components and variations in model size. Additionally, we explore different pre- and postprocessing strategies, and related label encoding schemes, and discuss their suitability for real-time transcription. Evaluating the adaptions on the MAESTRO dataset, we find a drop in transcription accuracy due to strictly causal processing as well as a tradeoff between the preprocessing latency and prediction accuracy. We release our system as a baseline to support researchers in designing models towards minimum latency real-time transcription.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Disagree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

Some old works exist on real-time melody estimation: - M. Goto, A real-time music scene description system: Predominant-f0 estimation for detecting melody and bass lines in real-world audio signals, 2004 - V. Arora and L. Behera, On-Line Melody Extraction From Polyphonic Audio Using Harmonic Cluster Tracking, 2013

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The methods to make a real time system could be used for other audio or music processing tasks too

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

The paper explores modifying STFT window and model architecture to achieve low latency in real time piano transcription

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak reject

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Strengths: - The paper aims at converting an offline piano transcription system to a real-time one with a low latency - Low latency can be achieved by modifying to STFT window to Tukey, and by network modifications such as removing velocity conditioning and sharing some computations. - Extensive experimentation

Weaknesses: - The takeaways from the paper seem limited. - The performance improvement appears marginal at larger tolerances. E.g., in Table 4, causal-AMT is much worse than mobile-AMT at 30 ms and 50 ms tolerances. - I wonder, practically, how useful are the F1 scores reported for 10ms tolerance in Table 4, even for causal-AMT? The paper should include a discussion around the practical uses of the best settings suggested in the paper.

Title: - the title says "Exploring network adaptations ..." but there are modifications outside the network, such as in STFT window. May be the title could be "Exploring adaptations ..." or "Exploring system adaptations ...".

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

All reviewers appreciate the work. The practical utility of the topic and the detailed analysis is admirable. There are some critical comments that the authors may take note of.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper have some insight, like the obvious (and already known) one that if you wan accuracy, especially on lower frequency, you have to relax some latency requirements. However, the windowing function shaped for enforcing causality is interesting.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Real-time piano trancription is feasible, but the most significant tradeoff is between the preprocessing latency and prediction accuracy.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper is well written and clear. The non simmetric windowing function used for limit the amount of "future" information and steering the analysis towards causality is a nice feature. However, a certain number of "future" information is still needed if you want a good accuracy. I honestly think that, especially for lower frequency, this limitation it will be hard to overcome, regardles the method used! Reducing the model size for reducing complexity is another good point, and doing it without sacrifice the performances is always a challenge. The evaluation covers different aspects, however It would be beneficial to see a comparision with other methods, like for example the cited [6,7,14,15] with the same test set.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors tried many directions and report both those that succeeded and those that failed, with detailed evaluations and ablations of every step. This makes their work more useful to others working on this problem.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Adapting Mobile-AMT for lower latency and causality, with experiments on MAESTRO.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper is an important step in a direction that is often overlooked in academia, yet is important for practical applications of AMT where low latency is required.

Strengths: * Numerous experiments and detailed ablations * Insightful remarks on the baseline model that show deep understanding of the architecture * Interesting exploration of under-explored topic that is important for practical applications

Weaknesses: * Actual processing time is not measured or discussed. Is the resulting model actually fast enough to run in 10 ms on consumer hardware? (I doubt it.) Have authors tested this? Otherwise, the choice of 10 ms frames should be revisited, since it's not practical. * Tables are hard to read, results could be presented in a clearer way, or some of the evaluations could be omitted to keep the more important ones * Asymmetric window is chosen due to improved results without discussing its effect on the STFT (I can assume it leads to increased spectral leakage due to larger main lobe and side lobes)

Additional comments:

Line 36: “latter two” doesn’t seem to match order of references. Please specify which ones. Line 107: I believe Kwon et al. also show an experiment with latency of 128 ms. Line 153: interesting insight about 10 s latency due to Squeeze and Excitation Line 230: Using a weighted loss to deal with class imbalance is a well known technique, used for example in AMT in the ‘Basic Pitch’ paper by Rachel Bittner et al. It’d be nice to cite that or a more general reference. Line 304: Is ‘assymetric Tukey window’ really a correct term here? The Tukey window is a cosine-tapered rectangular window, but here there is no rectangular element. I suppose ‘assymetric Hann window’ could be a better name. Lines 361-365: A3-5 and A6-8 are not detailed. Either omit some results from the table, or explain what the experiments are. Line 379: I’m not sure this logic makes sense to me. Why does robustness to lower tolerance threshold imply the model is more promising? Why do authors have a strict tolerance requirement? If the main requirement is for lower latency, the requirement for accurate time-localization can/should be relaxed. Section 4.4: Which post-processing method is used for Mobile-AMT? Is it evaluated with the same heuristics as Causal-AMT, or with the original heuristics? Section 4.4: There's no sufficient explanation for why Mobile-AMT performs poorly with strict timing tolerance. This is a surprising result which should be explained. Section 5: First two paragraph repeat the latency discussion done in the introduction. Most of the given latency numbers are not relevant when discussing what is the desired/required latency for an AMT system. Line 468: “...we reduce model size” - what is the difference in terms of FLOPs and/or number of parameters, compared to the baseline?

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

By eliminating all non-causal processing components, online models have the potential to become real-time models with low latency.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

An online piano transcription model can be upgraded into a real-time transcription model by replacing non-causal components with a causal version, and also applying proper pre- and postprocessing methods.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper has completed a detailed case study of an existing online piano transcription framework, Mobile-AMT, and proposed thorough updates to transform it into a fully causal version capable of real-time processing with a latency below 30ms.

The analysis and experiment design has shown a solid amount of work the author has put into this research. The author has analyzed the full pipeline of the Mobile-AMT, from the pre-processing window and model architecture to onset identification post-processing algorithms. A corresponding solution is proposed for each issue in the previous framework, followed up with experiment results for proper justification. Also, all the modifications deliver the overall principle of "causality only". Well structured.

According to Table 4, the proposed model's performance doesn't change much under different tolerances. A more detailed analysis of this phenomenon would help us understand the pros and cons of the proposed model architecture.

It is also worth trying to loosen the n_s to adapt for longer tolerance in Table 4 for a more well-rounded comparison. Although Table 2 has already shown the preliminary results of different window settings, a comparison of the fully trained models can serve as the final performance baseline to show how your model balances latency and accuracy, which will benefit follow-up work.

Minor comments: Line 419: duplicated "note"

P1-10: Exploring System Adaptations for Minimum Latency Real-Time Piano Transcription

Patricia Hu, Silvan Peter, Jan Schlüter, Gerhard Widmer

Presented In-person

4-minute short-format presentation