Improving BERT for symbolic music understanding using token denoising and pianoroll prediction

Jun-You Wang; Li Su

Abstract:

We propose a pre-trained BERT-like model for symbolic music understanding that achieves competitive performance across a wide range of downstream tasks. To achieve this target, we design two novel pre-training objectives, namely token correction and pianoroll prediction. First, we sample a portion of note tokens and corrupt them with a limited amount of noise, and then train the model to denoise the corrupted tokens; second, we also train the model to predict bar-level and local pianoroll-derived representations from the corrupted note tokens. We argue that these objectives guide the model to better learn specific musical knowledge such as pitch intervals. For evaluation, we propose a benchmark that incorporates 12 downstream tasks ranging from chord estimation to symbolic genre classification. Results confirm the effectiveness of the proposed pre-training objectives on downstream tasks.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Musically informed pre-text tasks (pre-training objectives) improve a model's learning of high-level music representations from symbolic data.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Testing on a mixture of established and novel symbolic music understanding downstream tasks, the authors show that musically informed pre-training objectives outperform vanilla masked-language modeling in learning meaningful music representations.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This is the initial review for the paper "Improving BERT for symbolic music understanding using token denoising and pianoroll prediction", submitted to ISMIR 2025 in Daejeon. The paper approaches representation learning from symbolic music data in various formats (score, score-derived MIDI, performance-derived MIDI) using Transformer-based language modeling. As their main contribution, the authors propose to replace the vanilla masked language modeling (MLM) pre-training with two more musically informed pre-training objectives: correcting slightly corrupted tokens and predicting piano-roll-like pitch and chroma representations. Moreover, the authors extend an existing set of downstream tasks in symbolic music understanding with various other tasks, using this set of tasks for a comprehensive and systematic evaluation that shows the model's efficacy in the context of baselines and other models variants.

In general, this is an interesting idea. While not revolutionary and, in some respect, being an incremental adapation to previous models, the clear description, good motivation and, in particular, the comprehensive and systematic evaluation makes it an insightful contribution to ISMIR, which I recommend for acceptance. The only substantial criticism I have is the strict assumption of a 4/4 time signature. This crucially limits the model's applicability and contradicts the musically informed approach (moreover, this important information is mentioned much too late in the paper).

Overall, the paper is well-structured and the writing is clear. However, there are a number of imprecise and even wrong statements/definitions that have to be corrected. Moreover, some important information is given in the paper at a later stage but should be mentioned earlier to guide the reader into the right direction. I will list these problems in the following:

line (l.) 11: "predict the [...] piano roll" - from what (single note representation)??? l. 51: Why is 5 tasks not comprehensive enough? Is 12 comprehensive? l. 64: "infer pitch and chroma distribution from the input note sequence" - this seems trivial as described here, does it involve the correction of corrupted notes? Then, this should be made clear l. 92: "Symbolic music is an abstract form of music" - This is not correct. There is no "symbolic music", maybe "symbolic music data" or "representation". Moreover, these are representations and not "forms of music" - please correct! l. 97: "Both MIDI and sheet music are forms of symbolic music" - This is also not correct. In particular, sheet music refers to the written artifact, which could be physical paper, or (as data) simple pixel graphics, which are NOT symbolic (i.e. machine-readable by explicitly encoding musical information). Please correct this l. 145: "raw DB information [by retrieving] tick information (ticks per beat)" - this is unclear, doesn't it rather require knowlege of the ticks per measure? If this is derived by assuming a constant time signature (4/4), this assumption needs to be mentioned beforehand. l. 164 ff: What is a "1/4 beat" compare to a "32nd" note? Are beats assumed to be equal to quarter notes? This is a crude simplification of symbolic music data, and crucially limits model performance! Moreover, please stick to one semantic description (beats or note durations) and explain why onset positions and durations are modeled in different resolutions! l. 187: "15% in practice" - 15% of what? l. 223: "sampling from all tokens" - from all possible tokens? l. 255: "tatum-level prediction": confusing, better "local prediction"! l. 264: "one bar contains 16 tatums" - Why??? Oversimplification! l. 331: "Piano performer style classification" - up to this point, the reader assumes only score-like music in the data. It should be mentioned much earlier that the model works with performance-like data (e.g. performance MIDI) as well! l. 434f.: "while excluding all no-4/4 time signatures" - This information is way too late! Also, this is a crude oversimplification that crucially limits the model's usefulness! l. 464.: "almost the downstream tasks" - almost all the downstream tasks? l. 502: "outperforms SOTAs in six tasks" - unscientific statement! l. 515: "Due to page limitation..." - this is not an excuse! Please adapt the paper writing to fit more of these results.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Strong accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

Summarizing the main aspects from the four reviews, this paper constitutes a valuable and interesting contribution to ISMIR. All authors agree on this positive assessment, emphasize the interesting strategy and the insightful multi-task benchmark.

There are two substantial issues of criticism (apart from several minor writing problems), which should be addressed: * There is a strict assumption of a 4/4 time signature, which substantially limits the model's applicability and contradicts the musically informed approach. Please mention this earlier and bring arguments for this choice. * The claims from the experiments are too strong. In particular, the effect of the proposed training strategies (token denoising & piano roll prediction) are rather small, while other tweaks have a stronger effect. Please discuss this more carefully and cautiously.

Overall, we congratulate the authors to this interesting submission and look forward to see this paper at ISMIR!

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors conduct only a small ablation study that doesn't reveals the role f many hyper-parameters.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Token denoising and pianoroll prediction are effective pre-training strategies for transformer-based symbolic music understanding

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper introduces M2BERT, a model for understanding symbolic music (notes). It learns by 1) fixing "broken" notes and 2) predicting piano rolls. It also presents the SMC benchmark (12 tests) to evaluate music understanding, aiming to improve learning beyond old methods. Results are limited by over-simplification of the problem (e.g. only 4/4 timings, only MIDI), and by missing tests and resources, but code is shared, aiding community work. Overall, a valuable ISMIR contribution.

Here is the list of notes that highlights my critics.

PROPOSED METHOD
Getting beat start (DB) info is poorly explained (ala "not perfect but works"), hurting trust in input quality and repeatability.
Rounding note timing for tokens might lose rhythm details, bad for time-sensitive music.
Tokenization for scores vs. performances isn't clearly different, despite their timing variations.
Choice of how much to "break" notes for fixing (corruption levels) isn't well justified or tested for optimality.
THE SMC BENCHMARK
Tests mix score and performance data without a clear strategy. This makes it hard to know what the model learns from each distinct music type.
For beat-finding tests, forcing all music to one tempo/rhythm (4/4) is unrealistic and questions test validity in real case scenarios.
EXPERIMENT SETUP
Excluding non-4/4 music from the "Reduced" learning set limits rhythm understanding. No reason given.
The "Full" dataset model (for SOTA comparison) didn't train long enough (25 epochs, still improving), making results unreliable.
30% note corruption and specific breakage levels lack clear justification or ablation study.
RESULTS
Conclusions from the "Full" dataset (including SOTA claims) are weak due to insufficient training.
Impact of music changes for beat-finding tests isn't discussed enough.
Longer music piece test (ablation) only on two tasks; too limited to generalize.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

In my opinion, the most relevant insight of this work is showing that the popular approach of using a [MASK] token is inferior using token noising bounded to a small range of values. Interestingly, the authors also show that the former approach is consistently better than using pure random noise.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This works builds on existing work on symbolic music understanding (MidiBERT) and proposes updating its architecture, token masking strategy and adding pianoroll prediction showing enhanced performancei in 12 different downstream tasks.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

In my opinion the two strongest point of this work are:

1) To build an extensive downstream evaluation framework with 12 different tracks. 2) To perform an exhaustive ablation of all the proposed contributions showing mostly consistent improvements in most of the proposed task.

As a downside, I find it a bit misleading that, while the paper's title and abstract suggest that the token denoising and the pianoroll prediction are the main contributions of the paper, it can be seen in Table that typically these are not the modifications causing the highest performance bost (except for tasks VE and OT). Most of the time, the highest performance boost is due to the improved architecture (ModernBERT, row 2) or the extended training dataset (row 6). While I believe that ISMIR's policy doesn't allow for a title change, I'll encourage the authors to highlight the impact of these modificaiton in the model performance.

Specific comments

Lines 192-197: I disagree with the authors statement suggesting that the proposed model accounts for domain specific knowledge. While this is something that is not proven by the proposed experiments, a simpler explanation is that providing noise that is closer to the in-distribution data (instead of a MASK token or pure random noise) is a harder tasks to solve, so it results is more robust representations which benefit the downstream tasks.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Beyond the 12 downstream tasks demonstrated in the paper, the approach appears applicable to various other symbolic This paper enhances Midi-BERT with a new backbone, a music-specific token corruption method, and piano-roll/chroma generation pre-tasks, outperforming in most of 12 downstream tasks.Music Information Retrieval (MIR) tasks.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper enhances Midi-BERT with a new backbone, a music-specific token corruption method, and piano-roll/chroma generation pre-tasks, outperforming in most of 12 downstream tasks.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper improves Midi-BERT by introducing a new backbone architecture, a musically informed token corruption method, and piano-roll/chroma generation pre-tasks, achieving superior performance in most of the 12 downstream tasks.

Overall, the writing is clear and accessible, and the model demonstrates clear performance gains compared to Midi-BERT with similar model size and dataset.

Raised question:

Raised Question: From my understanding, the relationship between pitch in CP-tokens and pitch in piano-roll representations is nearly a 1:1 mapping. Unless piano-roll prediction includes additional spatial (e.g., octave-related) mechanism, their training effect seems limited, making me partially disagree with the authors. In contrast, chroma representations could enable learning of octave relationships (e.g., C1-C2) not captured in CP-tokens. Thus, I wish Table 1 included an ablation study showing the individual contributions of chroma and piano-roll to better understand their respective impacts.

P4-9: Improving BERT for symbolic music understanding using token denoising and pianoroll prediction

Jun-You Wang, Li Su

Presented In-person

4-minute short-format presentation

Specific comments

Raised question: