Abstract:

Current methods for Music Structure Analysis (MSA) focus primarily on audio data. While symbolic music can be synthesized into audio and analyzed using existing MSA techniques, such an approach does not exploit symbolic music's rich explicit representation of pitch, timing, and instrumentation. A key subproblem of MSA is section boundary detection-determining whether a given point in time marks the transition between musical sections. In this paper, we study automatic section boundary detection for symbolic music. First, we introduce a human-annotated MIDI dataset for section boundary detection, consisting of metadata from 6134 MIDI files that we manually curated from the Lakh MIDI dataset. Second, we train a deep learning model to classify the presence of section boundaries within a fixed-length musical window. Our data representation involves a novel encoding scheme based on synthesized overtones to encode arbitrary MIDI instrumentations into 3-channel piano rolls. Our model achieves an F1 score of 0.77, improving over the analogous audio-based supervised learning approach and the unsupervised block-matching segmentation (CBM) audio approach by 0.22 and 0.31, respectively. We release our dataset, code, and models.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The discovery that there are "markers" in the Lakh dataset that have not been studied but that might relate to music structure is a great discovery that could benefit models in this field.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

The Lakh dataset may have more information in it than we yet know how to use. Also, predicting boundaries in MIDI scores is possible using standard techniques.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The article makes two major contributions that I think should be presented at ISMIR:

  • It introduces a new dataset of boundary annotations in over 6000 MIDI files.
  • It adapts the boundary prediction algorithm of Grill and Schlüter 2015 (GS15) to a MIDI context, and studies some parts of the adapted model in an ablation study.

However, describing a new dataset and a new algorithm in the same work leaves less room to do either of them justice. For the dataset, we learn mostly about the preparation of the data and less about the contents. For the algorithm, we understand the method well but not the design choices made, and there are many missed opportunities in the evaluation.

Extended comments:

The motivation for this work and the rationale for the method they present are made clear in the Introduction and Related Work sections. The dataset — which could have been reported in a publication on its own — is described clearly in Section 4, and is a delightful discovery of the authors. On the other hand, since the dataset is only part of the paper, there is no room to present and discuss an illustrated example, or to discuss basic statistics of the dataset (e.g., number of artists; variety of genres; average segment duration; etc.). How "diverse" datasets are is discussed twice in the paper (line 93, line 238), so the lack of detail here is surprising.

The explanation of the method (Section 3) was clear, although the rationale for the overtone encoding feature was not clear to me. It seems like many choices were made in designing it that are not discussed or defended. Why 3 overtones? Why randomise their frequency and velocity? Why not linear decay? What exponent of decay was used and why? Was it the same for each overtone? Why or why not?).

I found Section 5 harder to understand. The ablation study (Section 5.2) was the most interesting part of this section. The evaluation (Section 5.3) was disorienting, since it discusses the results of the baseline algorithms before they are explained (in line 381 and line 395). The audio-based approaches (Section 5.4) described algorithm designs, unofficial iteration, new aspects of prior work (like HPSS), and evaluation strategies, all in a subsection of the "Experiment". I recommend moving the explanation of the analogous audio method earlier, possibly to Section 3. Also, it would be valuable to perform another ablation study on the analogous audio method and report the results.

Other comments:

  • The authors mention that RWC is more diverse than SALAMI, but the SALAMI set is fairly diverse, with lots of jazz, classical and "world" music.
  • The word "our" in the section titles "Our Method" and "Our Dataset" is not needed.
  • How was the harmonic overtone series encoding inspired by the "Attention is All You Need" paper [27]?
  • The analogous audio method and the CBM system are listed as "baselines". Aren't they just competing systems?
  • Given that converting MIDI to audio is possible, why not compare the proposal with other audio-based systems like those discussed in [8]?
  • The phrase "measure endpoints" (line 255) leads to a garden-path sentence. They could also be called "bar lines".
  • When positive examples are oversampled by a factor of 2 (lines 318–9), does this mean each positive example is viewed twice, or that some negative examples are ignored? This could be clarified in lines 340–4.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The reviewers agreed that the two main contributions, the dataset and the boundary detection algorithm, were valuable. Outside of that, reviews were mixed: initial reviews ranged from weak reject to strong accept, and each reviewer's constructive comments touched on different parts of the paper.

Our average recommendation is to accept the paper. If it is accepted, the authors will find valuable suggestions among all of the reviews on how to improve each section of their paper, and on what aspects of the work should be defended better.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Agree (Very novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The dataset is highly useful for future research in structure analysis.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper presents a new model for section boundary detection for symbolic music and a new dataset.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a new model for section boundary detection for symbolic music. Besides the model implementation and data representation method, the paper also invented a dataset by filtering out songs from LMD. The dataset preparation process itself is highly novel, carefully manually curated, and described in detail. As far as I know there is not a symbolic structure dataset of this size and quality. I also checked the validity of some annotations by myself, and I could say the dataset itself is highly useful for future works and provides a big bonus for the paper.

The methodology of the paper also includes high novelty. One question: I think the whole section 3.1 aims to represent the symbolic music in an audio-like format. In this case, a pretrained audio model might be helpful since the downstream data format is close to the pretraining format. But in section 3.2 the author says the model is pretrained from ImageNet instead of audio, which is highly unexpected (but also gives interesting results). I wonder why the author chose this method and whether the author has tried to perform pretraining on audio and fine-tune on symbolic music.

The experiments are generally sound and the results are very promising. One limitation is that the results on other datasets (like RWC Pop, which has audio-aligned MIDI scores + structure annotation) are not reported. I would suggest to modify/replace Tab 3 with results on other datasets.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Disagree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The paper cites relevant work, but there is a lot to improve in how the citations are referred to in the manuscript. It would be useful when referring to a citation to mention a bit more about what is it in the paper that is being referred to. Some examples where this is seen are: 59: please summarize the main observations and conclusion relating to the figure. This makes the manuscript itself not self-contained. 119-120: please mention more info about the cited works. 225-226: please mention what is that method briefly.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

A concrete example is the conclusion in 393-397, which would have been more informative if it can be understood where each of the models stands out compared to the rest, such that their combination results in improvements despite the component models themselves sometimes performing inadequately. It is not clear what are the range of styles in the training or the test splits, or whether some files are clear outliers.

However, more generally, the paper could afford a lot of writing improvements (more details provided in the main review textfield). If that is addressed in future iterations, in addition to further qualitative analysis of the results, the paper would have a much stronger contribution.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

When symbolic data is available, the proposed CNN approach for music structure boundary detection performs better than both an analogous CNN approach in audio, and the typically used unsupervised methods.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper concerns section boundary detection for symbolic music, which can be considered as an early step in music structure analysis. Although most prior work for this task uses audio data, the authors see merit in improving methods that solve this problem in the symbolic domain because: a. According to the authors, audio approaches do not best exploit representations of pitch, timing and instrumentation, so it is better to conduct structure analysis of symbolic data in symbolic form rather than to synthesize the files and conduct the analysis in audio. b. They are further motivated due to the potential impact of structural boundary information on quality of the symbolic music generation results, but conducting an experiment to verify this aspect is left as future work.

To acheive these goals, the authors: a) Rely on existing metadata in LMD to create a new annotated dataset for the section boundary detection. The authors manually verify the annotations by visual inspection . b) Train several models to classify presence of section boundaries with fixed length windows i. Create 4 training setups and compare their results + the result of an ensemble of them all together ii. One of their training setups involved encoding MIDI instrumentations into a 3 channel piano roll, based on overtone relationships. c) Compare with an audio baseline by synthesizing their midi evaluation set and conducting a parameter search.

They find that their method achieves an improvement to the audio baseline.

Overall, the work is interesting and has the potential to become a more solid contribution, but the writing structure and organization of ideas (not the language) has a lot to improve.

For example, a very big part of the introduction was devoted to the potential positive impact of symbolic section boundary detection on symbolic music generation systems, whereas verifying this is something left as future work. I agree that referring to this aspect is an important motivation but it has taken too much space in the introduction given that it is not a core part of the work presented. Perhaps just a hint of this should be in the intro but much of it can be moved to the discussion section,

Then, the related work starts by mentioning the best performing system and the related tolerance, without mentioning beforehand how boundary detection is even evaluated. Even prior to that, it would be more readable to explain to readers (who might not be very familiar yet with structure analysis or boundary detection). the difference between hierarchical and non hierarchical approaches and the supervised + unsupervised approaches.

Another point which is important but not clearly articulated in the introduction and the abstract is the source of the data; the fact that the annotations are based on metadata in LMD that was verified by the authors, and what the motivation was to search through LMD in the first place (which seems to be mentioned in 87 - 103), should be clearer at the start.

Other comments: 104 - 110 seems to be yet another motivation for solving this task in the symbolic domain. Perhaps it is more suited for the introduction, or just simply not included in the related work section. 119 - 120: the manuscript does not discuss related work thoroughly enough. 245: what is meant by ‘appeared to be a valid segmentation’? I believe this by comparing the ratios and so but perhaps something more rigorous needs to be done. 373 - 374: please explain what the output is and how the peak picking method works.

Although I believe that there could be more insights than what is currently expressed in the paper just by an improvement of the writing structure and an extension to the analysis, I have chosen to reject the paper because in its current form I don't think it is ready to be published yet.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

As the authors stated, determining segment boundaries in symbolic music has potential in music generation.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The task of section boundary detection is considerably more straightforward when applied to symbolic data compared to audio, due to the reduced complexity and higher structural clarity inherent in symbolic representations.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The authors present an approach for section-boundary detection in symbolic music. In general, the paper is well detailed and clear in the approach. I have some minor comments/questions:

  • The authors, inspired by audio spectrograms, add a series of overtones to each note. This is explained with an example in lines 162-172 but, was this consistent for all the training data? Were different amounts and combinations of overtones tested?
  • Similarly, please detail the linear decay applied to the overtones.
  • Disregarding boundaries that fall within 16 bars of the first and last note seems to be a considerable amount of bars. It is certain that many MSA papers avoid giving details about edge cases, so it is appreciated that the authors of this paper are open about this. However, I see an excellent opportunity to show this effect in either the ablation study or in a separated experiment. Identifying section boundaries "in the middle of a song" is useful, but it also seems to omit an important aspect of structure analysis. The authors briefly mentioned that this issue "can be addressed in future work", which is understandable. Nevertheless, including these edge cases in the accuracy metrics could help to assess the results of the proposed method.
  • Lines 369-371: if no peak picking was chosen as a first approach and several consecutive frames exceed the threshold, which one determines the start/end of the section?
  • Lines 428-431: what do the authors mean with oversampling positive examples and undersampling negative examples? Is this target smearing and weighting?