P6-13: Simple and Effective Semantic Song Segmentation

Filip Korzeniowski, Richard Vogl

Subjects: Open Review ; Novel datasets and use cases ; Musical features and properties ; Evaluation, datasets, and reproducibility ; Structure, segmentation, and form

Presented In-person

4-minute short-format presentation

Abstract:

We propose a simple, yet effective approach to semantic song segmentation. Our model is a convolutional neural network trained to jointly predict frame-wise boundary activation functions and segment label probabilities. The input features consist of a log-magnitude log-frequency spectrogram and self-similarity lag matrices, combining modern deep learning approaches with hand-crafted features.

To evaluate our approach, we first examine commonly used datasets and find substantial overlap (up to 22%) between training and testing sets (SALAMI vs. RWC-Pop). As this overlap invalidates meaningful comparisons, we propose using the previously unexplored McGill Billboard dataset for testing. We carefully eliminate duplicate entries between McGill Billboard and other datasets through both audio fingerprinting and string-matching of song titles and artist names. Using the resulting set of 719 tracks, we demonstrate the effectiveness of our approach.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Disagree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

Much previous work is cited. However, the connection between this work and previous work is not clear. Sections 1 and 2 do not hint at the need for future work on MSA. A brief contrast with this work is given in lines 143–153, which states a desire to diverge from "current trends in deep learning". If this is the main motivation for the design of the proposed model, I think an explanation of these trends — why they are popular, what are some examples, and how the trajectory of these trends has soared or fizzled in other MIR tasks — is deserved.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q10 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chose, otherwise write "n/a"))

The explanations are all very clear. The critiques of earlier evaluations and recommendations for future ones are clear, although I am not sure that the evaluation conducted here resolves all the issues. While the comparison using the McGill Billboard dataset seems fair (Table 4), the comparison in Table 3 seems unfair, given that the proposed algorithm was tested using cross-validation, whereas the competing algorithms are all tested in a cross-dataset scheme. It is true that the "train-test overlap ... could lead to inflated results" (line 419–20); but it also seems true that training within a dataset could inflate results compared to a cross-dataset scenario.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper sets a good example for evaluation in MSA in several respects: it emphasises the importance of using 'trimmed' annotations (i.e., not including trivial 'begin' and 'end' tokens in the evaluation) and it points out the overlapping datasets compromise the cross-dataset evaluation. These are valuable recommendations that may be known by others in the field (mir_eval has a 'trimmed' setting, and the overlap between SALAMI, RWC and Isophonics is intentional) but are still worth committing to the proceedings.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

MSA evaluation should be conducted correctly and reported carefully, and cross-dataset performance of MSA is poorer than within-dataset performance.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes a new algorithm for music structure analysis (MSA) with a simple architecture, combining different implementation tricks from previous work. The paper describes some previous MSA evaluations as sloppy and propose a more rigorous approach, and they also identify a new dataset for evaluation in MSA: the McGill Billboard dataset (MBD).

The paper is clear and well written. The high level of detail in the algorithm description will be appreciated by anyone wanting to replicate this work. The tips for making sure evaluations are correct and rigorous are important and valid critiques of previous work, and the MBD is a good dataset to include in future work. I think this paper merits acceptance based on this.

That said, I think each of the 3 main contributions (algorithm, evaluation, choice of dataset) from the authors could be made clearer in an updated version.

Regarding the first contribution: I trust that the new pipeline was built correctly, and the evaluation with the MBD suggests it is a new state of the art. But in the explanation of proposed design (Section 3), I did not find reusable insights. This is because the connection to the existing literature is unclear, so for each chosen element in the pipeline, I am not sure what the other options were, and why any particular option was chosen. An ablation study would have been interesting, since there are a few points where it is clear that preliminary studies were used to fine-tune the process (e.g., line 202: "we found that this does not further improve performance", and line 243 footnote: "we found in preliminary experiments that individual probabilities for each segment label work better in practice").

One motivation for the proposed design is clear from the title: simplicity. The introduction says the proposal is "a simple, yet effective approach". If the simplicity of the algorithm is part of its value, why is it not clearly contrasted with the complexity of previous work? Also, by what standard is it "simple"? The level of detail in Section 3 suggests it is a sophisticated algorithm relying on many clever submodules and implementation tricks. I would have enjoyed a discussion about how other systems and training methods are needlessly complex, and in what way the proposed method is simple, and what makes simplicity desirable for a given task. For example, simplicity may come from there being a clear musical intuition that underlies a method. This paper does not discuss the musicality of the problem.

Regarding the 2nd contribution, the paper sets a good example for evaluation in MSA in several respects: it emphasises the importance of using 'trimmed' annotations (i.e., not including trivial 'begin' and 'end' tokens in the evaluation) and it points out the overlapping datasets compromise the cross-dataset evaluation. These are valuable insights, worth repeating, but they are not strictly new: mir_eval has a 'trimmed' setting because it is known among MSA researchers that this makes a big impact. Different authors make different choices about the parameter, but I hope that within any paper, the authors make consistent choices so that they are comparing apples to apples. And regarding the overlap between datasets: this overlap is known — it is by design! The original SALAMI paper mentions that data from RWC and Isophonics was deliberately included for comparison. Still, it is good to point out that this affects the value of the evaluations, and this overlap is evidently not common knowledge.

However, while the comparison using the McGill Billboard dataset seems fair (Table 4), the comparison in Table 3 seems unfair, given that the proposed algorithm was tested using cross-validation, whereas the competing algorithms are all tested in a cross-dataset scheme (albeit with small amounts of overlap between test and train sets). The paper argues earlier (Section 4.1), persuasively, that cross-dataset evaluation may lead to underperformance, since the datasets differ greatly. It sounds like training on all the datasets might lead to more robust methods. Given this, why devote Table 3 to comparing the proposed method (CV-8) with so many methods trained in a cross-dataset (CD) way? Given the CD results from other works, the apples-to-apples comparison would be to also train the proposed system in the (unrecommended) CD way and report these results. This would seem more fair, and also would not detract from SOTA findings shown in Table 3 (top section: Harmonix) and Table 4.

Put another way: in an evaluation, I am most interested in the comparison of methods, not the absolute performance achieved. So, comparing apples to apples (on a dataset with 3% leakage) is better than comparing apples to oranges (on datasets with no leakage).

Finally, regarding that 3rd contribution: given that this is the first use of MBD for evaluating structure, perhaps the paper should say a bit more about the creation of this dataset: how were the structure labels annotated, and how does it differ from SALAMI and Harmonix? This information is available from the respective papers about each, but since the contribution is highlighted in the introduction, I expected more commentary on it in the results section.

Post-discussion comment: the paper does not discuss the provenance of the audio data. How was the audio accessed? If the MBD was not provided by the original dataset owners, how did was the alignment between audio and annotations verified?

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The initial ratings that the reviewers gave for this paper ranged from strong reject to strong accept, with an average rating slightly favouring acceptance.

Some aspects that the reviewers agreed on: 1. The model architecture is interesting and the explanation of the method was clear (R1, R3, MR) 2. The discussion of evaluation integrity in music structure analysis (MSA) is insightful (R1, R2, R3, MR) 3. The evaluation makes unfair comparisons between algorithms, and does not perform any ablations on the proposed method (R2, MR, and R1 post-discussion). 4. The title states that the method is "simple" (and the text says "lightweight"), but it is not clear by what standard this is claimed.

Outside of this, the reviewers had a variety of suggestions for how to improve the paper. In particular, R2 wrote that the tone of the paper seemed unfairly dismissive of previous work. The way that point 3 undercuts point 2 (in my list above) may contribute to this.

Overall, we lean towards accepting this paper for the clear contributions outlined above. However, we strongly recommend that the authors revise their work to give clearer justifications for the choices made in the algorithm design, training procedure and evaluation, and to ensure that no claims or critiques in the paper are overstated.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The article indeed discusses and cites the most relevant papers.

The mir_eval package [Raf, 2014] is not referenced, and should be.

I think that paper [ME, 2014] should be cited in the "historical" models of MSA, because it is still one of the best-performing and most influential non-deep learning model, but I let the authors decide on that.

Authors mention that "no [downbeat] alignment is evident in SALAMI" in section 4.1, page 4. Authors can refer to [MCB, 2023] for a comparison between scores with downbeat-aligned and non-aligned annotations (Table 9.), showing very few differences, hence suggesting that most of the annotations are downbeat-aligned.

[Raf, 2014] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics”, Proceedings of the 15th International Conference on Music Information Retrieval, 2014. [ME, 2014] McFee, Brian, and Dan Ellis. "Analyzing Song Structure with Spectral Clustering." Proceedings of the 15th International Conference on Music Information Retrieval, 2014. [MCB, 2023] Marmoret, Axel, Jérémy E. Cohen, and Frédéric Bimbot. "Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm." Transactions of the International Society for Music Information Retrieval 6.1 (2023).

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

In addition to proposing a novel system, properly evaluated, the authors highlighted existing problems in evaluating systems: overlap between train and test datasets, inconsistency in "trimming" boundaries, and the use of different annotations for the Beatles dataset (NB: this could also be made for RWC Pop). I particularly liked the author's viewpoint of trying to benchmark their model as transparently as possible to ensure a fair comparison. It surely took time, hence the emphasis.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Firstly, it seems that the authors do not plan to release their model as open-source. I want to encourage the authors to do so to help future researchers and ensure reproducibility. Secondly, the authors tried to keep their models as "simple" as possible. I do not agree with the "lightweight" statement (see general comment), but I agree that the architecture is quite simple and does not resort to complicated tricks. On a side note, I would conversely say that the training procedure is quite complicated and seems ad hoc but is, in fact, very detailed. Thirdly, the experiments are well-detailed and allow for reproduction. Globally, the model and the experiments are quite well-detailed.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper presents a novel deep learning model for semantic structural segmentation, associated with more transparent experiments than the literature.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

  • General comment: The paper introduces a convolutional neural network (CNN)-based approach for "semantic song segmentation," which utilizes hand-crafted features as input (namely a log mel-like representation and self-similarity lag matrices). The authors address key challenges in music structure analysis (MSA), such as dataset overlap and inconsistent evaluation metrics, and propose a robust framework for segmenting songs into "semantic" sections, i.e., functionally labeled sections (intro, chorus, verse, etc.). The paper is well-written, and the problem is clearly motivated, making it a valuable contribution to the field.

  • Strengths:

  • The paper is well-written, with clear explanations of the methodology, datasets, and evaluation metrics.
  • The authors address key issues in MSA evaluation, providing a more reliable benchmark for future research.
  • The authors propose a novel model with SOTA performance.
  • The model in itself is quite simple, with relatively few different elements.

  • Weaknesses:

  • The training routine is very complicated and seems very ad hoc.
  • The model is said to be "lightweight", but does not seem lightweight to me: 9 convolutional layers as front-ends (3 layers per front-end), followed by 11 blocks composed of both convolutional and dense layers, and 2 final dense layers. It may be very few compared to the recent standard in the literature, but I feel that this is far more than the compared models, e.g., (using the references from the paper, and not limited to): [9, 13, 14, 15, 16]. *NB: I voluntarily restricted this list to conv models, because comparing convolution and attention layers does not seem fair in terms of parameters vs. data required for training.

  • Comments: -- The authors propose to process music audio signals using log-frequency log-magnitude spectrograms. How do the log-frequency triangular filters relate to mel filters? -- I do not understand why the authors state that "60 channels [are] reduced to 30 using 1x1 convolution" (lines 217-218, section 3.2): shouldn't 1x1 convolution result in the same shape as outputs? -- Lines 284-288: Maybe I misunderstood, but, to the best of my understanding, when several annotations are available, the authors use this data several times in training, using each annotation once. It feels to me that this compromises the principle of learning, where the same data should be associated with only one ground truth (how should the error backpropagation make sense otherwise?). -- Lines 318-319: The authors state that their model should work with a wide variety of music but evaluate it on Western music (mostly popular). In my opinion, a "wide variety of music" should include far more than the proposed datasets. I suggest the authors lower that sentence or rephrase it. -- The authors use the 0.5s tolerance only, while standards use both 0.5 and 3s tolerances. Why is this choice, and why not use both tolerances?

Overall, I found the paper sound, and would recommend it for publication.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly disagree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

I strongly disagree that the title adequately reflects the content of the paper. The title, “Simple and Effective Semantic Song Segmentation,” is somewhat misleading. A substantial portion of the paper focuses on addressing dataset issues, inconsistencies in experimental protocols, and critiques of prior work, which are topics that are not reflected in the title. These discussions are valuable but deserve explicit mention to better represent the paper’s broader scope. Additionally, while the title suggests a technically straightforward solution, the proposed method involves several non-trivial design choices that are neither thoroughly justified nor systematically ablated. The title should better reflect the paper’s emphasis on experimental rigor and evaluation methodology, rather than suggesting simplicity in technical design.

The use of the term “semantic” in the context of song segmentation is unclear. It is not evident whether “semantic song segmentation” is a standard term in prior literature.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly disagree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

If the paper’s primary aim (according to title) is to propose a method for music structure analysis (MSA), it should include a more in-depth discussion and justification of the technical design choices. The core components, such as the model architecture and input representations, are not novel. I am particularly interested in understanding how the TCN-based model achieves such strong performance. Is there empirical or theoretical evidence suggesting that TCNs are especially well-suited for MSA? or particularly on small datasets? In other fields like NLP and speech processing, Transformer-based architectures have generally outperformed TCNs.

If one of the central arguments is that experimental protocols are a major determinant of performance, it would be valuable to include a Transformer-based baseline (which is relatively straightforward to implement today) and evaluate it using standard input features, common training strategies, and simple post-processing on your corrected datasets. Given the context discussed above, I remain unconvinced that the proposed "simple" method is sufficiently "effective" to justify its claim. Are the improvements primarily due to architectural adaptations such as self-similarity lag features? Or do they stem from training strategies like stochastic weight averaging, advanced optimization techniques and learning rate scheduling, or optimized post-processing techniques? Other than that, the receptive field size of the model and its ability to leverage longer input sequences to capture long-range dependencies are not clearly specified, yet these could play a major role in performance. These factors might introduce hidden advantages over existing methods, and their impact is not adequately addressed in the paper.

To convincingly support the claim that the method is both "simple" and "effective," more extensive comparative studies and ablation experiments are necessary. These would help isolate which aspects of the approach contribute most to its success and whether it offers meaningful advantages over existing methods, especially given the paper’s emphasis on experimental rigor.

Another major important issue concerns the comparison presented in Table 3 between 8-fold cross-validation results and cross-dataset evaluations. While the paper itself argues that different datasets exhibit substantial variation in audio content and annotation guidelines (a key challenge in MSA), it remains unclear whether such a comparison is meaningful or informative. For instance, to my understanding, the "instrumental" sections in RWC-Pop were labeled as "bridge" due to guideline differences, which could significantly affect the reported results. Rather than providing insight into the model’s ability to generalize, this comparison primarily reveals the model’s tendency to overfit to specific datasets, particularly smaller ones.

Moreover, the paper does not clarify how much performance gain is attributable to using 8-fold cross-validation as opposed to 4-fold on Harmonix in Table 3. While it is generally expected that more training data in 8-fold CV would improve performance, this makes direct comparison with 4-fold CV results problematic. At minimum, a controlled analysis is needed to disentangle the effects of fold count and dataset size on the reported outcomes.

Robustness across datasets is a critical aspect of MSA, particularly given the limited availability of annotated data. It would be more valuable to investigate whether training on mixed datasets degrades performance on individual target domains, and to explore strategies for developing models that can generalize despite inconsistencies in annotation standards. Such analysis would contribute to making better use of the cumulative efforts in dataset creation and help advance the field toward more generalizable and scalable solutions.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

As mentioned earlier, I remain unconvinced that the TCN architecture, in isolation and without the influence of other contributing factors, offers better generalization than existing models in this context. While direct comparisons between TCNs and Transformers for MSA are scarce, related work in beat tracking (a similarly time-dependent task) suggests otherwise. For example, "Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention" demonstrates that Transformers outperform TCNs under comparable experimental conditions.

Given that no new Transformer-based experiments are presented here, the most convincing way to support the method’s effectiveness and facilitate reuse would be to release the training code and experimental setup. However, the paper provides no indication of plans to do so. Additionally, as noted above, several potentially influential factors, such as receptive field size, training techniques, or post-processing strategies, are not clearly ablated. Sharing and analyzing these would offer valuable insights to the community.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper highlights evaluation inconsistencies in music structure analysis and introduces a TCN-based model with strong results, but lacks in-depth analysis to clarify which factors drive its performance.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Overall, the major scope of this work remains unclear. On one hand, the proposed method lacks sufficient ablation studies to identify which components contribute to its performance gains. On the other hand, if the paper aims to guide the field toward more rigorous evaluation practices, it would benefit from broader comparisons with existing approaches under the proposed controlled environment to generate deeper insights.

I do see value in the paper’s emphasis on evaluation methodology, which is timely and useful for the community. I encourage the authors to consider reshaping the paper’s focus toward evaluation and reproducibility, and to resubmit it to a more appropriate track at ISMIR or relevant conferences. For broader impact, it would be highly beneficial to open-source the experimental protocols, allowing future work to build upon and benchmark against a common, well-defined setup.

Another concern relates to the writing style of the paper. While it is both valid and important to highlight limitations or issues in previous evaluations, presenting these observations in a more constructive and collaborative tone would help foster a more positive and productive dialogue within the research community. Acknowledging the inherent challenges of working with dataset limitations in prior work could further strengthen the paper’s message, framing its critique as part of a collective effort to advance reproducibility and methodological rigor in the field.

Other Concerns: - Table 1, Convolutional Front-End: It is unclear how the dimensionality is reduced from 81 to 1 using three layers of (3, 1) max-pooling. Based on the pooling configuration, this setup appears to reduce the dimension by a factor of 27 (i.e., 3 × 3 × 3), not 81. - Post-Processing: The contribution of the optimized post-processing method to overall performance remains unclear. Post-processing is often dataset-dependent and, in some cases, can affect evaluation scores by more than 10%. Quantifying its impact would help readers better understand the relative contributions of the model and the post-processing step. - Table 2: It is not clear whether the addition of the “solo” label affects performance. Labels such as "impro", "interlude" and "guitars" may implicitly include solo sections. - Section 5.1, Line 409: The statement that “the effect of using additional training data is not straightforward to assess” is somewhat unclear. If the benefit of extra training data is uncertain, it raises the question of why mixing datasets was used in this paper.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors identify and criticize evaluation inaccuracies of the MSA literature, which sheds light on the effectiveness of some of these approaches.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Efficiently finding structure in popular music does not require of large models or self-supervised approaches.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The authors present a simple architecture for music structure analysis. The paper is well written and its methodology is explained with detail. Despite the approach does not present any novel idea in terms of feature implementation, model architecture or post-processing stage, it shows an elegant and efficient solution for a well-known problem. The previous work is appropriately described and referenced. In my view, the primary strength of this paper lies in its critical examination of the evaluation methodologies used in previous studies.

A minor criticism of this work is the absence of the pairwise frame clustering (PFC) metric in the reported results. Given that including this metric would require minimal additional effort, I recommend its inclusion provide a more transparent evaluation.