Lose the Frames: Event-Based Metrics for Efficient Music Structure Analysis Evaluations

Qingyang Xi; Brian McFee

Abstract:

Many evaluation metrics in Music Information Retrieval (MIR) rely on uniform time sampling of phenomena that unfold over time. While uniform sampling is suitable for continuously varying concepts such as pitch or dynamic envelope, it is suboptimal for inherently discrete or piecewise constant events, such as labeled segments. Current Music Structure Analysis (MSA) metrics for label evaluation are all implemented with time sampling, which can be inexact and inefficient. In this work, we propose event-based implementations of the three most widely used MSA metrics. Our approach yields evaluations that are more accurate, more computationally efficient, and more reproducible, streamlining MSA research workflows.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Disagree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The related work discussion seems somewhat unbalanced - I was surprised to see the reference to Holzapfel et al. in there, while questions of sustainability and responsibility seem useful extra features, rather than the true experimental core of the work (for this same reason, I feel the claim of 'more responsible' evaluation in the title is a bit overblown, and the title would be more concise and on-point when not adding that). At the same time, the related work section very sparsely relates to work on music structure analysis, which is the actual task of focus in the work. As such, a more proper introduction of this task and common evaluation strategies or challenges would have been appropriate; instead, the section discusses another task (SED) that also would suffer from frame-based evaluation issues, but again generalizability to different tasks is not the core focus of the paper at this moment in time, and as such that part of the discussion seems to better fit in relation to a future work discussion. Also note some inconsistencies in wording: here, 'music segmentation' is referred to as a task, where elsewhere in the article this rather seems to be referred to as 'music structure analysis'.

If the authors ever wish to make a broader argument on works in the MIR field that raise issues with audio (in)stability or ambiguity of signal handling, several broader relevant references include: - Urbano et al. 'What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Features?', ISMIR 2014 - Sturm 'A Simple Method to Determine if a Music Information Retrieval System is a “Horse"', TMM 2014 - McFee et al. 'Open-Source Practices for Music Signal Processing Research: Recommendations for Transparent, Sustainable, and Reproducible Audio Research', IEEE SPM 2019 - Liem & Mostert 'Can’t trust the feeling? How open data reveals unexpected behavior of high-level music descriptors', ISMIR 2020

I do not insist these works are cited now, but they may be useful as part of a larger review on current evaluation issues and the importance of implementation transparency.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

the paper raises awareness to numerical errors/instabilities in sampled audio evaluation for musical structure analysis. I can imagine this is something that may inspire others working with and evaluating on audio data and the authors already give some first directions into this

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Exact implementations of music structure analysis evaluation measures are more efficient to compute, while giving more robust evaluation results.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes to take a continuous perspective on several commonly used evaluation metrics in music structure analysis evaluation, and compares the proposed take and implementation to performance obtained by the corresponding implementations in mir_eval, which perform frame-based sampling.

At points, the authors seem somewhat inconsistent in wording; apart from the earlier-mentioned 'music segmentation' vs. 'music structure analysis', there are mixed references to 'uniform sampling' and 'frame-based sampling' where these seem to connect to the same concept.

Overall the authors do come up with an original idea and empirically show their method is more computationally efficient and more robust. I must say that the instability as achieved by frame-based methods seems to yield a reasonable consistent and correctable bias, that in terms of time offset seem quite manageable.

Furthermore, I wonder whether the 'exact evaluation' claim may be a bit grand from a bigger-picture perspective. As is known on the SALAMI data, when multiple annotators annotated the same songs, they were not always fully in agreement on how they annotated. Furthermore, each annotator physically had to interact with Sonic Visualiser, which may additionally cause possible latency or inaccuracy in the annotations. The original SALAMI paper (reference [16]) proposes to consider that boundaries of annotations are a match if they are within a broader time window, and propose 0.5 and 3 seconds there. This is much coarser than the worst deviations found for the frame-based mir_eval implementation for a 2-second frame size. As such, I wonder whether it makes sense to extremely precisely try to match what a human used to annotate, as there will be some degree of instability in the ground-truthing process. Thus, the suggestion that frame-based implementations are problematically unstable may be less of a convincing argument given the nature of the parti (although the work still is convincing in its computational efficiency).

As a final tip: when submitting implementation/code base anonymously, consider using the anonymous GitHub service that can host such repositories for peer review purposes.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

On this paper, reviewers are in consensus that a nice idea is presented in this paper. While some questions remain on how large the contribution and its benefits really will be, the insights are reusable and offer a new and refreshing perspective on evaluation. As such, while the average of the reviews would lean towards a weak accept, the work clearly seems above the acceptance bar, and as such I recommend to accept the paper.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The metrics themselves are already widely adopted, the contribution can be seen as providing more efficient, drop-in replacements. Quite reusable.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

We should stop being lazy and derive exact integrals for semantic segmentation scores. It is faster than numerical integration .

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary

In this paper, the authors examine three evaluation metrics for semantic segmentation: Pairwise Frame Clustering, V-measure, and L-measure. These metrics are commonly implemented using discrete sampling methods in libraries such as mir_eval.

The main contribution of the paper is to introduce continuous formulations of these metrics, leveraging the fact that each relies on piecewise linear functions, which are straightforward to integrate analytically. When the number of segments is smaller than the number of uniform samples used in the discrete approximation, the proposed continuous versions offer improved runtime performance.

Overall, being drop-in replacements, the metrics are reusable.

Strengths

Very simple idea, drop in replacements of common scores
Paper is easy to follow
Three scores are considered

Weaknesses

The runtime is already 10^-1. Great, we are seeing a 3x improvement, but over someone that is already very small.
No evaluation on how exact and sampled scores deviate is presented.

Reasons for my score

Overall, this paper is quite nice and an easy read. However, the gains are very small and not that well justified. From a statistical perspective, I would like to see two or three things:

How does this new implementation correlate with the original ones, this should be close to 1. Very correlated.
This approach does appear to be more accurate, thus what were the errors of approximations from the original implementations? These errors may be derived based on the errors of trapezoidal integration, after all, this is what original scores do in a sense.
Now that we have exact formulations of the scores, can we explore that fact that these are averages and derive confidence bands/statistical tests?

Overall, I'm leaning towards acceptance given that nothing is strictly wrong with the paper. However, I am of the opinion that, as was presented, the gains of the new approach are yet small.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The most useful reusable insight in this paper is about the robustness of existing implementations of metrics and how sensitive they are to parameters such as frame sizes

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper takes existing music structural segmentation metrics and reimplements them in a faster, more robust manner.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper produces more precise implementations of popular structure analysis metrics that are used in contemporary literature. Primarily, the focus is on PFC, NCE, V-Measure and L-measure, all of which exist in the mir_eval toolkit. The results show an impressive improvement in speed over the toolkit implementations.

Looking at the mathematical definitions, it raises a few interesting ideas that require some clarification. The mathematical explanation of PFC makes sense, and the notion that areas of rectangles can be computed easily is valid and tracks quite well. Here, for the benefit of the reader, I think you need to introduce what the role of $ u, v $ actually are more clearly.

For NCE, V-measure and L-measure I referred to the definitions from the McFee journal article and the original V-Measure paper. It's especially interesting to look at these metrics as attempting to improve upon the framework of F-measures in the context of pairwise comparisons. Optimization using a table to speed up the process is clever!

For the L-measure section, you specify the definition of a d-level label mapping, without ever referncing what the $d $ might indicate here (is it depth?).

I think much of this paper is reasonably well written, if not a little terse because it places a decent amount of burden on the reader to be intimately familiar with these metrics to begin with (which I was not). I'd like to see two improvements:

1) Figure descriptors could better explain what the plots represent might be. As it is right now, it just reads like a "here are the things in the plot" type of caption. It is difficult to understand what the plots might be indicating, especially to someone who might not be familiar with segmentation metrics. 2) Derivation of the big O improvements or the complexity of some of these algorithms can be a little more explicit, if nothing, with some additional citations.

I think the discussion on frame sizes is perhaps the most interesting bit of this paper. It is rather curious that the original metrics appear to be sensitive to frame size differences while the proposed ones aren't to the same degree. This might be interesting to evaluate further through more experiments.

Beyond this, I think it would be see if it is possible to apply these metrics to more datasets beyond SALAMI. I understand that SALAMI is the de-facto standard for segmentation evaluation, so it's not something that necessarily harms this paper, but it's something I think would be worth exploring in future work.

This sort of improvement on existing metrics is certainly important and exciting to read about. I think it's a good contribution that deserves to be accepted - if nothing else, because it prompts conversations about the metrics we use and how we might: a) improve their efficiency and b) critically evaluate their behaviors in context.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The only insight of this work is that "it is possible to replace sample-based evaluation metrics with continuous one", but this seems to be a well-known idea (though this paper does contribute to the MIR community by implementing this idea in the evaluation of music structure analysis).

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Novel algorithms that efficiently implements evaluation metrics for music structure analysis.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper discusses a set of novel algorithms that implement three evaluation metrics for music structure analysis efficiently, by replacing the traditional sample-based algorithms with continuous algorithms.

The contribution of this paper is clear: a faster algorithm that aligns well with the original purpose of the three evaluation metrics. The efficiency of the proposed algorithms is also verified both theoretically (complexity analysis) and empirically (the actual runtime reported in Fig. 3).

In my opinion, although the contribution of this work seems marginal (just some faster reimplementations of existing evaluation metrics), in practice, it could foster large-scale evaluation of music structure analysis. Therefore, I tend to accept this paper.

Minor writing issues: - Would be great to explicitly mention which one is the ground-truth and which one is the prediction: S or \hat{S}? - What does the x-axis of Fig. 4 mean? I don't really get it.

P4-7: Lose the Frames: Event-Based Metrics for Efficient Music Structure Analysis Evaluations

Qingyang Xi, Brian McFee

Presented In-person

4-minute short-format presentation