P2-1: Reformulating Soft Dynamic Time Warping: Insights into Target Artifacts and Prediction Quality

Johannes Zeitler, Meinard Müller

Subjects: Machine learning/artificial intelligence for music ; Open Review ; Alignment, synchronization, and score following ; Awards Nominee ; Knowledge-driven approaches to MIR ; MIR tasks ; Music signal processing ; MIR fundamentals and methodology

Presented In-person

10-minute long-format presentation

Abstract:

Training deep neural networks for music information retrieval (MIR) often relies on strongly aligned data, where each frame has a precisely annotated target label. To reduce this dependency, soft dynamic time warping (SDTW) enables training with weakly aligned data by replacing hard decisions with weighted sums, allowing for gradient-based learning while aligning feature sequences to shorter, often binary, target sequences. However, SDTW introduces gradient artifacts that can cause blurring and degrade predictions, impacting the learning process. In this work, we analyze the sources and effects of these artifacts and propose a reformulation of SDTW that expresses its gradient in terms of an equivalent strongly aligned target representation. This reformulation provides an intuitive interpretation of learned representations and insights into the impact of SDTW hyperparameters on the prediction quality. Using multi-pitch estimation as a case study, we systematically investigate these modified targets and demonstrate their potential for improving training stability, interpretability, and alignment quality in MIR tasks.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

SDTW is a quite common way of dealing with weak labels in a number of tasks, thus this investigation into its properties provides reusable insights for anyone who wants to use SDTW (and potentially also other, similar techniques).

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Parametrisation of SDTW has a direct and very noticeable impact on the predictions, thus parameters should be chosen carefully.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper presents a reformulation of the Soft Dynamic Time Warping (SDTW) loss used in neural network training in weakly supervised settings. By expressing the SDTW gradient in terms of modified targets with standard element-wise loss functions like MSE and BCE, the authors offer enhanced interpretability and practical insights into SDTW-based learning.

Main Strengths:

  • Theoretical Contribution: provides a reformulation of the SDTW gradient into interpretable, modified target representations. Demonstrates equivalence between SDTW training and standard loss training with these modified targets.
  • Interpretability: the reformulated gradient enables a closer look into what neural networks learn using SDTW. The modified targets enable intuitive qualitative evaluation through visualisation.
  • Practical Relevance: offers practical guidelines for initialising and training DNNs with SDTW in weakly supervised tasks. The controlled experiments on multi-pitch estimation are illustrative and offer valuable insight into SDTW hyperparameter effects
  • Clarity: especially the provided figures are informative and aid comprehension of abstract concepts

Main Weaknesses:

  • Limited evaluation and ablation: evaluation is limited to one task (multi-pitch estimation) and ablation is limited to the effect one parameter (γ). As the paper's focus is on introducing the reformulation, showing its equivalency, and demonstrating its usefulness in the analysis of SDTW, in my mind this is fine. Nonetheless, it would be very helpful to include further case studies plus a more thorough ablation study (e.g. what about the step weights in SDTW?), maybe in an extended / future paper.
  • Reproducibility: some implementation details are described but no code is provided, potentially hindering replication

Summary:

The paper introduces a very useful way to get insights into the properties of the SDTW loss. The manuscript is very well written and technical concepts are conveyed with clarity, aided by well-chosen examples and illustrations. Overall, this is a very strong contribution.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Strong accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The paper presents a novel reformulation of the Soft Dynamic Time Warping (SDTW) loss, making the training process more interpretable by representing gradients as modified targets. The reformulation enables visualization and qualitative inspection of what the model is optimizing, providing insight into training dynamics.

All reviewers agree that this is a well-written and highly insightful paper. For the final version, please take into account the individual reviewer comments. Additionally, we encourage you to consider publishing your code to support reproducibility and facilitate future research.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The training process of neural networks using the soft dynamic time warping (SDTW) loss can be better understood by reformulating weak targets into strongly aligned targets.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper presents a novel approach to monitoring the training process of the soft dynamic time warping (SDTW) loss by reformulating weak targets into strongly aligned targets.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a novel approach to monitoring the training dynamics of the soft dynamic time warping (SDTW) loss by reformulating its gradients into an equivalent representation based on modified targets. These targets are strongly aligned with the network output. This reformulation enables the visualization of which targets the model is optimizing towards at each training iteration.

The authors effectively demonstrates the utility of this method in the multi-pitch estimation (MPE) task, particularly in analyzing the impact of the soft-min temperature hyperparameter on the prediction of short and long notes. The synthetic example presented in Section 4 is thoughtfully designed and is further validated through real-world examples in Section 5.

STRENGTHS: - The paper is clearly written and well-organized, with a coherent flow from theoretical formulation to synthetic demonstrations and real-world applications.

WEAKNESSES: - The proposed technique is limited to monitoring the training process rather than influencing or improving it. While the modified target visualization reveals what the model is optimizing toward, it does not provide a mechanism to steer training if the optimization path is suboptimal.

  • The recommendations (except "2. Target inspection") in the last paragraph of Section 5 has already been suggested in prior work (see reference [4]) and is not novel here.

MINOR COMMENT: - Since the soft-minimum operation is not novel to this paper, the original source should be properly credited, e.g. referring to [9] when introducing with Equation (4).

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper shows that training a DNN with a weak target is equivalent to training with a specific modified target. By explicitly realizing this modified target, the tradeoffs made via the softmin temperature hyperparameter (which may affect both the convergence of the labeling model and the accuracy of the predictions) become more evident. This makes the training process more interpretable. One wonders which other deep learning training procedures might be amenable to such a treatment.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

By visualizing the modified target in DNN training with a weak target (for example, multiple pitch detection), you can see if you are training an overly blurred representation.

Q17 (Would you recommend this paper for an award?)

Yes

Q18 ( If yes, please explain why it should be awarded.)

This paper adds insight into the training procedure for weak target DNNs, relevant for multiple pitch detection and possibly other MIR tasks. The paper is well written and explicit about both the reformulation of SDTW and the experiments exploring tradeoffs involving the softmin hyperparameter.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

In this paper, the authors reformulate soft-dynamic time warping targets used to train DNNs with so-called weak targets. This appears to be useful in multiple pitch estimation, but there are probably many other applications in music processing. The reformulation leads to computable modified targets that show the effect of the softmin hyperparameter, as well as a deeper understanding of the weak target training itself.

The paper is clearly written. Equations are clearly presented for the derivation at the correct level of detail. The work is properly motivated and antecedents are cited. In the penultimate section, concrete suggestions are given for weak target training.

The paper leads to a potentially more interpretable weak target training, as well as a deeper understanding of weak target training.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Even for soft alignment, it is possible to visualize a single alignment plan that is akin to an expected value of all alignments. I could see such an insight applying also in learning pipelines that use optimal transport.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

It is possible to interpret a softDTW alignment plan as alignment of an input sequence to an "expected value" output, which is helpful for debugging and for understanding the effect of the regularization parameter in softDTW

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This is mainly a theoretical paper on understanding softdtw on some common loss functions. The authors show that mathematically, when computing the gradient of softdtw with respect to one of the vectors in the input sequence x_n, both MSE and BCE point to point alignment losses both move x_n closer to "y_n^{mod}". y_n^{mod} is a convex combination of all target frames y, and it can be thought of as an expected value of what the softmax alignment plan aligns x_n to (in the case of gamma=0, a hard single alignment plan, y_n^{mod} is exactly the target point it's aligning x_n to). This interpretation is very useful during training, because one can sonify y^{mod} and hear how similar it sounds to x to hear if training is progressing. Likewise, it gives a way to examine the quality of the final result. Furthermore, this formulation elucidates a tradeoff of the smoothing parameter gamma in soft-DTW; a larger gamma improves smoothness in differentiability, but it blurs transient events more. As a result, the authors conclude by suggesting starting with a high gamma to stabilize training, but then to lower it as much as possible based on periodic manual inspection of ymod on some examples.

Overall, I really like this paper: the mathematical analysis is simple but deep, and it goes a long way to bring the black box and less-interpretable soft alignments back into something that can be sonified and examined. My main constructive point is about the evaluation: while it's helpful to see a specific example in figure 5 for multi pitch tracking, it would also be nice to see test loss on the multi pitch examples for different fixed gammas, as well as a "gamma schedule" following the heuristics the authors suggest in the conclusion. For future work, it would also be very cool to hear sonifications for more complex cross-domain alignment problems like version identification, and possibly even cross-modal alignments like text to audio! I know it isn't the point of the paper, but I feel like multi pitch tracking doesn't totally show off the potential of this idea. Beyond that, a minor comment is that since the paper relies on it so heavily, it might be helpful to give the reader some more intuition about why the gradient is like a pseduo-probability matrix. But again, this is overall a very cool work. If it gets in, it would be great to provide some sonifications as supplementary material!