P2-10: dPLP: A Differentiable Version of Predominant Local Pulse Estimation

Ching-Yu Chiu, Sebastian Strahl, Meinard Müller

Subjects: Open Review ; Music signal processing ; Knowledge-driven approaches to MIR ; Musical features and properties ; MIR fundamentals and methodology ; Rhythm, beat, tempo ; Machine learning/artificial intelligence for music

Presented In-person

4-minute short-format presentation

Abstract:

Predominant Local Pulse (PLP) estimation is a key technique in rhythmic analysis of music recordings, designed to identify the most salient pulse in an audio signal while adapting to local tempo variations. Unlike global tempo estimation, which assumes a fixed tempo, PLP dynamically adjusts to changes in tempo and rhythm, making it particularly effective as a post-processing strategy to enhance the locally periodic structure of a given input novelty or activity function. Traditional PLP estimation relies on a max operation to select the most prominent periodicity, limiting its use in differentiable learning frameworks. In this paper, we introduce dPLP, a differentiable version of PLP estimation that replaces the max operation when selecting a locally optimal periodicity kernel with a softmax-based weighting scheme. This modification ensures good gradient flow, allowing PLP to be seamlessly integrated into deep learning pipelines as an intermediate layer or as part of the loss function. We provide technical insights into its differentiable formulation and present experiments comparing it to the original non-differentiable PLP approach. Additionally, case studies in beat tracking highlight the advantages of dPLP in improving periodicity-aware representations within neural network architectures.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Predominant local pulse (PLP) can be integrated into a DNN.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

A differentiable version of predominant local pulse (PLP) can be integrated for end-to-end DNN training in periodicity-aware music analysis tasks.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes a dPLP, a differentiable version of the predominant local pulse (PLP) estimator, in the spirit of differentiable digital signal processing (DDSP). By replacing the max operation with the softmax function, dPLP can be integrated seamlessly into a deep learning pipeline for periodicity-aware music analysis like beat tracking, achieving end-to-end training and better performance.

Pros: - Has a potential of improving the performance of periodicity-aware DNNs with dPLP. - Enables end-to-end training of rhythm analysis systems without isolated post-processing. - Provides a model-based, interpretable, and flexible module for periodicity enhancement.

Cons: - The generalizability or usefulness across diverse musical genres have not been fully investigated. - Peak picking post-processing is still required and has a strong impact on the performance. - Performance gains might be partly due to increased model size in integrated architectures. - dPLP may have a bias towards faster tempi.

This is a well-motivated work that could potentially have a large impact on the MIR community. The periodicity is the most basic nature of music. The simplicity of the proposed dPLP would be a key because it is easy to implement and integrate in a wide variety of periodicity-aware music analysis tasks. One of the main concerns is that the influence of the softmax temperature parameter was fixed to 1 and has not been investigated. The sensitivity of dPLP (how much dPLP can differ from PLP) is unclear.

Please insert a table showing the configurations of the methods listed in Table 1.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Strong accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

This work proposes a differentiable version of predominant local pulse (PLP), a classical signal processing method used in various tasks. The reviewers confirmed the novelty and effectiveness of the proposed method and positively supported the acceptance of the paper.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper introduces a differentiable version of Predominant Local Pulse (PLP) estimation, enabling rhythmic salience modeling to be integrated into end-to-end trainable neural pipelines—something not achievable with classical PLP due to its non-differentiable argmax operation. The authors achieve this by replacing the argmax-based periodicity selection with a softmax-weighted sinusoidal kernel summation, allowing gradient flow through the entire PLP computation process. This formulation preserves the temporal and spectral interpretability of PLP while making it compatible with modern learning architectures. Moreover, the authors propose a modular architecture that combines a lightweight spectral flux estimator, the proposed dPLP module (with multi-kernel support), and a learnable fusion layer, demonstrating how differentiable rhythmic priors can be injected into audio-based models. Through a series of ablation studies and controlled comparisons (M1–M3), the paper illustrates how dPLP improves rhythmic precision and robustness in beat tracking and shows potential as a reusable building block for downstream tasks such as downbeat tracking, meter estimation, groove modeling, and rhythm-informed music generation.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

dPLP makes rhythmic salience estimation differentiable, enabling end-to-end trainable models that integrate interpretable rhythmic structure into modern neural music systems.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper proposes a differentiable version of Predominant Local Pulse (PLP) estimation, enabling rhythmic salience modeling to be fully integrated into end-to-end neural pipelines—something not achievable with classical PLP due to its non-differentiable nature. This work advances rhythmic analysis in MIR by making both onset detection and periodicity modeling trainable. Specifically, the authors implement a trainable spectral flux module and a softmax-based PLP mechanism, forming a lightweight yet effective architecture for beat tracking. The modular design is clearly motivated, and the experiments—especially the ablation comparisons (M1–M3) and visualizations (Figure 3)—make it easy for readers to understand how each component contributes to performance improvements. This makes the work not only technically sound but also highly reusable for future rhythm-aware models.

Although I believe the paper stands well on its own, I would like to share a few points that could be clarified or expanded, along with some broader suggestions that may go beyond the current scope.

For the first part – things that can be clarified or strengthened:

(1.) Choice of softmax for differentiable PLP: The authors replace argmax with softmax for periodicity selection, which is a valid and widely used approach. However, I recommend discussing why softmax was chosen over other common differentiable relaxations (e.g., sparsemax, entmax, Gumbel-softmax), especially since this choice is central to dPLP’s formulation and performance characteristics.

(2.) Related work on differentiable argmax replacements: Since the core mechanism of dPLP relies on replacing the non-differentiable argmax with a softmax relaxation, I suggest referencing relevant literature on differentiable approximations to argmax, such as Gumbel-softmax, sparsemax, or entmax. Even a brief citation and comparison would help ground the technique in broader machine learning literature and clarify whether softmax was chosen for theoretical or empirical reasons.

(3.) Related work on differentiable DSP for analysis: I suggest citing Kim et al., “Self-supervised Pitch Detection by Inverse Audio Synthesis” [1], which also applies a differentiable DSP-inspired module for an analysis task (pitch tracking). This would help contextualize dPLP within the emerging body of work focused on DDSP for analysis, rather than generation.

(4.) Evaluation on unstable tempo genres (recommended extension): The paper mentions that PLP is well-suited for expressive or non-steady tempo music (e.g., jazz, classical), but does not evaluate dPLP on such material. Including an ablation or case study in this context would provide stronger evidence of dPLP’s robustness and generality.

(5.) Limited downstream task evaluation: While the beat tracking results are convincing, it would strengthen the work to apply dPLP to at least one additional rhythmic analysis task, such as downbeat tracking or meter estimation. This would better demonstrate its general-purpose utility.

(6.) High-level musical interpretation: Beyond improving quantitative scores, what does dPLP reveal about musical rhythm? A brief discussion on how the soft salience output reflects rhythmic structure, listener perception, or interpretability—especially in expressive or syncopated music—would enhance the conceptual depth of the contribution.

For the second part – suggestions beyond the current scope (optional but valuable):

(1.) Self-supervised training potential: Given that dPLP produces a continuous rhythmic salience representation, it has strong potential for self-supervised learning, such as contrastive learning based on rhythmic consistency or alignment-based pretext tasks. I encourage the authors to consider this direction, particularly for under-annotated genres or low-resource scenarios.

(2.) Comparison with SOTA and integration into modern architectures: While the paper intentionally focuses on a lightweight, modular architecture, I believe it would be valuable to explore how dPLP performs when integrated into larger, modern neural systems, such as CRNNs or Transformers for multi-task learning. For example, dPLP could be tested as a component within All-In-One [2], a recent state-of-the-art metrical and functional structure analysis model. This would help clarify whether dPLP’s inductive bias provides additive benefit beyond controlled settings.

Overall, I believe the paper makes a meaningful contribution by enabling differentiable rhythmic structure modeling and validating its effectiveness through well-structured experiments. It opens up several promising research directions and should be included in ISMIR this year. As the field continues to explore rhythm-aware learning systems, this work represents a timely and well-executed step forward.

References:

[1] Self-supervised Pitch Detection by Inverse Audio Synthesis. Kim et al., NeurIPS 2021. https://openreview.net/forum?id=RlVTYWhsky7

[2] All-In-One Metrical and Functional Structure Analysis With Neighborhood Attentions on Demixed Audio. Chou et al., arXiv 2023. https://arxiv.org/abs/2307.16425

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

End-to-end differentiability is critical for rhythmic tasks, even when the dPLP isn't fully optimised (i.e., with empirical hyperparameters), or some approximations or relaxations are used (e.g., softmax). This implies that there's still room for improvement and broaden the way for similar work in the future.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Experiments show that the proposed dPLP provide useful gradients that help learning better frontend features for end-to-end beat tracking.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper proposed a differentiable version of the classic PLP method (dPLP) for beat tracking. The method is simple: it applies a softmax function to the tempogram along the tempi dimension. The resulting weights are used to weight-sum all the possible sinusoidal kernels at each frame. A small-scale but carefully designed, proof-of-concept experiment shows the benefits of dPLP, supported by some further analysis. However, the paper doesn't give much details on the proposed differentiable spectral flux module, which I found quite important. The computational benefit of log-scale BPMs is not fully explained. In addition, the paper could be more impactful by scaling up the model and benchmarking it against more recent beat tracking methods.

One question that bothers me the most is in Section 2.3, the authors mention that the log-scale BPMs can reduce computational cost. Which part of the computation is reduced? This is not clear from the text. The same claim is mentioned again in Section 4.1. Moreover, the connection between sensitivity to relative tempo changes and log-scale tempos is unclear to me. Some reference is needed.

The other suggestion is that the authors should provide more details on the differentiable spectral flux module, such as what kind of parameterisation is going on inside the module, the size of the kernel, etc. I would also like to see what kind of convolution kernels are learnt at the end and compare them with the first-order differentiation kernel.

I understand that M2-S is the same as SFX-T, but it would still be better if the authors could mention this specifically and highlight their equality in Table 1. I also recommend adding the scores of recent beat tracking methods to Table 1 for comparison.

I appreciate the author's efforts to make their work scientifically sound. Below are some personal suggestions on the notation: - Section 2.1, "This window, ...., and zero outside." can be omitted or simplified. - Is \omega and \phi(n, \tau) in the range [0, 1)? I infer this from the following text. It's better to clarify it since people usually assume it's in [0, 2\pi). - I feel \mathbb{R}+ is more common than \mathbb{R}{>0}. - The author could consider writing equations 5 and 8 as something like max(*, 0), which is much easier to understand, but I understand the current notation is from the FMP textbook. It's just a personal preference. - Section 4.2, "standalone PLPs" => did you mean softmax PLPs?

Please avoid putting long texts in the footnote (specifically, the number 5 footnote). The author can shorten the explanation or move it to the main text.

Although the authors said they will try more advanced architectures in the future, the work would have been better if they had done this or provided some initial results. Still, great work overall.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

Authors could consider referencing Chiu et al. (2023) and Foscarin et al. (2024) (references [10] and [13]) again in Section 3.3 as both are good examples of recent works shifting away from DBN-based post-processing.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The proposed modification to the PLP function, which makes it differentiable and allows it to be incorporated in deep-learning-based systems, provides a new possibilities for explainable beat-tracking algorithms.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The differentiable PLP enables integration in deep-learning-based pipelines.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper presents a differentiable implementation of the predominant local pulse estimation function (dPLP) that can be used in deep-learning-based methods. For this, it replaces the argmax operation that selects the optimal periodicity kernel for a softmax-based weighting scheme that merges multiple sinusoidal kernels.

The paper is very clear and well-written, and I have only minor suggestions and comments, which are presented in the following:

Introduction

  • Instead of describing "tempo" as "the speed of those beats", it would be perhaps more precise to say it is "the rate at which those beats occur".
  • The reference for footnote 1 in the text came before the full stop, where in other instances references are added after it (as they probably should).

Section 3.2.1

  • The trainable spectral flux function is a very interesting idea! However, the paper does not expand much on it (I understand it is not the main object). For instance, it is not clear how the differentiation the training of the convolution kernel works. I understand this could be remedied by the git repo, but if space is available, it could be interesting to include some more information.
  • I had to reread the last paragraph in this section a few times, thinking that this training had to somehow involve the dPLP. I believe I only really got it after reaching Section 4. The issue is that it is not usual to apply peak-picking directly to an onset detection function in order to find beats. I don't think there is a reference for this, since beat tracking methods usually transform the onset detection function into some form of periodicity representation. If there is such reference, maybe it could be added here. If not, please considering rephrasing it, making it clear that despite not being orthodox, this will be performed as a kind of baseline.

Section 3.2.3

  • "To assess the complementarity between Delta_S and the PLP curves" should be written as "To assess the complementarity between Delta_S and the dPLP curves" (dPLP).

Section 4.1

  • I am curious about beat tracking results on a few configurations that could serve as baselines as well and were not included in Table 1. First is the combination of SFX-I and A- or S-. Second, a version of M2, where S = SFX-I, instead of SFX-T. These would also allow to see the impact of training the fuser module over the GTZAN dataset with a non-optimal spectral flux module. If there is space, it could be interesting to add this discussion.

Section 4.2

  • "and L-correct metrics (from below 0.220 to 0.360)" should be written as "and L-correct metrics (from below 0.220 to above 0.360)" (above).

Section 4.3

  • When discussing the differences between fuser false-positives, it is argued that these can be attribute to the "differentiability of dPLP (M1 vs. M2)". However, if I understood correctly, in both M1 and M2, the dPLP is used, which is differentiable. The difference is that in M2 it is not backpropagated.
  • "In contrast, when the novelty functions from S modules do not align with the PLP curves (e.g., green regions), M1-F and M2-F, which have access to the dPLP outputs, avoid making a false-positive error" -> this is an interesting comparison, but doesn't seem correct. First, I assume the "PLP curves" are SFX-T and SFX-I, since M*-S will be discussed in the following lines. Second, there is barely any activation in SFX-T in the green region, only in SFX-I. Third, M1 doesn't use either SFX-T or SFX-I, since S is also trained. Perhaps this discussion should have been done using M1-S instead (where the activation is more pronouced)?