Improving Neural Pitch Estimation with SWIPE Kernels

David Marttila; Joshua D. Reiss

Abstract:

Neural networks have become the dominant technique for accurate pitch and periodicity estimation. Although a lot of research has gone into improving network architectures and training paradigms, most approaches operate directly on the raw audio waveform or on general-purpose time-frequency representations. We investigate the use of Sawtooth-Inspired Pitch Estimation (SWIPE) kernels as an audio frontend and find that these hand-crafted, task-specific features can make neural pitch estimators more accurate, robust to noise, and more parameter-efficient. We evaluate supervised and self-supervised state-of-the-art architectures on common datasets and show that the SWIPE audio frontend allows for reducing the network size by an order of magnitude without performance degradation. Additionally, we show that the SWIPE algorithm on its own is much more accurate than commonly reported, outperforming state-of-the-art self-supervised neural pitch estimators.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Combination of DSP and DNN works for pitch estimation.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

SWIPE could still be a front runner by careful implementation.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes to use the Sawtooth-Inspired Pitch Estimation (SWIPE) kernel, a classical DSP method, as a front-end for neural pitch estimation (e.g., Pitch Estimation with Self-Supervised Transposition-Equivariant Objective (PEST)), improving accuracy, robustness, and efficiency. It also demonstrates that a careful implementation of the SWIPE algorithm significantly outperforms state-of-the-art self-supervised neural pitch estimators, showing its potential was underestimated.

Pros: - Found a way of drawing the full potential of SWIPE, leading to the SOTA performance. - Has a potential of improving a wide variety of transcription tasks. - Allows significantly smaller neural networks without losing performance. - Offers a flexible latency-accuracy tradeoff adjustable at inference time.

Cons: - The paper is not so strong in terms of technical novelty. - Applicability for music recordings not limited to solo performance has not been investigated.

This is an interesting “bridging old and new” paper. The proposed combination of classical DSP and modern deep learning would be worth sharing among the community.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

This paper revisited SWIPE, a classical DSP method, and showed its usefulness in combination with deep learning models. The reviewers highly evaluated the idea of drawing the full potential of the traditional pitch estimator. The evaluation section, which is not fully convincing, would be significantly improved by including a comparison with the other conventional implementations.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Think twice before you call classic audio signal processing algorithms obsolete. They might still perform pretty well and can be combined with neural architectures.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The original SWIPE algorithm (when implemented right) might perform better than self-supervised neural methods for single pitch estimation.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper is about single-pitch estimation with a hybrid approach taking some inspiration from the DSP-based SWIPE method and combining it with neural architectures. Although single-pitch estimation has been a long-standing challenge in audio signal processing, it is still worthwhile to investigate further especially w.r.t. efficiency, robustness and cross-domain performance (e.g., speech vs. musical instruments).

The authors start by giving a brief overview on the SWIPE algorithm, followed by an introduction of supervised (e.g., FCNF0++) and self-supervised (e.g. PESTO) neural approaches for single-pitch estimation.

The authors continue with details about their slightly modified SWIPE baseline implementation as well as their combination with neural backends, where mostly the SWIPE-kernels and scores are used as essential front-end parts.

In their experimental section, they compare several configurations of the different approaches. For training, the use a combination of the MDB-stem-synth (musical instruments) and PTDB-TUG (speech) datasets. They also evaluate on held-out test splits as well as the completely unseen MIR-1K (singing voice) dataset. They include two DSP baselines PYIN and SWIPE (modified according to their paper) and

In the results section, the authors report the usual pitch-related scores like raw pitch accuracy (RPA), voiced/unvoiced F-measure, and overall accuracy (OA). In most of the settings, their hybrid approach outperforms FCNF0++ and PESTO as state-of-the-art examples of supervised and self-supervised neural pitch estimators/trackers. Notably, their SWIPE re-implementation surpasses PESTO when trained and tested on the MDB-stem-synth dataset (considering only RPA).

The experiments are rounded off by further explorations of noise-robustness and the trade-off between window-sizes and performance. The latter is especially important when considering low-latency scenarios.

All in all, I applaude the authors for revisiting SWIPE and combining it with recent state-of-the-art neural backends. That is a line of research that I personally would like to see explored more often. The reported results also seem to support this. However, the write-up of the paper has several flaws which almost made me reject it. The authors need to improve on the following aspects:

1) The authors need to more explicitely state what SWIPE kernels and SWIPE kernel correlation scores (sometimes only called SWIPE scores here) are. What is what can be deduced if one is either familiar with SWIPE or reads through the whole paper, but the reader would have a much easier time if there was a running example introduced in the beginning with a visualization.

2) Abbreviations for the various configurations are all over the place (compare tables 1 and 2 vs. 3 and 4) . It would be less confusing to the reader if the authors come up with a more systematic naming scheme (e.g., DSP-based, supervised, self-supervised followed by the signal representation and backend) and explain it once in before jumping into the experiments.

3) Some choices of the algorithmic configurations seem not very well justified. For example, why does the fully supervised model have a Toplitz-style fully connected layer if it makes no use of self-supervised training paradigm? Why were the frequency resolutions choosen so differently between Mel, CQT ?

4) The train and test configuration differ quite a lot between the different experiments. The authors justify this by their aim to reproduce results from earlier papers, but it seems somewhat arbitrary chosen. This is especially true when considering the different modality of the datasets musical instruments, speech and singing voice.

5) My personal experience with the PTDB-TUG dataset tells me that there are some annotation errors in the reference pitch tracks. In particular, there are sometimes wrong voiced/unvoiced information that assigns no pitch, even if there's a clear fundamental and harmonics visible in the spectrogram. I fear that any trainings and evaluations including this dataset have to be treated carefully. In the worst case, they could even call the main findings into question. Why did the authos not use other datasets that are available with pitch annotations?

6) The literature references must be more carefully curated and formatted. For example: a) reference [7] should refer to the ICASSP paper instead of the ArXiV pre-print: https://ieeexplore.ieee.org/document/8461329 b) reference [15] contains faulty characters in the paper title

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Most reusable part for me is to revisit our SWIPE implementations and exchange the ERB scale with MEL.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Using SWIPE as frontend to a neural network can significantly reduce the number of parameters while keeping robustness to noise.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a series of experiments aimed at coupling the self-supervised PESTO model with classical pitch estimators such as PYIN and SWIPE, with particular emphasis on SWIPE. The overarching goal is to enable pitch estimation on devices with limited computational resources, such as mobile phones.

A positive aspect of the paper is the authors’ effort to revisit and adapt traditional pitch estimators to better suit their specific needs. Notably, they report that replacing ERB bands with Mel bands increases SWIPE’s performance on music-related data. However, this potentially interesting observation remains unsubstantiated, as the paper does not include experimental evidence to support it.

Overall, I feel the paper is not yet ready for publication. The experimental results, while touching on relevant aspects, do not convincingly support the claims made. First, the authors argue that their implementation of SWIPE outperforms other versions using ERB bands. However, the comparison lacks rigor—there is no clear evidence that the baseline implementations are comparable in terms of evaluation setup, metrics, or configuration. Reproducing published results before introducing modifications would lend much more credibility to the analysis.

Second, the reported evaluation metrics are presented only at a high level, with average values across test datasets. Especially when the observed improvements are small, deeper analysis is crucial to assess the reliability and significance of the results. For instance, a plot of RPA over frequency or visualizations of the Toeplitz matrix in the case of the PESTO-SWIPE-tiny model would provide valuable insight. In this regard, a more detailed error analysis would have been more informative than the brief timing and robustness evaluations included in Sections 5.4 and 5.5.

On a minor note, the notation used in the mathematical sections is somewhat confusing. For example, it is unclear why formula (1) is presented in continuous notation, or why "i" is used to represent pitch in Sections 2.2 and 2.3. Additionally, the relationship between y = f_{theta}(x) the same as y-tilde is not clearly explained (see caption before Section 2.3.2).

I hope the authors find this feedback constructive. I encourage them to consider narrowing the scope of the paper and developing a single aspect in greater depth, with stronger empirical validation and clearer exposition. My recommendation, at this stage, is to reject the paper.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Rehashing and recycling different parts from previous works, but using signal processing knowledge in NN seems like a good idea, as well as leveraging self-supervised learning

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Signal processing knowledge improves performance of NN. Self-supervised learning can curb the need for more ground truth to improve performance.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This work proposes to use SWIPE kernels in a self-supervised training framework for pitch estimation. Overall, the text is well organized, well-structured, and well written. In general, the presentation is clear and the decisions are well justified. The introduction is clear, with a clear motivation, clear state of the art (SOTA), and clear contribution. The experimental setup is appropriate to present clear evidence of the claims. Finally, the references are appropriate, citing mostly peer-reviewed works and only including non peer-reviewed references when necessary.

All that said, I have a few suggestions to improve the manuscript.

Major suggestions:

First, the “self-supervised” aspect of the work seems to lie at the core of the contribution but it does not appear in the title. I think the title should include “self supervised” and use “fundamental frequency estimation” instead of “pitch estimation” (see below).

5.2 says (308-310) “In both the supervised and self-supervised setting, we train all models for 50 epochs using a batch size of 256 and the Adam optimizer with an initial learning rate of 10−4.” (387-389) and also (358-360) seem to contradict it. Please clarify.

Table 1 made me wonder, can OA > RPA? The definitions in 4.3 seem to indicate that OA <= RPA because RPA includes all voiced frames and OA only considers correctly classified voiced frames (a smaller number than all voiced frames). I’m assuming that these calculations only change the numerator and the denominator is all voiced frames for both. Formulas in 4.3 will help clarify the confusion.

Minor details:

(lines 25-26) I’d either change the phrasing of this particular sentence or avoid the term “pitch estimation” altogether. Rephrase as “fundamental frequency is considered as the signal counterpart of pitch” or “fundamental frequency correlates with pitch”. “Fundamental frequency estimation” avoids most of the problems because it’s what’s being estimated.

Personally, I’d lose the bullet-point lists in favor of regular paragraphs in 1 and 4.3.

P6-9: Improving Neural Pitch Estimation with SWIPE Kernels

David Marttila, Joshua D. Reiss

Presented Virtually

4-minute short-format presentation