Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

Recep Oguz Araz; Guillem Cortès-Sebastià; Emilio Molina; Joan Serra; Xavier Serra; Yuki Mitsufuji; Dmitry Bogdanov

Abstract:

Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where representation quality is influenced by the nature of the supervision and the utilized loss function. However, recent work unrealistically simulates real-life audio degradation during training, resulting in sub-optimal supervision. Additionally, although several modern metric learning approaches have been proposed, current neural AFP methods continue to rely on the NT‑Xent loss without exploring the recent advances or classical alternatives. In this work, we propose a series of best practices to enhance the self-supervision by leveraging musical signal properties and realistic room acoustics. We then present the first systematic evaluation of various metric learning approaches in the context of AFP, demonstrating that a self‑supervised adaptation of the triplet loss yields superior performance. Our results also reveal that training with multiple positive samples per anchor has critically different effects across loss functions. Our approach is built upon these insights and achieves state-of-the-art performance on both a large, synthetically degraded dataset and a real-world dataset recorded using microphones in diverse music venues.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper explores different ideas -- often developed in computer vision applications -- in the context of music fingerprinting. The results are sometimes surprising, which should make our community reconsider some assumptions about best practices.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

This paper significantly improves the performance of neural music fingerprinting techniques through systematic exploration of loss functions, data handling, and hyperparameter selection.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Some of the paper’s strengths: + One significant contribution of the paper is a larger, more comprehensive, and more realistic benchmark for evaluating music fingerprinting. The paper significantly expands the database size compared to previous works, considers a wider range of additive background noise and room impulse responses and microphone responses, and corrects erroneous (or at least poorly chosen) data pre-processing techniques. This will be a valuable resource to the community moving forward. + The results show large gains over a previously proposed method (NAFP, GraFP). The cumulative effect of careful attention to the data preparation, training configuration, and architecture design led to very large improvements in overall performance. + The paper is very thorough and systematic in presenting a detailed analysis of the effect of many different design choices: the degradation dataset used, the selection of examples in each batch, the effect of impulse responses & reverberation, the loss function, the number of anchors per batch, etc. The experimental results are very systematic and well organized, and offer insights into the importance of these important parts of the pipeline. + The writing is very clear, well organized, and easy to follow.

Some of the paper’s weaknesses: - The title is somewhat misleading in two senses: (a) it suggests that the paper proposes a new fingerprinting method, whereas this paper is more of an analysis paper, and (b) the results don’t really study the scalability aspect other than running all experiments on a larger dataset. To claim that a method is more scalable, I would expect there to be some comparison of the size of the fingerprint databases, experimental runtimes, comparison of results vs database size, etc. I think a title like “Improved Neural Music Fingerprinting” or “Improving the Robustness of Neural Music Fingerprinting” would be a more accurate title. - The paper is missing an explanation of what NAFP actually does (section 2.2). Once it computes the log (or power) mel spectrogram, how does it convert it into a real-valued fingerprint? What architecture does it use? This could be just a few brief sentences, but I think it is important for completeness. - There is little novelty in the way of new ideas, though the experimental results and insights are novel.

Other feedback: - Are both the synthetic and industrial datasets being released to the community? It was not clear to me, and often industrial datasets are not released due to copyright issues. - In Table 7 and the corresponding discussion in section 5, it was not explained very clearly what “exact match” and “near match” meant. - L481: Could you explain what “completing in a reasonable time” means? This is hard to interpret because there is no discussion in this paper of runtimes.

Overall, I appreciate the thoroughness of the experimental results and feel like it will be a valuable contribution to the community.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The reviewers discussed the strengths and weaknesses of the paper, which we summarize below.

Strengths • The paper proposes a more comprehensive and realistic benchmark for evaluation of music fingerprinting systems. In constructing this benchmark, the authors addresses multiple weaknesses in previous evaluation protocols, including insufficient diversity of room acoustics, background noise and unrealistic audio degradations. This is a valuable contribution to facilitate future research on this topic. • The experimental results are very systematic and thorough. The authors clearly document the steps in improving the baseline fingerprinting system. The ablation studies are done systematically and provide useful insights to other practitioners on practical issues like loss functions, batch construction, data preprocessing (filtering out higher frequencies), etc. • The experimental results offer clear recommendations to researchers on best practices. Careful selection of these practices led to significant improvements in overall performance of the fingerprinting system. Of particular note, some of the findings contradict some of the recommendations in the Computer Vision community (from which several of the techniques originated), so these insights are particularly helpful to the MIR community. • The paper provides a critical assessment of earlier work and benchmarks. The review of related work is comprehensive and current. • The authors plan to share their model and code with the community, which will facilitate open and reproducible research.

Weaknesses • The biggest criticism raised by reviewers was that the “marketing” of the paper was misleading. For example, the title makes it sound like a new fingerprinting method is being proposed, whereas in actuality the paper is making lots of small improvements to an existing approach. Another reviewer felt that the naming of the sections is misleading, and recommends renaming section 3 to make it clear that it describes the baseline system. The emphasis on “scalable” in the title and abstract is somewhat misleading since scalability is not really studied (other than evaluating on a larger dataset). The reviewers request the authors to make it clear in the title and abstract that this paper is about improving an existing system. This is itself a valuable contribution that does not need to be “oversold”. • Some parts of the writing could be improved. One reviewer suggests that a figure (or at least a brief textual summary) giving an overview of the system and training/evaluation pipelines would be very helpful to the reader. The introduction could be improved by mentioning earlier unsupervised approaches (like Shazam) and convincing the reader that the newer “neural” fingerprinting approaches have been shown to be better. A brief description of the NAFP method and architecture would be helpful for completeness. • One reviewer points out that the training was done on fma_medium and evaluation was done on a part of fma_full. This may indicate overlap between songs in the training and evaluation datasets. This should be clarified if there is no overlap, or acknowledged if there is overlap.

The paper has no major novelty in terms of new ideas, but the reviewers nonetheless agree that the paper presents a very systematic and thorough set of experimental results that suggests best practices and can help guide other researchers in the field.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Best practices (concerning loss functions, hyper-parameters, degradations) are given based on systematic experiments. Code and model parameters will be published and provide a good starting point for other researchers.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

To make Music Fingerprinting work best, use triplet loss and focus on a careful and realistic design of the degradation pipeline.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly disagree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors experimented with different ways to improve the performance of an existing fingerprinting system. Those could possibly inform other researchers when designing their own system.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The authors proposed to improve an existing machine learning-based fingerprinting approach in a number of ways, showing increases in performance.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

General comments: - I feel that the title, abstract, and listed contributions are a bit misleading. Essentially, the authors are proposing various ways to improve an existing method. While incremental works are not proscribed, I would clarify this from the very beginning. Besides that point, I am a bit on the fence, as this works feels more like a technical report to me, but the reported increases in performance are significant. I would have perhaps framed this work differently and re-organized some of the parts. See my detailed comments below.

Detailed comments:

I feel that the introduction could be improved. You kind of directly start talking about machine learning-based systems without mentioning the successful unsupervised methods which came first, such as scalable peak-based approaches like in the Shazam algorithm. It would be nice to first convince the reader that these new "neural" fingerprinting systems are "better," or at least more promising.
More accurately, triplet loss is a function not an approach.
"Representation quality is improved by each anchor sample in a batch having multiple positive samples." Is this claim backed up by reference [15]?
I would briefly explain what the NT-Xent loss is.
"A&U [17] proposes two metrics that good representations should obtain..." I am not sure to understand this. Could you perhaps rephrase?
You mentioned the number of anchors and positives per anchor in a batch. At this point, I am unclear if you are going to use negatives as well.
While higher sampling frequencies may not be necessary for music identification, they could definitely help, as you are getting more information which can possibly survive noise, for example. And I don't think that it's uncommon for audio to be transmitted at higher rates (8k is fairly low).
I am not sure that you really need Table 1. You are basically proposing cheaper parameters compared to other systems.
I would just explicitly mention the problems of NAFP rather than giving a link to GitHub open issue.
What does it mean that some query tracks were represented many times? Are you saying that some queries were used multiple times and some not at all in the final evaluation? How come??
Instead of "industrial evaluation," I would rather talk about "real-world evaluation."

4. - I feel the organization is a bit confusing. Perhaps Sections 3, 4, and 5 could be re-organized better, combining some parts?

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly disagree

Q3 (The title and abstract reflect the content of the paper.)

Strongly disagree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly disagree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly disagree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly disagree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper provides actionable and well-validated best practices for training neural AFP models. Insights into batching, IR treatment, loss function tuning, and metric learning are broadly useful. The discussion of false negatives in batch construction and the benefit of lowering the frequency threshold are especially relevant for practitioners in MIR and related domains.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A carefully tuned neural fingerprinting system using realistic degradations and triplet loss significantly outperforms previous approaches, setting a new standard for audio/music fingerprinting.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Dear authors, thank you for this well-executed paper.

The paper demonstrates strong scientific quality and clarity, combining thoughtful engineering with rigorous experimentation. The related work is comprehensive and current, and the methodological contributions are clearly motivated. Your evaluation setup addresses multiple weaknesses of previous works, and your ablation studies are among the most useful I’ve seen in neural AFP literature.

Your "best practices" approach is highly pragmatic and well-executed. Each step, from refining the degradation pipeline, to resolving false negatives in batches, to restoring low frequencies, is clearly justified and shown to contribute meaningfully to the final performance. The improvements in Table 2 are especially illustrative. I also appreciate the decision to lower the frequency threshold from 300 Hz to 160 Hz, which makes practical sense for music recorded in noisy real-world environments.

Your exploration of metric learning losses is another strong point. I found the analysis of NT-Xent’s degradation with more than one positive per anchor particularly insightful. As you suggest, this behavior likely stems from the softmax denominator not being able to simultaneously assign high similarity to multiple positives. It would be interesting to explore modified loss formulations that either decouple positives or replace the softmax entirely. This direction could lead to more robust representations in AFP and beyond.

A few suggestions for improvement: * Including pitch and tempo changes in your degradation pipeline would significantly strengthen your robustness claims, as these are common in music identification use cases. * Consider adding Top-3 or Top-5 hit rate in future work; these are often relevant in practical systems where near matches matter. * Provide analysis of false positives, i.e. when the right result is not the first one found. What would be the main reason(s) to see such false positive matches?

Finally, it’s excellent that you plan to release the code, models, and curated data. This greatly boosts the paper's impact and will support reproducibility and further research in the field.

P4-4: Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

Recep Oguz Araz, Guillem Cortès-Sebastià, Emilio Molina, Joan Serra, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

Presented In-person

4-minute short-format presentation