Abstract:

Automatic sample identification (ASID) - the detection and identification of portions of audio recordings that have been reused in new musical works - is an essential but challenging task in the field of audio query-based retrieval. While a related task, audio fingerprinting, has made significant progress in accurately retrieving musical content under "real world" (noisy, reverberant) conditions, ASID systems struggle to identify samples that have undergone musical modifications. Thus, a system robust to common music production transformations such as time-stretching, pitch-shifting, effects processing, and underlying or overlaying music is an important open challenge. In this work, we propose a lightweight and scalable encoding architecture employing a Graph Neural Network within a contrastive learning framework. Our model uses only 9% of the trainable parameters compared to the current state-of-the-art system while achieving comparable performance, reaching a mean average precision (mAP) of 44.2%. To enhance retrieval quality, we introduce a two-stage approach consisting of an initial coarse similarity search for candidate selection, followed by a cross-attention classifier that rejects irrelevant matches and refines the ranking of retrieved candidates - an essential capability absent in prior models. In addition, as queries in real-world applications are often short in duration, we benchmark our system for short queries using new fine-grained annotations for the Sample100 dataset, which we publish as part of this work.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Disagree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

It only compares with one previous system that doesn't seem to be peer-reviewed. Several references are listed incompletely.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

It is unclear to me whether the system presented in the paper is actually improving results over a previous (not peer-reviewed) method claiming to be state-of-the-art. The lack of comparison with different baselines and the closeness of the mAP question whether the presented system is a meaningful improvement.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Better sample detection inspired by fingerprinting.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak reject

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The authors present a system of sample detection based on a GNN with a classifier. The topic of sample detection is somewhat underexplored, so I appreciate the work on it. The main contributions are improved annotations for an existing dataset, a meaningful way of augmenting the training data, and the application of a GNN plus a classifier for retrieving the candidates.

I was a bit surprised by the general premise that a fingerprinting approach works for sample detection - it would be interesting to see a more detailed analysis on mixing levels and detection accuracy.

My main concern is the lacking comparison with previous systems (the only one is from a non-peer-reviewed study) and that the results seem to be no improvement (mAP .442 vs .441) over this previous system. What makes me more skeptical is that the larger comparison system results are omitted in Table 3, only the results for the smaller, inferior system are presented as comparison..

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The authors present an approach for sample identification with GNNs that seems to perform on par with or outperform the state of the art at a considerably lower complexity. Sample identification is a fascinating yet generally underexplored task, and the presented detailed annotations to an existing dataset are an important contribution to the field. In addition, the augmentation strategy during training is a neat strategy for this task.

The main weak point of the paper is that the actual comparison against the state of the art is missing for what are likely resource issues, preventing a decisive conclusion. The paper also could improve the description of methodology in some parts.

The discussion mainly focused on whether the existing contributions of the paper outweigh the shortcomings of the evaluation, making this a borderline decision.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper provides insights applicable to audio retrieval tasks beyond sample identification. The two-stage retrieval architecture, use of GNNs, and the refined annotations in Sample100 will benefit other researchers. The cross-attention classifier design and fine-grained evaluation pipeline can be reused or adapted in similar domains.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A lightweight GNN-based sample identification system is proposed that matches SOTA performance and adds retrieval refinement through cross-attention, supported by an improved dataset and clear evaluation.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Dear authors, thank you for your clear and well-structured contribution.

The paper presents a lightweight GNN-based model for automatic sample identification (ASID), combined with a retrieval refinement step using cross-attention and a detailed evaluation pipeline.

The task itself (sample identification under realistic transformations) is highly relevant and practically valuable, particularly for industry use cases like rights management. The approach of combining approximate nearest-neighbor search with a learned ranking function is well-motivated.

The extension of the Sample100 dataset with fine-grained annotations is one of the strongest contributions of the paper, and I greatly appreciate that both the dataset and code are shared for reproducibility. This will support follow-up research in this area. Your analysis of different sample types (beat vs. riff) and time-stretching levels (Table 4) provides helpful insights and reflects a thoughtful evaluation design.

There are a few points where the paper could be clarified: * The equation for mean-pooling (eq 4) seems a bit redundant, as the operation is trivial and well known. * It might be beneficial to discuss and analyze false positives, to see if there is a pattern, e.g. similar beats or riffs.

Overall, this is a useful and timely contribution with a clear structure, solid methodology, and strong practical grounding. I recommend acceptance.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

"n/a"

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

"n/a"

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The usage of the source separation to synthesize query-reference pair for automatic sample identification is clever.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A new two-stage framework based on graph-based neural network for automatic sample identification.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Please find below my review

====== REVIEW BEGINS HERE ======

The paper is about proposing a new architecture for ASID task. The architecture is composed of lightweight GNN as the encoder and multi-head attention-based classifier. Authors also open-source the code and release an advanced annotation of the Sample 100 dataset to benefit community of the relevant research. Overall, authors did some interesting designs and do some good analysis about query length and sample characteristics. However, I have several comments about the paper:

First, the clarity of the methodology part. In my opinion, the methodology part is hard to follow due to some not necessary math equation or notation. For example, the notation in section 3.1. I think the math notation of waveform y or fixed duration t_seg didn’t introduce clarity of the paper, plain text description will be more clear. Similar issues also existed in other part, such as the equation (2) is a standard GNN operation but authors didn’t state further new information. Also the similar issues from equation (3) to (5). Some informations are standard operation or it doesn’t necessary need an equation to illustrate the idea. It largely affect the clarity of the paper.

Second, also about the clarity of the figure 1. It is hard to understand the concept by just checking figure 1. Although after checking the methodology part then the figure can be understood, i think authors can further enhance the clarity of the figure 1. For example, the dotted arrow from “database of reference songs” to “list of candidates” is unclear for me in the first glance. Also why three query embeddings will merge to correct match, how did they merge. Those informations only can be understood after checking the paper. Similar issues for multi-head cross-attention part. Embedding matrices NM_q and NM_r appears abrupt, such as does the correct match in stage (A) and stage (B) denotes same things? Does the A and B show the sequential processing or what’s the relationship between A and B?

Third, the potential misleading contribution in evaluation part. From table 2, authors compare the proposed method with SOTA model, and further evaluate the performance with different batch sizes. However, it seems like author leverage the SimCLR-based learning framework already, it is not surprise to see that increasing batch size improve the performance. The paragraph from line 440 to 448 may easily misleads readers who unfamiliar with SimCLR to interpret this as the contribution of this paper.

In the end, my decision of this paper will be weak reject. Though authors demonstrate some vary good start points of the project and definitely contribute the MIR community, I expect authors to further improve the clarity of the paper.

Some minor comments about the clarity or possible missing references:

  • line 171: k-nearest neighbour graph → k-nearest neighbors graph
  • line 149 & line 160: “mel-spectrogram” and “spectrogram” share the same notation (? I think the spectrogram in line 160 should also be “mel-spectrogram” not “spectrogram”.
  • line 393: miss a reference about Adam optimizer
  • line 235: it will be better to mention K=4 in this paper.

====== REVIEW ENDS HERE ======

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Not only does this paper provide a new SOTA mAP result for performance on the Sample100 dataset, as well as advocate for the efficacy of such GNN-based models (in a CNN-dominated space), it also augments the Sample100 dataset with extended annotations, including more fine-grained temporal annotations, as well as annotations for repeat sample occurrences (as opposed to just the first occurrence per reference). This should do wonders for others in the community working in this space.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper focuses on the under-explored area of music sample identification, leveraging the power of lightweight GNNs that have recently proven effective in fingerprinting, applying a second-stage classifier for refinement, combined with a novel augmentation strategy for query segments, resulting in a new SOTA baseline for one of the few open datasets available in the space, not to mention augmenting said dataset for further use by the community.

Q17 (Would you recommend this paper for an award?)

Yes

Q18 ( If yes, please explain why it should be awarded.)

I would recommend the paper for an award because it is well-written, thorough in implementation details, builds upon previous work in the space, offers a new SOTA result, and offers an augmentation of one of the few datasets available in the space. It also demonstrates good and modern use of several areas of MIR, including fingerprinting, audio GNNs, multi-head attention (that "cool new thing"), beat tracking, source separation, time stretching, pitch-shifting, etc. Very impressive.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

I think this paper is exceptional. I really like that it was able to build upon the successes of GNNs in audio fingerprinting where CNNs have had such popularity recently, and established a new SOTA baseline. The explanation of the methodology was clear and easy to follow, and the results speak for themselves. The augmentation to the existing dataset was really a cherry on top.

I do have a few criticisms/questions/comments though -

The existing SOTA does feel a bit slighted in that it wasn't re-implemented for this paper to compare against the latest results. Not having implemented this existing baseline myself, I found myself wondering, exactly how far outside of the realm of practical are the computational needs to implement a ResNet50-IBN architecture as compared to the ResNet18-IBN model implemented in its place? Was the A100 GPU not suitable? I do understand that the replacement model used a number of parameters on the order of magnitude of the model proposed by this paper, which is part of the allure, but it doesn't feel like an entirely fair comparison to say you beat the benchmark. Maybe the existing SOTA architecture would have benefited from the Sample100 dataset augmentations added to this paper and performed even better? It is hard to know without trying.

It was neat to see the ablation study showing the effectiveness of the GNN approach of this paper with and without the MHCA piece, as well as with varying batch sizes. I do wonder though why you stopped at batch size 1024 once you narrowly stepped ahead of the reported SOTA mAP score - why not try one or two even larger batch sizes to see if you could knock it out of the park? Were the returns already diminishing? Were the computational needs bordering on impractical? Perhaps some sort of memory pressure or something? I just found myself wondering if there was some additional runway here.

Regarding the augmentation of the reference segments - were those augmented reference segments persisted and re-used, or were new augmentations created on-the-fly for every single batch?

Also, the augmentation step described in equation (8) seems a bit wild to me in general - does it make sense to apply ALL FOUR transformations (time-offset, gain variation, pitch-shifting, time-stretching) to every single reference segment? Might it make sense to only apply one or a few, here or there? Also, to make the augmentations seem more plausible, might it make more sense to look a bit more in depth at the nature of augmentations used in Sample100 to try and closer-replicate more realistic transforms? For example, time-stretching a segment in the uniformly sampled range of 70-150% and then mixing with the remaining stems seems a bit unrealistic of real music sampling. Wouldn't a beat-matched time-stretch, even at a higher or lower tempo octave of the reference drums stem be more plausible?

I thought Table 3 was quite helpful in showing model performance at varying query lengths, particularly that the proposed model excels with longer query lengths, but what about shorter query lengths? Using a minimum length of 5s feels even a bit long, but I could be wrong. This also got me wondering about the distribution of actual sample lengths present in the dataset.

On the note of dataset transparency, I also liked that Table 4 sort of presented the proportion of samples in the dataset that exhibited certain characteristics (and the success of the model in these specific areas), but it also left me kind of seeking additional information. Specifically, it broke down time stretches into >5% and <5% buckets, but this got me curious - how much more than 5%? The time-stretching augmentation in this paper varies from 70-150% - does that cover the ranges found here? Also, the "1-note" type of sample class present in the table with no mAP result doesn't seem to be explained anywhere. Are there any other sample types in the dataset?

Thanks again for your paper, I really enjoyed reading it.