P7-10: Joint Object Detection and Sound Source Separation
Sunyoo Kim, Yunjeong Choi, Doyeon Lee, Seoyoung Lee, Eunyi Lyou, Seungju Kim, Junhyug Noh, Joonseok Lee
Subjects: Multimodality ; Sound source separation ; Open Review ; MIR tasks ; MIR fundamentals and methodology
Presented In-person
4-minute short-format presentation
We propose See2Hear (S2H), a framework that jointly learns audio-visual representations for object detection and sound source separation from videos. Existing methods do not fully exploit the synergy between the detection and separation tasks, often relying on disjointly pre-trained visual encoders. In this paper, S2H integrates both tasks in an end-to-end trainable unified structure using transformer-based architectures. A naive combination of them, however, results in suboptimal performance. We propose a dynamic filtering mechanism that selects relevant object queries from the object detector to resolve this issue. We conduct extensive experiments to verify that our approach achieves the state-of-the-art performance in audio source separation on the MUSIC and MUSIC-21 datasets, while maintaining competitive object detection performance. Ablation studies confirm that the joint training of detection and separation is mutually beneficial for both tasks.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 ( The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Strongly disagree
Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))
The related work section and more generally the whole paper refers to work published in ML and computer vision conferences but only scarcely refers to work published in audio/music conferences/journals. There are a very large number of references (but many are not adequate) and clearly a significant number of relevant references are missing. For instance from work by Sanjeel Parekh & al., Zhiyao Duan & al., Cynthia Liem & al., and others
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Disagree
Q10 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chose, otherwise write "n/a"))
The evaluation is not convincing. There are missing details on the datasets used. This does not permit to well evaluate the difficulty of the task (how many concurrent sources are played ? what are the initial SDR for a naive separator (only outputting the mix) ? how would perform a pure SoA audio separator ? The demo example provided is not explained and it is difficult to understand what is seen/heard and as such it is not a convincing demo.
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Disagree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Strongly disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Disagree
Q15 (Please explain your assessment of reusable insights in the paper.)
The paper is building on previous work [14] and does not integrate new concepts. Besides, the reproducibility is rather low (no publication of code and the datasets used seems to be only partially accessible)
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
A joint object detection and audio source separation trained in an end-to-end fashion.
Q17 (This paper is of award-winning quality.)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Strong reject
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper addresses the relevant problem of audio-visual music source separation and is generally well written. However, it suffers from several significant limitations.
Literature Positioning: The related work review is skewed toward computer vision literature, with insufficient coverage of prior work in the audio/music source separation domain.
Experimental Validation: The experiments lack rigor. Dataset details are minimal, task complexity is not discussed, and the number of concurrent sources is unclear. There is no comparison with established music (only) source separation baselines, and the demo is neither clearly described nor convincing.
Novelty: The proposed method appears to be a minor extension of existing work, possibly by the same authors.
Reproducibility: The absence of code and partial dataset availability significantly hinders reproducibility. The implementation detials provided in the paper are not sufficient to easily reproduce the work.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Weak accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
The different reviewers had quite different opinions. But, after discussion, it was agreed that a substantial number of weaknesses pointed out by the reviewers are minor and could be improved in the final submission and that the paper has merits which could justify acceptance.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Disagree
Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))
- References: one of the core contributions of this paper is a joint audio-visual model that. To my knowledge that might be true for music models (assuming [3] was published later) but this is already prior-art in audio-visual speech separation such as in [1] and [2]. I would like to ask the authors to add a small section with references to speech separation models to the appropriate sections.
- [1] Samuel Pegg, Kai Li, Xiaolin Hu, TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion
- [2] Kai Li, Runxuan Yang, Fuchun Sun, Xiaolin Hu, IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
- [3] Yinfeng Yu, Shiyu Sun, DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
I see the authors make clear that the combination of different pretrained models can be done more efficiently when filtering the results of some. This isn't a typical approach seen in other papers
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
joint visual and audio branches help to separate instruments
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The paper introduces See2Hear (S2H), an end-to-end transformer-based framework that jointly performs object detection and sound source separation in videos. Unlike prior methods that treat these tasks separately, S2H integrates them using shared visual and auditory representations. The authors design a dynamic filtering mechanism to prune irrelevant object queries before cross-modal fusion, enhancing both efficiency and accuracy. Evaluated on the MUSIC and MUSIC-21 datasets, S2H seems to achieve convincing results in sound separation while jointly training detection. Ablation studies validate the benefit of this joint learning.
Core Contributions
- Unified End-to-End Framework:
Proposes a joint architecture that simultaneously performs object detection and sound source separation using shared transformer encoders and decoders.
- Dynamic Query Filtering Mechanism:
Introduces a method to filter out low-confidence or redundant object detections (bounding boxes) before fusion, improving both sound separation quality and detection precision.
- Transformer-Based Audio-Visual Fusion:
Utilizes cross-attention between visual object queries and audio tokens, allowing fine-grained association between objects and their sounds.
- State-of-the-Art Results:
Achieves superior SDR, SIR, and SAR performance over existing baselines (e.g., iQuery, Sound-of-Pixels) on MUSIC and MUSIC-21 datasets.
Concerns & Suggestions
- Dependence on pre-trained models: The model relies on pseudo-ground truth bounding boxes generated via an external detector (
detic). For the audio part the model relies on pre-trained weights from the AST model (likely pre-trained on noisy audioset). And for the core visual classifier the model relies on pre-trainedDETRmodel. This weakens claims of full end-to-end learning and could limit the applicability in less structured domains. As the authors mentioned, they retrained “all baselines on the same set of currently available videos to ensure a fair comparison” but to me it would have been more fair if the pre-trained models would be take out of the equation. This could mean to train all models from scratch or use the same original datasets (like AudioSet in this case). - I’m not deeply familiar with the MUSIC/MUSIC-21 dataset, so, I find it difficult to understand the objective of the task. Specifically if permutation plays a role in the task itself: are two instruments of the same kind (e.g. violin + viola) mixed? In that wouldn’t the loss function have to be permutation invariant to produce meaningful results? I would encourage the authors to to discuss if that is the case or whether the query based visual branch make it possible to not require PIT loss functions.
- Limited Evaluation Scope: The evaluation is restricted to musical instruments in controlled settings (MUSIC/MUSIC-21). It remains unclear how well the approach generalizes to more complex scenes, diverse object categories, or real-world noise. Also as far as the evaluation and ablation study goes, it is not clear which ablation completely disables the video branch. As I understand, this kind of ablation is common is audio-visual speech separation to demonstrate how well the separation with just audio would work. If this is not easily possible in this framework I would at least suggest to do an ablation experiment where the bounding boxes are pertubated with random noise or the input image itself is destructed.
- Audio separation baseline missing: In the same direction as above: i would highly suggest to also add an audio-only baseline separation model so that the reader can understand the benefit of the audio models. This should be trained with the same data to have a fair comparison.
- Computational Complexity & Scalability: The transformer-based architecture and fusion module may be computationally heavy, especially in multi-object or high-resolution settings. Runtime and memory usage analysis would strengthen the paper.
- Lack of clarity on audio analysis and synthesis: the paper describes in paragraaph 5.1 how the masking is taking part in practice. However, I found it difficult to understand how exactly the mask is computed, is the mask complex valued or real-valued? What parameters of the STFT have been used? How does the model deal with higher sampling rates like 44k if the model was only trained on 11khz inputs? How were the video frames selected when the hop size / sampling rate of video and audio are different? Was interpolation used?
- Temporal positional encoding of video frames: the paper mentions that for each sample, 3 frames of video are sampled. From each of the 3 frames, a number of bounding boxes is estimated and features are inferred. If this is correct, I wonder how the bounding boxes were tracked over the course of the 3 frames and how the model would be able to utilize temporal video information. Imagining a violinist would move his/her bow 3 frames might be enough to already encode temporal information. Therefore I wonder if (or if not: why) temporal positional encoding was used for the video frame index.
- Lack of Qualitative Detection Results: While separation results are illustrated, visual detection outcomes are not qualitatively analyzed or benchmarked beyond mAP/mIoU, making it hard to assess detection reliability.
- Please use a spell checker
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Strongly disagree
Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))
The following paper is closely related and should be at least mentioned or even compared with. Rahman, Tanzila, and Leonid Sigal. "Weakly-supervised audio-visual sound source detection and separation." 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021.
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))
"n/a"
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
This paper shows how audio and visual tokens could be jointly trained. This example inspires me on how to tackle the problem of facial expression detection and speaker diarization at the same time.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
This paper shows an example of how audio and visual tokens could be jointly used to train in a single model, allowing gradients from both tasks, object detection and source separation, to update the shared representation space.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The main strength of this paper is that this paper shows how audio and visual tokens could be jointly used to train in a single model, allowing gradients from both tasks, object detection and source separation, to update the shared representation space. Such update is not easy task. The main weaknesses or limiations of this paper are - Although 2 publicly available datasets are used and there is no more challenging datasets available, such as having multiple instruments in a same video, evaluating the proposed method on such challenging dataset is necessary. To advocate the purposed method, such dataset should be created first. - Source code should be said to be released soon, as such joint training is not easily to be reproduced.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
Although this type of joint training has been done previously in other tasks, such as audio-visual sound event detection and localization, this approach is novel and re-usable for music sound source separation. The losses presented are also great insights for the evolving multi-modal learning field.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
The article presents a framework to jointly learn audio-visual representations for audio and video, showing that end-to-end learning leads to improvements in audio source separation.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The article presents a new framework to perform audio source separation from audio-visual learning. The presentation is clear and the experimental analysis is strong. Some areas are missing further exploration and clarification, which would make this article stronger.
Remarks to address:
Line 226 defines \theta but it does not mention what this threshold is.
In 263 the embedding fusion of audio and video is mentioned. Figure 1 also describes it. The dimensions of the video embedding O_o are not the same as the dimensions of S_out. This has to be fixed for clarity.
Please cite the standard protocol for the data splitting on the MUSIC dataset (line 326).
Data processing. It is not clear what is the video sampling rate. The standard is 30fps. If 3 frames per video are sampled that is roughly a window of 100 milliseconds. However, in audio 6 seconds are sampled. This indicates that the audio and video are not time aligned. This needs to be further explained. Also the choice of the audio sampling rate at 11kHz seems arbitrary. This also needs to be explained in detail as it is critical for experiment reproducibility.
Results Section: The ablations presented are very clear. However, one ablation missing is what happens if the visual branch is missing the bounding box information. Opposite to having no b-box filtering, which means there will be a lot of visual information being merged into the network, what happens if there is no b-box information at all, leaving only audio and class information being processed by the network? This should also clarify the benefits of using bounding boxes.
The supplementary video needs to be more self-explanatory. It is not clear when each sound source should be separated. A timeline of the sound events and the task would make this clearer.
Is the code to reproduce experiments going to be open sourced?