Abstract:

Binaural audio remains underexplored within the music information retrieval community. Motivated by the rising popularity of virtual and augmented reality experiences as well as potential applications to accessibility, we investigate how well existing music source separation (MSS) models perform on binaural audio. Although these models process two-channel inputs, it is unclear how effectively they retain spatial information. In this work, we evaluate how several popular MSS models preserve spatial information on both standard stereo and novel binaural datasets. Our binaural data is synthesized using stems from MUSDB18-HQ and open-source head-related transfer functions by positioning instrument sources randomly along the horizontal plane. We then assess the spatial quality of the separated stems using signal processing and interaural cue-based metrics. Our results show that stereo MSS models fail to preserve the spatial information critical for maintaining the immersive quality of binaural audio, and that the degradation depends on model architecture as well as the target instrument. Finally, we highlight valuable opportunities for future work at the intersection of MSS and immersive audio.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This manuscript opens a new path in the well-explored territory of source separation, incorporating binaural localisation. It combines this novelty with some well-known algorithms in the area. Researchers in the area should be able to incorporate the insights easily into their own work. The work also includes a new dataset to support further research.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Binaural source separation is a interesting problem domain for which traditional source-separation algorithms are not yet fully adequate.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This manuscript is a solid investigation of problem of growing importance: traditional source separation techniques don’t work that well for binaural audio. The localisation binaural audio provides is essential to VR and AR, and the authors also nicely motivate other use cases, for example, benefitting people with certain types of hearing impairment. The has not yet been much work in the field, and this manuscript sets a baseline for how well current approaches work (or don’t) in this context.

The work includes a synthesised dataset based on standard corpora for the field, incorporating localisation effects. This dataset is a contribution in and of itself, and will surely benefit researchers in source separation trying to innovate in binaural audio.

The discussion of evaluation metrics in the paper is readable and convincingly identifies the key challenges and limitations of using signal-based metrics for this purpose. A human evaluation would be the logical next step, but such an evaluation is clearly out of the scope of what an ISMIR-length paper could achieve on top of the other contributions in the manuscript.

The authors test several well-known source-separation models for their analysis. The manuscript gives a good high-level explanation of these models, although it lacks a sufficient motivation for why these algorithms were chosen (and not others). For completeness and value to the community, it would have been better if the authors could have added one or two other complementary techniques. I suspect this would be too much for the camera-ready version, but if the authors have time, it would be worthwhile.

The results are clear and highlight specific limitations of traditional source-separation approaches in binaural contexts. The manuscript tries to cover a broad ground in little space, and given that constraint, it strikes a good balance between showing detail and explaining possible causes for the results.

Overall, the manuscript presents a kind of ‘negative’ result, in the sense that current algorithms are not as successful as one would hope for this task. But this negative result is also the authors’ motivation. The manuscript establishes a clear baseline for the community, includes helpful reasoning for why current algorithms are not up to the task, and suggests practical paths for future work. It definitely has a place at this year’s ISMIR.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak reject

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

This paper prompted a lengthy discussion among the reviewers, and in particular, there were questions about the formulae used for some of the computations, particularly SDR. It was not possible to resolve all of these questions from tracing the references. If these metrics needed to be recomputed, it could substantially alter the results.

The full reviews have more details, and several reviewers also highlighted the positive contributions in the paper – hence the final recommendation of only a weak reject. The authors should be encouraged to double-check their formulae and references for them, make them more explicit in the manuscript, and consider resubmitting here or elsewhere.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

the authors provide a new version of a musdb18 dataset which is very valuable for the community

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

we should take care more about spatial audio in music separation. The best system doesn't perform best on spatial audio.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper presents a good investigation into how well current music source separation (MSS) models preserve spatial information when running on binaural audio. The authors provide a novel binaural dataset derived from MUSDB18-HQ and evaluate multiple popular MSS models using both standard and spatial metrics. The results are clear and relevant, especially as interest in immersive audio continues to grow.

That said, I have two main points for improvement, both of which I believe are straightforward to address and would strengthen the manuscript:

  1. Clarify Practical Relevance Beyond the Atmos production Workflow

In its current form, the paper connects the problem to the growing importance of spatial audio. However, in music production, immersive formats such as Dolby Atmos, stems are typically already separated before spatialization is applied. As such, the relevance of performing source separation on already-binaural material may seem limited in this context. To be fair, the authors don’t claim that this is a relevant application. Nonetheless, I suggest that the authors briefly clarify this distinction in the introduction. A more compelling case for the utility of binaural MSS could be made by highlighting real-world examples where binaural mixes exist outside the production pipeline, such as:

  • ambient or field recordings
  • live concert captures recorded with binaural microphones
  • consumer content or binaural podcasts where multitrack spatial mixes are unavailable

This adjustment would help ground the work in practical scenarios where such technology is indeed needed.

  1. Provide Guidance on Improving Binaural MSS Performance

The paper effectively demonstrates that spatial cues degrade under current MSS models, but it stops short of offering guidance on how researchers could address this issue in future work. I encourage the authors to provide concrete and constructive suggestions, such as:

  • Data augmentations: Leverage tools such as github:facebookresearch/BinauralSpeechSynthesis to simulate a variety of binaural conditions during training.
  • HRTF diversity: Introduce variations in HRTF profiles to help models generalize across different listener anatomies.
  • Model modifications: Explore modifications to encoder-decoder architectures or loss functions that account for interaural level/time differences (ILD/ITD), or design models that explicitly process spatial features. Is there a specific time-delay that a model should be invariant to?
  • Include a simple baseline: A version of one of the tested models (e.g., Demucs or OpenUnmix) retrained on the Binaural-MUSDB dataset would be a valuable comparison and help validate the feasibility of training on binauralized data directly. I understand that this might be out-of-scope for the paper but it would significantly improve the usefulness of the work.

Adding even a short section outlining these strategies would provide practical value to readers and encourage further progress in this promising research direction.

Overall, I appreciate the novelty of the work and the thorough analysis. Addressing the points above would significantly improve both the paper’s clarity and its applicability.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper gives insights into whether MSS models preserve or not preserve spatial information.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

In general, MSS models are not fully able to preserve spatial information.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The authors investigate the performance of state-of-the-art music source separation (MSS) models on binaural audio. Using synthetic binaural mixtures created from MUSDB18-HQ stems and head-related transfer functions, they assess how well these models preserve spatial information. Their evaluation reveals that standard stereo MSS models often fail to maintain critical spatial cues, with degradation varying across model architectures and instrument types, highlighting the need for future research at the intersection of MSS and immersive audio.

I find the work interesting, clear, and well-written. I have some comments that I would like the authors to address in order to improve the quality of the manuscript.

  • Introduction: I believe it is also important to mention (e.g., In the paragraph starting at line 116) and cite the SDX Workshop, from which pretty much all the state-of-the-art models come.
  • Introduction, line 154: I find cumbersome seeing a paragraph beginning with a citation. I suggest to modify it in, e.g., “The work in [30]” or “Reference [30]”.
  • Sec. 3, line 196: You reference Fig. 2 (and Fig. 3 too, later on) before Fig. 1. I think it is better to swap places.
  • Figure 2: You should swap -90° with 90° in order to match the positive direction of \theta.
  • Figure 3: Do you have the thetas still in the range [-90°, 90°]? If you randomly extracted values in this range, why does not the distribution follow a normal distribution? Is it due to the fact that it is not completely random as you discarded angles to prevent the overlapping of sources in space? Please, comment on it.
  • Sec. 4.1: I think it is better to have the equations part of the text, thus avoiding to reference them as if they were figures. You can still have them in display mode, I am suggesting to consider them as part of your logical flow. Then, each variable should be formally defined, e.g., there is no definition of s, \hat{s}, x, N, k, etc.
  • Sec. 4.2: Why did you consider those models, and not more recent models such as BSRoformer?
  • Sec. 5: the models that you considered are not the state-of-the-art anymore. It is more precise to reference them as “old” state-of-the-art models.
  • Table 1 and 2: the caption of tables is typically reported above tables. Again, Table 2 is referenced before Table 1. You should swap places.
  • Sec. 5.5: The work would benefit from further information about the perceptual test, and, at least, a plot/table, otherwise I do not see how this section could be useful.
  • If I did not get it wrong, you selected a single HRTF to perform the analysis. In order to have more reliable results, I think you should have selected more than one. Can you, at least, comment on the generalizability of the results? How do you think the HRTF conditions the results?
  • Finally, why did you choose to have a binaural realization? Probably, the most complete way of evaluating the spatial characteristics of MSS models would be to compute inter-channel level distances of the nth-order ambisonics encoding, and maybe computing the DOA to verify whether the angle is maintained after demixing. Please, comment on it.
Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Disagree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

There were potential issues with the metric calculation that were not discussed in the paper.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

Although the subject is very worthy of investigation, the work in its current form is not thorough enough to allow meaningful formation of new knowledge.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

We need a better evaluation suite for spatial quality of source separation outptus.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

In this work, the authors present an analysis of the spatial errors introduced by music source separation models. In particular, the authors consider 3 models: Demucs v4, Spleeter, and Open-Unmix. The authors also used ITD calculated via GCC-PHAT, ILD, as well as SSR and SRR from Watcharasupat and Lerch to support their analysis. While the subject is indeed worthy of investigation, the paper in its current form is, in my opinion, not quite thorough enough to warrant its publication as a full ISMIR paper. I would encourage the authors to continue pursuing this line of analysis and resubmit in the future (or perhaps as LBD).

  • L31-35: It is perhaps very important to establish that the main distinction between binaural and stereo audio is the method of reproduction. Interaural cues can also be recreated (somewhat) in stereo, but would have been done without an explicit assumption on the use of headphones.

  • L89-91: Practically all source separation work are based on signal processing anyway. Perhaps the authors mean model-driven (as opposed to data-driven)?

  • L98: If anything, I would argue that deep learning undid quite a fair bit of progress on real-time source separation. What DL did offer is the ability to perform SS on single-channel or non-array signals, in a more complex environments, and generally with much higher fidelity than ICA/NMF-era systems.

  • L121-124, L277-279, L323-324: It appears that the authors are somewhat confused by SDR (which is admittedly a very confusing metric; see Le Roux et al., "SDR - half-baked or well done?", in ICASSP 2019 for more details). In fact, there are two versions of SDR, and it is unclear which version the authors are referring to. The "correct" version for MSS is the one that is basically the same as SNR. The one that is typically misused however does not penalize everything --- rather, it forgives significant timbral impairments that can be captured within 512-taps. It is also very important to note that even with a consistent SDR/SNR definition, there are also many other factors that can (potentially wildly) affect the calculations:

(1) Was the computation done on the full-track or chunk-wise? (a) If full-track, was the reported summary median or mean? (b) If chunk-wise, was the reported summary median of median, or median or mean, of mean of median, or mean of mean? Typically it is nanmedian of nanmedian.

(2) Was there significant regions of silence in the reference track? There isn't a particular standardized way of dealing with this yet, but it can affect any SNR-like metric.

(3) How were the arguments of the log stabilized? Was the "epsilon" only in the denominator, only in the numerator, or on both? Same question for Eq. 5 between L256/257. Specifically, how was hard-panning handled in Eq 5?

  • L197-199: The choice of location sampling has to be justified. Also, Fig. 3 looks very non-uniform. I understand random sampling can do that, but this is very non-uniform.
  • Also, was each song only assigned one set of locations?

  • L259-266, L323-324: It is perhaps better if the decomposition for both SDR and SSR/SRR are written out explicitly, given that nature of this work.

  • Fig.4 has to be separated by stem. It is unclear whether this is a pattern across all or just one stem.

  • Table 1: The GCC-PHAT parameters have to be stated. The bass ITD looks very far off. This could be an issue with either SSR/SRR or GCC-PHAT and has to be more thoroughly checked and discussed.

  • Table 1: How was Overall $\Delta\text{ITD}$ calculated?

  • The supplementary cannot be easily judged since there is no reference track from the ground truths.

  • It is somewhat a missed opportunity to not compare more models. While I understand the 3 chosen have easily accessible open-source implementation, it would be more interesting to perhaps also consider other model archetypes of similar "leagues". For example, one could compare hybrid (Demucs v3 or 4) vs time-domain (Demucs v2) vs learnt basis (ConvTasNet) vs full-band TF-domain (ByteSep) vs subband TF-domain (Bandsplit RNN). Most of these have official open-source implementations, and those without do have a few unofficial implementations or related systems with official implementations.