P7-11: User-Guided Generative Source Separation

Yutong Wen, Minje Kim, Paris Smaragdis

Subjects: ; Sound source separation ; Open Review ; MIR tasks

Presented In-person

4-minute short-format presentation

Abstract:

Music source separation (MSS) aims to extract individual instrument sources from their mixture. While most existing methods focus on the widely adopted four-stem separation setup (vocals, bass, drums, and other instruments), this approach lacks the flexibility needed for real-world applications. To address this, we propose GuideSep, a diffusion-based MSS model capable of instrument-agnostic separation beyond the four-stem setup. GuideSep is conditioned on multiple inputs: a waveform mimicry condition, which can be easily provided by humming or playing the target melody, and mel-spectrogram domain masks, which offer additional guidance for separation. Unlike prior approaches that relied on fixed class labels or sound queries, our conditioning scheme, coupled with the generative approach, provides greater flexibility and applicability. Additionally, we design a mask-prediction baseline using the same model architecture to systematically compare predictive and generative approaches. Our objective and subjective evaluations demonstrate that GuideSep achieves high-quality separation while enabling more versatile instrument extraction, highlighting the potential of user participation in the diffusion-based generative process for MSS. Our code and demo page are available at https://yutongwen.github.io/GuideSep/.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The analysis of differing behaviour between generative and discriminative/predictive source separation approaches can be used to guide future work

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Diffusion-based music source separation can perform well when controlled by a mimicry audio condition of the target instrument and/or from a spectrogram mask locating the target instrument

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper successfully lays the groundwork for music source separation methods that can be controlled by users in a more intuitive way, with the mimicry condition being a novel conditioning method.

The paper is overall very well written and scientifically rigorous. The ablation studies prove that both proposed conditioning methods contribute to separation performance and that the diffusion-based generative approach performs better than a mask-based one (although I am not convinced this finding is entirely novel, so I encourage some more literature review and to rephrase the paper's novelty claims in that regard if needed).

There are a couple of limitations, which are not discussed enough in the paper: - The humming is assumed to be closely time-aligned to the target source, by way of how the system is trained. This means that users need to hum along to the music and humming without listening along to the mixture would likely fail. The paper introduces some data augmentation, but distorts timing only slightly, so more variation here would be needed - The audio input duration is quite short (4s), and it's unclear how the model performs with longer inputs. This also seems to be a limiting factor in case more time-warped humming inputs are to be supported - Authors find including a synthesis task (generate from mimicry condition) in training helps, which is surprising, but don't provide an ablation result for that - The paper does not train on the actual task (real humming inputs) due to unavailable data, and only evaluates in a very synthetic setting, making it difficult to assess how useful the model actually is. For future research, it would be crucial to collect some data for this novel task

Small corrections: - L275 refers to the mimicry condition suddenly as melody - would be better to use one name consistently throughout the paper - L277 delete “in” - L431 - the diffusion-based approach in particular? - L433 - could you find a more precise word than “clean”? Do you mean free of artifacts?

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

Review summary

This paper introduces a generative source separation approach that allows user guidance via two modalities: time-synchronized humming and mel-spectrogram masks. The method is implemented using a diffusion-based model and aims to make the separation process more interactive and intuitive. The reviewers generally found the work to be original, well-executed, and relevant to the ISMIR community.

Two reviewers rated the paper as Strong Accept, citing clear exposition, a well-motivated problem, and a promising approach with reproducible implementation. A third reviewer gave a Weak Accept, while a fourth initially gave a Strong Reject due to concerns about the evaluation setup, lack of real user input or UI, and fairness of model comparisons. However, this reviewer later acknowledged the paper’s novelty and agreed to revise their stance to a "weak accept", provided the subjective listening test results are more rigorously evaluated.

The core idea of guided source separation through user input is novel and an interesting foundation for further research. While the evaluation and experimental setup have weaknesses, some of these can be addressed as part of a camera-ready version by the authors. Additionally, the contribution of source separation by humming using a diffusion model is convincing.

Recommendation: Weak accept

Note to the Authors To improve the final version of the paper, please address the following key points:

Clarify the Subjective Evaluation: Provide more details on MUSHRA test design, participant screening, and whether remixing affected perceptual quality judgments.

Acknowledge Evaluation Limitations: Clearly state the limitations of your comparisons between generative and predictive models, including the lack of computational parity.

Discuss Practicality of Inputs: Expand on how mel-mask inputs would be obtained in real-world scenarios, and briefly discuss UI considerations or potential deployment contexts.

Improve Consistency and Clarity: Fix terminology inconsistencies (e.g., “mimicry” vs. “melody”) and minor typos identified by reviewers.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper is very interesting and informative, but does not provide insights beyond its scope.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Diffusion-based source separation by mimicry works surprisingly well.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper "User-guided generative source separation" introduces GuiseSep, a source separation system conditioned on mel spectrograms and mimicry input. The work is very well structured and written and clearly puts into context and communicates its contributions.

L. 252: Typo "noramlized" -> "normlaized" Table 1: Typo: "psuedo" -> "pseudo"

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors provide open-source code, supporting reusability. The method combines ideas from diffusion models, guided generation, and dropout sampling—offering components that can be adapted to other MIR tasks. Drawing inspiration from adjacent domains further strengthens the paper’s potential impact.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper introduces a user-guided music source separation method that enables users to guide the separation process using waveform mimicry and rough masks on mel spectrograms. These inputs help enforce constraints on specific regions of interest during inference, offering an intuitive and flexible mechanism for interactive separation.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

General Comments

The proposed approach is novel in its user-guided controllability and its integration of diffusion models with guided priors. The work is both timely and relevant, addressing a growing interest in interactive music source separation. The accompanying audio examples are convincing and engaging. However, several conceptual, methodological, and evaluative aspects of the paper would benefit from further clarification and development.

Detailed Comments

L.177: The statement “Extraction using non-polyphonic instruments: We restrict the condition melody to be monophonic to reflect real-world limitations of many instruments” is a somewhat confisuing. The authors should clarify what is meant by “real-world limitations,” particularly since many real-world instruments (e.g., piano, guitar, harp) are inherently polyphonic.

L.243–L.252 (Notation clarity): The mathematical notation could be improved for precision and readability. Specifically: * The variable c is overloaded—used both as a complex spectrogram and as an instrument label. * c_mix refers to the STFT domain, while c_mask is defined in the mel-spectrogram domain. As these are different time-frequency representations, the distinction should be made more explicit. * In the subsequent section, the authors mention a projection of the mel-axis using a 1-hidden-layer neural network. It would help readers if the dimensions of each matrix were provided in this earlier section, along with a note that architectural details will follow. * Mel vs. STFT Masking: The decision to apply masking in the mel domain instead of the STFT domain is not discussed. This design choice could have significant implications (e.g., resolution trade-offs, perceptual smoothness), and the rationale should be clarified.

L.288 (Terminology): The acronym ODE should be defined as Ordinary Differential Equation when first introduced, for readers unfamiliar with this terminology.

L.297–L.299 (Evaluation metrics): The evaluation focuses on objective metrics like SDR, but perceptual quality is critical in music separation tasks. The authors are encouraged to include perceptual metrics or refer to established evaluation frameworks. For a comparative overview of metrics, see: M. Torcoli, T. Kastner and J. Herre, "Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1530-1541, 2021, doi: 10.1109/TASLP.2021.3069302.

Section 3.1.1 (Dataset selection): If the model is designed to be source-agnostic, it would be valuable to test it on a wider range of musical contexts, especially multitrack classical music datasets. This would provide insights into how well the model generalizes across diverse instrumentation and structural complexity. For example, datasets like PHENICX-SMM or PCD could be considered: * M. Schedl, D. Hauger, M. Tkalčič, M. Melenhorst, and C. C. S. Liem, “A dataset of multimedia material about classical music: PHENICX-SMM,” in Proc. Int. Workshop on Content-Based Multimedia Indexing (CBMI), Bucharest, Romania, 2016, pp. 1–4, doi: 10.1109/CBMI.2016.7500240. * Y. Özer, S. Schwär, V. Arifi-Müller, J. Lawrence, E. Sen, and M. Müller, “Piano Concerto Dataset (PCD): A multitrack dataset of piano concertos,” Transactions of the International Society for Music Information Retrieval, vol. 6, no. 1, pp. 75–88, 2023.

Section 3.1.2 (Dropout strategies): The dropout-based sampling strategy is conceptually interesting and appears effective. However, an ablation study comparing performance with and without dropout would strengthen the empirical justification for this approach.

Section 4.1 (Subjective listening tests): * MUSHRA methodology recommends a post-screening of the participants stating that participants should be excluded from the listening test if they assign the hidden reference to a score lower than 90 for more. In Figure 3a, the average score for the ground truth (GT) signal is exactly 90, raising questions about listener behavior and post-screening. It is unclear whether such screening was applied. * The paper does not report whether the subjective differences are statistically significant. * The experience level of the 13 participants is not discussed, and 13 listeners is a relatively small sample for subjective testing.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly disagree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

  • The comparison between the predictive and generative models, which is the paper’s main contribution, is not computationally fair. For a meaningful comparison, the generative model should be evaluated under a comparable inference cost, such as with a single-step sampling method.
  • Although the authors highlight user-guided controllability in the title and abstract, the paper provides no description of the user interface system. Moreover, the proposed UI does not appear in the subjective experiment.
  • The subjective test results are questionable. The ground truth (hidden reference) in the MUSHRA evaluation scored around 90, which is unusually low and suggests potential issues with experiment design or listener instruction. This undermines the validity of the test results
  • Figure 2 includes a misleading visual: the "mimicry" input is depicted incorrectly. It should be illustrated in the spectrogram domain.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

Diffusion-based generative models, despite requiring higher inference complexity, can achieve superior separation performance compared to predictive models with identical architectures.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper proposes a generative music source separation model leveraging user-provided mimicry and mel-spectrogram masks for flexible instrument extraction.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The proposed method employs mimicry (melodically similar audio) and mel-masks as query inputs for source separation. Using humming and mel-masks as user-generated queries is an interesting and differentiated approach compared to existing methods that rely on simpler queries, such as textual labels (e.g. instrument classes) or audio examples with similar timbres. Nevertheless, the paper is rejected due to the following concerns:

Reasons for Rejection: 1. Limited Practicality and Lack of UI Details The proposed approach appears less practical compared to existing methods because the required user inputs are relatively complex. Given such complexity, a carefully designed user interface (UI) would be essential; however, the paper does not provide any detailed description or demonstration of the proposed UI. 2. Fundamental Limitations for Polyphonic and Transient Sources The mimicry condition assumes a monophonic melody to simulate humming, imposing inherent limitations. This assumption restricts the model’s applicability to polyphonic or harmonically complex sources (e.g., unison or polyphonic instruments). Moreover, it fundamentally lacks suitability for transient-rich sounds like drums. 3. Limited Contribution Considering Computational Cost and Audio Quality Although the proposed method is not intended for real-time processing, it still operates at a relatively low sampling rate of 16 kHz. This limitation, combined with the comparatively low subjective audio quality demonstrated in the provided demos—particularly in comparison to existing commercial software—significantly restricts the overall contribution and potential impact of the paper. 4. Unfair Computational Comparison A major contribution highlighted in the paper is the systematic comparison between predictive and generative models. However, this comparison is not fair from a computational standpoint. To fairly evaluate performance, the generative model should use single-step inference or at least be evaluated under equivalent computational costs compared to the predictive model. 5. Low Reliability of Subjective Evaluation The MUSHRA test produced questionable results, with the ground truth (hidden reference) scoring approximately 90 points. Such a low score strongly suggests problems with experiment design, listener instruction, or overall experiment reliability, undermining confidence in all reported subjective evaluation outcomes.

Minor Comment: * When performing data augmentation for mel-spectral masks, randomization of frequency and time offsets should also be considered for robustness.