P3-5: LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation
Tom Baker, Javier Nistal
Subjects: Melody and motives ; Music generation ; Generative Tasks ; Music and audio synthesis ; Harmony, chords and tonality ; Open Review ; MIR tasks ; Musical features and properties
Presented In-person
4-minute short-format presentation
Text-to-audio diffusion models produce high-quality and diverse music but many, if not most, of the SOTA models lack the fine-grained, time-varying controls essential for music production. ControlNet enables attaching external controls to a pre-trained generative model by cloning and fine-tuning its encoder on new conditionings. However, this approach incurs a large memory footprint and restricts users to a fixed set of controls. We propose a lightweight, modular architecture that considerably reduces parameter count while matching ControlNet in audio quality and condition adherence. Our method offers greater flexibility and significantly lower memory usage, enabling more efficient training and deployment of independent controls. We conduct extensive objective and subjective evaluations and provide numerous audio examples on the accompanying website.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 ( The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Strongly agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
the use of identity-initialized and zero-initialized conv layers for fine-tuning seems interesting
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
we can use lightweight conv layers to fine-tune a text-to-music generation model to learn new conditions
Q17 (This paper is of award-winning quality.)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Weak accept
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper presents a lightweight alternative to the ControlNet approach for fine-tuning a text-to-music generation model to handle new conditions. Figure 1 clearly illustrates the idea, which involves learning identity-init and zero-init convolutional layers instead of using cloned encoder blocks. The authors implemented their method using Diff-a-Riff (432M parameters) as the backbone and tested variants with learnable parameters ranging from 32M to 64M, all smaller than the ControlNet baseline (165M). Both objective and subjective evaluations show that the proposed method matches the performance of the ControlNet baseline, though it does not surpass it.
Strengths: * Fresh and interesting and interesting use of zero-init and identity-init convolutional layers for fine-tuning. * Achieves similar performance to the Music ControlNet baseline while using about 1/4 to 1/2 of the learnable parameters. * Solid set of experiments, with nice use of APA and MUSHRA evaluations.
Weakness: * The backbone, Diff-a-Riff, is not open source. * The reduction in trainable parameters is not substantial. * It’s unclear if the idea of identity-init convolution is novel, as the authors don’t clarify this. * No comparison with a relevant recent work [11] (Hou et al., ICASSP 2025), though this is understandable since [11] is new. * Insufficient explanation of how their method differs from an important prior work, ControlNet-XS [13]. * Limited description of how their approach differs from another key prior work, Sketch2Sound [36], from a methodology point-of-view, although the authors did implement Sketch2Sound (labeled LiLAC* in Table 1).
Minor issues: * References are not consistently formatted.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
The reviewers are generally positive about this submission, noting issues like the problematic abstract opening and several points needing clarification. I encourage the authors to use the reviewers' feedback constructively to enhance the paper's quality while preparing for the camera-ready version.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Strongly agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The proposed control steategy is new and may be reused for similar or even vastly different applications
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
A lightweight strategy for controling an existing latent diffusion model for accompaniment generation.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper proposes a lightweight, modular variation of the method proposed in controlnet [9] and evaluate the approach using an example music generation model. The evaluation focusses on time-varying control over music generation—something lacking in current systems. The approach is validated through objective and subjective evaluations and demonstrated with audio examples online.
General remark:
Overall I feel that the method is explained clearly and motivated sufficiently. Experimental results are a bit limited. In table 1 all methods achieve APA == 1, and this independently of the fact that MSE is quite different. I wonder whether the APA metric is very helpful here.
Sugested modifications:
Table 2 appears a bit confusing. Section 5.3.4 explains the motivation of the experiment.
306-309: achitecture is more susceptible to CLAP leakage—where over-specified control signals (e.g., chroma) can dominate or obscure CLAP’s condition.
I am not sure to understand this. My question would be: what would one want? If you have to conflicting control signals (here CLAP and chroma), then there is a design problem. I do not think it is generally better if CLAP or chroma wins. So the arrows in table 2 do not seem justified, or at least I don't see why one would favor one over the other. One could for example say that the chroma feature is more specific and therefore it should overrule the more general specification (CLAP). It appears table 2 reflects an understanding that is the other way around. It would be helpful to understand why.
I am also a bit disturbed by the discussion of the SCA results in paragraph in 509-516. There we find that comparing to SAQ we the differences for SCA are clearer. However, in the table 2 it is the other way around. the differences are more pronounced for SAQ (max diff 4.2) then in SCA (max diff 2.7).
If I understand correctly the results displayed in table 4 col SCA, it appears that the control efficiency is still somewhat weak.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Strongly agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
This paper presents a post-hoc conditioning method that is light weight to train and can be used with any type of pre-trained diffusion based model.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
This paper presents a light-weight and flexible method to incorporate control in pre-trained generative models.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper builds off of ControlNet in introducing controllability to pre-trained generative music models. To reduce the number of parameters, the authors introduce light weight convolutions in the adaptor branch. Through the experiments, the authors show that (1) the quality of generation is not affected by their method, (2) effects of conflictive conditioning (conflicts between post-hoc conditions, i.e. chroma and pre-existing conditions, i.e. CLAP embeddings), (3) the effect of specificity of conditions, i.e. chroma and chord, and (4) the quality of the post-hoc conditions, i.e. chords and chroma.
I think the paper is well organized and does a good job of presenting a compelling argument for their light-weight controllable architecture. No mention has been made of the code being made available and I hope this could be done to help the community build on it. I also appreciate the extensive samples page. Below I have listed a few points I would like clarification on:
-
Fig 1: I think the figure does a good job of communicating the idea. I was confused about a couple of things. (1) In the right subfigure, what is the zero convolution (in the center of the figure, along the y-axis) doing and how is the output of the convolution integrated with the output from the identity convolution? I also don’t see this zero convolution in the equation defined in line 168. Additionally, I think it would be helpful to have a legend for the colours indicating clearly which parts are trained and which aren’t.
-
Regarding conditioning:
2 a. Training details, CFG on conditioning: Lines 229 - 234: My understanding is that CFG is being used on the conditions introduced in this paper, i.e. chroma and chords when training the adapter branch. The pre-trained model takes in CLAP embedding and the audio context as conditioning signals as well. Is CFG being used for these conditions as well during the training of the adapter branch? If so, this should be made clear. It is unclear to me what the “new c” refers to in line 232 as “c” is defined as “the condition c” (chroma and chord) that is introduced to the adapter branch in line 191.
2 b. Table 1: My understanding is that the audio context is not used for all the models in table 1 except for the Diff-a-Riff + Context model. Is this correct? I have comments based on if this assumption is correct or not:
If this is the case, then I’m a little confused what the relevance of APA is, since the model is not seeing the audio prompt at all. Perhaps the goal here is to show that even without the audio prompt the generated samples adhere to the context audio with just a chromagram input, if this is the case it should be explicitly stated. If my assumption is incorrect, then lines 384 - 386 seem to attribute alignment to the context to the chroma conditioning which wouldn’t make sense if the context is also seen as conditioning.
Additionally, I think cMSE being reported for the Diff-a-Riff models is similarly a little confusing. While I see the value of a standard metric across all models, I believe that if the models are not seeing the chroma conditioning during inference, it should be explicitly stated to avoid confusion.
2 c. Chroma MSE as a metric: I wonder if a per-frame chroma overlap would be a more reliable metric in the context of this work, similar to that used in Music ControlNet [1]. I think MSE introduces biases by penalizing notes that are closer less than notes that are further away, which could be problematic.
-
LiLAC: LiLAC is said to be similar to Sketch2Sound [2]. However Sketch2Sound finetunes the backbone after adding the input condition. Is that being done for this baseline? If so, that should be explicitly stated since this is not the case for the other LiLAC models.
-
Samples page: I would be very interested to listen to samples from the misaligned conditions which I don’t believe are on the samples website right now. Especially examples which highlight the behavior stated in lines 436-442.
-
Minor corrections: In section 3.2.1, the term ‘adaptor branch’ is never explicitly linked to G_l (..). The term is termed as ‘its cloned counterpart’, I would recommend making the link in the text. There is a space missing between Table and 2 in line 411
[1] Wu, Shih-Lun, et al. "Music controlnet: Multiple time-varying controls for music generation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 32 (2024): 2692-2703. [2] García, Hugo Flores, et al. "Sketch2sound: Controllable audio generation via time-varying signals and sonic imitations." ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Disagree
Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))
Some works on inference-time optimization for controllability are missing, e.g., DITTO [1], and ST-ITO [2]
Ref [1] Novack, Zachary, Julian McAuley, Taylor Berg-Kirkpatrick, and Nicholas J. Bryan. "Ditto: Diffusion inference-time t-optimization for music generation." ICML 2024. [2] Steinmetz, Christian J., Shubhr Singh, Marco Comunità, Ilias Ibnyahya, Shanxin Yuan, Emmanouil Benetos, and Joshua D. Reiss. "St-ito: Controlling audio effects for style transfer with inference-time optimization." ISMIR 2024.
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Disagree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Disagree
Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))
Memory usage advantage is main claim of this work. However, Figure 1 purported shows that the model consumes more memory than ControlNet due to more "total" parameters, despite having fewer "trainable" parameters. Some experiments/analyses should be done to strengthen this claim on memory savings.
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
disagree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
It's interesting to see that dedicating fewer parameters for fine-grained controls seemingly reduces "bleeding" into other forms of control when those controls are conflictive (Table 2).
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
Fine-grained controls for text-to-audio models can be achieved with much fewer parameters than ControlNet approaches.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak reject
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
I. Strengths
I'm highly supportive of the direction that this work explores, i.e., to shave off the resources/parameter needed to achieve fine-grained control. It is especially important for musical applications as different users might have drastically different types of controls they'd like to achieve. The lightweightness of these methods can meaningfully reduce the barrier to achieve plug-and-play controls for various use cases.
I also appreciate the exploration on conflicting controls (Sec 5.3.4 and Table 2) since these use cases are potentially central to musicians' creative inquiries, e.g., making playing techniques / phrases (controlled via fine-grained signals) that are technically impossible with some instruments (controlled via text / CLAP). And, it's nice to see that LiLAC seems to have an edge over the heavier-weight ControlNet.
Besides, some architectural ablations and a listening study are both conducted, which are commendable. However, I think overall the work is less than ready for publication in its current state.
II. Weaknesses
(W1) Missing analyses on memory/efficiency improvements While the advantage on memory is a repeated claim in the manuscript, there isn't any comment or experiment supporting this claim. Also, Figure 1 sort of tells that the proposed design actually has more components, and hence total parameters, than ControlNet -- if I understand correctly, this would worsen inference-time speed and memory footprint (although it is likely still advantageous at training thanks to fewer trainable parameters hence fewer optimizer states).
Some speed/memory stats compared to ControlNet should be reported. Besides that, the authors could consider other ways where the proposed architecture could shine more efficiency-wise -- perhaps it's combining multiple fine-grained controls, since all controls can share the same encoder backbone, or demonstrating improved sample efficiency (i.e., the required amount of training data to make controls work) which implies better applicability in cases where controls are costly to obtain (e.g., require hand labeling).
(W2) Limited exploration on controls and output space This work primarily explored single-instrument outputs and harmonic controls (chroma and chords), which I feel is a little narrow especially considering that prior works like Music ControlNet and DITTO have tackled multi-instrument audios and a wider range of controls (dynamics, rhythm, structure, etc.).
(W3) Insufficient motivation & demonstration on additional experiments It's great to see experiments that discuss the interactions/conflicts between heterogeneous controls which might contain overlapping information (Tables 2 & 3). Yet, I think more could be done to better ground these explorations to applications, through, for example, motivating why "less bleeding" is important (and in what use cases) and/or providing generated samples to show that the differences are qualitatively substantial. I have these comments because from the metric gaps in Tables 2 and 3, it's difficult to judge how much better LiLAC is over ControlNet as these metrics are, after all, proxies to the actual desiderata, which in turn depend on specific application goals.
(W4) Problematic abstract opening I would strongly advise revising the first sentence in abstract ("Text-to-audio ... lack fine-grained controls") as this is not true anymore. Plenty of citations in this manuscript also refute this statement.