Adding temporal musical controls on top of pretrained generative models

Sarah Nabi; Nils Demerlé; Geoffroy Peeters; Frederic Bevilacqua; Philippe Esling

Abstract:

Recent advances in deep generative modeling have enabled high-quality models for musical audio synthesis. However, these approaches remain difficult to control, confined to simple, static attributes and, most importantly, entail retraining a different computationally-heavy architecure for each new control. This is inefficient and impractical as it requires substantial computational resources. In this paper, we propose a novel approach allowing to add time-varying musical controls on top of any pretrained generative models with an exposed latent space (e.g. neural audio codecs), without retraining or finetuning. Our method supports both discrete and continuous attributes by adapting a rectified flow approach with a latent diffusion transformer. We learn an invertible mapping between pretrained latent variables and a new space disentangling explicit control attributes and style variables that capture the remaining factors of variation. This enables both feature extraction from an input, but also editing those features to generate transformed audio samples. Finally, this also introduces the ability to perform synthesis directly from the audio descriptors. We validate our method with 4 datasets going from different musical instruments up to full music recordings, on which we outperform state-of-the-art task-specific baselines in terms of both generation quality and accuracy of the control by inferring transferred attributes. Our code is available on the supporting webpage.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Similar generative models could be built using the proposed approach, by using different datasets and defining custom control variables. Also the results provided give some insights on which musical aspects could serve as a good set of user control variables (sufficiently disentangled, interpretable etc.) when designing generation systems

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Time-varying musical controls can be added to a pre-trained generative model with latent variables, without needing to train the generative model, by training an invertible mapping between latent and control space

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper pursues a valuable research direction: As pre-trained models become more and more difficult to train or fine-tune, current methods for adding musical controls to them becomes less feasible.

The proposed method, while not entirely novel as it is mostly adapted from PluGeN, does entail a small extension for time-varying controls and makes a convincing case (theoretically and experimentally) for why that extension is needed.

It delivers a good set of experiments, featuring numerous baselines, different types of tasks (retrieval, editing, generation) as well as application domains (monophonic, polyphonic single instrument, full music). Using MSE as evaluation metric for the melody extraction task, while fulfilling its job in the context of the paper, is unusual, so adding some clarification on why the usual F1-based scores (e.g. Overall Accuracy) are not used would be helpful.

Another point that needs a bit more clarification is the choice of control variables. It seems they were carefully chosen, but that might limit applicability to other tasks as it is not clear how (e.g. just four tags for tagging, but all basic pitch features for melody control). Do control variables need to be very independent of each other, and what happens if they are not? Are there any guidelines for selection?

The paper is well written and flows well overall. One potential issue is that I could not fully understand how the SDEdit baseline works, as in, how the edited version is created once an input example is mapped to its corresponding noise vector.

Minor issues: L37 - Reference needed for this claim L51 - Reference would be helpful here

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

Summary of Reviews

This paper proposes a method to introduce time-varying control over pretrained generative models by learning an invertible mapping between the latent space and a disentangled control space. The approach avoids retraining the generative model and is demonstrated across multiple tasks including editing, retrieval, and conditional synthesis.

Three reviewers gave strong accept recommendations, and one a weak accept. The reviewers agreed that the paper is clearly written, scientifically sound, and relevant to the ISMIR community. While the model and core ideas build on prior work (notably PluGeN), the extension to temporal control is seen as a meaningful and well-executed contribution. The breadth of evaluation is also a strong point.

Some concerns were raised regarding missing ablation studies, unclear implementation details for continuous controls, and assumptions about the independence of control variables. However, these issues can be considered relatively minor, as they can be addressed by improving the writing for a camera-ready version.

Final recommendation

This is a well-executed paper that makes a timely and practical contribution to controllable music generation. While its methodological novelty is relatively incremental, the practical impact is significant. I recommend acceptance, with the expectation that minor issues be addressed in the final version.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper ambitiously presents a framework for mapping from a latent space to a control space of the same dimensionality while disentangling control attributes. The paper continues to apply the general framework to three representative tasks with mixed results that are enticing to read. One can obviously try to apply the idea in a different musical context with any control attributes that are specific to any genre of music.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A rectified-flow based mapping method enables musically meaning control of otherwise highly inexplainable audio-codec latent space vectors, thanks to the method power in disentanglement.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper is well written with good organization and story-telling ability which allows a reader who is relatively unaware of SOTA progress to follow what has been done. As mentioned earlier in this review, I believe the proposed framework is addressing an important general question and has potentials for a wide range of applications. Below, I list a few comments for the authors to consider:

I suppose the notion of "time" in Eq. (2) is different from the usual definition of time is music or digital signal processing. Though this is not a new idea in machine learning and should be clear in the context, the authors may want to point out the difference explicitly (perhaps with a footnote) to avoid causing any potential confusion. Same suggestion for "frame-rate" near the end of page 3.
Somehow, line 163 contains mutiple lines -- but it appears that [0,M_k]^K should be changed to [0, M_k-1]^K if M_k is the number of classes for each attribute.
Figure 2: thank you for a very nice illustration.
Line 307 and 357: conditionnal -> conditional
Regarding the anonymous demo page, here are some comments: (a) In the "pitch" plots, what does the control variable represent? At first I thought the plots are pitch contours, but then some results do not look right. (b) for audio editing: I feel that the proposed method is indeed better than AFTER, but there is definitely room for further improvement in the future in terms of the synthesized sound clarity. (c) Conditional synthesis: the present results are quite thought-provoking. The female singing example especially piques my interest since the results indeed sound like somebody is humming with free pronunciation.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Invertible mapping between a uncontrollable pre-trained latent space and a controllable attribute + style space.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The problem statement is clear. The authors attempt to add time-varying controls on top of existing generative models without requiring retraining.
With SDE-Edit and PluGeN, the baselines are sufficient.
The theoretical part largely inherits the ideas from PluGeN, with the main novelty being the addition of time-varying control and the use of rectified flow. A few points to consider:

3.1 The core of adding time-varying control primarily lies in the alignment step. It would be helpful to describe the alignment process in more detail. For example, do you tweak the parameters of the libraries to ensure consistent hop sizes? Do you apply high-level descriptors using a single value for the entire song across the temporal axis?

3.2. You may want to explain how you compute \sigma_i.Furthermore, how are a_max and a_min defined? Are they per-batch, per-song, per-dataset, or heuristically defined global values?

3.3 Please write Equation (10) more rigorously. For example: L_\theta = min_\theta E_{t \sim [0, 1]} (||...||_2^2)

The experiments can potentially be improved, in particular:

4.1 There are no ablation studies. Given the introduction of both time-varying control and rectified flow as extensions of the PluGeN baseline, these two components should be ablated separately before presenting the full comparison in Tables 1 and 2.

4.2 It may also be beneficial to organize the comparison tables in a way that is consistent with Section 4.2.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper proposes a method to incorporate controllability of generation into any pre-trained generative model with an exposed latent space. This can be easily adopted to any relevant model.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper proposes a method to condition pre-trained generative models by learning a mapping between entangled latent codes from the models into a controllable, interpretable and disentangled space.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a solution to control existing pre-trained generative models by manipulating their latent codes. By learning a mapping between the latent space and a new space where the dimensions correspond to various user defined conditions, the authors show that generations exhibit increased controllability while maintaining fidelity of generation. Through 3 tasks, reconstruction, editing and conditional synthesis, they show that their model performs reasonably better than competitive baselines on multiple datasets.

I think the paper does a very convincing job of presenting the quality of their model with well thought out experiments and discussion. The method is light-weight and impactful due to its ability to work on any exposed latent space. There will also be a release of code which is much appreciated. I only have some small suggestions which I list below:

Handling of continuous variables a_c (section 3.1): Eq 7, multiplies the normal distribution function for values of i ranging from 0 to M_k. These values are discrete class values that the attribute a_k can take (interpreted from the paragraph between lines 162 - 163). However there is no mention of how this equation can be adapted for continuous attributes even though this an important part of experiments and contributions presented in this paper. Algorithm 1: a_min and a_max haven’t been defined anywhere. I think it should be clarified that these values are calculated across the dataset and not on a per-sample basis. Independence of control variables: The PluGeN framework assumes that the control signals are independent. However the control signals defined in this paper aren’t necessarily independent: for instance instrument label could be correlated with the pitch distribution, octave, sharpness and so on. Perhaps these correlations aren’t strong enough to significantly affect performance. Either way, I believe this should be addressed in the text. Difference between table 1 and table 2: Is the difference just that the models in table 2 have additional continuous descriptors? I’m surprised by how the onset F1 score and instrument accuracy values degrade just by the adding of another control variable. Is there an intuition for why this is the case? It would also be interesting to hear the difference between these samples. Listening tests: It would be helpful and more convincing to have preference scores from actual humans, especially for higher level features like emotions. It is difficult to simply trust a quantitative method for such conditions. Samples page (similar to previous point): I think the samples page is very well organized and is impressive. I would love to see examples of the emotion based generations there as well. Since these features are pretty high-level, I think it is important to ground them in both human preference and highlight examples of them.

Minor corrections 7 a. Line 273: synthetis -> synthesis 7 b. Line 307, 5.1.3 heading: conditionnal -> conditional

P7-6: Adding temporal musical controls on top of pretrained generative models

Sarah Nabi, Nils Demerlé, Geoffroy Peeters, Frederic Bevilacqua, Philippe Esling

Presented In-person

4-minute short-format presentation