Abstract:

Music mastering style transfer aims to model and apply the mastering characteristics of a reference track to a target track, simulating the professional mastering process. However, existing methods apply fixed processing based on a reference track, limiting users' ability to fine-tune the results to match their artistic intent.In this paper, we introduce the ITO-Master framework, a reference-based mastering style transfer system that integrates Inference-Time Optimization (ITO) to enable finer user control over the mastering process. By optimizing the reference embedding during inference, our approach allows users to refine the output dynamically, making micro-level adjustments to achieve more precise mastering results. We explore both black-box and white-box methods for modeling mastering processors and demonstrate that ITO improves mastering performance across different styles. Through objective evaluation, subjective listening tests, and qualitative analysis using text-based conditioning with CLAP embeddings, we validate that ITO enhances mastering style similarity while offering increased adaptability. Our framework provides an effective and user-controllable solution for mastering style transfer, allowing users to refine their results beyond the initial style transfer.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper focuses pretty specifically on reference-based automatic music mastering, but has some specific insights that may apply to work on ITO more broadly. Specifically, the insight to differentiably optimize z_ref instead of employing black box optimization parameters for audio FX chains

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Inference-time optimization is a promising family of methods for automatic music mastering based on a reference track

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak reject

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes a novel strategy for automatic music mastering based on a reference track. Following past work on controlling audio effects, the authors explore a self-supervised strategy for training FX chain encoders, and inference-time optimization for iteratively refining the automatic mastering process based on the reference. The authors perform extensive quantitative and qualitative evaluation on their proposed approach.

Overall, this is an interesting paper demonstrating promising results on a difficult task. It will be of interest to several sub-communities of researchers within ISMIR working on automatic mastering, inference-time optimization, and differentiable signal processing. However, there is a key issue around the motivation for exploring differentiable synthesis that limit the ability to judge the promise of the proposed method in the broader landscape of recent MIR research. Moreover, there are some areas for improvement around missing baselines, the quantitative evaluation protocol, minor methodological novelty, and the presentation of incomplete follow-up work.

Motivation for differentiable synthesis. L58 says "white box methods ... are often constrained by the simplicity of their differentiable processors, which may not fully replicate the complex tools in professional mastering". However, this paper then goes on to explore slightly more sophisticated differentiable primitives, which are likely always going to have a lower performance ceiling than professional chains. An apples-to-apples comparison against an off-the-shelf black box optimization framework (e.g. ST-ITO) on the parameters of a professional (non-differentiable) mastering toolchain is essential here to justify the decision to go with a differentiable approach.

Additional baselines. It would be very helpful to see a few additional simple baselines: (1) randomized z_ref for black box model, (2) randomized z_ref for white box model, and (3) randomized FX parameters for white box model. As it stands, it's unclear to what extent the benefits of the proposed approach are coming from the audio manipulation primitives vs. the encoding / optimization procedures.

Quantitative evaluation protocol. In the white box setting, why not just directly evaluate the ability of the model to exactly reconstruct the original FX parameters? I.e., just report the error between the ground truth FX parameters and the estimated ones.

Minor methodological novelty. The idea of differentiable optimization of z_ref is interesting, though ultimately a bit limited in its novelty and applicability to non-differentiable settings. Moreover, it is unclear if this result would hold for other audio production / effects matching scenarios. This isn't a huge issue but as it stands it's unclear how reusable this insight is outside of the specific context in this paper

Incomplete follow-up work. Section 5.3 is somewhat interesting but currently unjustified (no experiments or evaluation). Also, why is it important to match a text prompt if an audio reference can be provided? I would much rather see that extra page devoted to more thorough investigation or analysis of the non-text-conditioned setting

Errata - L161 xin = fnorm(f1(A)) based in Figure 1, here you wrote it the other way

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak reject

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

Overall, the recommendation for this paper is weak reject. R1/R2 and myself (MR) leaned negative on the work in our initial reviews, while R3 leaned strongly positive. During the discussion, R1 reiterated concerns from their review about the evaluation / real world practicality / limited methodological innovation, and R2 reiterated a lack of clarity in the writing. R3 eventually adjusted their score to weakly positive in light of criticisms raised by other reviewers.

Summary of strengths / weaknesses from reviews:

Strengths: well-motivated (R3), diverse set of experiments and baselines (R3) Weaknesses: Arbitrary choice of reference tracks (R1), concerns about generalization from synthetic setup to real-world setting (R1), writing clarity (R2), concerns with practical usefulness (R3/MR), insufficient evaluation protocol (MR)

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The main contribution of this paper is application of inference time optimization (ITO) to automatic mastering, but the paper just refers to existing work for explaining how this is done, rather than sharing findings or insights on how to get this method to work or how to think about it intuitively. If I were applying ITO to some new problem, I don't think I would be able to get much insight on how to approach that from this paper - I'd be better served by going straight to the references.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper applies inference time optimization to automated mastering with the goal of improving model adaptability to a reference track.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

As with many MIR tasks, evaluating "similarity" between a mastered track and a reference master is tricky. The devil is usually in the details when it comes to bridging the gap between evaluation metrics in a paper and real-world applicability or impact. For example, what kind of reference tracks are conceivable choices for real-world mastering engineers to use? In some of the audio samples on the demo page, the choices of references seem difficult to justify as useful choices given the stylistic/genre differences. Perhaps a bigger concern is that the mastering FX chains that are being modeled in the data are themselves random rather than specifically targeted because they were used intentionally for mastering. This seems to me to create a significant gap between the actual task at hand (matching a dataset of randomized audio FX chains) and the real-world musical activity used to justify the technical work (mastering music with reference tracks).

While the technical method proposed in the paper does seem to perform comparatively well next to the baselines, I don't think that the application of the method and reporting of these metrics alone are enough to justify publishing this paper in ISMIR. If this is primarily a paper about music mastering, I think that the proxy-mastering task would need to be refined or justified further; otherwise, I could imagine this being reframed to focus more on the ITO method.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

If I understand correctly, inference-time optimization (ITO) combines direct optimization of a reference with transfer network processing. This could be helpful in providing flexible control to style transfer systems.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A mastering-by-example system is created using components of transfer modeling and a differentiable effects chain.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper describes a mastering-by-example system, created using the following components: a transfer network called the Mastering Style Converter, a differentiable effects chain, retraining of the reference encoder, and ITO, which we take to mean direct optimization of the embedding target for the transfer network.

An evaluation is described, in which white vs. black box approaches were compared, including retraining of the reference encoder, and proposed ITO procedure. The comparisons are done using a suite of objective metrics, as well as with a subjective test. A further test was conducted using embeddings from text prompts.

In this work, a complex new system was trained and many experiments were conducted. However, I had some problems understanding the main ideas behind ITO and its motivation. I feel that what ITO is was not clearly explained, neither in the introduction, nor in section 2.2. Some clear statement about what ITO is, combined with a reference, would be helpful. Similarly, a technical definition of FX normalization, and why it is used, would be helpful.

After reading the references I eventually settled that ITO must mean something like in [16], in which ITO means optimizing a latent noise reference through a diffusion process (which additionally requires gradient checkpointing through iterative steps of the algorithm). But the use of ITO here is more like an analogy, because instead of optimizing latent noise, we are optimizing the reference embedding. Furthermore, since diffusion doesn’t seem to be involved, implementation of the gradients is more straightforward than [16].

In general, I would have appreciated greater clarity in the text and direct mathematical expression (with citations), to better understand the contributions of what seems like a unique system.

Additional questions: Are there any regularizers preventing that z_ref goes directly to the new reference? In Section 3.4 Inference-Time Optimization on Reference Embedding, it mentions using the Audio-Feature loss from [13] but doesn’t give more details.

In the evaluation, ITO was only performed as an additional step when the reference encoder Phi was retrained. As it seemed to show little marginal benefit for some cases, was it possible to test using a fixed reference encoder?

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

failed to cite: J. Koo, M. Martínez-Ramírez, W-H. Liao, G.Fabbro, M. Mancusi, and Y. Mitsufuji, “ITO-Master: Inference-Time Optimization for Music Mastering Style Transfer”, in Proc. of the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper proposes a method for optimisation of reference encoder at inference time which could be extended to fine tuning of other tasks associated with audio effects like music mixing, audio effect style transfer (in general sense), synthesizer style transfer. Further, the insights about poor performance of optimisation on effect parameters when using a differing content reference is a useful insight for other style transfer applications.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Style transfer especially in post production tasks can benefit from specific case further fine tuning/ optimisation after using a general deep learning model at inference time. This can enable the system to produce improved results for the given example.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper proposes a novel approach to improving post-production style transfer using inference-time optimization. It presents a range of experiments with and without optimization, introduces strong baselines and novel metrics, and offers valuable insights into optimization techniques involving reference encoders and effect parameters. Overall, I really enjoyed reading the paper.

Strengths:

The use of inference-time optimization for post-production style transfer is both novel and well-motivated. A diverse set of experiments and baselines help contextualize the effectiveness of the proposed approach. The comparison of various optimization techniques (reference encoder and effect parameters) provide a useful reference for future research.

Criticisms and Suggestions: Real-world usability vs. reference alignment: While the proposed method aims to align the output with a given reference, I found that masters from E2Emastering and Matchering subjectively sounded more usable in real-world scenarios, even though they were less aligned with the reference. Future work should consider not only similarity to the reference but also the production value and usability of the output. Reference-based mastering is a curatorial task, if the reference is poorly chosen, the result may be undesirable.

Uncited similarity in Figure 1: Figure 1 appears quite similar to the one in this ISMIR 2024 LBD paper (https://ismir2024program.ismir.net/lbd_446.html), but this is not cited. Please include the citation or clarify the relationship between the figures.

Fx-normalization claims (Lines 151–152): The paper states that Fx-normalization improves model performance, but no supporting metrics or references are provided. Please clarify or provide evidence. Notably, Fx-normalization is typically used in mixing tasks involving wet stems- if this paper builds on that prior work, relevant citations should be included.

Clarification on distortion removal (Line 159): Please elaborate on the reasoning behind the need to remove all distortion.

Percentages in Line 196: The methodology behind the percentage calculations is unclear. Please explain how these values were derived.

Subjective listening tests (Line 414): Based on my listening experience, I preferred the outputs from E2Emastering and Matchering in terms of usability. However, I understand that the objective of your listening tests may have been reference similarity, not user preference. I would encourage future evaluations to include subjective preference ratings in addition to similarity, especially since mastering involves aesthetic and perceptual judgments. Section 5.3: The paper evaluated only one song and though the figures show how the CLAP embedding is able to drive the system to produce different masters, it is unclear if they sound usable as no audio examples for the same were shared. Further, this was not evaluated using subjective listening tests. Additional Comment: The introduction of new evaluation metrics for post-production style transfer is commendable, especially given the difficulty of objective evaluation in this domain. However, incorporating user preference and real-world usability into the evaluation pipeline will strengthen the practical relevance of this work.