P5-14: Fx-Encoder++: Extracting Instrument-Wise Audio Effects Representations from Mixtures
Yen-Tung Yeh, Junghyun Koo, Marco Martínez-Ramírez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji
Subjects: Open Review ; Knowledge-driven approaches to MIR ; Musical features and properties ; Applications ; Music composition, performance, and production ; Representations of music
Presented In-person
4-minute short-format presentation
General-purpose audio representations have proven effective across diverse music information retrieval applications, yet their utility in intelligent music production remains limited by insufficient understanding of audio effects (Fx). Although previous approaches have emphasized audio effects analysis at the mixture level, this focus falls short for tasks demanding instrument-wise audio effects understanding, such as automatic mixing. In this work, we present Fx-Encoder++, a novel model designed to extract instrument-wise audio effects representations from music mixtures. Our approach leverages a contrastive learning framework and introduces an ``extractor'' mechanism that, when provided with instrument queries (audio or text), transforms mixture-level audio effect embeddings into instrument-wise audio effect embeddings. We evaluated our model across retrieval and audio effect parameter matching tasks, testing its performance across a diverse range of instruments. The results demonstrate that Fx-Encoder++ outperforms previous approaches at mixture level and show a novel ability to extract effects representation instrument-wise, addressing a critical capability gap in intelligent music production systems.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 ( The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The paper has some resuable insights, e.g., that the overall approach of combining mixture-level and instrument-level (through querying) effects representation learning can yield better results on both levels, work well on complex effects chains, but suffer in the presence of novel timbral characteristics.
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
Contrastive FX representation learning augmented with query-based extraction results in improved representation learning at both the instrument and mix level.
Q17 (This paper is of award-winning quality.)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Weak accept
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
Summary
This paper presents an evolution of FX-Encoder to extract audio effects representations from both mixtures and queried instruments from within mixtures. They evaluate this task on both effect parameter retrieval, as well as effect parameter estimation via inference time optimization. They find that the proposed method
Comments
Overall, this paper presents an advancement of audio effects representation learning models. While prior models operated at either the mixture or instrument level, this work proposes a solution that works at both levels. The proposed methods in the paper seem sound and perform better than prior work in almost all scenarios, but there is still much room for improvement. Unfortunately, the paper has minimal error analysis, providing little insight into how such models can be improved in the future. Furthermore, many details about the evaluation are missing, including those about the evaluation dataset as well as a whole table of results related to Section 5.2. Such omissions reduce both the reproducibility of the paper and its reusable insights.
Specific Comments
Section 4.1 - More detail is needed about the construction of the evaluation dataset, e.g., how the effects and parameters are sampled. How were the mixtures constructed (are there ever multiple instances of the same class)? Can there be single instruments? Is the effect parameter evaluation set distinct from the training set?
Section 5.1 - The statement 'Notably, even when using high-quality source separation, we observe a clear gap between the "Target Instrument" and "MSS(m)"' seems to be incorrect. ST-ITO w/ MSS(m) performs better than target extraction.
Section 5.2 - The table referred to in this section seems to be completely absent. The results are not present.
Section 6 - What is 'VA modeling'? This has not been defined in this paper.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
The majority of reviewers agreed that this paper is addressing an important topic and is generally well-written. Some noted that the evaluation could be stronger, e.g., with an ablation study, but the majority agreed that this should be included in the ISMIR program. That said, reviewers noted aspects of the writing that could be improved before a final submission. For example, while reviewers could envision the application and utility of this work, they noted that the authors should more clearly communicate the motivation and potential applications of the work, particularly regarding the instrument-specific embeddings. Furthermore, reviewers said that authors should provide a clearer interpretation of the instrument-specific embedding evaluation.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Strongly agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Disagree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Strongly agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Strongly agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Disagree
Q15 (Please explain your assessment of reusable insights in the paper.)
-
The core idea of learning instrument-specific effect embeddings directly from mixtures.
-
The use of SimCLR-based contrastive learning, combined with hand-crafted hard negatives and effect-normalization strategies, is a promising technique that may be reusable in other representation learning tasks involving fine-grained audio transformations.
-
The architectural strategy of jointly optimizing for both mixture-level and instrument-level objectives might inspire similar dual-view training regimes in other multimodal or layered signal contexts.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
This paper proposes FX-Encoder++, a model for extracting instrument-specific effect embeddings directly from music mixtures without source separation.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak reject
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper presents FX-Encoder++, an innovative model addressing a key challenge in music technology: extracting instrument-specific audio effect representations directly from full music mixtures without the need for source separation. The method leverages a pretrained CLAP encoder alongside an “extractor” mechanism, enabling both audio and text queries. A carefully designed contrastive learning framework, underpins the training process, (FX-Normalization, consistent instrument composition, and hand-crafted hard negatives ensures robust learning, the dual-objective). However, I have several concerns and questions regarding the manuscript:
Motivation and Problem Definition The main concern is: why is it important to extract instrument-specific embeddings from mixtures?
The authors state in lines [39–45]: “Applications in this domain require..., how they shape the overall sound of a complete mixture (‘mixture level’) and how they transform individual instruments within that mixture (‘instrument level’).” This statement reads more as a definition constructed for the purpose of this work rather than a widely accepted requirement of FX-specific representation learning. Based on my understanding, the goal of FX-specific representation learning is to embeddings that are specific to audio effect transformations (rather than invariant to them, as in general-purpose embeddings). Therefore, the motivation for why versatile understanding of both mixture and single-instrument content is necessary should be clarified and better grounded in prior literature. From the citations in the introduction [16,17], it seems that only guitar FX classification requires instrument-wise embeddings, which are not directly evaluated in this work. (also this task not extract embedding from mixture) In summary, how is the instrument embedding extracted from a mixture used in music production?
Instrument-Specific Conditioning - Line [275]: How is conditioning performed via the MLP layer? and How to attend to effect-related features?
Are the mixture-level embedding and CLAP query embedding concatenated? Is Adaptive Layer Normalization used? What specific operations are involved?
Writing and Presentation - Line [160]: The phrase “is generate” is inappropriate here. Since your model is not a generative model or does not involve stochastic sampling, please revise this terminology. - Figure 1: The figure and caption are confusing. It’s unclear whether "splitting source tracks" refers to time-based segmentation or instrument-wise source separation. The use of inconsistent icons (e.g., guitar, drums, mixture) adds to the confusion. - Lines [239–240]: CLAP supports both audio and text queries during training and inference. Why does the paper only mention audio in line 239-240? And the biggest advantage of using CLAP is that 1) it can handle multi-modal input (audio and text) and 2) it works even with zeroshot (unseen) words. Other than that, it is no different from using random text embedding (one-hot instruments class embedding) or any audio embedding. I think it would be good to write down the purpose of using CALP Encoder.
Evaluation - Lines [361–362]: Are there any overlaps between the 8 instrument queries and MoisesDB?. Please clarify whether this evaluation setup is in-domain or out-of-domain. - Line [372]: What are the queries and targets in the mixture-level retrieval evaluation? The phrase “testing effect identification in complete mixtures or isolated recordings” is ambiguous. Are these the query types, the retrieval targets?
Minor Comments - Line [357]: The task of query-by-audio retrieval should be supported by references in audio retrieval [ref1] or audio fingerprinting [ref2], not solely CLAP. CLAP primarily addresses text-to-audio retrieval. (If I were to explain metrics rather than retrieval tasks in this part, I would state that Recall and Precision are not proposed in CLAP paper.) [ref1] Disentangled multidimensional metric learning for music similarity, ICASSP 2020 [ref2] Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning, ICASSP 2021
- Line [376]: The phrase “directly from mixtures” is vague and should be rephrased for clarity.
I believe this paper presents impressive experiments and result tables, and it is a valuable contribution. However, due to (1) the insufficient discussion on the necessity of instrument-specific audio effects and (2) the relatively weak writing quality in the methods and evaluation sections, I recommend a weak reject.
To strengthen the work, I suggest the following for reframing the paper: 1. Clearly articulate how fine-grained, instrument-wise information can improve the quality of mixture embeddings. 2. Consider adding an instrument-specific music production task as a downstream application.
Q2 ( I am an expert on the topic of the paper.)
Disagree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Disagree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
By avoiding music source separation algorithms, e.g., using both processed and dry tracks, it is possible to obtain better models specialized in extracting instrument-specific audio effect representations from mixture.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
The proposed processing pipeline allows for transforming mixture-level embeddings into instrument-specific embeddings leveraging CLAP.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The authors propose Fx-Encoder++, a novel model that extracts instrument-wise audio effects representations from music mixtures using a contrastive learning framework and an extractor mechanism guided by audio or text queries. They demonstrate that their approach outperforms previous methods on retrieval and parameter matching tasks, effectively advancing intelligent music production through improved understanding of audio effects at the instrument level.
I find the work very interesting and well-written, however there are some points that are not clear. I have some comments that I would like the authors to address in order to improve the quality of the manuscript.
- Introduction, line 55: what do you mean with “However, they focus only on modeling the aggregate result, rather than identifying how effects have been applied to each instrument”? Do you mean they model only the full Fx-chain and not each single Fx?
- Sec. 3.1, line 177: “iff” -> “if”.
- Equation (2): please, define the operator \sim{.} and N.
- Sec. 3.1, line 200: what do you mean with “negative” and “positive” pairs?
- Sec. 3.1, line 211: Is there a requirement for the effects to be differentiable?
- Sec. 3.1, line 290: can you please tell more about the scheduling you used to introduce the instrument-level loss? Is \lambda_{inst} increasing linearly? Or following cosine raising? Can you also tell more about the results you obtained instead without introducing such a scheduling or by swapping the paradigm?
- I do not know if I missed it, but can you give information on how the audio query is supposed to be?
- Sec. 5.1, line 421: please, define USS. You should also explicitly define q_{text} and q_{audio}. To be consistent and precise, define also MSS.
- Sec. 5.2, line 453: I do not find any definition for L_d, which you use to quantify the performance of the methods as far as matching the effect parameters is concerned. Please, add more information.
- I really suggest to add a github page with audio examples in order to clarify even better the performance of the models.
- Can you also provide applications of such instrument-specific embeddings?
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Strongly agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The learnable extractor proposed in this paper is a useful mechanism for the task of extracting instrument-wise audio effect representations from mixtures.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
This paper proposes Fx-Encoder++, which enables the extraction of audio effect representations for each instrument from mixtures.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper tackles the important and novel problem of extracting instrument-wise audio effect representations from mixtures, an emerging challenge in the field of intelligent music production, through an original approach that includes mechanisms such as the extractor module. The demonstrated high performance on the audio effect retrieval task mark a great advancement in the field. Particularly noteworthy is the finding illustrated in Figure 2, where retrieval performance improves as the number of effects increases, a compelling result. Although there are some limitations, such as parameter matching performance and understanding of single effects, these are challenges that could be addressed in future work. Given the novelty and potential impact of the proposed approach, I consider this paper a valuable contribution to the ISMIR community.
That said, one point of concern is the lack of ablation studies for the proposed method. As a result, it is unclear which techniques contribute to the observed performance improvements. For instance, in the contrastive learning setup, techniques like Fx-Normalization and Hand-Crafted Hard Negative Samples are employed, but the individual effects of these components on retrieval and classification performance have not been evaluated. Investigating these aspects is important to validate the soundness of the method. Therefore, I do not believe this paper is suitable for an award recommendation.