P6-12: Assessing the Alignment of Audio Representations With Timbre Similarity Ratings

Haokun Tian, Stefan Lattner, Charalampos Saitis

Subjects: Evaluation methodology ; Representations of music ; Evaluation, datasets, and reproducibility ; Similarity metrics ; Open Review ; MIR tasks ; Timbre, instrumentation, and singing voice ; Musical features and properties

Presented In-person

4-minute short-format presentation

Abstract:

Psychoacoustical so-called "timbre spaces" map perceptual similarity ratings of instrument sounds onto low-dimensional embeddings via multidimensional scaling, but suffer from scalability issues and are incapable of generalization. Recent results from audio (music and speech) quality assessment as well as image similarity have shown that deep learning is able to produce embeddings that align well with human perception while being largely free from these constraints. Although the existing human-rated timbre similarity data is not large enough to train deep neural networks (2,614 pairwise ratings on 334 audio samples), it can serve as test-only data for audio models. In this paper, we introduce metrics to assess the alignment of diverse audio representations with human judgments of timbre similarity by comparing both the absolute values and the rankings of embedding distances to human similarity ratings. Our evaluation involves three signal-processing-based representations, twelve representations extracted from pre-trained models, and three representations extracted from a novel sound matching model. Among them, the style embeddings inspired by image style transfer, extracted from the CLAP model and the sound matching model, remarkably outperform the others, showing their potential in modeling timbre similarity.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

See below

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

This paper addresses the problem of evaluating how well audio representations, both handcrafted and learned, align with human perceptions of timbre similarity.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper addresses the problem of evaluating how well audio representations, both handcrafted and learned, align with human perceptions of timbre similarity. Building on the concept of timbre spaces known from psychoacoustic studies, the authors present a well-structured evaluation framework and metrics to benchmark a wide range of audio models and datasets.

The study shows that handcrafted features like MFCCs remain competitive, and although recent models offer only modest improvements, the insights gained are relevant and timely. The analysis helps clarify the current landscape of timbre-aware audio embeddings.

That said, the paper could benefit from clearer explanations of certain design choices, particularly in Section 4.2, where implementation details often take the focus away from the underlying motivation. The discussion section would also be stronger if it elaborated on why some models, such as CLAP or MFCC, perform well and what this implies for future model development. Finally, while the figures are informative, more guidance through their main takeaways would improve interpretability.

Overall, this is a well-executed and thoughtful contribution. While the gains from recent models are modest, the work lays an important foundation for evaluating timbre-relevant audio representations and is likely to stimulate further research in perceptual model evaluation.

Further comments:

  • Line 10: "the existing ‘timbre space’ data" is unclear. Which data do you refer to?
  • Line 23: specify to which group "CLAP-based models" belong to? (One of the pretrained models)
  • Line 77: "past 21 timbre space datasets" remains unclear at this point. Are these all the datasets you could identify in previous work? -> Ah, it is explained later (Line 130 ff.)
  • Line 87 ff. (Related Work): You may also mention the article by Abesser et al. (How Robust are Audio Embeddings for Polyphonic Event Tagging? IEEE/ACM TASLP, 2023), which takes a similar approach by analyzing the embedding spaces of two non-trainable audio representations alongside several deep audio embeddings in the context of sound classification. Line 207: "an pair" should be "a pair"
  • Section 4.1.1: Could be shortened as the strategy is straightforward. However, clipping may remove relevant (e.g. decay) information, and zero padding could introduce confounding factors (e.g., correlations between sample length and specific instruments). Maybe comment on this.
  • Line 224-229: This appears straightforward and the text could be made more compact.
  • Line 252-272: Even though standard, a brief explanation of the individual metrics would be helpful; especially in clarifying what each captures. In particular, it would be valuable to explain how the metrics complement each other and why examining them together provides a more complete assessment.
  • Line 279-282: Introducing the three distance functions here may cause confusion, as the metrics mentioned do not align with those used in Section 4.1.2 (e.g., l^1 vs. l^2 norms).
  • Line 287: Give reference to "Vital" synthesizer -> Ah, comes later ... Not clear if last letter is an letter or number.
  • Line 300: Write out number "7" into "seven"
  • Line 312: Write out number ...
  • Line 325-326: The purpose of the eight regression outputs and the two classification heads is unclear. Please clarify what each is intended to represent and how they are used during training and evaluation.
  • Section 4.2: I find Section 4.2 a bit hard to follow as it emphasizes what is done rather than why. The motivation behind key choices, in particular the prediction targets, remains unclear.
  • Line 365: There are various ways to define multi-scale spectral losses, which critically shape the behavior of the loss function. For further discussion, see Schwär et al., "Multi-Scale Spectral Loss Revisited", IEEE SPL, 2023.
  • Figure 2: It would be helpful to include comments on which models achieved the best scores and why - specifically, what architectural or methodological factors contributed to their strong performance.
  • Line 420: See comment to Line 365
  • Section 5: The discussion of results feels quite compact and ends rather abruptly. It would be helpful to guide the reader more clearly through the key insights from Figures 2 and 3, explaining what each figure illustrates and how it supports the conclusions.
  • Section 5: It would be particularly interesting to analyze how the results depend on the individual dataset and scenario. This could be effectively visualized using a heatmap that reports performance for selected models and a fixed metric (e.g. MAE).
  • Line 479: "Ddsp" should be "DDSP" (encapsulate {DDSP} in bibtex)
  • Line 564: "Rwc" should be "RWC" (encapsulate {RWC} in bibtex)

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The reviewers agree that this paper offers interesting contributions to the topic of timbre similarity ratings, which is both timely and relevant for the ISMIR community. However, there are also concerns regarding the novelty of the approach and the lack of discussion around certain design choices and experimental results.

Strengths: * The paper is well written, clear, and easy to follow. * The overall message is communicated effectively. * The sound matching method demonstrates good performance. * The work lays a valuable foundation for evaluating timbre-relevant audio representations. * The work is a solid effort to build a unified evaluation framework. * Provided Python package is a plus for reproducibility

Weaknesses: * The novelty of the proposed approach is limited. * A more detailed per-dataset analysis would help build stronger intuition around the framework. * While the figures are informative, additional guidance on their key takeaways would improve interpretability. * The paper focuses on the "how" but provides limited insight into the "why"; clearer explanations of specific design choices would strengthen the work. * The phrase "alignment of audio" may mislead readers to think the paper focuses on the task of audio alignment.

Despite some weaknesses, we recommend a "weak accept". The paper establishes an important foundation for evaluating timbre-relevant audio representations and is likely to encourage further research in perceptual model evaluation.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

Beside the proposed sound matching model, the paper is mainly an evaluation of oh well existing models and classical signal processing algorithms correlates with manual annotated score of timbre similarity.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

CLAP-based models, the style embeddings from the proposed sound matching model achieve marginal gains over alternatives, in the task of timbre similarity ratings.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper is well written and clear, however at first, the terms "alignment of audio" leads me to think as if the paper was about the task of audio alignment. That convintion was pretty hard to remove from my head! Beside that, I think that the paper is well organized and generally well written. The proposed sound matching method shows good performances however, in my opinion the novelty of the approach is limited. The evaluation is well conducted and the metrics used seems to be well suited for the evaluation of such models (despite the fact that, as stated in the paper, three of them shows an high degrees of correlation suggesting that 2 out of 3 might be redundant). To me, the laking part is in the novelty but I would suggest a weak accept for mainly two reasons: 1) I'm not an expert on this task, so my judjment of the novelty might not be accurate, and 2) the evaluation compares several methods, included some classical signal processing alorithms, it uses different metrics and shows a good insight of the state of the art in this task.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The evaluation framework and the sound matching model are both task-specific.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

MFCCs are still relevant for capturing timbre similarity.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents an evaluation framework for assessing audio representations and their alignments with subjective timbre similarity ratings. The proposed framework leverages 21 existing datasets and implements several metrics in a unified manner. Additionally, a new model (i.e., sound matching), trained via self-supervised learning with synthetic data, is proposed and evaluated in the same framework. The evaluation results suggest that the proposed model and CLAP are both strong contenders compared to other models, and the widely used MFCCs still remain competitive. For the most part, the paper is easy to follow. Occasionally, I found the brief explanations within the brackets to be a little excessive and counterproductive. Nevertheless, I think the overall message of the paper is clear.

The paper consists of two main components, namely the evaluation framework and the sound matching model. Each of these components, in my opinion, could potentially be a strong contribution on its own, assuming sufficient explanations and insights are provided. When combined together, however, the authors inevitably have to leave out some details in order to make room for all the topics. Unfortunately, the current arrangement makes both components feel somewhat incomplete. For instance, in addition to the aggregated results across 21 datasets, I was hoping to see a more detailed per-dataset analysis, which could provide insights into the variance of these datasets and build a stronger intuition about this framework. Instead, I feel the discussion is shortened or limited in order to cover the sound matching model. Similarly, I feel the section on the sound matching model is too brief and many of the details are still missing (e.g., more information about the parameters in the data generation pipeline and the construction of style embeddings). As a result, neither of these two components is well-explained, in my humble opinion.

Another potential concern is around the depth of the provided information. In the evaluation framework, the authors tried to cover a variety of audio length handling methods, distance functions, and alignment scores. As such, the paper focuses on explaining “how” to approach these steps and is relatively light on “why” these choices were made. For example, it is unclear why there are four rank-based scores and only one error-based score. How can they complement each other? Are there considerations specific to the assessment of timbre similarity? Without further insights and explanations, the choices made in this framework may seem a bit arbitrary and are no more than a collection of standard metrics. As commented by the authors, some of the scores are highly correlated, which implies redundancy in these choices.

Despite the above mentioned concerns, I appreciate the effort of building a unified evaluation framework, and I find the idea of training the sound matching model in the context of timbre similarity intriguing. I believe this work could benefit from another round of revision and polishing, and my initial recommendation was borderline leaning towards a “weak reject”. However, after taking other reviewers' opinions into consideration during the discussion phase, I decided to adjust my recommendation to weak accept.

============= Minor comments: Line 175, “... there are cases when this is only one” → I suppose you meant to say “there are cases when there is only one viable option”?

Line 227-229, “Pairs outside the diagonal blocks … are not considered” → if that is the case, why not compute metrics per dataset and aggregate the metrics? (as opposed to building a dissimilarity matrix and only consider the diagonal blocks)

Line 251, “final rank-based alignment scores … averaging across all rows” → since some of these scores are correlation coefficients, I wonder if taking fisher’s z-transform before averaging would be more appropriate? (see [1] for example)

Line 260, “... given margin condition …” → how do you decide the value of margin?

Line 349, “... (B, C, H, W) …” → no introduction of these variables?

Line 406, “... style embeddings converge quickly, retaining high alignment …” → it is hard to interpret the results here… IMHO, a correlation coefficient of 0.4 may not be considered high in other research fields. It would be helpful if the authors could help the readers set the right expectations. Also, it is a bit odd to present the test results per training epoch. In a way, it could be interpreted as maximizing the test results during training, which makes the comparison to other models somewhat unfair.

Line 409, “... Kendall, Spearman, and triplet metrics are highly correlated …” → based on visual examination?

Line 421, “... showing spectral distances can be problematic for pitch.” → I do not understand this statement, for I thought each dataset has been pitch-normalized according to Table 1? Inconsistency in the reference section: I noticed some minor inconsistent citation formats that could be easily improved. For instance, some of the conference papers were cited as “in Proceedings of …” and some of them were not. Also, for the same journal (i.e., JASA), some of the entries have all upper cases and some of them are lower cases. I would recommend another pass through in the next iteration.

[1] Alexander, Ralph A. "A note on averaging correlations." Bulletin of the Psychonomic Society 28.4 (1990): 335-336. (https://link.springer.com/article/10.3758/BF03334037)

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

CLAP embeddings can be used in timbre related MIR tasks since the (dis)similarity results have strong correlation with human perception.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Representation learning models provide embeddings that has strong correlations with human perception in terms of comparing timbre of audio files.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

In this paper, the authors evaluate how well some audio representations align with human perception on how similar timbre of tracks are. The audio representations include hand crafted features such as MFCC to latest deep audio understanding models such as Clap. The motivation is well put as the current timbre space analysis is very limited and audio representations might be useful in mimicking human perception. Due to lack of sufficient data, authors use pre-trained models without fine-tuning or training from scratch, simply to evaluate the alignment.

The evaluation is thorough using diverse set of representations and datasets. Well-defined and interpretable metrics are used.

Provided Python package is definitely a plus for reproducibility.

I have only minor comments, - Style embeddings show promising results but they are only provided for proposed similarity model. It would have been beneficial to provide results for other models such as Clap as well since Clap already provide good results itself. In Table 1, there are several “Same” in the Type of sounds. It is hard to track what it is same with. Although Clap and style embeddings have better results in general the improvement against MFCC’s is marginal. The manuscript can benefit a deeper analysis of this fact. Typo: line 163 - datsets -> datasets