Beyond Genre: Diagnosing Bias in Music Embeddings Using Concept Activation Vectors

Roman Gebhardt; Arne Kuhle; Eylül Bektur

Abstract:

Music representation models are widely used for tasks such as tagging, retrieval, and music understanding. Yet, their potential to encode cultural bias remains underexplored. In this paper, we apply Concept Activation Vectors (CAVs) to investigate whether non-musical singer attributes - such as gender and language - influence genre representations in unintended ways. We analyze four state-of-the-art models (MERT, Whisper, MuQ, MuQ-MuLan) using the STraDa dataset, carefully balancing training sets to control for genre confounds. Our results reveal significant model-specific biases, aligning with disparities reported in MIR and music sociology. Furthermore, we propose a post-hoc debiasing strategy using concept vector manipulation, demonstrating its effectiveness in mitigating these biases. These findings highlight the need for bias-aware model design and show that conceptualized interpretability methods offer practical tools for diagnosing and mitigating representational bias in MIR.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The manuscript does a good job outlining the linearity of representation among some widely-used music embeddings with respect to several basic musical concepts. The general approach would also be applicable to other models and concept ontologies.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Concept activation vectors (CAVs) can be used to find linear hyperplanes separating some high-level musical concepts (gender, language, and genre) in modern neural embeddings.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This manuscript explore the effectiveness of concept activation vectors (CAVs) in a selection of widely-used neural music embeddings to show whether certain high-level musical concepts are represented linearly. Overall, it is an interesting glimpse into the state of music embeddings today, and it would surely generate some discussion and interest at ISMIR. The manuscript does suffer, however, from several methodological limitations that reduce its impact. While I personally lean toward accepting the work, there would be stronger motivations possible.

The key limitation is inherent in the technique: concept activation vectors can only identify linear relationships. In a somewhat confusing footnote (3), the authors mention that any linear classifier would be suitable while also mentioning the possibility of adding hidden layers. This should be fixed for the camera-ready: adding any hidden layers would almost surely generate a non-linear classifier...and the author's model in Equation 1 already encompasses all possible linear ones. I think that the authors mean that any differentiable classifier would be applicable (and indeed, given that §5 goes on to avoid using derivatives, perhaps any classifier would in fact be applicable).

But enforcing linearity rather sharply limits the usefulness of the technique for the authors' purposes: there is no particular reason to believe that the concepts considered would be linearly separable. Where the authors find positive results, these can certainly be seen as evidence for the presence of a particular concept in a neural model. Where the authors do not, it may be that the concept is even strongly present, just non-linearly. The authors should clarify this limitation in the camera-ready version.

I am also not fully convinced by the bias-reducing approaches, neither for generating the dataset nor for vector manipulation. In both cases, the approaches seem too ad-hoc to be fully sound or widely scalable.

Fundamentally, I see this manuscript as a proof of a methodological concept, and as such, the particular choice of concepts considered is reasonable enough. That said, I still would have found the manuscript more interesting with a richer set of concepts. We don't really need AI to make a gender assessment of a vocalist, and genre is notoriously difficult to formulate as a well-posed scientific classification. If the authors tested any other concepts, it would be wonderful to add some information about them to the camera-ready.

Finally, as a small point, it is not clear from the manuscript whether the authors used all embedding layers for the CAV training or chose specific layers.

In the spirit of moving science forward, however, I still think there is benefit to sharing this work as is with ISMIR. The work is well written, and extending the authors' approach to incorporate non-linearity or other musical concepts would be straightforward. Some of the conclusions do rely on potentially biased concept interactions, but the authors have made a good-faith, if ad-hoc, effort to reduce this bias. The results as presented do show that some current models can already convincingly incorporate certain musical concepts.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The reviewers agreed that the approach in this manuscript is interesting, although collectively, the group struggled to understand all of the methodological detail. The most important issue in the discussion was the issue of balance (or lack thereof) in the dataset. In the case of imbalance, there seems to be a substantial risk that the results as presented are primarily a reflection of the imbalances rather than the desired message. In addition to the other comments from reviewers, the authors should spend particular attention clarifying balance or imbalance in the dataset and how it affects interpretation of the results.

If, after closer inspection, the authors find that class imbalance is a serious enough problem to have thrown off the entire analysis, it would be most appropriate to withdraw the paper. For the purposes of the review, however, the group decided to give the authors the benefit of the doubt.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Disagree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

References

I would propose to include Mishra et al. (2017) when talking about alternative interpretability approaches (l110), due to their relevance in the field of MIR.
While Sec. 2.1 does elaborate why concept-based explanations might be preferable to alternative approaches, concept-based approaches beyond (T)CAV are not mentioned at all - e.g., Ghorbani et al. (2019), or even intrinsically interpretable systems from Koh et al. (2020) or Chen et al. (2020). This would give a better overview of the available methods (and it could still be easily argued why TCAV is the best method for the application in this paper).
While in Sec. 2.2 the most similar work is described to be [5], one could argue that other approaches utilising TCAV to evaluate bias are at least as relevant - even if they are not necessarily in MIR. The differences between these approaches would be a valuable addition to the related work, and whether this goes beyond the adaption to a new domain. -200: Is there a reference for this?

Saumitra Mishra, Bob L. Sturm, Simon Dixon: Local Interpretable Model-Agnostic Explanations for Music Content Analysis. ISMIR 2017: 537-543

Amirata Ghorbani, James Wexler, James Y. Zou, Been Kim: Towards Automatic Concept-based Explanations. NeurIPS 2019: 9273-9282

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang: Concept Bottleneck Models. ICML 2020: 5338-5348

Zhi Chen, Yijie Bei, Cynthia Rudin: Concept whitening for interpretable image recognition. Nat. Mach. Intell. 2(12): 772-782 (2020)

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Disagree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

Scholarly / scientific quality

Sec. 1: The introduction lacks a brief statement as to why genre representations in particular were chosen as the target of these investigations (e.g., because there are known skews in the data that could potentially affect genre representation, despite the fact that they should not necessarily have an impact).
Sec. 3: It should be clarified briefly that these systems can be found online and where, as well as that they are used in a pre-trained fashion (i.e., that it is not part of the work to retrain them from scratch).
Sec. 4: It should also be added where the genre information comes from for STraDa, i.e., whether this is part of the metadata or needed to be required from Deezer or the like. Furthermore, it would be valuable to add information on how the additional tracks were manually annotated, i.e., how the genre etc. was obtained.
Sec. 4+5: The biggest weakness of this work is the description of how CAVs are computed in Sec. 4 and 5. The fact that CAVs are obtained by training a classifier that differentiates between 'positive' and 'randomly selected non-positive samples' (l266) is never really explained, so the entire setup of nine binary classification tasks (where one differentiates between male / female, so there's no random data involved?), the splitting at language-genre-gender combinations and the balancing in respective sets is all rather hard to grasp (and I also do not understand what 'limiting the number of samples across the joint distribution ...' (l269) means). Instead, I would propose to start by telling the reader that for training the binary linear classifiers necessary for TCAV we need corresponding positive and negative (random) samples for each concept we want to model (and also clarifying whether this means that the negative samples for gender are one gender, or whether two separate classifiers are trained for this). Then, it could be clarified that in order to hopefully correctly model the targeted concepts (and not just imbalances or other concepts), other aspects need to be varied and balanced, and how this is achieved (e.g., by fixing one concept, and varying and balancing the others). Finally, it should be stated what the training and test sets are used for in this context. The lack of explanation on how CAVs are computed exactly (and on which data), makes the methodology of this paper so hard to understand, that also the following results are difficult to interpret (as I was simply not sure what exactly is tested; I thought for the longest time, that CAVs in this work model combinations of language-genre-gender). This aspect would therefore be crucial to rework to ensure a good understanding of the proposed method.
Sec. 5: The second main concern I have is whether the proposed method actually captures what it is supposed to capture. If we assume that the CAVs model the concepts as desired, e.g., female/male, and we compute p_cav for test-samples, then I would intuitively assume that we get a lot of p_cav values > 0 for female vocals (because the CAV should point towards a subspace where a lot of female-singer music is located). Similarly, we would get a lot of p_cav values < 0 for songs of male singers, if the CAV captures that subspace as well. In other words, if the test set is not balanced regarding the (bias) concept we are checking for, how can we ensure that this is not just due to the data imbalance? While Sec. 4+5 describe that the training sets are carefully crafted not to simply reflect imbalances, was this similarly done for the test-set in these scenarios, or was this possibility accounted for differently? Could we avoid this by computing a vector representing a certain genre, and computing its alignment (e.g., in terms of the angle between the two) with the CAV representing a potential bias factor? While in l340 it is carefully phrased that stronger alignments could be 'potential biases', the discussion of the results in Sec. 6 does talk about the positive / negative biases the models have, and I am not sure whether this is actually what is measured here. I think the notion introduced in Sec. 5.2 could better capture underlying biases than what is discussed in 5.1, e.g., if male/female vocals consistently rank higher in terms of p_cav values for a certain genre, this might indicate the possibility of certain biases.
Sec. 5.2/6.3: While this traversal is another interesting approach to learn something about the meaningfulness of the learned CAVs (similar to what is done in Section 4.1.1 in [7]), I am not entirely sure how this could be used as a strategy for debiasing (even post-hoc). The system itself is not changed, and neither are the embeddings - only the 'ranking' retrieved via TCAV, yet this ranking will likely not be used for the majority of applications. How could this be utilised to effectively debiase the system (post-hoc)? This needs to be clarified, or rephrased to state that it is another check whether CAVs actually represent the concepts they are targeting.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Reusable insights

This work showcases an interesting usage for the TCAV method, where instead of probing a system for human understandable concepts important for a particular prediction, concepts that could present potential biases are investigated. This is an interesting idea, and one that could be reused in different settings or for different MIR models, to investigate the implicit biases that are modelled (this could even be extended easily for classifiers).

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Concept activation vectors (representing human understandable concept in the embedding space of DNNs) can be used to investigate potential biases in music representation models (and potentially be used to mitigate these biases).

Q17 (Would you recommend this paper for an award?)

No

Q18 ( If yes, please explain why it should be awarded.)

-

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary: This is a well-written paper looking into an interesting application of the TCAV method. The topic is relevant (and would benefit) the ISMIR community, and has the potential for reusable insights or generating discourse, as this could be used in the future to investigate biases of music representation models (which gain popularity). Unfortunately, the description of how CAVs are computed exactly (i.e., their data setup) should be clarified, as the current state makes the presented results rather hard to interpret, and I am unsure whether the proposed adaptation of the TCAV score actually reflects what is desired. Also the debiasing method would need further elaborations as to how this could actually help in debiasing a system (post-hoc).

Novelty: While TCAV has been previously used within MIR, and beyond this field as a way of detecting biases in systems, to the best of my knowledge applying TCAV to detect biases in MIR systems is novel.

Reproducibility: The dataset modifications to account for underrepresented genre-gender combinations are provided via additional material, and the code is also said to be released upon acceptance, aiding the reproducibility of this work. The only difficulty for reproduction purposes might be the difficult-to-understand setup of training/test sets for deriving CAVs .

Pioneering proposals: While the proposed work exhibits a level of novelty, this application to a new field does not really provide any pioneering proposals (as TCAV has previously been used to detect biases in systems).

Readability and paper organisation: - The title reflects the content of the paper, and the abstract is written well (the only minor issues I have are discussed in detail in the scholarly and scientific comments). - References should be touched up and made consistent (e.g., all caps Journal names, Proc. vs. Proceedings, no venue for [12], all lower case acronyms...) - The paper is very well-written and nicely structured. Some minor remarks about rephrasings are listed below. - Sec. 1: It was not immediately clear to me what 'audio representations' are referring to (i.e., internal representations vs. pre-processed audio), maybe this could be clarified by adding something like 'internal' or audio representations learned by a system when talking about them first (l40). - Sec. 2.1: The section jumps around a bit between (T)CAV and alternative approaches, which could be restructured slightly (this would probably allow for some space to briefly mention other concept-based approaches as discussed in 'References'). - l165-176: To make the differences between [5] and this work clearer, 1) the 'datasets' (l169) should be clarified (e.g., to separate two different (?) datasets), and 2) the sentence from l169-l176 should be split up, where the content of the second half should follow the initial description of [5] (e.g., ... to separate two datasets. This method addresses the domain.... . In contrast, our CAV-based method... ). - Figure 1: As this figure consists of two figures, I would make them subfigures (this also allows for easier referencing); the titles of the individual model plots could be improved by using only the model name also used in the paper, e.g., MERT instead of mert_v1_95m; the legend should be made a bit bigger and clearer, I am unsure what the TCAV dist. (distance? distribution?) is, and it might be easier to understand if the three colours are explained separately; also, the results (i.e., passing or lack thereof) of the significance tests should be indicated somehow, or is that reflected in the colours as well? Finally, it should be stated somewhere why there are fewer genres depicted in the lower part of the figure, is it because language-genre examples were missing? - The symmetry of the concepts 'male' and 'female' go beyond intuition (l497) - at least depending on how the according CAVs are computed. If the corresponding CAV of 'male' is just -CAV of female (which I assume could be the case if this is derived via a binary classifier as suggested in l250), then (4) and (5) should indeed be equivalent (which then could raise the question as to why the results in Figure 2 are different at all, numeric instabilities?) - l485: The 'sorting' should be explained once more at this point (e.g., where we compute the TCAV values for ... and sort ... according to...)

Potential to generate discourse: I could see this work having the potential to generate discourse, as it might be an interesting approach to look into biases within MIR models. However, the work might need some touch ups and more concrete ideas of how the debiasing could be realised.

Relevance of the topic to ISMIR: Both the bias in musical data and systems, as well as the interpretation of musical embedding systems is relevant to the ISMIR community. As the work on interpretable deep learning is limited in the MIR community, work like this is even more valuable.

Minor/detailed remarks:

l23: more recently l31: linebreak l47: maybe: unexplored l154: as -> via (as neurons etc. are not biases themselves) l159: remove '.'; sources [29-31] should preferably be referenced immediately after the according concept (e.g., counterfactual attention learning [30]), same for references in l164 l170: undesireable biases as concepts l241: footnote should be after '.' l258: To construct the training and test sets required for the computation of CAVs (as these have not been defined yet) l311: this acronym was already introduced (a few times), can be used as is formula (3): 'I' needs to be defined l322: Some gradient information can certainly be extracted of an embedding system, just not the needed one (e.g., w.r.t. a target class) - this needs to be corrected l359: rank higher in terms of ('rank' is not really used a lot in this context, so it is not entirely clear what that means) l301, l390: linearly encoded -> not necessarily, it can be linearly separated well enough from embeddings of random samples; maybe: indicating that the CAVs should represent the concepts they are targeting reasonably well, as the concept and random activations can be linearly separated. l356: briefly clarify 'balancing constraints' here -> e.g., without balancing the distribution of other factors like gender or language in the training set, and ... l393: indicated by the 95% confidence intervals not spanning across the 0.5 mark (or similar?) l451: meaning (?) it is still above chance but (?) should be interpreted...

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors explore use of an explanatory system for understanding AI to see whether music embeddings incorporate structural bias (such as gender-based or lyric-language-based) into their classifications, such as for genre. The ideas of their explanatory system might have broader application, and that'd be neat.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Systems for doing embedding of music may implicitly incorporate properties of performers into their models, and that might matter for making other predictions based on those embeddings; this can be ameliorated by being aware of these implicit biases.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This is an interesting, if frustrating, paper using some interpretable-AI tricks to identify when embeddings of music implicitly are learning concepts about their data that oughtn't be relevant to classification, such as classifying genre based on the perceived gender of a performer in a song.

I will say that I don't understand some of the goals of this study: the claim is essentially that, "the only thing that should be different if you substitute out a male singer for a female singer is the one variable about gender of performer". But that's obviously not right, despite the claim, "we expect female-led and male-led English language jazz to be musically comparable, despite differences in vocal timbre." Why would this be true? The instrumental blend would be different! You might have differen backing performers. They might emphasize different octaves in their accompaniment. Etc. Like, I get what they're trying to say, but there's almost a claim that the k.d. lang version of "Hallelujah" should be just as easy to classify as the Leonard Cohen version, and ... why would that be true?

So I admit I was a little suspicious, and I'm still a little suspicious: the fix is literally just "take a convex combination of the classifier for genre and the classifier for the lead singer being female and use that, and not just the uncorrected classifier for genre."

This overall approach does "work", in that if we emphasize the classifier for gender, the songs that are rated as most clearly hip-hop start to be also with female singers. But it's a little frustrating, in that it suggests that the hardest-to-classify-for-gender songs will probably get tossed out quickest (like, e.g., k.d. lang). I'd love a much more detailed study of what actually is returned than just the tiny bit that's in 6.3.

All told, it's in interesting overall idea, and I'd watch a talk on this paper.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Agree (Very novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper proposes a method to diagnose and adjust the bias in music embeddings, which can be applied to many modern models today.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper tries to diagnose and adjusts the bias in music embeddings.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

In this paper, the authors propose to use a CAV-based method to investigate the bias presented in pretrained models (embeddings). The method trains a linear classifier of concepts and explore whether some genre of music tends to be classified as positive or negative (or, in authors’ words, whether they align with the concept). The authors investigate four SOTA music embeddings and reveal that there exist strong gender bias and language bias in these embeddings. Based on these findings, the authors propose a strategy to adjust the bias, and it is able to provide a less-biased CAV.

Strengths - The paper is clearly motivated. It reveals and tries to address an important issue in the era of large foundation models. It is crucial to be aware of the gender, cultural, and any kind of potential biases introduced in these data-based models. - The choice of different embeddings spans a wide variety of pre-training strategies, and the resulting model bias reflect the bias introduced in the pre-training. - The careful curation of datasets reflects a rigorous experimental design to minimize the bias introduced in the dataset. - The results are clear and easy to explain, which show the clear bias presenting in the embeddings. - The proposed de-biasing strategy is simple yet effective.

Weaknesses and Questions - In section 5.1, the authors mention that there is no downstream classifier and therefore “no gradient information can be extracted”, but the genre classification seems to be a straight-forward downstream task to extract logits and gradient so that [7] can be fully adapted. - The description of how TCAV is done is confusing. CAV has different meanings, sometimes it means “concept activation vectors” and sometimes it represents a vector, as in Eq. (2). The vector CAV is the same as the weight vector in Eq. (1). There concepts are mixed together and makes it hard to read. - (Important) Following the previous point, the process described in Section 5.1 suggests that the TCAV score is actually the percentage of test samples per genre that are classified as positive samples by the classifier in Eq. (1), and since each genre has a balanced test set, the ideal percentage is 50%. However, this is not pointed out, and the authors opt for a more abstract and complicated way to explain the idea - the idea of “how well two concepts align”, which comes from [7] (but the authors’ implementation has significantly deviated from [7]). Also, I suggest creating a new name since this is no longer the TCAV introduced in [7]. - Line 348-349: the authors mention the statistical test but do not present the results in Section 6. The statistical test with Bonferroni correction would fail to reject the null hypothesis in many cases because there is at least one genre where there is no bias and Bonferroni correction requires that there is significance in all sub-tests. - The section 5.2 and 6.3 also read a bit confusing to me. The authors mention the ranking of different tracks. I suppose this is the ranking of the p_{CAV} scores as in Eq. 2 (the likelihood of the song’s being a hip-hop song)? This should be explicitly described. - A minor question: while the authors have discussed the difference between the proposed de-biasing strategy and [5], I think both of them assume linearity. Therefore, I am curious how they will perform differently. The discussion in Section 2.2 only points out the difference in motivation (dataset bias in [5] and demographic bias in this paper). What I see is that [5] debias the embedding and the paper debias the CAV (the classifier), but this is not pointed out in the paper.

While I think the paper is addressing the crucial problem of implicit bias in music embeddings, and the authors have done extensive and careful studies to show interesting and meaningful results, the description of methods is really unclear (see weaknesses and questions). Since the methods deviate significantly from the TCAV reference [7] and therefore no reference could be found, it is crucial to clearly state everything in the implementation. Therefore, I could not recommend accepting the paper as it is. While my main criticism is about presentation, which can be done in the camera-ready version, I expect not minor but a great amount of adjustment for a good, clearly-written paper, so I would have to recommend a weak reject even though I like all the results and discussions.

To improve the paper, I would suggest considering the problems of presentations mentioned above. Even though I did not mention in the weaknesses, the process of dataset construction is also a bit hard to read. Section 4 and 5 takes great effort to understand because of the unclarity. Visualization (both of dataset and methods, as in [7]) could help. Also, without explaining original TCAV in detail, mentioning “gradient information” and why the bias term is required can be confusing. The authors could opt for either including the details of [7] or focusing on their implementations.

Minor corrections - Eq. (1) suggests linear regression but what is done here is (I suppose) logistic regression. - Many references are not properly formatted. For example, [12] and [16] don’t show proceeding names; [7] is published in ICML not in NeurIPS; “MuChoMusic” in [6] should contain upper cases; [6] and [14] are both from ISMIR and should be in the same format (including abbr. in [14]).

P3-4: Beyond Genre: Diagnosing Bias in Music Embeddings Using Concept Activation Vectors

Roman Gebhardt, Arne Kuhle, Eylül Bektur

Presented In-person

4-minute short-format presentation

References

Scholarly / scientific quality

Reusable insights

Minor/detailed remarks: