GlobalMood: A cross-cultural benchmark for music emotion recognition

Harin Lee; Elif Celen; Peter Harrison; Manuel Anglada-Tort; Pol van Rijn; Minsu Park; Marc Schönwiesner; Nori Jacoby

Abstract:

Human annotations of mood in music are essential for music generation and recommender systems. However, existing datasets predominantly focus on Western songs with terms derived from English, which may limit generalizability across diverse linguistic and cultural backgrounds. We introduce 'GlobalMood', a novel cross-cultural benchmark dataset comprising 1,180 songs sampled from 59 countries, with large-scale annotations collected from 2,519 individuals across five culturally and linguistically distinct locations: U.S., France, Mexico, S. Korea, and Egypt. Rather than imposing predefined emotion and mood categories, we implement a bottom-up, participant-driven approach to organically elicit culturally specific music-related emotion terms. We then recruit another pool of human participants to collect 988,925 ratings for these culture-specific descriptors. Our analysis confirms the presence of a valence-arousal structure shared across cultures, yet also reveals significant divergences in how certain emotion terms (despite being dictionary equivalents) are perceived cross-culturally. State-of-the-art multimodal models benefit substantially from fine-tuning on our cross-culturally balanced dataset, particularly in non-English contexts. Broadly, our findings inform the ongoing debate on the universality versus cultural specificity of emotional descriptors, and our methodology can contribute to other multimodal and cross-lingual research.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

Yes, although some overview of other datasets that focus on getting raters from different cultures, e.g. MERP or survey could be provided.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

The paper presents a new dataset for emotion labelling with a global focus. It is always great to see more data available.

Q17 (This paper is of award-winning quality.)

No

Q18 ( If yes, please explain why it should be awarded.)

The paper presents a new dataset for emotion labelling with a global focus. It is always great to see more data available.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper presents a new dataset for emotion labelling with a global focus. It is always great to see more data available. The LLM evaluation approach adopted is also interesting.

The paper is well written and provides a great new resource.

The authors may want to have a look at this recent MER dataset survey paper as well: https://arxiv.org/abs/2406.08809

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Strong accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

All reviewers agree this is a valuable contribution to the community.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The presented methodology for staged tag collection and subsequential rating can be applied to a multitude of MIR and further tasks - and should be extended to further countries. The presented dataset can be used as a benchmark for future algorithms.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Text-translated tag names do not necessarily correspond to meaning-equivalent music mood concepts across languages or cultures.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper "GlobalMood: A cross-cultural benchmark for music emotion recognition" presents an open dataset containing music mood tags and their ratings from five locations, together with the methods for sourcing and refinement.

As the authors elaborate, mood tags may vary across different cultures, and even translation-equivalent terms may have different meanings depending on the cultural context.

The authors present several use-cases of their data by providing an such analysis of term equivalence versus translation equivalence across locations, using a subspace-mapping of tag representations collected from different locations across the music corpus. They furthermore provide an evaluation of agreement of the human data with multi-modal audio LLMs.

The paper is well-structured and written and systematically describes the contributions. Figures successfully visualise the data, results and collection method.

The description of the annotation method, in my opinion a key contribution, could use some clarification, as details of the collection chain are somewhat spread across several sections (end of Section 3.2, 4.1, 4.2). Regarding Figure 1 - it would be great to add to either the figure or related text a description that the tagging chain (A) is (if I understand correctly) performed by annotators from the same country and location for a specific location's tagging and emotional tag repertoire (thus the tagging chain resting culture-homogenous). This could be also clarified in the text around "two parallel chains per country". Line 161 "elicits mood terms across languages" also could be rephrased to e.g. "elicits culture-specific mood terms in local languages" if the above assumption is correct. On the contrary, if participants in Figure 1A are from multiple locations, there would be doubts about mutual understanding along the chain.

The analysis in Section 4.2.2 shows particular promise in showing limits of "dictionary translation" of tags. So was the analysis of audio LLMs for individual locations. Regarding line 412, it would be interesting to know how the capacity and training data in the Gemini audio models changed.

The specific experiments of improvements for CLAP for specifically Arabic tags showcase very concrete avenues to improve tagging performance in existing methods.

Depending on whether the vision is to extend the data and benchmark to further locations, I would ask the authors to consider the naming ("global") of the dataset/paper - notwithstanding this is a great novelty and improvement in global spread of locations - as the five used locations still are somewhat limited. Also consider the mention of "globally balanced" in Line 162.

I recommend the paper to be presented at ISMIR 2025.

Notes: I appreciated the sharing of the data - which helped confirmation of understanding. It seems that the csv lines with Arabic tags are ordered differently, and inconsistent regarding the csv column headers in both files for global mean and raw ratings: The Arabic tags are always at the outer (last) column, whereas the other languages position tags in the 3rd column e.g.:

country,videoID,tag,mean_rating,sd_rating,n_ratings EG,L7A9gIIYE8U,الاستمتاع,3.5,1.0801234497346435,10 vs KR,Y2cyFXBo9o4,활기찬,3.1818181818181817,0.8738628975053029,11 vs MX,IWLcPqj3poM,amor,2.3636363636363638,1.5015143870590968,11

Smaller notes:

Line 93 - our results demonstrate - this could go into the conclusion

I noted the anthropomorphising terminology of "human-like capabilities for understanding" for Gemini in (ln 412), and wonder if this is warranted and within the scope of the presented evidence. Just a paragraph below (ln 419) the correlation with human ratings is considered comparable in performance to pre-existing specific mood-estimation algorithms.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper provides culturally grounded mood terms that can serve as a valuable reference for future studies involving music listening tasks, particularly when tailoring emotional descriptors to specific linguistic and cultural contexts. These terms may also be integrated into NLP models to effectively support emotion-related applications in research and industry, for example, as prompts for music recommendation systems or affective tagging interfaces.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper reveals cross-cultural similarities and differences in emotional terms used in music listening, based on a large-scale dataset of human annotations.

Q17 (Would you recommend this paper for an award?)

Yes

Q18 ( If yes, please explain why it should be awarded.)

I strongly recommend this paper for award recognition, given its novelty in conducting a large-scale cross-cultural human study with a scientifically rigorous design. It provides valuable insights through a bottom-up approach, carefully addressing labor-intensive annotation and analysis across multiple phases. In addition to validating findings with human data, the paper further enhances its contribution by comparing human annotations with outputs from large language models.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper explores the culture-specific usage of music-related mood terms through a large-scale annotation study. The authors present a clearly defined research question and a carefully designed methodology to investigate it. I strongly recommend this paper for acceptance based on the following strengths:

The cross-cultural design—using data from five culturally diverse countries—helps overcome the common WEIRD (Western, Educated, Industrialized, Rich, and Democratic) bias in human-subject research and strengthens the generalizability of findings in emotion studies, where cultural variability in mood perception presents a significant methodological and interpretive challenge. The scale and scope of this annotation effort are particularly impressive, given the logistical and cultural complexity involved in such experiments.
A wide range of annotations in different languages broadens the applicability of Music Emotion Recognition (MER) systems and contributes to NLP research across linguistic and cultural boundaries. By capturing how mood terms are interpreted within specific cultural contexts, this work lays important groundwork for developing personalized or culturally adaptive emotion recognition systems.
A robust experimental design—comprising two stages followed by model evaluation—supports the reliability of the results. Careful stage-specific song selection and a bottom-up procedure for mood-term extraction ensure ecological validity. Additionally, the comparison of human annotations with two computational models strengthens the interpretive framework of the study.

Suggestions for improvement:

In Section 4.2.2, where the authors compare within-country and cross-country agreement, no baseline or statistical comparison is provided for interpreting the reported coefficients. Given this, it may be premature to conclude that the emotion term ‘happy’ shows “a considerable gap between cross- and within-country agreement” (Lines 377–378).
I suggest clarifying the phrasing in Lines 361–365 on two fronts. First, if I understand correctly, the term “mean correlations across language pairs” refers to inter-country agreement computed as the average of pairwise correlation coefficients. If so, a brief explanation would improve clarity. Second, I would like to ask whether it is appropriate to interpret within-country agreement—calculated using the Spearman–Brown formula—as a proxy for measurement error. If that is the intended interpretation, it may be helpful to explicitly frame it as such to guide the reader's understanding.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The top-down approach to annotation of free-text descriptors in native language is highly-relevant and a significant contribution to the task of MER.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper introduces a cross-cultural dataset for the task of music emotion recognition. The amount of annotators and the cross-cultural approach make this an important contribution to the task of MER.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper contributes a novel dataset that takes into account culture and allows the annotation of free-text. The top-down approach to annotation of free-text descriptors in native language is highly-relevant and a significant contribution to the task of MER. Moreover, using LLMs to analyze native language features is interesting and relevant.

Major comments: Although completely agreeing with the authors regarding the issue of using English descriptors to music, some work has been done to validate some emotion models in other languages - particularly for GEMS. See Strauss2024. General intra-rater statistics could be added to section 4.2.1, Krippendorff’s alpha or ICC could give a general notion of how agreement works within the same language as compared across languages. One of the major issues of cross-cultural work is that evaluating textual information from an unknown language can be challenging. Figure 2 shows mood terms such as “latino” with high arousal and positive valence or “foreign” with maybe low arousal and negative valence? Although this is briefly mentioned in the discussion section, perhaps more could be written about the difficulty of making such translations (see next comment). The finding in section 4.2.2 is very interesting. I’m not sure if I understand this correctly, but having a rating for terms like “latino” or “foreign” would already bias the calculation from the MDS. If the mean rating per term is introduced for the 1180 songs, would there be a circular logic to this? Perhaps I’m not understanding this section correctly and there is only a need to better clarify it.

Minor comments: Figure 2 has some sliders that I would assume are in Spanish. Are they “alegría”, “paz” and “nostalgia”? Section 3.2 is a bit unclear in L229. Can you clarify how you selected a subset of 180? L299 refers to mitigating a priming effect which might not be clear to the general MIR reader. Please clarify a bit further.

References: @article{Strauss2024, title = {{The Emotion-to-Music Mapping Atlas (EMMA): A systematically organized online database of emotionally evocative music excerpts}}, author = {Strauss, Hannah and Vigl, Julia and Jacobsen, Peer-Ole and Bayer, Martin and Talamini, Francesca and Vigl, Wolfgang and Zangerle, Eva and Zentner, Marcel}, year = 2024, month = jan, journal = {Behavior Research Methods}, volume = 56, number = 4, pages = {3560–3577}, doi = {10.3758/s13428-024-02336-0}, issn = {1554-3528}, url = {http://dx.doi.org/10.3758/s13428-024-02336-0} }

P1-1: GlobalMood: A cross-cultural benchmark for music emotion recognition

Harin Lee, Elif Celen, Peter Harrison, Manuel Anglada-Tort, Pol van Rijn, Minsu Park, Marc Schönwiesner, Nori Jacoby

Presented In-person

10-minute long-format presentation