Expotion: Facial Expression and Motion Control for Multimodal Music Generation

Fathinah Izzati; Xinyue Li; Gus Xia

Abstract:

We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls—specifically, human facial expressions and upper-body motion—as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation. Code, demo and dataset are available at https://github.com/xinyueli2896/Expotion.git

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Disagree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The authors claimed that "this work is the first to leverage synchronized expressive gestures and facial expressions for music generation." However, a quick google scholar search brought up many papers:

Roberto Valenti, Alejandro Jaimes and Nicu Sebe, "Sonify Your Face: Facial Expressions for Sound Generation," MM, 2010.
Jiang Huang, Xianglin Huang, Lifang Yang, and Zhulin Tao, "D2MNet for music generation joint driven by facial expressions and dance movements Author links open overlay panel," Array, 2024.
Alexis Clay, Nadine Couture, Elodie Decarsin, Myriam Desainte-Catherine, Pierre-Henri Vulliard, and Joseph Larralde, "Movement to emotions to music: using whole body emotional expression as an interaction for electronic music generation," NIME, 2012.
Vishesh P, Pavan A, Samarth G Vasist, Sindhu Rao, and K. S. Srinivas, "Movement to emotions to music: using whole body emotional expression as an interaction for electronic music generation," I2CT, 2022.
Jiang Huang, Xianglin Huang, Lifang Yang, and Zhulin Tao, "A Continuous Emotional Music Generation System Based on Facial Expressions," ICID, 2022.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The proposed adaption and finetuning methodology for equipping pretrained music generation models with additional multimodal controls can likely be reused in future work to enable other control signals.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Through adapting and finetuning a pretrained music generation model, we can allows an user to control a music generation system through visual gestures and facial expressions.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary

This paper proposes a novel interactive music generation system that can be controlled by visual gestures and facial expressions. The authors propose to adapt and finetune an existing music generation system to take into new multimodal control signals. The authors compile a new dataset consisting of 7 hours of video recordings with synchronized responsive facial expressions and upper-body movements. With the proposed dataset, this paper shows the effectiveness of the proposed method through objective and subjective evaluations.

Strengths

The paper is clearly-written and easy to follow.
The paper addresses a potentially-impactful research direction towards interactive music generation.
The provided qualitative examples clearly show the effectiveness of the proposed method.
The authors conducted extensive evaluations through both objective metrics and a subjective survey. The ablation studies are also well-designed, and the results are clearly-presented.
The proposed dataset will be a great contribution to the community if made publicly available. However, the authors did not discuss data release plan in the paper.

Weaknesses

The authors fail to discuss connections to existing non deep learning based interactive generative music systems. Much prior work can be found in our neighboring communities such as NIME, ICMC, and AIMC.
The subjective evaluation results are presented without error bars. While the best performing models achieve much higher scores than those of the baseline model, it is hard to make any strong conclusions and significance claims without error bars.
The performance of the model is not strong. The best tempo error is still 28 bpm.
While the authors claimed that "our multimodal controls complement each other in creating better music" (Line 118-120), the roles of the visuals and texts remain unclear to me.

Justification of the Overall Evaluation

This paper represents a significant step towards interactive music generation. The paper is clearly-written and most claims are supported with experimental results. However, the results are not strong, and it remains unclear the effectiveness of the proposed model in supporting meaningful interaction through body movements, facial expressions and text prompts altogether. I am thus recommending a weak accept.

Detailed Comments and Suggestions

(Line 118-120) "In contrast, our multimodal controls complement each other in creating better music." -> This is an unsupported claim. While something similar has been discussed near the end of Section 5.2, I don't think we can arrive in this conclusion with the presented results.
(Section 4.1: Dataset) Will the dataset be released? If not, how do you ensure reproducibility? If releasing the raw videos is challenging, perhaps the authors can only release the extracted features, e.g., extracted facial landmark and joint positions.
(Line 270-272) "We recruited volunteers to record their facial expressions and upper body movements while listening to 30-second audio clips." -> Reactive facial expressions and upper body movements can differ from those intended to be used as inputs to control a music generation system. Some brief discussion would help clarify this.
(Section 4.3: Baselines) How did you synthesize the generated MIDI files? Please clarify this.
(Line 378-380) "This configuration performs best overall, with the lowest FAD and KL scores and the highest IS Score," -> The KL score is not the lowest.
(Section 5.1 and Table 1) Is the higher the IS score, the better? Isn't the optimal IS score be that of the ground truth? If so, please remove the arrow in Table 1 and restate some arguments in Section 5.1.
(Figure 3) The axis labels are barely visible. Also, some numbers are close. Error bars would help us see if these differences are significant enough.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

We have mixed recommendations from the reviewers with 3 weak rejects and 2 weak accepts. The negatives come mostly from concerns about the not-so-strong performance of the proposed method and the lack of some experiment and implementation details. However, given how novel the task and methodology is, I think the weaker performance does not disqualify this paper from acceptance. I believe this paper will generate many fruitful discussions among our community and inspire much follow-up work along this direction. I'm thus recommending a weak accept for this paper.

If the paper is accepted, the authors must carefully read all the reviews and try their best to address the concerns raised by the reviewers. Specifically, here are the required revisions in the camera-ready version:

Discuss related work on non deep learning based approaches.
Discuss the limitations of the proposed method, especially the 28 bpm tempo error and the not-so-effective facial expression controls.
Provide more details about the listening test setup and add error bars to the results.

Here is a summary of the reviews:

Strengths

(R3, R6, MR) Clearly-written and easy to follow.
(R3, R6) The novelty is significant.
(MR) The proposed dataset will be a great contribution to the community.

Weaknesses

(R2, R3, R6) The tempo control is not quite effective with an average error of 28 bpm.
(R5, MR) Missing error bars for the subjective evaluation.
(R1) Missing details about the subjective test.
(R1) Dependence of the mode on text prompts is not addressed.
(R2) Facial expressions have little effect on generated samples.
(R3) Limited objective evaluation results.
(MR) Lacking literature review and connections to non deep learning based interactive generative music systems.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Disagree

Q4 (The paper discusses, cites and compares with all relevant related work)

Disagree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The method and level of use and control of motion and facial expression to generate music is unclear - both in the paper and the title. The title implies a primary role of facial and body movements, while the model seems to largely rely on the text prompts supplemented by the gestures.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Disagree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

while the project and the presented results are clearly valuable, it is difficult to accurately gauge scholarly quality due to vague and missing content. The number and demographics of participant raters is unclear, the paper contains many sweeping statements that are unsupported by statistics or references, BPM error is undefined, etc.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The lack of supporting evidence makes interpreting useable insights difficult.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper presents EXPOTION, a generative model that uses facial expressions and upper-body motion along with text prompts as multimodal controls to produce expressive and temporally accurate music.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper presents a novel integration of facial expressions and upper-body movements with text prompts to generate music outputs. The authors propose a parameter-efficient fine-tuning strategy on a pre-trained text-to-music model (MusicGen), and introduce a new 7-hour dataset of expressive audiovisual recordings. Overall, this is an ambitious and timely contribution to the emerging field of expressive, multimodal generative models for music.

While both the novel dataset and generative model are noteworthy, the paper lacks clarity and detail in several points. The dataset is relatively small for training and validating a generative model, and aspects around audiovisual recording selection and processing is unclear. Comparison to existing baseline models is vague in the introduction, and introduces a second model in the results that was not signposted in earlier sections. The use of subjective measurements is also unclear, with minimal information about rater selection, instructions given, and inter-rater comparisons. The recordings used movements that were not spontaneous and may have been exaggerated or lacking realism/genralisability.

The dependance of the model on text prompts (generic or generative) is not addressed, giving the impression of greater impact and control on outputs from visual inputs. There does not seem to hgave been a control paradigm without text inputs to compare the effect of video information.

The paper would benefit greatly from restructuring and clarity of detail, as well as a fuller consideration of existing literature. That said, the author's do present a unique and working model for music generation using multimodal controls.

Comments: 78-87 – this is more like discussion than introduction, or selling the approach before having given the literary justification. Line 83 “current state of the art” - Vidmuse implied as best/only existing system. A broader consideration of the literature mighrt be useful here. Also should Video2Music be named here (see line 313 below)

119-120 subjective - how was this distinction made? 184-187 “first resampling the video to 80 fps and then, since MARLIN processes 16 frames simultaneously, obtaining the facial expression features in a frame rate of 5 fps by setting the stride to 16 – what was original recorded fps? Stride of 16 what?

274-275 – movements & expressions were not spontaneous, probably exaggerated? 285 - text prompts in what way? when no vocals in the audio, what were the text prompts? any control participants without prompts?

308 – opening sentence is az bit sweeping – this needs to be clear authors are referring specifically to music generation. Models like this exist in speech research – also - is statement true?

Line 313 Should Video2Music be mentioned as current state of the art system in line 83 above? Line 349 “Participants rated” - how many raters, where were they recruited from? Did they have any guidance /what instructions were they given? Range of musical background? Table 1: reasoning for bolded values is unclear. It is not the best value for that measure (e.g there is a lower KL value than the one bolded Line 404 –13 sentences are confusing Line 422-425 this would be better placed in methods or where the participant raters were first mentions.

Line 449 “perceptually more engaging” – is there supporting evidence for this? currently this is a bit of a sweeping statement – no stats shown to support extrapolation of subjective observation. Also would help to state what the “more” is compared to

5.3 Ablation Studies: - need to refer to table in the text Line 472 - what is BPM error related to/defined by? was a BPM given ? if not, then may the creativity of the model not also lie in the deviation from standard rhythm - ?

Line 483 – notable improvements – this is a bit vague , would help to specify what compared is here Line 485-488 "excels" - can this be supported inferentially?

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper explores the effect of adding conditioning to an existing generative model. This is a standard idea at this point.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Adding video conditioning to MusicGen, where the video is of a listener pretending to conduct or miming playing an instrument, has some effect on generated samples.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper describes a new control mechanism for music generation intended to capture a user's intentions from video of their upper body movements, "similar to what conductors do" as the authors put it. Control is provided via a partially fine-tuned MusicGen model that takes video feature embeddings in addition to the standard text prompt. The authors also experiment with adding features representing users' facial expressions, but these turn out to have little effect on generated samples. To support this work, the authors create a dataset of paired audio and video recordings of participants performing rehearsed movements to commercial library music soundtracks. From the examples shared on the paper website, participants do indeed often mimic what they imagine a conductor might do. In other cases they mime playing a musical instrument. The experiments are clearly reported, but a couple of the results are striking and call for more explanation. The most basic control that I would expect to be added by conditioning on video clips from this dataset would be the ability to control tempo by giving a beat. The authors evaluate this with an Average Tempo Error metric. This metric captures the average difference in bpm between the tempo of the reference audio provided to a participant and the estimated tempo of a generated sample conditioned on video of their movements. According to the results in Table 1, this difference is 35 bpm for MusicGen with no video conditioning. For a control of this kind to be useful, this error needs to be substantially reduced, ideally to close to zero. However even in the best configuration reported in Table 1, the average tempo error is still 28 bpm. Why is this? Can you suggest a method to improve on this within the same general approach? While video features do not appear to provide meaningful control over tempo, the authors claim in Section 5.2 that video conditioning leads to samples that are more "coherent, expressive, and perceptually more engaging", and the subjective evaluation results in Figure 3 report a very significant increase in average Likert scale scores for the "Musicality" and "Creativity" of generated samples relative to vanilla MusicGen. If I've understood correctly, the samples being evaluated are only 10 seconds long. Is that correct? If so, can you be clearer about how coherence and musical creativity can be usefully measured in such a short snippet of audio? What I actually hear in the examples on your demo page is that samples from your models are more rhythmically varied, and often have more prominent rhythmic sounds, than the samples from the MusicGen baseline. Overall this is a well organised piece of research, and the conditioning you add is clearly starting to do something. But I think it would be a much better paper if you allowed more time to iterate on your results, until tempo control works as expected and you are able to understand more precisely how video conditioning affects generated audio samples.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper shows that leveraging facial expressions, upper-body motion, and text prompts can effectively improve semantic and temporal coherence in music generation while using only lightweight fine-tuning. The introduced Expotion system outperforms prior baselines and existing state-of-the-art video-to-music generation models

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Using facial expressions and body movements for generative music synthesis leads to better generation of music that tracks better both the emotions and timing of input better than existing video-to-music systems.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes a generative model that produces expressive and temporally accurate music from facial expressions, upper-body motion, and text prompts. The multimodal contributions are clear and concise.

The abstract is very clear and concise.

The paper does a great job building on existing work.

Is there a reason why you didn't test for statistical significance in the subjective evaluation section? The creativity, video-audio consistency, and musicality aspects of the face and baseline model seem very close. It would be interesting to see if there really is a significant difference in ratings of these models.

l. 456 typo "no significnat different".

The 2k fine-tuning steps aren't mentioned anywhere in the paper except for the abstract. I'm not 100% sure now, but wouldn't it add up to roughly 5k steps based on 130 clips and the information from Section 4.2?

Is there a reason you used VideoCLIP instead of an AV-sync metric (e.g., CCA-based)?

Review 4:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Disagree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

The objective metrics are not sufficiently convincing to me

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Agree (Very novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

As described in the paper, the task is novel and currently has no open-source models available. Will the authors release the code and dataset in the future? I found no commitment in the paper regarding code or dataset release, which is essential for reproducibility.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The authors proposed a new joint-modality model for music generation with controllable motion and facial expressions.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Overall, the paper is well-written. The fine-tuning methodology itself is convincing. The novelty is significant. The goal is to bridge the facial and music domains, which is a bold attempt.

The main concerns are two: - The design of objective metrics. Unlike the CLAP score for text and audio, it seems that there is no facial music consistency score. Therefore, the authors resort to VideoCLIP and SALMONN to measure the facial music consistency, and that's the only metric to measure the goal. For me, CLAP and FAD score cannot provide any effective measurement toward the goal. I am not very confident about this measure, and the paper doesn't provide more details. I think such key metrics should be more explained in the section 4.4.1. - Since there are no established facial-music embeddings or scoring methods such as CLAP for text and audio, the authors propose an alternative metric for rhythm. However, the lowest reported value is around 28.07 BPM, which I consider a significant error. In musical contexts, a BPM deviation greater than 10 seems also to be not unacceptable.

The objective metric design is the main doubt of the paper, the remaining parts are well-presented.

P3-13: Expotion: Facial Expression and Motion Control for Multimodal Music Generation

Fathinah Izzati, Xinyue Li, Gus Xia

Presented In-person

4-minute short-format presentation

Summary

Strengths

Weaknesses

Justification of the Overall Evaluation

Detailed Comments and Suggestions

Strengths

Weaknesses