P2-13: A Survey on Vision-to-Music Generation: Methods, Datasets, Evaluation, and Challenges

Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao

Subjects: Multimodality ; Music generation ; Generative Tasks ; Music and audio synthesis ; Applications ; Open Review ; Music videos, multimodal music systems ; MIR tasks ; MIR fundamentals and methodology

Presented In-person

4-minute short-format presentation

Abstract:

Vision-to-music generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast applications like film scoring and short video creation. However, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and future directions. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

A well-written, (mostly) complete survey of vision-to-music generation works covering methods, datasets, evaluation, and challenges. The work notes that video-to-music generation works are still in early stages and have yet to fully have a complete treatment in the academic literature (I agree). This work will help advance this progress and is worth a read for anyone looking to start working on the topic.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

A well-written, (mostly) complete survey of vision-to-music generation works covering methods, datasets, evaluation, and challenges.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary: In this work, the authors present a survey of vision-to-music generation works covering the topics of methods, datasets, evaluation, and challenges. Multiple timelines of representative works and categorizations of methods, datasets, eval methods, and similar are provided as well as example architecture diagrams. The references section is also extensive. Major/minor comments Overall, very nice survey and informative work. The work is well organized at a high-level and has well written sentence structure at a low level. The survey topics of methods, datasets, evaluation, and challenges are well done as well, along with the extensive references section. Regarding issues for improvement, I would suggest • Clarification on the language of rhythmic videos. In the intro, you break down the topic into three main areas 1) general videos 2) human movement videos and 3) images. Here, the focus on “human movement videos” is more focused than parts of later in the work that discuss rhythmic videos where human movement videos are a subset. It could be useful broaden the focus of human movement videos to videos with rhythmic motion where human movement videos are a subset. • When commenting on page 2 and elsewhere “However, audio music lacks controllability, and the generated music is typically shorter (usually under 20 seconds) due to sampling rate limitations, I would argue this is generally untrue. Over the last few years, there have been several works showing extensive controllability for audio-domain music generation (e.g. Music ControlNet) as well as long-form generation (StableAudio). The controllability of symbolic-domain music is also focused on more typical note-level control, however. The issues above are minor and easily addressable.

Grammatical comments: • Abstract: “Vision-to-music Generation” -> “Vision-to-music generation”

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak reject

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

Summary: In this work, the authors present a survey of vision-to-music generation works covering the topics of methods, datasets, evaluation, and challenges. Multiple timelines of representative works and categorizations of methods, datasets, eval methods, and similar are provided as well as example architecture diagrams. The references section is also extensive.

Initial Scores: 2 strong accept, 2 strong rejects

Metareview: Overall, initial reviews were very mixed because of the format of a survey paper. Overall, everyone agrees the paper has strong points -R1 "excellent tables with collected systems, datasets, metrics and the structured information like durations, music length, and so on" -R2 "very ambitious article that tackles the state of the art on vision-to-music generation, looking at models, datasets and evaluation" -R2 "timely, clear, and comprehensive contribution to an emerging multimodal field and will be of immediate use to both academic and applied communities. It is presented in a way that is very useful and very clear." -R3 "datasets and metrics tables are a helpful views of the research landscape presented in a well-formatted and highly usable format for future researchers."

Areas for improvement -R1 "I'm not really sure what the contribution of the paper is." -R1 "The identified challenges and their importance would be a lot more convincing if they were contextualized with impact" -R3 "However, outside of compiling information, the paper does not engage with the material very deeply and is therefore not original or impactful." -R3 "These insights are very surface level." Discussion: The discussion focused on issues w.r.t. technical correctness and scope which were mistakenly flagged by the meta-reviewer as minor issues before the discussion. R1 and R2 confirmed concern on some of the technical descriptions as well as the idea that this could become a wonder (potentially long-form) paper if refined a little bit more.

Recommendation: Reject (Weak)

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Great collection of systems, datasets, metrics, with a historical timeline, for image/video-to-audio synthesis

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Not sure, actually. "There's a lot of related work within the same domain, and they're all somewhat similar"

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

I've had a difficult time with this paper. On the one hand, I greatly appreciate the excellent tables with collected systems, datasets, metrics and the structured information like durations, music length, and so on. On the other hand, I'm not really sure what the contribution of the paper is. The identified challenges and their importance would be a lot more convincing if they were contextualized with impact. If the intended contribution of the paper is to attempt to establish a shared taxonomy for vision-to-music generation I would like more details on the process for that.

I think this would become a wonderful review article if it receives an additional pass with details ironed out, and some sections merged/split/rearranged. Additionally, some smaller technical details seem wrong to me, or at least misleading, which isn't up to the standard I would like to see for a published survey paper that people would rely on.

Some potential improvements:

  • 344: What is Frechet Distance in this context? Frechet Inception Distance (FID) or something else? Strongly recommend not just referring to the general measures of probability distribution similarity. Typically we mean KL-divergence given some specific, suitable feature representation, and not just time-domain audio or MIDI byte sequences.

  • Why is FAD under “music-only” metrics while CLAP is not. Especially considering 380-381

  • 381: KL isn’t trained

  • “CLAP score” needs clarification. Do we mean cosine similarity between text and audio embeddings? How is that a vision-music correspondence? Or do we mean adapting CLAP to CLIP (e.g. wav2clip) to get a matching still frame (image) encoder as well?

  • I wonder if a better term for “vision-to-music generation” is “soundtracking” or “soundtrack generation”. Perhaps too limiting in use case, but just a loose thought.

  • Nice to avoid “generative music” as term due to historical meanings, rather see “music generation” in ethics statement

  • 148-149 claim deserves backing reference (or remove)

  • I would like to see a distinction between input types for music videos (video created for the song) and soundtracking (songs selected for the video) in figure 2

  • Add compute resources to table 1 (e.g. GPU hours needed to train the final system)

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

As a review on a novel area, it should be useful as a blueprint for any new research.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A thorough review on vision-to-music generation, that overviews an extensive collection of models, datasets and evaluation strategies.

Q17 (Would you recommend this paper for an award?)

Yes

Q18 ( If yes, please explain why it should be awarded.)

A very complete and timely review on a topic that has some promise for the near future.

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This is a very ambitious article that tackles the state of the art on vision-to-music generation, looking at models, datasets and evaluation. This paper makes a timely, clear, and comprehensive contribution to an emerging multimodal field and will be of immediate use to both academic and applied communities. It is presented in a way that is very useful and very clear. While clearly structure and spacing were very well thought out to allow the necessary content to fit, model listing could have been extended with additional useful information on open-sourceness, existence of demo UIs, pretrained weights, etc. There could have been quantitative summary of trends (eg. Yearly evolutions of number of models, model types, dataset sizes, etc.) possibly in graphical or tabular form, but again it had to be a compromise in available space. There are some claims that are too definitive (“No previous surveys have focused on vision-to-music generation” without “to the best of our knowledge”; “CLAP is the dominant method” instead of ““CLAP is currently one of the most widely used models”), but given the overall soundness of the paper they become admissible.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Disagree (Well-explored topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Having datasets, methods, and metrics compiled together for a specific task is useful. To be frank though, although these insights are reusable I don't feel that they will be valuable for very long. This research area is evolving so rapidly that the methods in this survey will likely be obsolete in a matter of months.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This survey gives a broad view of some current methods, data, and metrics used to approach the task of video to music.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong reject

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This datasets and metrics tables are a helpful views of the research landscape presented in a well-formated and highly usable format for future researchers. However, outside of compiling information, the paper does not engage with the material very deeply and is therefore not original or impactful. I expect a good survey to point out trends, identify open questions, and reflect on the big picture of where we started and where we're going. The extent of these insights are: 1. There is a lack of standardized datasets and benchmarks 2. Customization and controllability are important for practical use 3. A promising direction is to combine symbolic and audio methods

These insights are very surface level. You have laid out a list of evaluation metrics used in the literature. Are there gaps? You analyze 3 input types: general videos, human movement videos, and images. Why are these three the focus? What are the other input types? L402: "Exploring how to align these technologies with applications offers significant commercial opportunities". For example? The ethical statement in section 8 simply says "we think the ethics of this work should be considered" but does no critical thinking to further the consideration.

The paper also makes a number of claims that are unsubstantiated opinions:

L132: "we will mainly focus on general videos and images, while paying relatively less attention to human movement videos. This is because their semantic association with music is not strong" and "Their application scenarios are also relatively limited."

To me, it's clear that the semantic association with dancing videos with music is strong, and there are many application scenarios.

L278: "the diversity of content and styles in [music videos] may be limited"

To me, this is not the case. Music videos are incredibly diverse in content and style.