TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure

Qi He; Gus Xia; Ziyu Wang

Abstract:

Hierarchical planning is a powerful approach to model long sequences structurally. Aside from considering hierarchies in the temporal structure of music, this paper explores an even more important aspect: concept hierarchy, which involves generating music ideas, transforming them, and ultimately organizing them—across musical time and space—into a complete composition. To this end, we introduce TOMI (Transforming and Organizing Music Ideas) as a novel approach in deep music generation and develop a TOMI-based model via instruction-tuned foundation LLM. Formally, we represent a multi-track composition process via a sparse, four-dimensional space characterized by clips (short audio or MIDI segments), sections (temporal positions), tracks (instrument layers), and transformations (elaboration methods). Our model is capable of generating multi-track electronic music with full-song structure, and we further integrate the TOMI-based model with the REAPER digital audio workstation, enabling interactive human-AI co-creation. Experimental results demonstrate that our approach produces higher-quality electronic music with stronger structural coherence compared to baselines.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper could provide inspiration to exploit in-context learning with LLMs, not only in the context of music creation but also for other tasks.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

The proposed framework structures music generation around clips, sections, tracks, and transformations, and uses an LLM (with in-context learning) to generate multi-track electronic music.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper introduces TOMI, a novel approach to music generation that models the concept hierarchy in music composition. The system structures music generation around clips, sections, tracks, and transformations, using an LLM to manipulate this structure and generate complete multi-track electronic music. The system is integrated with the REAPER digital audio workstation. This paper has several interesting aspects: The TOMI framework, which offers a new way to conceptualize and represent music composition, explicitly modeling the hierarchical relationships between musical ideas, their transformations, and their organization in time and instrument layers. This is a significant contribution as it moves beyond simply modeling the temporal sequence of musical events and attempts to capture the underlying conceptual structure of a composition. The use of in-context learning to guide the LLM in generating the parameters of the TOMI structure is a creative and promising approach. This allows the system to leverage the LLM's language understanding capabilities to generate musically meaningful structures. DAW integration offers practical implications and supports human-AI co-creation, even if it’s not a major scientific contribution. The paper presents a comprehensive evaluation, including both objective metrics for structural consistency and a subjective listening test to assess the perceived quality of the generated music. The inclusion of ablation studies is also interesting, by isolating the impact of different components of the system. The paper is generally well-written and clearly presented, and structured. The figures and tables are informative and contribute to the reader's understanding of the proposed approach, as well as the demo website

And now some of the weaknesses. Using GPT-4o with context learning is a creative and interesting choice, but it would be interesting to know how much do the generations rely on the examples given at the prompt, and also how creative the generated compositions are… What degree of variation does the system allow, given similar “prompts”? Then, the TOMI framework facilitates iterative experimentation with song arrangement and instrumentation due to its explicit representation of sections and tracks. However, the system's reliance on pre-existing musical clips limits the user's ability to iterate on the development of core musical ideas (melodies, harmonies, rhythms) within the system. This could be a significant constraint for users who may want to combine manual editing + the use of AI in order to explore musical motifs and variations during the compositional process. Then, some design / naming choices could be more clear: e.g. the creation of a drum sequence (e.g. kick + snare) takes place within a Drum Transformation Node, while this is basically a composition/arrangement task … Then, a “Fx transform”, actually only decides if there is a riser or faller and the end/beginning of a section, so it seems more about arrangement than transformations… According to section 3.2. “We initiate the sample retrieval process to get the actual clip materials. Then, we set a global tempo and key to unify the keys and tempos of clips.” While this is a valid approach, it could be beneficial to inform the sample retrieval process with the keys and tempo so as to avoid large pitch stretching factors. Also, the sample retrieval doesn’t really seem to be informed on the arrangement / style. E.g. The bridge section in the example webpage video introduces a violin which is unexpected for the style of the composition… and it is in audio format, so the virtual instrument couldn’t be replaced. Then it mentioned that: “it can also extract music stems, such as bass, chord, and melody, from the source MIDI to augment the data.”. How is this done? It would be good to have a reference? To finalise, there are some weaknesses in the article, but there are many positive aspects, and one could foresee interesting extensions of this work, including the control of FX, and the use for other musical (and potentially non-music related) tasks, which use hierarchy / transformations.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

The paper "TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure" introduces TOMI, a novel system for multi-track music generation. It proposes a hierarchical taxonomy to describe electronic music songs, utilized with a Large Language Model (LLM) for in-context learning to guide the transformation and organization of existing music clips. The system is integrated with a Digital Audio Workstation (DAW) to facilitate human-AI co-creation. The authors evaluate TOMI through objective metrics and a crowdsourced listening test, comparing it against rule-based methods and other generative approaches.

Strengths

Novel hierarchical taxonomy: The paper introduces a well-structured hierarchical taxonomy for electronic music, explicitly modeling relationships between ideas, transformations, and organization across time and instruments.
Innovative LLM integration: Using an LLM for in-context learning to guide the TOMI structure is a creative approach
Practical utility & DAW integration: The system's DAW integration demonstrates practical implications for human-AI co-creation and production workflows.
Comprehensive evaluation: The paper presents a thorough evaluation, including objective metrics, subjective listening tests, and ablation studies.
Strong performance: TOMI-LLM is shown to generate more coherent music than baselines, suggesting the LLM learns musical structure, with non-LLM components handling lower-level details effectively (R4).
Readability and presentation: The paper is well-written, clearly presented, and logically structured, aided by informative figures and a demo website
Reusable insights: The TOMI data structure is extensible, offering potential applications beyond the paper's scope (R1, R4, Meta).

Summary of weaknesses

Limited content generation: The system primarily organizes existing clips, which limits its ability to generate truly novel musical content (R2, Meta).
Reproducibility: Details needed to fully reproduce the LLM's in-context learning behavior are unclear (R1, Meta).
Incomplete related work (R4).
Musical coherence: There are concerns about harmonic consistency within tracks and also on how MIDI instrument presets are assigned (R2).
Design and logic clarity: Some framework design choices and aspects of the clip retrieval process need clearer justification (R1, Meta).
Missing technical details regarding the extraction of stems and some quantitative metrics (e.g. track count/duration) (R2).
Minor presentation flaws (phrasing issues and inconsistent reference formatting).

Here are some specific recommendations for the camera-ready version. To enhance the paper's clarity, depth, and impact with feasible effort, consider these changes (and consider each of the individual reviews as well for more detail):

Clarify LLM prompts: Provide examples of complete TOMI prompt structures and standalone LLM prompts on the companion website (R1).
Address clip search: Clarify how clip search operates (R1, Meta).
Cite relevant work: Include and discuss the "SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints (ISMIR 2024)" paper in the related work (R4).
Acknowledge the need for better solutions for MIDI instrument assignment than random selection (R2).
Clarify design choices: Explain the reasoning behind naming conventions (e.g., "Drum Transformation Node," "Fx transform (Meta).
Add missing technical details: Briefly explain stems extraction from MIDI. Provide metrics or discussions on the number of tracks and duration TOMI can generate (R2).
Refine language and references (see each of the reviews)

To conclude, the paper presents a significant contribution with the TOMI framework and its innovative integration of LLMs for structured music composition, including valuable DAW integration. Despite some minor concerns, the overall strengths and potential for future work are substantial. We recommend acceptance, believing the suggested revisions will further enhance the paper's clarity, depth, and impact.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

A taxonomy for algorithmically creating a DAW session, possibly even useful beyond integrating with an LLM as presented in the paper.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A hierarchical taxonomy for describing a DAW session is useful for integrating with an LLM to generate music.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a hierarchical taxonomy to describe an electronic music song, which can then be used for in-context learning when provided to an LLM. In general the writing is clear and the experiments are clearly described and thorough including a crowdsourced listening test.

Strengths: - Nice hierarchical taxonomy - Good experimental validation

Weaknesses: - The companion website gives some prompt details, but I'm still unsure how in-context learning happens. How many examples are provided to the LLM? A full prompt that is actually fed to the LLM would be helpful here - It wasn't clear to me what a user actually provides the LLM, again full LLM prompts should be provided on the companion website. -I was unclear how clip search happens

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper proposes a novel approach that leverages existing music clips and organizes them to create a coherent music track, a technique commonly used in electronic music production but not yet applied in the domain of automatic music generation.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

An electronic music generation method that employs a large language model to guide the transformation and organization of existing music clips.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The authors propose a new architecture to accomplish full-song-level music generation by leveraging the idea of transforming and organizing existing music clips, and utilize an LLM to guide the generation process. (Although, in my opinion, merely organizing existing music clips from the dataset without creating new content can hardly be defined as "composition.") The audio samples provided are of high quality, and the authors also integrate the system with a DAW, enabling human-AI co-creation, which I believe is highly meaningful. The methodology is explained in detail, and they compare their method with other approaches in the literature using both objective metrics and subjective tests.

The topics discussed in this paper are of interest to the ISMIR community, and the writing is clear and thorough. I recommend it for publication.

For weaknesses, there are some points that need to be addressed in future work:

The harmonic coherence within the musical content of each track is not confirmed.
For MIDI tracks, randomly assigning instrument presets is not a suitable choice. I believe a better solution should be explored.
You emphasize that you achieve multi-track compositions with full-song-level structure. Can you provide a metric to evaluate how many tracks and what duration your method can generate?

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Disagree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The paper [1] below also implements a system that generates multi-track compositions with full-song structure via section-based prompts. While the authors of [1] work only in the symbolic domain and their implementation is very different, their section-based organization and prompting strategy is similar enough to that in this paper that it should be cited.

[1] SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints (ISMIR 2024)

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The TOMI data structure is reusable and extensible.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

TOMI is a novel hierarchical modeling system that can be used by an LLM (and/or a person) to organize musical ideas into full songs.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Strengths: This is a well-written paper that proposes a system (TOMI) for transforming and organizing musical ideas into complete compositions. The system is capable of handling both audio and MIDI clips, and organizes them into songs with a process that I imagine is roughly similar to the actual process of many electronic music producers.

TOMI has several potential uses beyond what is explored in the paper. For instance, the system can work with an artist's self-developed database of musical ideas. The system might also be useful for companies like Splice (I could imagine a "help me make a song" link on their homepage). The system could also be used as an interface between an artist and their DAW, with an LLM filling in part of the TOMI data structure and the artist filling in the rest. The artist could swap in or out specific clips that are chosen by the system within this interface, and have the system build (or re-build) the song from the modified data.

The system integration with a DAW is nice, and shows that once the TOMI data structure is complete, the song it encodes can be automatically built inside a DAW for further tweaking by the end user.

The paper finds that an LLM using the full TOMI system generates more coherent music than (a) the LLM using the system without composition links, (b) a rule-based method using the full TOMI system, and (c) MusicGen. These findings, in particular finding (b), are quite interesting to me. Based on these findings and the strengths above, I recommend strong acceptance of this paper. In my experience, current LLMs struggle with music theory, but (b) suggests that they have learned something about the structure of music that is useful for creating new works, and the non-LLM portion of the TOMI system effectively handles the lower-level detail portions of the music-making process that LLMs are weaker at.

While it is clear from the demos that the current implementation of TOMI is not capable of producing music at the level of humans by itself, it is a good step forward, it outperforms reasonable baselines, and it can be used to augment human creativity.

Weaknesses/specific suggestions for improvement: -In the companion website, I would like to see an example or two of a complete TOMI prompt structure. Right now there are just ...'s in many places. The authors promise to open-source their code after acceptance, and while I'm sure the prompts would be in that code, I don't think readers should have to go digging to find it.

-I would also like to see an example or two of the standalone LLM prompts to ensure that the baseline comparison is reasonable.

-Are >, -, and = the only three symbols allowed in general transforms? Please make this clear somewhere on the companion website.

-It seems that general and drum transforms are limited to 16th note patterns in this specific implementation. Again, just make this clear somewhere.

-Line 148: "defined by a set of features" The clips have a set of features. They are not "defined" by these features.

-Line 155: "they can query the databases" Maybe change this to "we can query the databases," since the LLM supplies f doesn't directly assist in the query.

-Line 204: "is reused twice" -> "is used twice"

-Line 260-264: I hope this portion of the code will be released.

-Line 272: "If no matches are found, the clip and its associated composition links are discarded" I'm a little worried about this, as it could cause important parts of the composition to be missing. Would it be reasonable to modify your error-catching script to catch cases like this and ask the LLM to try again?

-Line 358: What exactly does "size" mean here?

-The references need to be cleaned up. (Many citations point to the arxiv rather than the official published versions of papers.)

P3-11: TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure

Qi He, Gus Xia, Ziyu Wang

Presented In-person

4-minute short-format presentation