P6-1: MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling
Jingjing Tang, Xin Wang, Zhe Zhang, Junichi Yamagish, Geraint Wiggins, George Fazekas
Subjects: Machine learning/artificial intelligence for music ; Generative Tasks ; Music and audio synthesis ; Open Review ; Expression and performative aspects of music ; Knowledge-driven approaches to MIR ; MIR tasks ; Awards Nominee ; Music synthesis and transformation ; Musical features and properties
Presented In-person
10-minute long-format presentation
Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model’s generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75\% lower Fréchet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 ( The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
code and demos
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
Valle's methodology used in MIDI-Audio for expressive performance rendering
Q17 (This paper is of award-winning quality.)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Strong accept
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper presents MIDI-VALLE, a neural codec language model for expressive piano performance synthesis, inspired by VALLE from the speech domain. The authors introduce Piano-Encodec, a piano-specific audio tokenizer, and propose a discrete tokenization pipeline for both MIDI and audio, enabling high-fidelity synthesis from symbolic input with either MIDI or audio prompts. The paper is solidly grounded in prior work and demonstrates clear technical contributions, especially in bridging symbolic and acoustic domains via discrete representations.
That said, some limitations are noted. The model primarily targets classical piano and struggles with jazz/generalization; its evaluation lacks comparison to stronger baselines like Pianoteq; and claims around style prompting and codebook interpretability would benefit from more evidence. Importantly, while the architecture supports audio-prompt-based generation, the potential for style transfer (as explored in VALLE) is not fully demonstrated or evaluated. Highlighting this as a future direction—e.g., transferring expressive characteristics from a performer’s prompt to unseen MIDI—would significantly strengthen the work’s broader impact.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
This paper combines a custom-trained audio codec with a discrete token-based MIDI-to-audio generation pipeline. The work is technically sound, well-motivated, and includes solid subjective/audio results demonstrating high-quality piano synthesis.
reviewers all agree on an acceptance. For the final version, please pay attention to the weakness parts posted by reviewers, especially the todo list by R3.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The paper provides valuable reusable insights, especially through its analogy between phoneme processing in speech modeling and expressive piano modeling. This comparison offers a foundation for capturing complex musical nuances in a manner similar to how speech models handle linguistic details.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
In the context of expressive piano synthesis, the authors question the use of traditional MIDI representations (piano rolls) and spectrograms, proposing instead the use of transcribed MIDI with more features, and learned tokens (Audio and MIDI) to improve generalization and alignment.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
Strengths:
-
Flexible Framework: MIDI-VALLE presents an adaptable framework that can be applied to various music synthesis and transcription tasks, such as music score prediction, audio-to-MIDI transcription, and music generation beyond classical piano.
-
The model introduces a novel method of treating MIDI as discrete tokens, avoiding traditional piano rolls and spectrograms, which improves generalization and synthesis quality.
-
MIDI-VALLE’s training on transcribed MIDI data enhances its ability to generalize to recorded data without the need for fine-tuning, which is beneficial for real-world applications where recorded data is often limited.
-
The model demonstrates improved synthesis quality (audio examples are provided), outperforming state-of-the-art baselines such as M2A on multiple datasets, with better preservation of timbral and ambient features.
Weaknesses :
-
MIDI-VALLE struggles to generalize beyond classical music, particularly with genres that involve richer harmonic content, syncopation, and subtle expressive variations.
-
The remaining FAD gap between MIDI-VALLE generations and the ground truth may be due to the noisier outputs of the non-autoregressive model.
-
The distinction between MIDI-VALLE’s MIDI tokenization and the Octuple MIDI method is unclear, particularly without a clear reference to the original model. Also, claims about note-wise encoding and reduced complexity lack sufficient explanation or comparison with the original model.
-
It would have been interesting to provide audio examples to support this claim: “The first codebook captures primary acoustic features, such as pitch, note duration, and timbre, while the subsequent codebooks focus on finer details of these features.”
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Strongly agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Strongly agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
- another example of TTS technology applied to music synthesis
- discretization can help improve the synthesis performance but comes with the heavy requirement of large datasets
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
MIDI-VALLE adapts the VALLE neural codec language model and uses discrete tokenization for MIDI and audio to achieve significantly improved, state-of-the-art expressive piano synthesis quality and generalization on classical music compared to previous methods.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
Strengths: - Clear writing and analysis - Superior Synthesis Quality on Classical Music - Robust Tokenization Approach - Enhanced System Compatibility
Weaknesses of the model - Struggles with Jazz, authors acknowledge this and provide appropriate justifications - Prompt Alignment Sensitivity, also discussed in the article
Accept (confidence 4/5): this paper adapts the VALLE framework with discrete tokenization for MIDI and audio to achieve state‑of‑the‑art expressive piano synthesis, showing substantial improvements over the M2A baseline—particularly on classical repertoire—via better FAD scores, listening‑test preferences, generalization to recorded MIDI, and prompt‑acoustic adaptability. While limitations remain in jazz performance, prompt alignment, and pedal synthesis (all acknowledged for future work), the strong technical contributions and clear results on the primary task justify acceptance.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
- The presented results can contribute to the development of the audio domain synthesis of expressive music performances, including the end-to-end systems that transform scores into performance audios.
- Piano-Encoder, if released, can be used for any piano related audio synthesis tasks.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
A large transformer-based model for expressive piano performance synthesis from MIDI with acoustic style prompting and control.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The paper presents MIDI-VALLE, a transformer-based model for expressive performance synthesis based on the architecture of VALL-E, a model for text-to-speech synthesis. The work introduces a fine-tuned audio codec for piano music, Piano-Encodec, and a MIDI-to-audio generative model with additional audio and MIDI acoustic prompts for conditioning. The results are convincing and well presented.
The main strengths of the paper are: 1. Theoretically sound and solid paper. Clear motivation and application of existing approaches from the related domain of speech synthesis to expressive music performance synthesis. 2. The choice of model design is validated and all steps are explained in detail. The choice for audio tokenization is also validated and contrasted with the piano roll representation. 3. The trained Piano-Encodec shows good reconstruction quality and can be used for any other research related to piano audio generation. To the best of my knowledge, there are no open-source audio codecs specifically tuned for piano music. 4. The subjective evaluation and the demo samples on the website are convincing and show the effectiveness of the designed approach for performance synthesis.
The main weaknesses of the work are the computational complexity of the model and the suboptimal subjective evaluation: 1. The task of MIDI-to-audio inference does not involve generating an expressive performance from scratch, and thus is easier than end-to-end expressive performance rendering and audio synthesis. While the model design is sound, a 12-layer transformer may be overkill for this problem. The autoregressive part of the first inference stage makes inference slower than alternatives. An ablation on the model size and replacing tokens with mel spectrograms will be interesting. 2. The work does not contribute much to the architectural design of audio codecs and synthesis. It is a successful adaptation of existing methods to a task of MIDI-to-audio synthesis. The choice of EnCodec might be a bit outdated when there are more advanced audio codecs in terms of compression and number of tokens. For example, DAC [1] as a better version of EnCodec or WavTokenizer [2] with a single codebook. 3. MIDI-VALLE is only compared with the M2A model [3]. In the M2A paper, however, the model loses against MIDI files synthesized with Pianoteq. This raises the question: is MIDI-VALLE better than the Pianoteq synthesis? A direct comparison would strengthen the paper. In addition, the work on diffusion-based performance conditioning for preserving acoustics and style in audio synthesis can be used for a comparison [4].
Some questions and comments that may be addressed in the final version of the paper: 1. In Section 3.1.1, is there any scientific evidence that the first codebook models pitches, durations, and timbre? It follows intuitively, but without confirmation, e.g. by training only on the selected codebooks, it is an unconfirmed statement. 2. In Table 1, why the vocabulary size for speech is 512 when each RVQ has a codebook of size 2048 (Section 4.2)? 3. In Section 3.3, does it mean that for the MIDI prompt, some audio-to-MIDI transcription is required, when initially we only have audio for inference? 4. In Section 4.1, pedals are excluded but do durations encode raw or sustained MIDI durations? 5. In Section 4.2, why only 60 hours of the entire ATEPP dataset are used for codec tuning? 6. Does the model skip or repeat notes in the middle of the sequence? For example, VALL-E is known to struggle with word skips/repeats for non-trivial sentences due to attention failures in the autoregressive token modeling. It is interesting to observe the attention maps for the trained transformer model. 7. The model is trained on 15-20s snippets. Can it be used to synthesize a full-length MIDI performance? How well will the acoustic conditions be preserved?
Minor: 1. The abstract contrasts a two-step approach, and the wording implies that the paper solves its challenges, but the paper solves only the second step. 2. In Section 3.2, the formal definition does not distinguish between MIDI prompt and target MIDI. If $x$ is the target MIDI, then the MIDI prompt should also be defined. 3. In Section 3.2, is it correct that the AR model does not work with acoustic prompt? From Figure 1, this information is not trivial. 4. In Section 6, split the Results section into several subsections for better readability. 5. Line 415: "taht" -> "that"
Overall, this is a solid paper that should be accepted for presentation at the conference.
References: [1] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems. 2023 [2] Ji, Shengpeng, et al. "Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling." ICLR. 2025. [3] Tang, Jingjing, et al. "Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores." ICASSP. 2025. [4] Maman, Ben, et al. "Performance conditioning for diffusion-based multi-instrument music synthesis." ICASSP. 2024.