Abstract:

The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The multimodal data, documentation, and benchmark analyses of the PianoVAM dataset will facilitate future research in piano transcription and related topics.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

The authors contribute a well-constructed and well-documented multimodal piano performance dataset along with benchmark analyses.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This is the meta-reviewer's independent review for paper 87: PianoVAM: A Multimodal Piano Performance Dataset

This paper introduces the PianoVAM dataset, which is a multimodal dataset of piano performances. The paper describes how data were acquired and preprocessed, attributes of the dataset, as well as the fingering annotation algorithm and benchmark results between PianoVAM and MAESTRO and between audio-only and audio-video transcription.

I commend the authors for this comprehensive and well-written paper. The dataset serves as a useful contribution to the MIR field; the paper was thorough and information-dense, while also being easy to follow. The Introduction effectively the reader through the topics of multimodal MIR and piano transcriptions to arrive at the present work; and the tables and figures do an excellent job of summarizing a lot of information in understandable ways. I also thank the authors for the ethics statement in the paper, and for providing the anonymized repo and example video for review. I have two points of main feedback and other minor suggestions for the authors.

Main feedback - The positioning of PianoVAM-Finger within the PianoVAM dataset in the paper was a little confusing. Based on visiting the repo, it seems the fingering information is indeed part of the larger dataset, but PianoVAM-Finger is first mentioned in Section 2.3 and Table 2, and its inclusion or separation wasn't totally clear to me. The authors could clarify this, for example, by mentioning PianoVAM-Finger by name earlier (e.g., in the Introduction) and state outright that it's part of PianoVAM. - Please state the license of the dataset in the paper (at the start of Section 3 for example).

Other minor suggestions for the authors - Section 2.1: There are additional audio-visual datasets not covered here; the authors could point out that those covered are just examples and not all available datasets of this kind. - The audio loudness normalization procedure in Section 3.2.2 makes sense. But can the non-normalized versions also be released as part of the dataset? Some users might wish to work with the original recordings or implement their own normalization procedure. If only the normalized audio is released, it is not clear how a user could get back to the original versions. - Section 4 paragraph 1: Can the authors elaborate a bit on how the range of works/composers and performer skill levels compares to other published datasets? - Section 4 paragraph 2: The difference in sustain pedal usage between MAESTRO and PianoVAM is notable, especially when pitch and velocity distributions were well matched across the datasets. Can the authors speculate as to why PianoVAM has such higher use of the sustain petal? (E.g., performer skill, emphasis on informal practice, pieces performed, hardware?) - Section 5: Can more information on the usage of the GUI-based fingering annotation tool be provided? Was it used only for the annotations reported in Section 5.2.1? Can more information be provided on the number of annotators, their qualifications, and the extent and nature of the annotations? - Minor typos/wording issues: Line 228 "This sections"; line 252 sentence beginning "Floating hand", line 450 "which makes player to prepare".

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

This is the meta-review for paper 87: PianoVAM: A Multimodal Piano Performance Dataset. The paper received independent reviews from the meta-reviewer as well as 4 additional reviewers, all of whom were broadly positive in their assessments of the paper.

The reviews highlight many positive aspects of the paper including the clarity of writing, the paper being well-structured and densely informative, the value of the main contribution (the multimodal dataset), and clear relevance to MIR. The reviewers have also provided suggestions for improving the paper. Examples include missing details, a need for more explanation on the limited improvement in onset detection from adding visual information, potential shortcomings of focusing on practice sessions, opportunities for more reflection, and missing license. The authors are encouraged to read all of the reviews and incorporate the reviewer feedback.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Agree (Very novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper highlights the importance of and growing interest in multimodal performance data. PianoVAM combines realistic and varied performance recordings with advances in data processing to provide a novel landmark estimation model and fingering detection algorithm, with comprehensive, high quality outputs.

PianoVAM contributes to the growing body of multimodal datasets in MIR that have been shown to improve model robustness and performance in various performance analysis tasks. The paper also offers valuable insights for future multimodal dataset generation.

These insights show the value of PianoVAM in advancing MIR research and its potential to enable new approaches in understanding and analyzing piano performances.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper provides a novel and comprehensive multimodal dataset of piano fingering techniques with wide-ranging potential impact.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a novel and impactful dataset for multimodal piano performance analysis. The paper is exceptionally well written, with clear relation to existing literature and potential impact.

Thank you for providing the supplementary sources, they were extremely clear and informative. The use of completely opensource systems for maximum accessibility and usability os also commendable

The dataset looks to hold great potential for future research, training and tuition.

minor comments: lines 357-378 could benefit from clarification. could benefit from some reflection or suggestions for improvement regarding potential inaccuracies in fingering detection due to reliance on automated methods, particularly for complex high tempo recordings.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

However, another recently released audiovisual dataset for the analysis of Carnatic music is:

Shankar A, Plaja-Roglans G, Nuttall T, Rocamora M, Serra X. Saraga Audiovisual: a large multimodal open data collection for the analysis of carnatic music. Paper presented at: 25th International Society for Music Information Retrieval Conference (ISMIR2024); 2024 November 10-14; San Francisco, USA

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

Although (if accepted) the dataset will be very useful for the research community, especially that interested in performance assessment from a pedagogical perspective, it is not clear what has been concluded from the transcription results or the dataset creation methods (for ex. the finger labelling module) apart from showing that PianoVAR is indeed useful.

In the context of a dataset paper, it could have been interesting to discuss some pitfalls or main lessons learnt when collecting this type of multimodal data at a large scale.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

PianoVAR is a multimodal (MIDI - Disklavier), audio, Video)piano rehearsal dataset of amateur performers covering a range of self-reported skill levels with a varied repertoire. This multimodality affords the possibility to create fingering annotations with a demonstrable reliability, and can allow its use for audiovisual piano transcription.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents the PianoVAM dataset, a multimodal collection of piano performances with comprehensive annotations including audio, video, and note-level information. While the dataset offers valuable diversity in performers and repertoire, there is room for improvement with respect to clarity, methodology description, and technical explanations.

Strengths: The dataset has a high potential to support piano performance research, whether pedagogical or for tasks such as transcription and piano fingering prediction.

Weaknesses: - Further information about the distribution of practice sessions per player per piece would significantly enhance the dataset description. Also, more detailed demographic information would be valuable about player training would be beneficial.

  • Floating Hands Detection: The description in Figure 3 would benefit substantially from a clearer diagram to improve readability. Perhaps A two-column diagram with clearer directional arrows would better organize the steps into clear phases, and would improve the readability of section 5 overall.

  • Terminology Misalignment: The term "audiovisual piano transcription" suggests that video is an integral part of the prediction architecture, whereas it appears to be applied as a post-processing step to audio-only transcription models. This should be clarified.

  • The paper mentions measuring "integrated loudness" using the pyloudnorm package without explaining what integrated loudness is or describing the package. This requires clarification and a citation to the pyloudnorm preprint.

  • More specific information about which MediaPipe functionality was used would help ensure the annotations remain understandable even if MediaPipe is discontinued. -"Approximate synchronization" between video and MIDI (line 154) needs explanation. Why was a Sakoe-Chiba band of 2.5 seconds necessary instead of applying a simple offset for alignment, especially since both sources were recorded simultaneously?
  • It's unclear whether variations in recording conditions and loudness are intentional features or unintended inconsistencies. The introduction suggests these might be features, but it's not specified whether such variations were documented for each recording setting or if they were unintentional and meant to be removed. -MIDI Velocity Adjustment: Has any adjustment been made to MIDI velocity values in response to loudness normalization? This could significantly affect predicted velocity values in transcription tasks.
  • Lines 233-234 use the term "missed" notes without specifying whether these are incorrect detections or notes omitted from processing.

Minor Issues - Table 4: Please indicate that these metrics are for piano transcription - Table 2: The "Data Type" field lists "MIDI, Score" - please clarify if this refers to "MIDI Score" or to separate performance MIDI and XML (or other format) score files.

Addressing these issues would greatly enhance its accessibility and utility for future research.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors have created a unique and rich dataset that will enable further research in piano music analysis.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper introduces a multimodal piano performance dataset with rich audio-visual annotations.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper introduces PianoVAM, a new multimodal piano performance dataset. The dataset is a valuable contribution to the piano music transcription task.

The dataset includes synchronized video, audio, MIDI, hand landmarks, and fingering labels. Such data is indeed rare and has the potential to significantly advance research in detailed piano transcription and performance analysis.

The author also provides a clear description of the semi-automated fingering annotation algorithm and the data acquisition process. The challenges and limitations of the methods are also discussed, which increases the transparency and reproducibility of the work.

The last point I concern is the limited improvement with visual information. The improvement in onset detection with the inclusion of visual information is relatively small than I expect. It would be beneficial if the authors could make more analysis about why the improvement isn't more substantial.

Any way, PianoVAM is a valuable contribution to the field. The authors have created a unique and rich dataset that will enable further research in piano music analysis.

Review 4:

Q2 ( I am an expert on the topic of the paper.)

Strongly disagree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper offers several reusable insights for the MIR community, particularly in the synchronization of audio, MIDI, and performance video, and in the annotation of fine-grained piano fingering. The alignment pipeline is clearly documented and adaptable, while the fingering annotation method addresses a practical gap in expressive performance modeling. Together, these contributions support a wide range of multimodal MIR tasks, from gesture analysis to strategies for improving transcription robustness under real-world conditions.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper introduces a richly multimodal dataset of natural piano practice sessions that advances MIR research by enabling robust audio-visual transcription and fine-grained performance analysis, including fingering.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents PianoVAM, a multimodal dataset of amateur piano practice sessions that includes synchronized audio, MIDI, video, hand landmarks, fingering labels, and metadata. Designed to support tasks like audio-visual transcription and performance analysis, it addresses key gaps in existing datasets (e.g., lack of fingering data). The authors detail the data acquisition process, propose a semi-automated fingering annotation pipeline, and provide baseline transcription benchmarks. The dataset fills a notable gap in current resources by enabling research on expressive performance modeling, transcription robustness, and is made publicly accessible alongside useful tools and documentation.

While I am not an expert in multimodal piano performance analysis, I found the dataset and its scope compelling. That said, it would be helpful to better understand how the focus on practice-session recordings in this dataset might influence certain downstream tasks. For example, because practice performances can include irregularities like hesitations, mistakes, or deviations from the score, this may affect the suitability of the dataset for tasks such as expressive timing modeling or score-to-performance alignment that typically assume more deliberate, structured performance.