Abstract:

Smoothly transitioning between chords on the guitar can be a major challenge for beginners, especially when they rely on just a few common chord diagrams. Yet many chords can be played in multiple ways (i.e., voicings), which can facilitate more comfortable hand movements on the fretboard. To address this, we present the FretboardFlow dataset, featuring 97 songs recorded with a hexaphonic pickup to capture multiple chord voicings as performed by expert guitarists. Our dataset builds upon the GuitarSet pipeline, incorporating a Python translation of Prätzlich et al's KAMIR algorithm for interference reduction, for automated hexaphonic transcriptions, thereby capturing harmonic structure and performance-driven voicing choices that implicitly reflect muscle memory and ergonomic habits, providing a rich resource for analyzing real-world chord transitions. To predict the most convenient voicing within progressions, we propose a dual-model approach integrating both chord and voicing history, and loss functions well-suited to the flexible nature of voicings. Our research expands on prior chord prediction work by incorporating expert-recorded voicing variations of the same progressions and introducing a novel machine learning approach to fretboard navigation. We publicly release this dataset as a living resource to support data-driven exploration of context-aware guitar instructions.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Agree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

Avoid references in languages other than English, unless they absolutely contribute something no other source can. In this case, a relatively random news article about Chordify seems cited instead of a simple link to their homepage.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

No

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Disagree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

Few insights are derived from the raw numeric data.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

New data available for fret board voicing prediction, accompanied by experiments with an alternative architecture

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak reject

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Summary

The submission presents a new collection of fret board voicing data, obtained from hexaphonic recordings. Experiments on the prediction of chord voicing are conducted with dual branch models that process chord symbols and temporally preceding fretboard data separately before being merged by a linear layer. Those branch networks use either Bi-LSTMs or DeepGRUs, where the latter proves to be generally more performant according to a number of metrics.

Positives

  • New data to work with is always welcome.
  • The proposed DeepGRU architecture is interesting, and the results are promising.

Negatives

  • A main issue is that the evaluation is strongly numeric, without providing any insight what this means for the actual real-world problem. What does a typical prediction look like? What type of errors are being made? Is the output perceived to be good enough in practice or completely unworkable? An application or human centric addition to the evaluation would be very insightful for this kind of problem without clear ground truth.
  • The dataset could be better curated. Now it is dominated by a single person who recorded more than the rest combined. None of the challenges/opportunities coming from multiple annotators are currently explored. Only hard songs are recorded by multiple people, and the effect of difficulty, number of variations per song per player, etc. is not examined.
  • The whole discussion of "proper scoring rule" does not seem to lead to an approach that is different from earlier work. It seems a justification for a non-existent problem.
  • The term "history length" suddenly appears on line 421 and plays a prominent role in the experiments, but is not properly explained. I interpret it as a strict cut-off of the input to the recurrent layers, but see no obvious reason why that is necessary and a contradiction with the justification of using recurrent layers. At minimum a comparison with using the complete history would be needed.
  • Both datasets are only used in isolation with cross-validation, whereas there is an opportunity to do cross-dataset evaluation.
  • Representing a fretboard as a binary matrix, where the fact that only the highest fretted note on a string produces sound is not explicitly encoded, seems subobtimal compared to an integer vector representation. At least something worth exploring.

Overall

Given that the dataset is described as a living dataset, and extended analysis and experiments would be welcome, the current submission feels very much like work in progress. Addressing the raised points would lead to a very valuable contribution to the field, but this year's ISMIR might be too soon for that.

Additional comments

  • The supplementary material should have been submitted as a separate file, not as part of the main paper.
  • Confusing usage of both "Amsterdam Playability Dataset" and "Billboard Playability Dataset" to refer to what seems to be the same thing.
  • The references could use cleaning up, e.g. [11] is missing its publication venue and there's a typo in [23] ("toplay").
  • Mentioning first assigning on l. 224 and then selecting on l. 238 of the songs is unnecessarily confusing. I interpret it as a free selection out of a predefined subset, but this should be explained once instead of splitting this information over multiple paragraphs.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

All reviewers appreciate the work put into collecting the data for this submission and making it publicly available. The experiments show promise, though would benefit from more human interpretation and insights. We recommend looking at the individual reviews to address the points made there. Do keep expanding this resource!

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

While there are no theoretical contributions, the proposed machine learning based approach is sound and not seem adapted to the task being addressed.

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

A dataset was carefully collected and edited to ensure reproducibility and further research. Code to reproduce experiment will also be published.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

While MIR research focuses on chord prediction for pop music guitar, one may try to also predict "expert" voicings that experienced players might prefer for playability through training on the newly collected fretboardflow dataset of chord voicings progressions.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Overall, this is a very nicely written paper, and the very first time I see the task of chord voicing estimation being formalized. While this is obviously a very niche topic, with little possibility to expand the learnings from this paper to other MIR tasks, I believe it does qualify as a valid and novel research area.

The dataset collected is, in itself, an impressive contribution. As a guitar player myself, I do understand the value in widening the possibilities of chord voicings to facilitate interpretation and I look forward to analyzing the various rendering of the same piece that were collected.

The experimental part of chord voicing estimation through machine learning technique is a bit more confusing. It's not clear from Table 1 (yet alone from table S1) what elements of the proposed method are most effective to the task. Metrics seem to be hard to jointly optimize, and while authors choose to highlight the test loss, I would intuitively have thought that ease of transition and playability should be most important here.

The discussion on the MSE being a "proper scoring rule" is not very convincing (tbh it reads a bit like a posteriori justification of why it should be the metric we trust the most). I am no expert in Brier score, but I believe it is mostly applicable in a binary setup. In a multi-label case such as the task at hand, and considering the rather limited amount of samples and the large imbalances, I wouldn't put too much trust on it being an unbiased estimator of actual expert annotation probabilities

We're also lacking in some respect an understanding of the type of predictions that these various systems make. There are several occurrences in the paper where author state that changing from open chords to barre can be perceived as suboptimal. Are there some configurations that effectively limits such transitions more than others? Maybe a quick qualitative analysis of some of the results could have been insightful.

details

Abstract * the last sentence (l14) is odd and lacks a verb * "data-driven exploration of personalized guitar instruction." I'm a bit puzzled by "personalized" as I see no evidence in the paper about this

Introduction * Quick note on Ultimate Guitar, users do still have comments and ratings to help decide which versions they might find more suitable * the transition on l50 is a bit weird, what's the relation with the previous paragraph? *"up to five voicing variations for each of 35 songs," -> for 35 songs

Section 5 could a simpler approach, optimizing for hand movement (e.g a wasserstein distance on the fretboard) be also tested here?

l366 this bidirectionality mirrorring player's choices seems reasonable but is it also due to lower perfs of the causal models?

l400 would achived -> would be achived

Bibliography * The link chosen for chordify (2) is a dutch blog post celebrating their 10 year anniversary.. maybe a wikipedia link be more adapted * references 11, 12, 18 are incomplete

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Strongly agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The paper shows how the community could benefit from datasets featuring more varied (guitar) chord voicings

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A custom dataset with varied chord voicings and a dual-model architecture allow to suggest voicings that are consistent through time.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Overall Comments

A lot of work has gone into this paper, from building a new dataset to comparing it to an existing one, as well as using state-of-the-art methods and introducing new ones. As detailed below, my two main remarks are on the lack of information regarding the recorded guitarists (an Ethical Statement would be particularly relevant too) and the tendency to compare datasets only on ratios even though they are so different in size. Apart from that, the paper is well written and could surely help future research in the field.

Refinements required

  • l14-15: how can one capture muscle memory through microphone recordings?
  • l151-156: I agree that it's important to have varied voicings, but is it really lacking in DadaGP?
  • l208: It's great to release the dataset! What will the licence be?
  • l236: making a web interface is a lot of work, you could show it!
  • l249-259: I think the participants' presentation should arrive earlier. We also lack a lot of information about them, like how they were recruited (was it approved by an IRB?), were they paid, what's their musical background, etc. It would also be interesting discussing why the number of recordings is so unbalanced between participants.
  • l290: rhythm* data quantised? And what is a quarter-measure interval? Is it a quarter note? How does it work in time signatures other than 4-4?
  • l301-312: the criticisms made towards DadaGP are a bit fallacious. I don't think it really makes sense comparing ratios between the two datasets and not discussing the fact that DadaGP is at least ten times larger
  • Figure 2: Same comment
  • l366-367 & 369-370: the claim that a guitarist thinks bidirectionally should be backed up/explained (or removed)
  • Table1 has a lot of values, maybe only a subset is required in the main text, especially if not everything is discussed
  • Results analysis: I feel the discussion section lacks a qualitative analysis of the predictions to better understand why the proposed models generally perform worse on string-fret F1 even though the loss is better.

Typesetting and Language issues

  • l4: the use of single is not very clear, should be rephrased
  • l158: natural rather than naturalistic?
  • Some references are incomplete: at least [11] [12] [18]
Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This work provides a new recorded dataset of multiple chord voicing takes and a dual‑history LSTM/GRU model. Both the data and methodology are reusable if released upon acceptance.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The work introduces a valuable guitar multi‑voicing chord dataset and a dual‑context neural model to predict guitar‑chord voicings that minimize physical transition cost beyond existing one chord at a time systems.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper can serve a valuable insight and dataset in the ISMIR community. The contribution is above acceptance threshold and will help future work on performance‑aware voicing, auto‑arrangement, and guitar pedagogy.

Strengths: Novel Dataset. While similar datasets exist, the introduction of the FretboardFlow dataset addresses a clear gap. It offers a well-structured and resourceful dataset for guitar chord voicings and progressions. With potential future extensions, it can serve as a strong resource for modeling realistic chord transitions.

Practical Application. The dual-model architecture, which integrates chord-symbol sequences and voicing history, is well-motivated and offers a data-driven solution to challenges in chord position selection.

Citations: The paper cites the most relevant prior work in the area and actively addresses potential gaps the reader might have, usually backed by proper citation.

Suggestions: Dual path concats and then passes through a linear layer. Fusion or attention variants can be tried in future work.

The comparison between the DadaGP dataset and FretboardFlow isn't fully direct. You would need a separately collected test set with well-defined structure to properly evaluate and present a table of losses on that set. Claims that DadaGP scores better due to its simpler format are reasonable, but they require experimental backing. The augmentation shifting you're applying may already address many of the issues DadaGP has with uniformity and chord variation. You could try test the models trained on FretboardFlow on a smaller subset of DadaGP and the other way around to confirm your claims.

For completeness you need to explicitly list all types of augmentation you are using, rather than only referencing related work that applies similar methods. (One augmentation you could possibly consider in future work: for chords that span 5 or more strings, you could randomly sample 3 notes, and then for the next chord, sample the 3 closest notes. This will maybe create the variations that are missing from DadaGP and also add more variations to your dataset).

Human validity. The loss metric is automated, a small perceptual/user study with guitarists would strengthen claims.