Keyboard Temperament Estimation from Symbolic Data: A Case Study on Bach's Well-Tempered Clavier

Peter Van Kranenburg; Gerben Bisschop

Abstract:

In this paper we address the task of keyboard temperament estimation from symbolic data. The aim is to find a keyboard temperament that minimizes the deviations from pure intervals, given a corpus of music. The problem of finding a suitable temperament has been studied for centuries. Many solutions have been proposed. By taking a data-driven approach, we contribute a method to this field. We define a loss function that measures the deviation from pure intervals, with a reward for exactly pure intervals. Three optimization methods are explored: Basin Hopping, Differential Evolution, and Dual Annealing. We validate our method with synthetic data, and by comparing with c. 1,500 existing temperaments, including equal temperament. Our method improves on any existing temperament. As a case study, we apply the method to Bach's Well-Tempered Clavier. Our findings show interesting correspondence to existing proposals in musicological literature.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q10 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chose, otherwise write "n/a"))

The study is robust, reproducible, and technically rigorous. The authors validate their method with synthetic data and systematically evaluate across multiple target sets, optimization strategies, and historical corpora (including the full WTC).

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Strongly Agree (Very novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The framework is generalizable beyond the WTC and could be applied to other corpora, styles, or even tuning systems outside Western music. The loss function design, optimization approach, and corpus weighting mechanisms are highly reusable.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

This paper introduces a method to estimate keyboard temperaments from symbolic corpora using weighted interval loss functions and optimization, revealing compelling links to historical temperaments.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Strong accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a novel and rigorously developed framework for estimating keyboard temperaments from symbolic musical corpora, using optimization over weighted interval inventories and comparing results with hundreds of historical temperaments. The use of Bach’s WTC as a case study is both musically and historically relevant. The authors strike a good balance between technical modelling, algorithmic rigor, and interpretive reflection.

Strengths: - Innovative problem formulation that bridges MIR and musicology. - Well-designed and theoretically grounded loss and reward functions. - Thorough validation and comparative analysis with historical temperaments. - Rich engagement with secondary literature and music theory.

Suggestions for improvement: - Clarify parameter settings (e.g., weightings of intervals, β in exponential decay) and their sensitivity. - Provide a higher-level summary of technical sections for non-specialist MIR readers. - Consider sharing code or a simplified demo to aid reproducibility and adoption.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Strong accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

This paper presents a novel and technically robust framework for estimating keyboard temperaments from symbolic musical corpora, using a corpus-weighted loss function defined over mistuned intervals. The approach is applied to Bach’s Well-Tempered Clavier and evaluated through a comparison with over 200 historical temperaments. The paper stands out for its synthesis of computational musicology, symbolic MIR, optimization techniques, and historical interpretation.

The reviewers unanimously rated the paper “Strong Accept,” praising its methodological rigor, clarity, and musical significance. Several aspects of the work deserve special mention: • Originality: The paper proposes a compelling new task—temperament estimation from symbolic corpora—grounded in centuries of musicological inquiry, yet not previously explored computationally in this way. • Scientific quality: The loss functions, evaluation metrics, and optimization strategies are carefully chosen and well explained. The results are both reproducible and interpretable. • Impact: While this is not a mainstream MIR task, it bridges computational methods with questions of historical performance practice and musical structure, making it relevant to a wide range of ISMIR attendees.

Suggestions for camera-ready improvements—echoed across several reviews—include: • Clarifying parameter choices (e.g., α, β, and snapping functions) and assessing sensitivity. • Reframing the novelty claims to avoid overstating the task as entirely new, and better emphasizing the contribution as a well-motivated formalization of an existing musicological problem. • Improving reproducibility by sharing code and, if possible, sound examples (e.g., synthesized outputs using inferred tunings). • Minor presentation issues, such as labeling, figure reordering, and typographic adjustments. These are all minor refinements that, once addressed, will further enhance the clarity and accessibility of this valuable work.

This paper introduces an innovative and rigorous method for historical temperament estimation, grounded in music theory and symbolic MIR. It makes a unique and reusable contribution to the ISMIR community and merits inclusion in the conference.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

-

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q10 (Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a"))

-

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

The proposed method is quite useful both for studying historic performance practice (as done in the paper with respect to the WTC) as well as for practical applications in music performance, but it doesn't really go beyond that. The theory that links temperaments to other domains is relatively well-explored and this paper doesn't add much to it.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

A criterion for (weighted) "mistunedness" of intervals is combined with optimization methods to compare how well different temperaments fit to given corpus of pieces, and to compare historic temperaments to optimal solutions.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The authors propose a family of criteria to determine how well a 12-tone temperament fits to a given corpus based on (a) a set of possible target sizes for each interval (corresponding to simple frequency ratios) and (b) the harmonic and melodic intervals encountered in the corpus, weighted by metric position. They apply a range of optimization methods to find good solutions and compare them to a large set of historic temperaments. They find that under the given criteria, usually better temperaments than the historically discussed ones can be found, but that the observed patterns nevertheless reflect the historic discourse.

The paper is very straightforward and well-written. The proposed method is simple, sensible, produces interesting results, and can be applied to other historic collections, both for scholarly investigation and for performance. I would consider the novelty rather low since the problem of optimal temperaments has been widely discussed (as the paper shows), however there is clearly a contribution here. I therefore happily recommend to accept the paper.

My main complaint about the paper would be that the framing of the contribution is a bit off. For example, I would advise against phrases such as "introducing a new task" here since the task here since the underlying problem (finding a good temperament for a certain style of music or a given set of pieces) has been discussed in some form for quite some time, as the authors demonstrate themselves. Similarly, the authors speak of their method "outperforming" existing temperaments. I would argue that the main contribution is the loss function, not the optimization methods which are applied out-of-the-box and could be substituted with other non-linear optimization methods. In that respect, arguing that the proposed method works best because the results score best wrt. the proposed loss turns the argument on its head (it would be easy to define an entirely unsuitable loss for which the best solutions would naturally score better than existing temperaments). The actual usefulness of the proposed criterion is (a) derived from sensible underlying assumptions and (b) demonstrated by obtaining interesting solutions that are similar to the historic ones.

I suspect that this focus on "performance" results from perceived norms for evaluation at ISMIR, which I see as a bit of a structural problem. In this case, I think that the theoretical discussion of the model assumptions and the results make for the better argument, and I would advise the authors to adapt their framing in this direction.

Minor remarks: - l. 167: using the chordify method to determine harmonic intervals is a bit problematic since the weight that a particular harmonic interval receives depends on what happens in a third voice. E.g., an interval between voices 1 and 2 of a quarter is counted once in case no other voice moves, but twice or more in case another voice moves during this interval. Counting each occurring harmonic interval exactly once or weighting it by its duration would both have been sensible choices, but this method is a bit odd. - l. 219: for clarity, mention that interval sizes here are given in cents, not as frequency ratios (for additivity) - l. 290: capitalize "validation"

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The authors present a computational framework for estimating temperament for keyboard works.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The authors present a computational framework for estimating temperament for keyboard works.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Novelty: Introducing a new problem of optimizing keyboard temperament based on input music (whether synthetic or real) is an interesting problem and is worth studying.

Scientific rigor/merit: high. Methodologically strong paper, mathematically precise. Presents finding a temperament as an optimization problem, which seems computationally and mathematically practical. Presentation of the problem and proposed solution is clear. Loss function seems grounded in reality. Evaluation through three different optimization algorithms seems fine, especially since they all get similar results. All facets are adequately described. I thought the discussion and engagement with historical musicological debates was good, although might go over some people's heads. Minor problems: loss function tuning could be better (setting parameters for alpha, beta, etc). Right now these are set to magic numbers, which isn't great, but seems to work ok.

Clarity/readability: High. Logical flow and organization, math is clear and precise. Some musical jargon that we don't see a lot of in ISMIR (wolf fifths, circular temperament, etc) but that can be chalked up to that ISMIR doesn't get a lot of work on temperament. Could use a picture or more explanation of the "snapping" effect of R_pure.

Relevance: While this work is novel, scientifically rigorous, and clear, it may not be relevant to the majority of ISMIR attendees, but I do not think it should be penalized for that. It's data-driven work, which is important, as is combining computational musicology with performance practice.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Strongly agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Formulating musicological questions as optimization problems, when possible, can be a fruitful approach.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

How can one estimate the optimal tuning for a given corpus?!

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Strong accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

Interesting work, aiming to infer the best temperament for a set of pieces, favoring so-called "pure" intervals. The experiments are solid and engaging. It's unfortunate that the code is not available for replication, as well as synthesized versions of some WTC pieces using the inferred optimal intervals—this would allow for a subjective evaluation of whether the resulting tunings are indeed more compelling. A brief note discussing the practical applicability of this approach in real-world settings would be valuable: how feasible would it be to tune instruments according to the obtained temperaments? Of course, this would be much easier on a synthesizer, but that contrasts with historically informed performance practices. It would be nearly impossible, for instance, to implement such tuning on a pipe organ.

I was intrigued that the optimization appears to be performed—if I understood correctly—on the intervals listed in Table 1. I’m curious about what the results would be if the optimization were unconstrained, requiring only that the sequence of p's be increasing. Would we converge toward equal temperament when using a large corpus of pieces? Section 4 presents an interesting musicological discussion! I confess I considered recommending 'weak accept' due to the lack of code, but I believe this does not diminish the contribution of the work—so I decided on a 'strong accept'.

Additional comments:

Lines 164–165: "but we can decide to omit intervals" — Why would that be desirable?

Table 1: Please provide a source for the “acceptable” intervals listed.

Section 2.6 title should start with a capital letter.

Figure 2: Consider reordering the pitch classes starting from C, as done in the keyboard diagram in Figure 1.

P5-1: Keyboard Temperament Estimation from Symbolic Data: A Case Study on Bach's Well-Tempered Clavier

Peter Van Kranenburg, Gerben Bisschop

Presented In-person

10-minute long-format presentation