P1-11: Matchmaker: An Open-Source Library for Real-Time Piano Score Following and Systematic Evaluation
Jiyun Park, Carlos Eduardo Cancino-Chacón, Suhit Chiruthapudi, Juhan Nam
Subjects: Evaluation methodology ; Real-time considerations ; Generative Tasks ; Reproducibility ; Evaluation, datasets, and reproducibility ; Open Review ; Alignment, synchronization, and score following ; Expression and performative aspects of music ; MIR tasks ; Musical features and properties
Presented In-person
4-minute short-format presentation
Real-time music alignment, also known as score following, is a fundamental MIR task with a long history and is essential for many interactive applications. Despite its importance, there has not been a unified open framework for comparing models, largely due to the inherent complexity of real-time processing and the language- or system-dependent implementations. In addition, low compatibility with the existing MIR environment has made it difficult to develop benchmarks using large datasets available in recent years. While new studies based on established methods (e.g., dynamic programming, probabilistic models) have emerged, most evaluations compare models only within the same family or on small sets of test data. This paper introduces Matchmaker, an open-source Python library for real-time music alignment that is easy to use and compatible with modern MIR libraries. Using this, we systematically compare methods along two dimensions: music representations and alignment methods. We evaluated our approach on a large test set of solo piano music from the (n)ASAP, Batik, and Vienna4x22 datasets with a comprehensive set of metrics to ensure robust assessment. Our work aims to establish a benchmark framework for score-following research while providing a practical tool that developers can easily integrate into their applications.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 ( The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The package makes score following algorithms easily available to the community, which in turn allows for a deeper analysis and understanding of these algorithms, as well as baselines for future experiments.
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
A Python package for real-time score following algorithms.
Q17 (This paper is of award-winning quality.)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Weak accept
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper presents an open-source Python package designed for real-time audio-based score following. The authors provide and evaluate several algorithms and features on three public piano datasets.
Main Strengths:
- Main contribution: an open framework for real-time score following is provided, comparable to available offline tools like the Sync Toolbox or Match. This makes score following algorithms easily available for everyone.
- The provided algorithms work well on piano music. Additionally the design of the packages suggests that it should be easy to add additional algorithms in the future.
- Standardising the evaluation of these algorithms is important, thus providing the evaluation measures directly with this Python package is very valuable for future research.
Main Weaknesses:
- Limited Scope: Only three algorithms are included and evaluated (two OLTW and one HMM), excluding some well-known older approaches (like Antescofo) as well as more recent approaches based on learned features. One problem is that most of the approaches are not publicly available, but these would be worthwhile additions to this package (and to the comparison of approaches in the paper). Additionally, the algorithms are only evaluated on piano music, while in general they should work on any kind of music. It would be very useful to include at least some results on other genres.
Further Comments:
-
It would be interesting to look into the relationship between quantitative evaluations and qualitative feedback to identify which evaluation measures show a high correlation with the perceived quality. This is probably highly dependent on the application (automatic accompaniment vs. visualisations vs ...), but it is something I am missing in the literature and it would provide some grounding of the evaluation results.
-
I was a bit confused by ASAP vs (n)ASAP, e.g. in the table it is named ASAP, in the text it is referred to as (n)ASAP. Would be good to unify/clarify this.
-
I do not fully understand Table 4. How are these delays measured? Is this the time it takes to compute one step of the algorithm? This would then be highly dependent on the implementation details of each of the algorithms. This makes sense in the context of the package (on certain hardware), but it is not a general guideline regarding the properties of the algorithms. Also, how is the MAE for the different features types computed? This only makes sense in combination with an alignment algorithm. Or is this some average over all algorithms?
-
Regarding features, there is also another version of the LSE features in Arzt, Widmer, Dixon: "Adaptive distance normalization for real-time music tracking" (EUSIPCO 2012), where they were combined with chroma features, which led to further improvements.
Typos, Grammar, Style: - Line 113: othen -> often - Line 142: The usage of the package is mainly divided into two scenarios -> The package supports two main usage scenarios - Line 145: with default setting -> with the default setting - Line 220: We only included performance -> We only included performances - Line 239: which we name it log-spectral energy -> which we name log-spectral energy - Line 430: ourperforms -> outperforms - Figure 6: OTLWArzt -> OLTWArzt
Summary:
This paper presents a much-needed easily usable Python package for real-time score following. While there is room for improvements (e.g. providing more algorithms and considering more music genres, not only classical piano music) it is as a valuable contribution to the field of MIR.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Weak accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
This paper presents an open-source Python framework for the evaluation and benchmarking of real-time audio-based score following algorithms. The framework includes implementations of multiple alignment algorithms, standardised evaluation metrics, and public datasets. Reviewers broadly agree that this is a useful contribution that addresses a significant gap in reproducibility and standardisation in the score following community.
However, there are a number of concerns with this manuscript, with the main one being the strong focus on piano music. It would be good to see other music genres included, also to make sure that the focus on a specific genre does not lead to design decisions that make it hard to use in a more general case.
Further concerns include a limited discussion of real-world problems of score following and their evaluation (e.g. trills, skips, repeats), a lack of deeper error analysis (why does the HMM underperform?), some needed clarifications (e.g. how is the delay measured?), and the writing style (incl. typos). Please see the individual reviews for details.
Overall, despite these concerns, this is a useful contribution which fits well into ISMIR. Before a potential publication, the authors are strongly advised to improve the writing of this paper and to try to address/discuss some of the concerns discussed in the reviews.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Disagree
Q15 (Please explain your assessment of reusable insights in the paper.)
The tests as reported in this manuscript are presently limited to piano performance datasets.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
A unified software platform with web demo has been developed for evaluating and comparing the performance of score-following methods.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak reject
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
I believe the team has thoroughly examines state of the art score following methods and the manuscript demonstrates their passion not only to make fair comparisons but also to make the job easier for fellow researchers and developers. However, I recommend border line reject due to the writing style. At times, the style of this manuscript is more like a combination of review article (quick mention of 58 citations) + user manual (Sec. 3). It contains a lot of useful information but sometimes this makes it hard for a reader to find out the emphasis and identify where the novelty truly is. I think the paper might generate discourse as a demo but I am not sure if it is ready to be accepted as a technical paper. Below are some additional comments and inquiries for the authors to consider.
- in Sec. 2.2 we see some empty citations "[]".
-
I suppose that AE (absolute error) would remove an important aspect of score following -- whether the follower is lagging behind or rushing forward. Since I am not fully aware of the current progress on this topic, I am curious whether anybody has discussed this and use median and average error with a sign (+/-) to supplement for the lost information due to taking the absolute value?
-
In Sec. 5.1: please define $\theta_e$ -- I suppose it is the tolerance of inaccuracy?
-
In line 316-323, the symbols t_i, t_j are performance "time" but t_k is performance "beat". Does that mean t_k is in the unit of beat while t_i and t_j are in seconds? I am puzzled and thus the equation looks vague to me.
-
Line 390: I cannot see where the "horizontal segments" are. Perhaps they can be manually marked to enhance visibility?
-
Sec. 8 is great and I very much look forward to playing with the Web demo.
-
The performance of HMM looks far below the other methods in Table II and III and the authors revealed that it might have been under-evaluated (Line 434). This might weaken the overall acceptability of the manuscript, but I agree that this is worth future investigation.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
This framework can serve as a baseline platform for evaluating new score-following algorithms or integrating new feature extraction modules or alignment backends. Researchers can plug in new methods and test them against the same standardized datasets and metrics.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
The paper provide a unified, open-source Python framework for evaluating and benchmarking score-following algorithms.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The paper provide a unified, open-source Python framework for evaluating and benchmarking score-following algorithms. This is important since the lack of such a framework has hindered progress in the field due to issues with reproducibility, comparability, and generalizability of research findings. This paper makes a valuable contribution to the field of music information retrieval, specifically in the area of real-time audio score following.
The strengths of the paper lie in:
Addressing a Real Need: The paper tackles the problem of fragmented implementations and the difficulty of comparing score-following methods, which is a significant bottleneck in the field.
Open-Source Framework: The development and release of an open-source Python package provide a valuable tool for the MIR community.
Systematic Evaluation: The authors conduct a systematic evaluation of different music representations and alignment methods using multiple datasets. The use of diverse datasets (ASAP, Batik, and Vienna4x22) and comprehensive evaluation metrics strengthens the validity of their findings.
Despite the strengths, I have several concerns and questions:
The paper focuses on alignment accuracy but does not consider how systems handle large deviations like repeats, skips, or performer errors. These are common in live piano practice. Could the framework take these factors in to accound beyond local timing errors?
Table 4 reports extremely low alignment delay values (e.g., 0.07 ms for OLTWArzt). However the performance MIDI files from ASAP only have 3 ms resolution. Can the authors clarify how latency was measured and whether reporting sub-millisecond values is meaningful under these resolution limits?
While the focus is real-time alignment, offline score following remains important for batch evaluation or annotation purposes. Does the current framework support an offline evaluation mode using non-causal alignment (e.g., full-sequence DTW)? If not, could it be extended for that purpose?
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
This paper provides a valuable contribution for real-time score following alignment evaluation. By providing a unified framework, this could benefit the scientific community in the field. The paper also claims to provide the source code upon acceptance, which would be critical for the wide use of the tool.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
The paper presents an open source framework for score following evaluation.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The paper proposes a new framework for score following evaluation. This tool could be of great use for the scientific community. An open source tool that is well-maintained could also evolve for wider use, integrating new metrics and model benchmarking as new research comes to the field. While the article is well-written and proposes a novel evaluation framework, there are some areas for improvement:
-
The experimental analysis is clear and covers a decent amount of baselines. However, there is no clear explanation to the errors seen on models like HMM. The experimental analysis would benefit from having more examples of error analysis. This would inform why some methods struggle more than others.
-
In scores, there is often musical ornamentation, such as "trills". Trills are a good example that add notes to the musical performance while they do not show on the score. This is an area that requires to be discussed as it can improve the error correction for the proposed metrics.
-
Does the framework support other instruments and polyphonic performance? It sounds like the tool applies to other instruments as well. If that is the case, the experimental analysis would benefit from discussion on other instruments.
-
Missing citations in line 102, 107