P5-13: Sheet Music Benchmark: Standardized Optical Music Recognition Evaluation
Juan C. Martinez-Sevilla, Joan Cerveto-Serrano, Noelia Luna-Barahona, Greg Chapman, Craig Sapp, David Rizo, Jorge Calvo-Zaragoza
Subjects: Evaluation methodology ; Machine learning/artificial intelligence for music ; Music transcription and annotation ; Evaluation, datasets, and reproducibility ; Evaluation metrics ; Open Review ; Optical music recognition ; MIR tasks ; Knowledge-driven approaches to MIR ; Novel datasets and use cases
Presented In-person
4-minute short-format presentation
In this work, we introduce the Sheet Music Benchmark (SMB), a dataset of six hundred and eighty-five pages specifically designed to benchmark Optical Music Recognition (OMR) research. SMB encompasses a diverse array of musical textures, including monophony, pianoform, quartet, and others, all encoded in Common Western Modern Notation using the Humdrum **kern format. Alongside SMB, we introduce the OMR Normalized Edit Distance (OMR-NED), a new metric tailored explicitly for evaluating OMR performance. OMR-NED builds upon the widely-used Symbol Error Rate (SER), offering a fine-grained and detailed error analysis that covers individual musical elements such as note heads, beams, pitches, accidentals, and other critical notation features. The resulting numeric score provided by OMR-NED facilitates clear comparisons, enabling researchers and end-users alike to identify optimal OMR approaches. Our work thus addresses a long-standing gap in OMR evaluation, and we support our contributions with baseline experiments using standardized SMB dataset splits for training and assessing state-of-the-art methods.
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 ( The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work.)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Strongly agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Agree (Novel topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)
The paper introduces a high-quality benchmark dataset (SMB) and a detailed evaluation metric (OMR-NED) that together offer standardized, fine-grained assessment of modern OMR systems.
Q17 (This paper is of award-winning quality.)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)
Strong accept
Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
This paper fills an important gap in OMR. offering a much-needed benchmark dataset (SMB) and a novel evaluation metric (OMR-NED) that provides fine-grained error analysis across musical symbol categories. The construction and annotation of the dataset are rigorous, and the metric builds meaningfully on existing tools like MusicDiff. The baseline experiments, while not yielding high accuracy, are appropriate and indicative of the dataset's complexity.
Strengths: - Timely and highly relevant contributions to the MIR/OMR community - Solid technical methodology for both dataset and metric - Clear writing and thoughtful organization
Suggestions for improvement: - Consider adding a summary table comparing SMB to previous datasets (e.g., size, scope, textures, and coverage). - Provide visual examples or schema for OMR-NED categories and scoring to aid understanding. - Ensure that links to dataset/code are well-documented and accessible post-acceptance.
Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)
Strong accept
Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))
This paper introduces a significant and much-needed contribution to the Optical Music Recognition (OMR) research community through the development of a standardized benchmark dataset (SMB) and a novel evaluation metric (OMR-NED). The work is timely, addressing a longstanding gap in reproducible, fine-grained evaluation for OMR systems, and has clear potential to become a foundational reference for future research in the field.
Across the board, reviewers acknowledge the technical soundness, relevance, and clarity of the work. The dataset construction and annotation processes are described as thorough, and the OMR-NED metric is appreciated for its granularity and thoughtful design.
Some reviewers expressed concerns about details that could be improved in the camera-ready version, including:
- Clarifying the licensing terms of the dataset.
- Providing per-category evaluation scores to highlight the variability in task difficulty.
- Offering more clarity on the tokenization process and how OMR-NED treats substitutions.
- Expanding the explanation of differences between SER and OMR-NED and motivating the design choices in the latter.
- Verifying the accuracy of labels in the dataset (e.g., use of “monophonic”).
These are constructive and actionable suggestions, none of which undermine the overall scholarly contribution of the work. Rather, they indicate areas for refinement that will enhance the impact and usability of the benchmark.
Importantly, the paper meets ISMIR’s criteria for reproducibility and openness, and it presents reusable infrastructure—dataset, metric, and tools—that will serve the community for years to come. It is rare to see a benchmark paper that is both technically solid and practically impactful in the way this one is.
The paper represents an exemplary effort in infrastructure building for MIR, especially in a field where robust benchmarks have been lacking. It is likely to catalyze future work, foster comparability between methods, and stimulate further discourse on evaluation methodologies in OMR.
Q2 ( I am an expert on the topic of the paper.)
Disagree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Strongly agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Strongly agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Strongly agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Disagree
Q15 (Please explain your assessment of reusable insights in the paper.)
It's a standard dataset paper, with well-written descriptions and statistics of the dataset followed by the suggested evaluation metrics and a baseline model.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
We now have a diverse dataset of sheet music images and annotations for optical music recognition evaluation, ready for real-world applications.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Strong accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The paper introduces a new, large and diverse benchmark for the optimal music recognition (OMR) task, with detailed dataset construction, metric definition, and baseline results. The key contributions are:
- The dataset is large and diverse, and the paper provides a thorough statistical analysis
- The process of annotation and post-processing is well-described, helping the users better understand the characteristics and any limitations of the dataset.
- A new suggested standard metric (OMR-NED) for the benchmark and the improved tools for calculating them, with discussions on the pros and cons of each.
Minor limitations that could be addressed for camera-ready:
- The paper doesn't seem to clarify the license for the data, other than a mention of public-domain uploads.
- In addition to the dataset, a reference implementation of the OMR-NED and SER could further facilitate easier and more reproducible comparison of the benchmark.
- While having a competitive baseline is not the primary goal of a dataset paper, and the low performance is expected without a vision model as the authors note, OMR-NED often being over 90% is somewhat discouraging. Using a pretrained encoder (by using one of publicly available vision models or pre-training on GrandStaff) would have produced more competitive baseline metrics that can help better understand how difficult the benchmark is and how likely it may become saturated soon and no longer useful.
- Also, per-category baseline metrics would've been also good to include, basically a version of Table 3 for the models/data in Table 2, which would be informative on the expected difficulty in each category and also encourage the researchers using this benchmark to report per-category details.
Q2 ( I am an expert on the topic of the paper.)
Agree
Q3 (The title and abstract reflect the content of the paper.)
Agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Agree
Q15 (Please explain your assessment of reusable insights in the paper.)
The paper introduces a benchmark for evaluating OMR systems.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
The paper introduces a benchmark for evaluating OMR systems.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Agree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The paper introduces a dataset and an evaluation measure to benchmark OMR systems. As such, it is a valuable contribution to the community, as there is a lack of annotated evaluation datasets in OMR. The description of the dataset is suitable, however the description of the proposed evaluation metric and the baseline experiments lack in detail. To improve the paper, I suggest the authors to:
-
describe the limitations of the chosen **kern format - it is not a fully-fledged music notation format such as MusicXML, so you should describe what it can and can't represent
-
provide more details on how the proposed evaluation metric is calculated, as it is not completely clear
-
the baseline system's error rates are so high, that it makes me wonder if there is any use in including the results - over 90% error rate means that almost all symbols are wrong, so the output is almost random? If so, it makes no sense in including this evaluation, if not, you should make this clearer in the paper.
-
you should elaborate more deeply on the differences between SER and the proposed measure, as they can differ over 50%?
Q2 ( I am an expert on the topic of the paper.)
Strongly agree
Q3 (The title and abstract reflect the content of the paper.)
Strongly agree
Q4 (The paper discusses, cites and compares with all relevant related work)
Strongly agree
Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)
Agree
Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)
Yes
Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)
Strongly agree
Q9 (Scholarly/scientific quality: The content is scientifically correct.)
Agree
Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)
Agree
Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)
Agree
Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)
Disagree (Standard topic, task, or application)
Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)
Disagree
Q15 (Please explain your assessment of reusable insights in the paper.)
The paper proposes a reusable dataset and evaluation metric, but they are not particularly insightful.
Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)
New dataset for OMR of scanned music taken from KernScores and labelled, along with an evaluation metric for OMR using MusicDiff for fine-grained comparison.
Q17 (Would you recommend this paper for an award?)
No
Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)
Disagree
Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)
Weak accept
Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)
The proposed dataset and evaluation metric are a meaningful contribution to OMR research. Overall the paper is written clearly, but some sections are lacking, and the experiments are not entirely convincing.
Strengths: * New dataset for OMR of scanned classical Western music (685 pages) * Standardization of kern notation for consistent tokenization * Proposed OMR-NED evaluation gives information on type of error
Weaknesses: * Discrepancy between experiments and stated dataset purpose: Due to the small size of the dataset, training from scratch gave bad results. Authors could have fine-tuned a model trained on more data, or used the dataset as a benchmark for existing models, which is what they proposed this dataset for in the first place. * Not clear whether author release all their code (Humdrum standardization, MusicDiff modifications, evaluation metric calculation). Just making sure. * In OMR-NED, does it really make sense to treat a substitution as insertion+deletion, counting it as two errors? * Tokenization scheme is not stated (is every full ‘word’ a token?), yet dataset statistics are reported on tokens rather than e.g. notes.
Additional comments: * Table 2: Interestingly, OMR-NED and SED give a different ordering of the rows. For example, if we look at the ekern results only, then Monophony got the worst results under OMR-NED, but placed second under SED. How do authors explain this discrepancy? Can they somehow show that OMR-NED matches human judgement better than SED, which counts substitutions differently? * Table 2: How come Monophony (which is supposedly the easiest task) got worse scores than Pianoform, Quartet, and Other? * I downloaded the dataset and noticed that many scores labeled as Monophonic (in the mono_scores folder) are in fact not monophonic (they have chord/voices). Did authors (mis-)use the term monophonic throughout the paper to mean ‘single staff’? * Figure 4: The lyrics seem made up and don’t match the music. This is not a very good illustration of OMR-NED. * Authors mentioned dataset splits in the abstract, but where are they? I see only a ‘train’ split. * Line 211: ‘particellas’ is not a standard term in English. The term is ‘parts’.