Human vs. Machine: Comparing Selection Strategies in Active Learning for Optical Music Recognition

Juan Pedro Martinez-Esteso; Alejandro Galan-Cuenca; Carlos Pérez-Sancho; Francisco J. Castellanos; Antonio Javier Gallego

Abstract:

Optical Music Recognition (OMR) systems rely on accurate layout analysis (LA) to segment different information layers in music score images. While deep learning approaches have improved performance, they remain heavily dependent on large amounts of annotated data. In this work, we propose the integration of a Few-Shot Learning (FSL) architecture into an active learning framework for LA. This enables interactive and iterative training, allowing the model to progressively improve from minimal annotated data. We evaluate how this approach enhances recognition accuracy and reduces annotation effort, and we study the impact of different sample selection criteria within this framework, comparing data selected by five expert annotators against four automated strategies: random, sequential, ink density-based, and entropy-based. Experiments across three diverse music score datasets show that entropy-based selection consistently outperforms human choices, achieving an F1-score of 81.1% with only 8 labeled patches, while humans required at least 16 to reach similar performance. Our method improves over existing FSL approaches by up to 21.6% and substantially reduces annotation time. These results suggest that automated strategies can offer more efficient alternatives to human selection in OMR annotation workflows.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Disagree

Q5 ( Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

The reference on "entropy-based" selection is too loose to just cite a book on information theory.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The experiments comparing different selection methods is informative for future OMR research, as well as the baseline measurements of human annotation time.

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

Being aware of the selection method in active learning is beneficial for the layout analysis in OMR.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper is well written and structured in general. From motivation to how the experiments were organized are clear. Given that certain details are clarified and provided, this paper provided informative experiment results on how different selection method in active learning would impact a few shot learning training scheme in layout analysis for OMR.

The details that needed to be provided and/or clarified. - How does entropy-based selection method work? how is entropy calculated? Either provide a more specific/precise reference, or provide the technical definitions in the paper. - Instructions/criteria for human annotaters.

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

Main revisions that the reviewers suggested. - Clarify how the entropy is calculated for the entropy-based method. - Make it clear what information or criteria human annotators used when selecting patches. For instance, did they have access to the current model's segmentation results to identify areas with errors, or were they instructed to select diverse data? Understanding the human baseline is important.

Please also address the comments by all reviewers.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

n/1

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Clever automated strategy can even outperform human.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Few-Shot Learning + Active learning + Enhanced selection method can largely improve the accuracy of OMR.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Disagree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes the adaptation of a Few-Shot Learning (FSL) architecture to an active learning setting for layout analysis (LA). Specifically, the work explores several patch selection methods, demonstrating that some automated approaches can outperform human selection as iterations progress.

The base model is based on the few-shot learning framework previously proposed by Castellanos et al. (ISMIR 2023). The authors enhance this model through the use of active learning strategies and achieve up to a 21.6% performance improvement using different patch selection techniques. One of the key contributions of this paper is the comparative analysis between human-driven and automated sample selection strategies, showing that a well-designed selection method can significantly impact the overall performance of the OMR (Optical Music Recognition) pipeline.

The paper is clearly written, and the experiments are well explained. However, I recommend that the authors provide more detailed explanations of the patch selection strategies, especially the entropy-based method, which ultimately achieved the best results. Additionally, the paper specifies a fixed patch size of 256×256 pixels. It would be helpful to include a discussion about how different patch sizes or shapes might affect the performance or selection quality.

Although this work does not introduce a novel architecture or fundamentally new topic, it is significant in demonstrating how the integration of active learning strategies and intelligent sample selection can enhance the effectiveness of OMR systems. I believe the insights and results presented in this paper are valuable and merit presentation at the conference.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Strongly agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Strongly agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

Paper well addresses the annotation scarcity and proposes a different approach to conventional previously introduced models.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

We don't need (so many) annotators anymore :)

Q17 (Would you recommend this paper for an award?)

No

Q18 ( If yes, please explain why it should be awarded.)

/

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper is well-motivated, addressing a realistic problem in OMR, the annotation scarcity. The paper is clearly written and easy to follow.

There are two main contributions: the proposed adjusted approach and the experimental setup with the comparison of human vs. algorithmic annotators and thev tracked time spent. In addition, the finding that entropy-based selection surpasses human annotation in both efficiency and performance is a bit surprising, but supported by the provided data. In a more philosophical aspect (or maybe even economic in a different way), the paper does open a significant question of considering systems to replace the annotators. However, there are still several challenges open before achieving this goal in a form of an automatised pipeline for annotation. Nevertheless, the paper includes a sufficiently significant contribution of the proposed approach to be considered for publication.

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Disagree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Disagree

Q15 (Please explain your assessment of reusable insights in the paper.)

Applying active learning to optical music recognition, the paper shows that automated selection strategies are more efficient than manual human selection, making better use of a limited annotation budget. This may be the case for other tasks as well.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

The paper investigates different active learning query strategies in the context of deep learning-based layout analysis, observing that entropy-based selection achieves best results if the annotation budget is limited.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

The paper applies active learning for deep learning-based layout analysis. Starting with little labeled data, an initial model is trained. Afterwards, further annotations can be requested from an "oracle", where the regions to be annotated are either selected by a human or using automated strategies. The model is then re-trained with the extended labeled data and this process is repeated. The central question addressed in this paper is whether the data to be annotated should be selected by humans or using automated strategies.

The paper is well-written and generally easy to follow. However, a few details regarding the selection process remain unclear to me:

1) Entropy-based selection: As far as I understand, the model outputs for each pixel 4 values between 0 and 1, corresponding to the 4 considered layers. Is the entropy calculated for each layer separately, or is the entropy calculated after jointly normalizing the 4 values?

2) Human selection: Based on which information or criteria do the human annotators select patches to be annotated? Do they have access to the segmentation of the current model so that they can specifically select a region with many errors? Are they instructed to select a diverse set of labeled data? It would be important for this to be made clear.

Some minor issues: - l. 114: The introduction of sigma seems unnecessary, as the performance-based stopping criterion is not used in this paper. - l. 131ff: It is unclear how sequential selection works; is the goal to distribute the patches uniformly across all images? How are patches selected within an image? While this can be read up on in the given reference, the paper would benefit from a bit more detail. - l. 159ff: Sounds like the patch order would be predefined for entropy-based selection. - l. 316ff: I would suggest indicating annotation time in person-hours, which would allow for a better understanding of the annotation effort.

Overall, I suggest to accept the paper, as it appears to be one of the first applications of active learning to optical music recognition, showing the potential of this research direction.

P6-11: Human vs. Machine: Comparing Selection Strategies in Active Learning for Optical Music Recognition

Juan Pedro Martinez-Esteso, Alejandro Galan-Cuenca, Carlos Pérez-Sancho, Francisco J. Castellanos, Antonio Javier Gallego

Presented In-person

4-minute short-format presentation