Quantize & Factorize: A fast yet effective unsupervised audio representation without deep learning

Jaehun Kim; Matthew C. McCallum; Andreas F. Ehmann

Abstract:

Foundation models have become increasingly prevalent in tackling Music Information Retrieval (MIR) tasks. Although they can be a powerful tool for understanding music, the computation required for the training and inference of these models continues to grow as they become more complex. Specialized acceleration, such as Graphical Processing Units (GPUs), has become necessary for operating these models, as they are mostly based on large Deep Learning (DL) architectures. Furthermore, it is difficult for users to interpret them due to their black-box nature. In this work, we propose Quantizers and Factorizers for Music embeddings (QFM), a fast, unsupervised audio representation for music understanding backed by a wide range of rich MIR features and efficient feature learners. Experimental results show that QFM models perform within the range of results achieved by recent previous open source DL models on all evaluated tasks, with competitive results on a subset. This is surprising given the significantly smaller computational requirements of QFM models for training and inference.

Meta Review:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 ( The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work.)

Strongly agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

No

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated “Strongly Agree” and “Agree” can be highlighted, but please do not penalize papers rated “Disagree” or “Strongly Disagree”. Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

By showing that multiple downstream tasks can work well on a set of selected MIR features, the paper gives insights into which features could be selected for MIR tasks. Also it can raise ideas on how to make existing DL-based representation learning methods more efficient

Q16 ( Write ONE line (in your own words) with the main take-home message from the paper.)

A music representation based on MIR features (instead of deep learning) can provide competitive downstream task performance, while offering faster training and inference.

Q17 (This paper is of award-winning quality.)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation (to be completed before the discussion phase): Please first evaluate before the discussion phase. Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines.)

Weak accept

Q21 (Main review and comments for the authors (to be completed before the discussion phase). Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper proposes an unsupervised music representation for which training and inference is faster than current DL-based methods, while offering similar performance. It also claims to provide a generic architecture in which feature engineering and ML can be used side-by-side in future MIR work.

The paper is well written, the proposed system is sensible and reasonably novel (parts of the system existed before, but it was not used in this context to my knowledge) and the experiments are conducted rigorously, proving the paper's claims described above to a large extent. The efficiency claims would be better supported if there was an analysis of training and inference cost (in real-world currency, or FLOPS). Also, GPU acceleration is not used, and the benefit of the proposed system depends on how costly GPU usage is compared to the author's CPU setup - as GPU acceleration would substantially reduce training and inference time (and perhaps cost) for the approaches the paper is using for comparison, but likely not for the proposed system.

As a more minor point, the paper also suggests non-DL representations like the proposed one could be more interpretable, but unfortunately does not include any analysis on this.

Minor remarks: - Some links in the references are broken, sometimes entries are incomplete (e.g. missing authors) - Figure 1 could be interpreted as factorization being applied for each audio feature in each chunk independently, but in Section 3.2 WMF is applied to the whole dataset, this should be clarified more - L33 “handful of works” implies citing more than one paper - L164 - I assume the audio features’ mean and standard deviation, not the audio chunk waveform itself?

Q22 (Final recommendation (to be completed after the discussion phase) Please give a final recommendation after the discussion phase. In the final recommendation, please do not simply average the scores of the reviewers. Note that the number of recommendation options for reviewers is different from the number of options here. We encourage you to take a stand, and preferably avoid “weak accepts” or “weak rejects” if possible.)

Weak accept

Q23 (Meta-review and final comments for authors (to be completed after the discussion phase))

Summary of Reviews and Discussion

This paper proposes an unsupervised music representation learning pipeline that combines classical MIR features with quantization and matrix factorization, offering a computationally efficient alternative to deep learning-based models. Reviewers generally found the approach well-motivated, the writing clear, and the empirical results compelling, particularly given the simplicity of the method.

However, all reviewers noted a lack of methodological detail, particularly in the description of the Quantization-Factorization module. Clarifications on hyperparameters, model architecture, and mathematical formulation are needed, and several figures need to be clarified.

All reviews provided a "weak accept" recommendation.

Final Recommendation

I recommend accepting this paper to ISMIR. Despite the noted lack of clarity, the paper offers a timely and thoughtful contribution to the ISMIR community. Its practical relevance, solid experimental results, and potential to broaden the conversation around music representation learning justify its inclusion. The authors are encouraged to enhance methodological clarity in the camera-ready version.

Review 1:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Disagree

Q5 (Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a"))

There are other DL baselines that are easy to compare with but are not included in the paper. See [1].

[1] Yuan, R., Ma, Y., Li, Y., Zhang, G., Chen, X., Yin, H., ... & Fu, J. (2023). Marble: Music audio representation benchmark for universal evaluation. Advances in Neural Information Processing Systems, 36, 39626-39647.

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Disagree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Disagree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Disagree (Standard topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Strongly agree

Q15 (Please explain your assessment of reusable insights in the paper.)

The non DL-based music representation learning methods may help future works on downstream tasks and foundation models. It may even help DL-based methods by providing insights into the pretraining task.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

This paper provides an unsupervised audio representaiton learning method by a combined pipeline of a quantization and a factorization module.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper aims to improve music foundation models without the usage of deep neural network, and shows improvements over some of the deep learning based method like CLMR. FM without deep learning is a highly underresearched topic. Even though the results are not STOA, it is still very impressive and provides helpful insights to the community.

The model choice (quantization+factorization) is also reasonable since it retains compact sequential information. Although I do wish the author could explain more the design choice. See comment #5.

My main concern is about the description of the method. The author skips all details on the QF model, leaving only citations, making this paper very hard to understand even with traditional machine learning background. The paper definitely can describe all methods in detail potentially with formula to reduce confusion (currently there is none), and shrink the length of the experiments (Fig. 2 & 3 definitely occupied unnecessarily much space).

Other comments:

Fig. 1: What is a ZScore in Fig. 1? Also in Fig. 1 KMeans should be "KMeans/GMM." Also, the node QF_total should connect to all feature_{1...N} instead of each QF local block.
In sec 3.2: there are lots of important hyperparameters in the NGram and WMF model, but are never described in the paper.
Line 164: The description of G1: It calculates "...each audio chunk’s mean and standard deviation." I assume the author wants to say the feature of each audio chunk, instead of the raw audio content.
How did you train the model? What are the training hyperparameters? I assume that the KMeans/GMM, NGram and WMF all require training and they are trained sequentially using different algorithms. More clarification is needed.
It would be better if the author explained the model choices in their introduction: (1) Why quantization? (2) Why use NGram+WMF?

Overall, I think the paper is a clear accept seeing the result, but more refinement is required.

Review 2:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Strongly agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Strongly agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Strongly agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

This paper shows that shallow feature learners are still beneficial for several downstream MIR tasks. Novel ways for leveraging such shallow feature learners could be a very interesting direction of research.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Shallow feature learners combined with clever feature aggregation, manipulation and filtering are still beneficial for several downstream MIR tasks.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

This paper presents a method to leverage shallow feature learners (read classical MIR features) to compete with large foundation models. The authors present a Quantization-Factorization module which operates on groups of feature sets to produce "embeddings" for music in an unsupervised manner. The resulting embeddings can be used across several downstream MIR tasks achieving comparable (though slightly worse) metrics when compared to significantly larger and computationally expensive large foundation models for music.

Strengths: - Well-motivated and clearly written. - Shows that there is still room to leverage shallow feature learners in the context of MIR. - Evaluation and ablation studies are well presented

Weaknesses: The main drawback of the paper is that the method section needs to be described a little bit better. There are some technical aspects that are not clear to the reader. - The temporal resolution of the feature vectors seems inconsistent. It would seem like for every 9 second chunk, each feature set has a resolution of 43 ms (22050 / 512). However, that doesn't seem to be the case for the Patches feature set (which is randomly sampled across the entire 9 second chunk) - The process of converting the codes to the unigram matrix as well as the WMF step can be explained a bit better. The authors should consider adding some mathematical notations to help the reader better understand the different steps within the QF module along with how the dimensionality of the computed features / final embeddings progresses through the different sub-modules (starting from an audio chunk of say length N). - Since WMF is such a core part of the proposed method, the authors should provide some background and description about the method. Simply referring to prior work is not sufficient in this case. - If space if a concern, the experimental set-up (downstream datasets and metrics) can be compressed by the moving the details to an Appendix / Supplementary material.

There are also a couple of questions related to the experiments that the authors should clarify: - It is not clear why the final PCA and QF_total modules were excluded for the ablation experiments in Section 5.3.1. - One thing that seems to be missing in the ablation study is the influence of the G1 modules. While the authors report the metrics just using the G1 module (as a baseline), it would really interesting to report the results of the QFM models with the G1 modules removed.

Other minor comments: - Line 127: Typo: a quantization -> either a quantization module or the quantization module - Line 164: Instead of audio chunk it should be feature sets' - Line 210: Should be (micro, nano) - Line 243-244: It is a little weird for the default chunk time to be set based on the default setup of the Tempogram features. Isn't that configurable? - Line 295-296: 60 (and 200) vectors are sampled from all the vectors in a chunk. It might be useful to add the total number from which these are sampled - Line 433: "when computation requirements dictate, a primary embedding". it's not clear what primary embedding means here? - There seems to be a missing reference to the footnote below Line 474

Review 3:

Q2 ( I am an expert on the topic of the paper.)

Agree

Q3 (The title and abstract reflect the content of the paper.)

Agree

Q4 (The paper discusses, cites and compares with all relevant related work)

Agree

Q6 (Readability and paper organization: The writing and language are clear and structured in a logical manner.)

Agree

Q7 (The paper adheres to ISMIR 2025 submission guidelines (uses the ISMIR 2025 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments.)

Yes

Q8 (Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains.)

Agree

Q9 (Scholarly/scientific quality: The content is scientifically correct.)

Agree

Q11 (Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences.)

Agree

Q12 (The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used.)

Agree

Q13 (Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community.)

Agree (Novel topic, task, or application)

Q14 (Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.)

Agree

Q15 (Please explain your assessment of reusable insights in the paper.)

In this work the authors explore alternatives to deep learning for learning music representations. Even if the novelty of the paper is limited, I think the results of this investigation might be useful one one side in re-assessing the performance of music features fusion, and on the other side on reflecting on the benchmarks used by recent deep learning foundation model in the state-of-the-art.

Q16 (Write ONE line (in your own words) with the main take-home message from the paper.)

Quantisation and factorisation of low-level music features is a possible alternative to deep-learning foundation models when evaluated on standard benchmarks.

Q17 (Would you recommend this paper for an award?)

No

Q19 (Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community.)

Strongly agree

Q20 (Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines)

Weak accept

Q21 (Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate.)

In this paper, the authors compare recent deep-learning foundation models with a feature fusion method of classic music features through quantisation and factorisation. A set of features are first separately quantised using KMeans and then factorised using Weighted Matrix Factorisation. The resulting embeddings are then concatenated with the mean and standard deviation of the original features. Finally, feature fusion is achieved by concatenating the single embeddings and applying PCA. The model is trained on the FMA and Million Song Dataset and evaluated on standard downstream tasks through probing. The authors show that for some downstream tasks, this approach produces results comparable to much more complex deep-learning based foundation models. The paper is well-written and easy to follow. The proposed method is described in detail and the experimental setup is technically correct. However, I would recommend that the authors improve the readability of Figures 2 and 3, or consider reporting the results in two tables. Strictly speaking, the paper's novelty is somewhat limited, but I believe it could stimulate potential discussions in the community. Generally, I think the authors missed an opportunity to reflect on why such a simple approach (and sometimes even the considered baseline) can compete with much more complex deep-learning foundation models, especially for some tasks. Such an analysis might highlight potential weaknesses of the popular probing-based evaluation or of the considered datasets, and lead to potential improvements in the evaluation of music foundation models in general.

P7-7: Quantize & Factorize: A fast yet effective unsupervised audio representation without deep learning

Jaehun Kim, Matthew C. McCallum, Andreas F. Ehmann

Presented In-person

4-minute short-format presentation