Exploring State-Space-Model Based Language Model in Music Generation

Wei-Jaw Lee; Fang-Chih Hsieh; Xuanjun Chen; Fang-Duo Tsai; Yi-Hsuan Yang

Exploring State-Space-Model Based Language Model in Music Generation

Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang

Some of the required materials for this paper do not exist: Video

Abstract:

The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook suffices to capture the majority of semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that SiMBA achieves faster convergence and generates outputs closer to ground truth under limited-resource settings, highlighting the promise of SSMs for efficient and expressive music generation. We put audio examples on Github\footnote{https://lonian6.github.io/web-exploring-ssm/}.