HAFM
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
Authors: Jian Zhu, Jianwei Cui, Shihao Chen, Yubang Zhang, Yunlong Xue, Cheng Luo, Jun Sun.
This repo contains the code and data of Hierarchical Autoregressive Foundation Model.
1. Abstract
We present HAFM, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic $\rightarrow$ coarse acoustic $\rightarrow$ fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization. Experiments on MUSDB18 demonstrate that HAFM achieves a Fr'{e}chet Audio Distance (FAD) of 2.08 on isolated vocal inputs, outperforming retrieval baselines and matching prior state-of-the-art systems with fewer parameters. The source code is available at https://github.com/HackerHyper/HAFM.
2. ARCH
3. infer
python infer_simple.py \
--vocal_path vocal.wav \
--output_path output.wav \
--config configs/ar.yaml
If you have any problems, contact me via qijian.zhu@outlook.com.
