Audio-to-Audio

HAFM

HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

Authors: Jian Zhu, Jianwei Cui, Shihao Chen, Yubang Zhang, Yunlong Xue, Cheng Luo, Jun Sun.

This repo contains the code and data of Hierarchical Autoregressive Foundation Model.

1. Abstract

We present HAFM, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic $\rightarrow$ coarse acoustic $\rightarrow$ fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization. Experiments on MUSDB18 demonstrate that HAFM achieves a Fr'{e}chet Audio Distance (FAD) of 2.08 on isolated vocal inputs, outperforming retrieval baselines and matching prior state-of-the-art systems with fewer parameters. The source code is available at https://github.com/HackerHyper/HAFM.

2. ARCH

ARCH

3. infer

python infer_simple.py \
  --vocal_path  vocal.wav \
  --output_path output.wav \
  --config      configs/ar.yaml

If you have any problems, contact me via qijian.zhu@outlook.com.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for zhuqijian/HAFM