BandTok

BandTok is a 2D audio tokenizer that represents music as a time-frequency image. This repository provides the BandTok tokenizer and a language model trained on BandTok tokens for generating 10-second music clips at 44.1 kHz from text prompts.

Links

Install

pip install -r requirements.txt

The BandTok decoder uses NVIDIA BigVGAN. You can also install it explicitly:

cd /bandtok
git clone https://github.com/NVIDIA/BigVGAN

The package uses Hugging Face Hub for config.yaml, bandtoklm.safetensors, and tokenizer-only bandtok.safetensors.

One-Command Music Generation

python examples/infer.py --repo_id xlbhzz/bandtok --prompt "A happy Latin song" --output output.wav

For a local pre-upload smoke test from this repository directory:

python examples/local_test_infer.py --prompt "A happy Latin song" --output local_test_output.wav

Tokenizer Reconstruction Inference

Use the tokenizer-only checkpoint to encode an audio file into BandTok tokens and decode it back to waveform audio:

python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav

For a directory, the script mirrors the input folder structure under the output directory:

python examples/tokenizer_infer.py --repo_id . --input /path/to/audios --output tokenizer_reconstructions

You can also save the encoded tokens:

python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav --save-tokens input_tokens.pt

Python API

from bandtok import BandTokPipeline

pipe = BandTokPipeline.from_pretrained("xlbhzz/bandtok", device="cuda")
audio = pipe.generate("A happy Latin song", duration=10.0)
pipe.save(audio, "output.wav")

Tokenizer-only usage:

from bandtok import BandTokTokenizer

tokenizer = BandTokTokenizer.from_pretrained("xlbhzz/bandtok", device="cuda")
tokens = tokenizer.encode("input.wav")
audio = tokenizer.decode(tokens)

Troubleshooting

  • BigVGAN import error: run git clone https://github.com/NVIDIA/BigVGAN under /bandtok.
  • T5 download errors: the prompt encoder uses t5-base; make sure Hugging Face downloads are available or pre-cache the model.

Citation

If you find this work useful, please cite:

@inproceedings{cheng2026modeling,
  title     = {Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation},
  author    = {Cheng, Yuqing and Ma, Xingyu and Yu, Guochen and Gu, Xiaotao},
  booktitle = {IEEE ICME 2026 Challenge Papers},
  year      = {2026}
}
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xlbhzz/bandtok-model