stereo2spatial-v1

stereo2spatial-v1 is a DiT model for converting mono or stereo audio into 12-channel 7.1.4 spatial audio at 48 kHz.

The model is intended to be used with the stereo2spatial codebase. The bundle contains the model weights, runtime config, and bundled EAR-VAE assets needed for inference.

Model Summary

  • Architecture: SpatialDiT
  • Sample rate: 48000
  • Latent FPS: 50
  • Output layout: 7.1.4
  • Output channels: 12
  • Channel order: FL, FR, FC, LFE, BL, BR, SL, SR, TFL, TFR, TBL, TBR
  • Hidden size: 1024
  • Layers: 12
  • Heads: 16
  • Latent dim: 64
  • Memory tokens: 32

Training Summary

This v1 release was trained for 440,000 total steps:

  • Stage 1, part 1: 200,000 steps without GAN
  • Stage 1, part 2: 200,000 additional steps with GAN enabled
  • Stage 2: 40,000 steps with GAN enabled

Intended Use

This model is intended for:

  • research and experimentation in stereo-to-spatial generation
  • local inference workflows that render mono/stereo audio to 7.1.4
  • prototyping multichannel music and immersive-audio pipelines

This model is not a drop-in replacement for professional mastering, QC, or broadcast authoring workflows.

Limitations

  • The model is trained for a 7.1.4 output layout; do not expect other layouts to work without retraining or exporting a different target-channel setup.
  • Results are input-dependent and may introduce artifacts, unstable imaging, or balance issues on difficult material.
  • Each output goes through a VAE that was trained on stereo content (not individual channels from spatial tracks) so sometimes results may sound subpar. (At some point I may finetune the vae on per-channel outputs for increased quality without having to retrain this model)

Quick Start

From a local checkout of the stereo2spatial code repository:

python -m venv .venv
. .venv/Scripts/activate  # Windows PowerShell: .\.venv\Scripts\Activate.ps1
pip install -e .
python -m pip install -U "huggingface_hub[cli]"
hf download francislabounty/stereo2spatial-v1 --local-dir checkpoints/stereo2spatial-v1
python infer.py --checkpoint checkpoints/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cuda --show-progress

The recommended usage is pointing --checkpoint at the downloaded bundle directory. The inference CLI will:

  • read config.json
  • load weights from model.safetensors
  • auto-discover the bundled EAR-VAE files under vae/

Example Inference Command

python infer.py --checkpoint checkpoints/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cuda --show-progress --report-json outputs/report.json

Useful flags:

  • --device cpu to run on CPU
  • --solver auto|heun|euler|unipc|... to change the latent solver
  • --normalize-peak to normalize the rendered WAV before writing

License

This model is released under the Apache 2.0 license.

Downloads last month
17
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including francislabounty/stereo2spatial-v1