stereo2spatial-v1

stereo2spatial-v1 is a DiT model for converting mono or stereo audio into 12-channel 7.1.4 spatial audio at 48 kHz.

The model is intended to be used with the stereo2spatial codebase. The bundle contains the model weights, runtime config, and bundled EAR-VAE assets needed for inference.

Model Summary

Architecture: SpatialDiT
Sample rate: 48000
Latent FPS: 50
Output layout: 7.1.4
Output channels: 12
Channel order: FL, FR, FC, LFE, BL, BR, SL, SR, TFL, TFR, TBL, TBR
Hidden size: 1024
Layers: 12
Heads: 16
Latent dim: 64
Memory tokens: 32

Training Summary

This v1 release was trained for 440,000 total steps:

Stage 1, part 1: 200,000 steps without GAN
Stage 1, part 2: 200,000 additional steps with GAN enabled
Stage 2: 40,000 steps with GAN enabled

Intended Use

This model is intended for:

research and experimentation in stereo-to-spatial generation
local inference workflows that render mono/stereo audio to 7.1.4
prototyping multichannel music and immersive-audio pipelines

This model is not a drop-in replacement for professional mastering, QC, or broadcast authoring workflows.

Limitations

The model is trained for a 7.1.4 output layout; do not expect other layouts to work without retraining or exporting a different target-channel setup.
Results are input-dependent and may introduce artifacts, unstable imaging, or balance issues on difficult material.
Each output goes through a VAE that was trained on stereo content (not individual channels from spatial tracks) so sometimes results may sound subpar. (At some point I may finetune the vae on per-channel outputs for increased quality without having to retrain this model)

Quick Start

From a local checkout of the stereo2spatial code repository:

python -m venv .venv
. .venv/Scripts/activate  # Windows PowerShell: .\.venv\Scripts\Activate.ps1
pip install -e .
python -m pip install -U "huggingface_hub[cli]"
hf download francislabounty/stereo2spatial-v1 --local-dir checkpoints/stereo2spatial-v1
python infer.py --checkpoint checkpoints/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cuda --show-progress

The recommended usage is pointing --checkpoint at the downloaded bundle directory. The inference CLI will:

read config.json
load weights from model.safetensors
auto-discover the bundled EAR-VAE files under vae/

Example Inference Command

python infer.py --checkpoint checkpoints/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cuda --show-progress --report-json outputs/report.json

Useful flags:

--device cpu to run on CPU
--solver auto|heun|euler|unipc|... to change the latent solver
--normalize-peak to normalize the rendered WAV before writing

License

This model is released under the Apache 2.0 license.

Downloads last month: 18

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including francislabounty/stereo2spatial-v1

Stereo2Spatial

Collection

Collection of flow matching DiT models to go from stereo -> spatial • 1 item • Updated Apr 1