Text-Guided Audio Spatializer

A text-guided spatial audio model that converts mono audio into 3D spatialized binaural audio based on natural language descriptions.

Model Description

This model takes mono audio and text descriptions (e.g., "front-left, level, near, medium room, medium reverb") and generates First-Order Ambisonics (FOA) encoded spatial audio, which can be converted to binaural stereo for headphone listening.

Architecture: Transformer-based model with cross-attention between audio features and text embeddings.

Training Data: Synthetic spatial audio generated using room impulse responses and directional encoding.

Sample Rate: 24kHz

Usage

import torch
import soundfile as sf
from spatializer.models.crossattn_transformer import CrossAttnSpatializer

# Load model
model = CrossAttnSpatializer.load_from_checkpoint("epoch=14-step=342.ckpt")
model.eval()

# Load audio
audio, sr = sf.read("input.wav")

# Spatialize with text
text = "front-left, level, near, medium room, medium reverb"
with torch.no_grad():
    foa_output = model.spatialize(audio, text)

# Convert FOA to binaural stereo
from spatializer.utils.foa import foa_to_stereo_simple
binaural = foa_to_stereo_simple(foa_output)

# Save output
sf.write("output_binaural.wav", binaural.T, 24000)

Spatial Parameters

The model understands the following spatial parameters:

Direction: front, front-left, left, back-left, back, back-right, right, front-right
Elevation: down, level, up
Distance: near, mid, far
Room Size: small, medium, large
Reverb: dry, medium, wet

Limitations

Input audio is resampled to 24kHz
Best results with mono source material
Requires headphones for proper spatial audio experience
Model trained on synthetic data, may not capture all acoustic nuances

Training Details

Framework: PyTorch Lightning
Optimizer: AdamW
Epochs: 15
Checkpoint: epoch=14-step=342.ckpt (version 3)

Citation

@misc{helix-spatializer-2025,
  title={Text-Guided Audio Spatializer},
  author={Your Name},
  year={2025}
}

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support