Text-Guided Audio Spatializer

A text-guided spatial audio model that converts mono audio into 3D spatialized binaural audio based on natural language descriptions.

Model Description

This model takes mono audio and text descriptions (e.g., "front-left, level, near, medium room, medium reverb") and generates First-Order Ambisonics (FOA) encoded spatial audio, which can be converted to binaural stereo for headphone listening.

Architecture: Transformer-based model with cross-attention between audio features and text embeddings.

Training Data: Synthetic spatial audio generated using room impulse responses and directional encoding.

Sample Rate: 24kHz

Usage

import torch
import soundfile as sf
from spatializer.models.crossattn_transformer import CrossAttnSpatializer

# Load model
model = CrossAttnSpatializer.load_from_checkpoint("epoch=14-step=342.ckpt")
model.eval()

# Load audio
audio, sr = sf.read("input.wav")

# Spatialize with text
text = "front-left, level, near, medium room, medium reverb"
with torch.no_grad():
    foa_output = model.spatialize(audio, text)

# Convert FOA to binaural stereo
from spatializer.utils.foa import foa_to_stereo_simple
binaural = foa_to_stereo_simple(foa_output)

# Save output
sf.write("output_binaural.wav", binaural.T, 24000)

Spatial Parameters

The model understands the following spatial parameters:

  • Direction: front, front-left, left, back-left, back, back-right, right, front-right
  • Elevation: down, level, up
  • Distance: near, mid, far
  • Room Size: small, medium, large
  • Reverb: dry, medium, wet

Limitations

  • Input audio is resampled to 24kHz
  • Best results with mono source material
  • Requires headphones for proper spatial audio experience
  • Model trained on synthetic data, may not capture all acoustic nuances

Training Details

  • Framework: PyTorch Lightning
  • Optimizer: AdamW
  • Epochs: 15
  • Checkpoint: epoch=14-step=342.ckpt (version 3)

Citation

@misc{helix-spatializer-2025,
  title={Text-Guided Audio Spatializer},
  author={Your Name},
  year={2025}
}
Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support