File size: 2,331 Bytes
8772ff6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---

license: apache-2.0
tags:
- audio
- spatial-audio
- 3d-audio
- ambisonics
- text-to-spatial
library_name: pytorch
---


# Text-Guided Audio Spatializer

A text-guided spatial audio model that converts mono audio into 3D spatialized binaural audio based on natural language descriptions.

## Model Description

This model takes mono audio and text descriptions (e.g., "front-left, level, near, medium room, medium reverb") and generates First-Order Ambisonics (FOA) encoded spatial audio, which can be converted to binaural stereo for headphone listening.

**Architecture**: Transformer-based model with cross-attention between audio features and text embeddings.

**Training Data**: Synthetic spatial audio generated using room impulse responses and directional encoding.

**Sample Rate**: 24kHz

## Usage

```python

import torch

import soundfile as sf

from spatializer.models.crossattn_transformer import CrossAttnSpatializer



# Load model

model = CrossAttnSpatializer.load_from_checkpoint("epoch=14-step=342.ckpt")

model.eval()



# Load audio

audio, sr = sf.read("input.wav")



# Spatialize with text

text = "front-left, level, near, medium room, medium reverb"

with torch.no_grad():

    foa_output = model.spatialize(audio, text)



# Convert FOA to binaural stereo

from spatializer.utils.foa import foa_to_stereo_simple

binaural = foa_to_stereo_simple(foa_output)



# Save output

sf.write("output_binaural.wav", binaural.T, 24000)

```

## Spatial Parameters

The model understands the following spatial parameters:

- **Direction**: front, front-left, left, back-left, back, back-right, right, front-right
- **Elevation**: down, level, up
- **Distance**: near, mid, far
- **Room Size**: small, medium, large
- **Reverb**: dry, medium, wet

## Limitations

- Input audio is resampled to 24kHz
- Best results with mono source material
- Requires headphones for proper spatial audio experience
- Model trained on synthetic data, may not capture all acoustic nuances

## Training Details

- **Framework**: PyTorch Lightning
- **Optimizer**: AdamW
- **Epochs**: 15
- **Checkpoint**: epoch=14-step=342.ckpt (version 3)

## Citation

```

@misc{helix-spatializer-2025,

  title={Text-Guided Audio Spatializer},

  author={Your Name},

  year={2025}

}

```