soundsol commited on
Commit
8772ff6
·
verified ·
1 Parent(s): b0e62cb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - audio
5
+ - spatial-audio
6
+ - 3d-audio
7
+ - ambisonics
8
+ - text-to-spatial
9
+ library_name: pytorch
10
+ ---
11
+
12
+ # Text-Guided Audio Spatializer
13
+
14
+ A text-guided spatial audio model that converts mono audio into 3D spatialized binaural audio based on natural language descriptions.
15
+
16
+ ## Model Description
17
+
18
+ This model takes mono audio and text descriptions (e.g., "front-left, level, near, medium room, medium reverb") and generates First-Order Ambisonics (FOA) encoded spatial audio, which can be converted to binaural stereo for headphone listening.
19
+
20
+ **Architecture**: Transformer-based model with cross-attention between audio features and text embeddings.
21
+
22
+ **Training Data**: Synthetic spatial audio generated using room impulse responses and directional encoding.
23
+
24
+ **Sample Rate**: 24kHz
25
+
26
+ ## Usage
27
+
28
+ ```python
29
+ import torch
30
+ import soundfile as sf
31
+ from spatializer.models.crossattn_transformer import CrossAttnSpatializer
32
+
33
+ # Load model
34
+ model = CrossAttnSpatializer.load_from_checkpoint("epoch=14-step=342.ckpt")
35
+ model.eval()
36
+
37
+ # Load audio
38
+ audio, sr = sf.read("input.wav")
39
+
40
+ # Spatialize with text
41
+ text = "front-left, level, near, medium room, medium reverb"
42
+ with torch.no_grad():
43
+ foa_output = model.spatialize(audio, text)
44
+
45
+ # Convert FOA to binaural stereo
46
+ from spatializer.utils.foa import foa_to_stereo_simple
47
+ binaural = foa_to_stereo_simple(foa_output)
48
+
49
+ # Save output
50
+ sf.write("output_binaural.wav", binaural.T, 24000)
51
+ ```
52
+
53
+ ## Spatial Parameters
54
+
55
+ The model understands the following spatial parameters:
56
+
57
+ - **Direction**: front, front-left, left, back-left, back, back-right, right, front-right
58
+ - **Elevation**: down, level, up
59
+ - **Distance**: near, mid, far
60
+ - **Room Size**: small, medium, large
61
+ - **Reverb**: dry, medium, wet
62
+
63
+ ## Limitations
64
+
65
+ - Input audio is resampled to 24kHz
66
+ - Best results with mono source material
67
+ - Requires headphones for proper spatial audio experience
68
+ - Model trained on synthetic data, may not capture all acoustic nuances
69
+
70
+ ## Training Details
71
+
72
+ - **Framework**: PyTorch Lightning
73
+ - **Optimizer**: AdamW
74
+ - **Epochs**: 15
75
+ - **Checkpoint**: epoch=14-step=342.ckpt (version 3)
76
+
77
+ ## Citation
78
+
79
+ ```
80
+ @misc{helix-spatializer-2025,
81
+ title={Text-Guided Audio Spatializer},
82
+ author={Your Name},
83
+ year={2025}
84
+ }
85
+ ```