Helix v0.7 - Mono to Spatial FOA
Helix is a neural spatial audio model that converts mono audio to First-Order Ambisonic (FOA) 4-channel format using text-based spatial positioning.
🎧 Try the demo on Spaces (Coming soon)
Model Description
- Architecture: UNet1D with FiLM conditioning (2.4M parameters)
- Input: Mono audio (24kHz) + spatial text description
- Output: 4-channel FOA (W, X, Y, Z) at 24kHz
- Training: 50 epochs on 2,490 commercial-safe paired samples
- Validation Loss: 1.26 (88% improvement from initialization)
Capabilities
✅ Convert mono/stereo to spatial FOA ✅ Text-guided positioning (8 directions × 3 elevations × 3 distances) ✅ Real-time capable (CPU inference) ✅ 100% commercial-safe training data ✅ Compatible with VR/AR and spatial audio workflows
Usage
Quick Start
import torch
import soundfile as sf
from helix_model import HelixModel
# Load model
model = HelixModel.from_pretrained("your-username/helix-v0.7")
# Load mono audio
audio, sr = sf.read("your_audio.wav")
# Spatialize
foa_output = model.spatialize(
audio,
direction="left",
elevation="level",
distance="mid"
)
# Save FOA output (4 channels)
sf.write("output_foa.wav", foa_output.T, 24000)
Spatial Parameters
Direction (8 options):
front,front-left,left,back-left,back,back-right,right,front-right
Elevation (3 options):
down(-30°),level(0°),up(+30°)
Distance (3 options):
near(1m),mid(2.5m),far(5m)
Advanced Usage
# Batch processing
foa_outputs = model.spatialize_batch(
audio_list,
positions=[
("front", "level", "near"),
("left", "up", "far"),
("back", "down", "mid")
]
)
# Custom positions (azimuth, elevation in degrees)
foa = model.spatialize_custom(
audio,
azimuth=45.0,
elevation=15.0,
distance=2.0
)
FOA Output Format
The model outputs First-Order Ambisonics (SN3D normalization):
- Channel 0 (W): Omnidirectional (pressure/mono sum)
- Channel 1 (X): Front-back axis
- Channel 2 (Y): Left-right axis
- Channel 3 (Z): Up-down axis
Decoding FOA
# Decode to binaural for headphones
from helix_utils import foa_to_binaural
binaural = foa_to_binaural(foa_output)
sf.write("binaural.wav", binaural.T, 24000)
# Or use in spatial audio tools:
# - Reaper (with ambisonic plugins)
# - Pro Tools (with Dolby Atmos)
# - Unity/Unreal (native FOA support)
Training Data
Trained on 100% commercial-safe data:
| Source | License | Samples | Type |
|---|---|---|---|
| FMA | CC BY 4.0 | 2,000 | Music |
| Common Voice | CC0 | 490 | Speech |
| pyroomacoustics | MIT | Generated | RIRs |
Total: 2,490 paired mono→FOA examples
Performance
- Inference Speed: ~0.15s for 4-second audio (CPU)
- Model Size: 9.6 MB
- Memory: ~500 MB RAM
- Sample Rate: 24 kHz fixed
- Max Duration: 30 seconds (longer audio will be chunked)
Limitations
⚠️ Current limitations:
- Fixed 24kHz sample rate
- Best for speech and music (limited effect sounds)
- Trained on limited spatial diversity
- No real-time streaming (processes full clips)
Citation
@misc{helix2024,
title={Helix: Text-Guided Mono to Spatial Audio},
author={Your Name},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/your-username/helix-v0.7}}
}
License
Model: CC BY 4.0 (commercial use allowed with attribution) Training Data: All commercial-safe (CC0, CC BY 4.0, MIT)
Model Card Contact
For questions or issues, please open an issue on GitHub or contact [your-email].
Example Applications
🎵 Music Production: Pan instruments in 3D space 🎙️ Podcasts: Position speakers spatially 🎮 Game Audio: Convert mono assets to spatial 🎬 Film: Quick spatial audio prototyping 🥽 VR/AR: Immersive audio experiences
Updates
v0.7 (Oct 2024):
- Initial release
- 2.4M parameter UNet1D
- Text-guided spatial positioning
- 100% commercial-safe training
Coming in v0.8:
- 20K training samples (8x more data)
- Better speech quality (Common Voice)
- Sound effects support (Freesound)
- Real spatial recordings (TAU dataset)
- Lower validation loss target: <1.0
Built with PyTorch Lightning | Gradio demo available
- Downloads last month
- 11