Helix v0.7 - Mono to Spatial FOA

Helix is a neural spatial audio model that converts mono audio to First-Order Ambisonic (FOA) 4-channel format using text-based spatial positioning.

🎧 Try the demo on Spaces (Coming soon)

Model Description

Architecture: UNet1D with FiLM conditioning (2.4M parameters)
Input: Mono audio (24kHz) + spatial text description
Output: 4-channel FOA (W, X, Y, Z) at 24kHz
Training: 50 epochs on 2,490 commercial-safe paired samples
Validation Loss: 1.26 (88% improvement from initialization)

Capabilities

✅ Convert mono/stereo to spatial FOA ✅ Text-guided positioning (8 directions × 3 elevations × 3 distances) ✅ Real-time capable (CPU inference) ✅ 100% commercial-safe training data ✅ Compatible with VR/AR and spatial audio workflows

Usage

Quick Start

import torch
import soundfile as sf
from helix_model import HelixModel

# Load model
model = HelixModel.from_pretrained("your-username/helix-v0.7")

# Load mono audio
audio, sr = sf.read("your_audio.wav")

# Spatialize
foa_output = model.spatialize(
    audio,
    direction="left",
    elevation="level",
    distance="mid"
)

# Save FOA output (4 channels)
sf.write("output_foa.wav", foa_output.T, 24000)

Spatial Parameters

Direction (8 options):

front, front-left, left, back-left, back, back-right, right, front-right

Elevation (3 options):

down (-30°), level (0°), up (+30°)

Distance (3 options):

near (1m), mid (2.5m), far (5m)

Advanced Usage

# Batch processing
foa_outputs = model.spatialize_batch(
    audio_list,
    positions=[
        ("front", "level", "near"),
        ("left", "up", "far"),
        ("back", "down", "mid")
    ]
)

# Custom positions (azimuth, elevation in degrees)
foa = model.spatialize_custom(
    audio,
    azimuth=45.0,
    elevation=15.0,
    distance=2.0
)

FOA Output Format

The model outputs First-Order Ambisonics (SN3D normalization):

Channel 0 (W): Omnidirectional (pressure/mono sum)
Channel 1 (X): Front-back axis
Channel 2 (Y): Left-right axis
Channel 3 (Z): Up-down axis

Decoding FOA

# Decode to binaural for headphones
from helix_utils import foa_to_binaural

binaural = foa_to_binaural(foa_output)
sf.write("binaural.wav", binaural.T, 24000)

# Or use in spatial audio tools:
# - Reaper (with ambisonic plugins)
# - Pro Tools (with Dolby Atmos)
# - Unity/Unreal (native FOA support)

Training Data

Trained on 100% commercial-safe data:

Source	License	Samples	Type
FMA	CC BY 4.0	2,000	Music
Common Voice	CC0	490	Speech
pyroomacoustics	MIT	Generated	RIRs

Total: 2,490 paired mono→FOA examples

Performance

Inference Speed: ~0.15s for 4-second audio (CPU)
Model Size: 9.6 MB
Memory: ~500 MB RAM
Sample Rate: 24 kHz fixed
Max Duration: 30 seconds (longer audio will be chunked)

Limitations

⚠️ Current limitations:

Fixed 24kHz sample rate
Best for speech and music (limited effect sounds)
Trained on limited spatial diversity
No real-time streaming (processes full clips)

Citation

@misc{helix2024,
  title={Helix: Text-Guided Mono to Spatial Audio},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/your-username/helix-v0.7}}
}

License

Model: CC BY 4.0 (commercial use allowed with attribution) Training Data: All commercial-safe (CC0, CC BY 4.0, MIT)

Model Card Contact

For questions or issues, please open an issue on GitHub or contact [your-email].

Example Applications

🎵 Music Production: Pan instruments in 3D space 🎙️ Podcasts: Position speakers spatially 🎮 Game Audio: Convert mono assets to spatial 🎬 Film: Quick spatial audio prototyping 🥽 VR/AR: Immersive audio experiences

Updates

v0.7 (Oct 2024):

Initial release
2.4M parameter UNet1D
Text-guided spatial positioning
100% commercial-safe training

Coming in v0.8:

20K training samples (8x more data)
Better speech quality (Common Voice)
Sound effects support (Freesound)
Real spatial recordings (TAU dataset)
Lower validation loss target: <1.0

Built with PyTorch Lightning | Gradio demo available

Downloads last month: 11