SonicBot
Audio generation and processing inference package based on the Higgs audio model architecture.
π¦ Package Contents
This package provides complete inference capabilities for Higgs audio models:
Core Model Architecture (
boson_multimodal/model/higgs_audio/)- Dual-channel audio generation model
- Transformer encoder and decoder
- Audio feature projector
- Delay pattern support
- Multi-codebook audio generation
Audio Processing (
boson_multimodal/audio_processing/)- Higgs Audio Tokenizer (DAC-based)
- Semantic encoder/decoder
- Descriptive Audio Codec (DAC)
- Vector Quantization (VQ)
Data Processing (
boson_multimodal/data_collator/,boson_multimodal/dataset/)- HiggsAudioSampleCollator (batch processing)
- ChatMLDatasetSample (dialogue data structures)
- Multi-channel audio token handling
Inference Scripts
infer_single_channel.py- Single-channel audio inferenceinfer_dual_channel.py- Dual-channel audio generation
π Directory Structure
higgs_audio_inference/
βββ boson_multimodal/ # Core library
β βββ __init__.py
β βββ constants.py # Token definitions
β βββ data_types.py # ChatML data structures
β βββ audio_processing/ # Audio tokenizer + vocoder
β β βββ higgs_audio_tokenizer.py
β β βββ semantic_module.py
β β βββ descriptaudiocodec/ # DAC codec
β β βββ quantization/ # Vector quantization
β βββ data_collator/ # Data batch processing
β β βββ higgs_audio_collator.py
β βββ dataset/ # Dataset utilities
β β βββ chatml_dataset.py
β βββ model/
β βββ higgs_audio/ # Core model
β βββ modeling_higgs_audio.py # Model implementation
β βββ configuration_higgs_audio.py # Configuration classes
β βββ audio_head.py # Decoder projector
β βββ utils.py # Utility functions
β βββ common.py # Base classes
β βββ custom_modules.py # Custom layers
β βββ cuda_graph_runner.py # CUDA optimization
βββ infer_single_channel.py # Single-channel inference script
βββ infer_dual_channel.py # Dual-channel inference script
βββ INFERENCE_GUIDE.md # Detailed inference guide
βββ requirements.txt # Dependencies
βββ pyproject.toml # Project configuration
βββ README.md # This file
π Quick Start
1. Installation
Install dependencies:
pip install -r requirements.txt
Core Dependencies:
- PyTorch >= 2.0
- Transformers >= 4.45.1, < 4.47.0
- descript-audio-codec
- librosa, torchaudio
- safetensors
2. Prepare Resources
Ensure you have the following:
Model Checkpoint:
path/to/checkpoint/ βββ config.json βββ model.safetensors βββ ...Tokenizer: Auto-downloaded from HuggingFace Hub
- Default:
bosonai/higgs-audio-v2-tokenizer
- Default:
Test Data (optional): Tokenized dataset
dataset/tokenized_data/ βββ val_manifest.jsonl βββ tokens/
3. Run Inference
Single-Channel Inference
For single-channel audio processing:
python infer_single_channel.py \
--checkpoint path/to/checkpoint \
--dataset-dir path/to/dataset \
--num-samples 5 \
--output-dir outputs/results \
--device cuda \
--channel-index 0
Dual-Channel Inference
For dual-channel audio generation (conversational AI):
python infer_dual_channel.py \
--checkpoint path/to/checkpoint \
--dataset-dir path/to/dataset \
--num-samples 5 \
--output-dir outputs/results \
--device cuda \
--max-frames 500
Key Parameters:
--checkpoint: Path to model checkpoint directory--dataset-dir: Path to tokenized dataset directory (containingval_manifest.jsonl)--num-samples: Number of validation samples to process--output-dir: Output directory for generated audio files--device: Device to use (cudaorcpu)--max-frames: Maximum audio frames to generate (for speed control)--tokenizer: Tokenizer repo (default:bosonai/higgs-audio-v2-tokenizer)--channel-index: (Single-channel only) Channel to extract (0 or 1)
π‘ Using as a Python Module
Import and use in your Python code:
from boson_multimodal.model.higgs_audio import (
HiggsAudioModel,
HiggsAudioConfig
)
from boson_multimodal.audio_processing import (
load_higgs_audio_tokenizer
)
from boson_multimodal.data_collator import (
HiggsAudioSampleCollator
)
# Load model
config = HiggsAudioConfig.from_pretrained("path/to/checkpoint")
model = HiggsAudioModel(config).to("cuda")
# Load tokenizer
tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer")
# Create collator
collator = HiggsAudioSampleCollator(
audio_in_token_id=128015,
audio_out_token_id=128016,
audio_stream_bos_id=1024,
audio_stream_eos_id=1025,
audio_num_codebooks=8,
interleave_audio_channels=True,
audio_token_frame_hz=50
)
# Run inference (see inference scripts for details)
π§ Configuration
Model Configuration
Key parameters in config.json:
{
"audio_num_codebooks": 8, // Number of audio codebooks
"audio_codebook_size": 1024, // Size of each codebook
"audio_token_frame_hz": 50, // Frame rate (50 fps)
"interleave_audio_channels": true, // Interleave dual channels
"use_delay_pattern": false, // Whether to use delay pattern
"audio_dual_ffn_layers": [...] // Dual FFN layer configuration
}
Token Specifications
- Audio-in token: 128015 (
<|AUDIO|>) - Audio-out token: 128016 (
<|AUDIO_OUT|>) - Audio stream BOS: 1024
- Audio stream EOS: 1025
- Pad token: 0 or 128001
- Text vocab size: ~128000 (LLaMA-based)
- Audio vocab size: 1024 (per codebook)
π― Inference Outputs
The inference scripts generate:
Audio Files (WAV format)
- Sample rate: 16000 Hz
- Single-channel:
output_generated.wav,input_groundtruth.wav - Dual-channel:
channel0_input.wav,channel1_generated.wav,channel1_groundtruth.wav
Evaluation Metrics (console + JSON)
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- SNR (Signal-to-Noise Ratio)
- Correlation coefficient
Metrics JSON
- Per-sample metrics
- Average metrics across all samples
π Choosing the Right Script
Use infer_single_channel.py when:
- β Processing mono audio
- β Audio enhancement tasks
- β Audio reconstruction from tokens
- β Single-speaker scenarios
- β Extracting one channel from stereo
Use infer_dual_channel.py when:
- β Conversational AI (dialogue generation)
- β Turn-taking scenarios
- β Stereo audio processing
- β Multi-speaker systems
- β Generating responses conditioned on input
π Troubleshooting
Issue: Module not found
Error: ModuleNotFoundError: No module named 'boson_multimodal'
Solution: Ensure you're in the correct directory or add to Python path:
import sys
sys.path.insert(0, '/path/to/higgs_audio_inference')
Issue: CUDA out of memory
Error: RuntimeError: CUDA out of memory
Solution:
- Reduce
--max-framesparameter - Reduce
--num-samples - Use CPU mode:
--device cpu
Issue: Tokenizer download failed
Error: Cannot download tokenizer from HuggingFace Hub
Solution:
- Check network connection
- Use proxy:
export HF_ENDPOINT=https://hf-mirror.com - Download tokenizer manually and specify local path:
--tokenizer /path/to/local/tokenizer
Issue: Token shape mismatch
Error: "Expected token tensor with shape..."
Solution:
- Single-channel: Ensure tokens are
[8, frames], use--channel-indexif needed - Dual-channel: Ensure tokens are
[2, 8, frames]
π Documentation
- Main README: This file - Package overview and quick start
- Inference Guide:
INFERENCE_GUIDE.md- Detailed inference documentation - Training Reference:
DUAL_CHANNEL_TRAINING_README.md- Training documentation
π Common Questions
Q: Can this be published as a pip package?
A: Yes. The package includes pyproject.toml. You can build and install:
pip install build
python -m build
pip install dist/higgs_audio_inference-*.whl
Q: What's the model size?
A:
- Code: ~3800 lines of core code + dependencies
- Model weights: Depends on checkpoint (typically hundreds of MB to a few GB)
Q: Which PyTorch versions are supported?
A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+.
Q: How do I use this in my project?
A: Two ways:
- Command-line:
python higgs_audio_inference/infer_*.py ... - Python import: See "Using as a Python Module" section above
π‘ Tips
- Start small: Test with
--num-samples 1and--max-frames 100first - Use CUDA: CPU inference is 10-50x slower
- Monitor memory: Reduce
--max-framesif OOM errors occur - Check outputs: Listen to generated audio to verify quality
- Read the guide: See
INFERENCE_GUIDE.mdfor comprehensive documentation
Acknowledgments
This research was supported by Bitdeer AI of Bitdeer Technologies Group through provision of GPU resources and AI cloud services.