SonicBot / README.md
Chaos96's picture
Add Acknowledgments
3dc6a52

SonicBot

Audio generation and processing inference package based on the Higgs audio model architecture.

πŸ“¦ Package Contents

This package provides complete inference capabilities for Higgs audio models:

  • Core Model Architecture (boson_multimodal/model/higgs_audio/)

    • Dual-channel audio generation model
    • Transformer encoder and decoder
    • Audio feature projector
    • Delay pattern support
    • Multi-codebook audio generation
  • Audio Processing (boson_multimodal/audio_processing/)

    • Higgs Audio Tokenizer (DAC-based)
    • Semantic encoder/decoder
    • Descriptive Audio Codec (DAC)
    • Vector Quantization (VQ)
  • Data Processing (boson_multimodal/data_collator/, boson_multimodal/dataset/)

    • HiggsAudioSampleCollator (batch processing)
    • ChatMLDatasetSample (dialogue data structures)
    • Multi-channel audio token handling
  • Inference Scripts

    • infer_single_channel.py - Single-channel audio inference
    • infer_dual_channel.py - Dual-channel audio generation

πŸ“ Directory Structure

higgs_audio_inference/
β”œβ”€β”€ boson_multimodal/              # Core library
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ constants.py               # Token definitions
β”‚   β”œβ”€β”€ data_types.py              # ChatML data structures
β”‚   β”œβ”€β”€ audio_processing/          # Audio tokenizer + vocoder
β”‚   β”‚   β”œβ”€β”€ higgs_audio_tokenizer.py
β”‚   β”‚   β”œβ”€β”€ semantic_module.py
β”‚   β”‚   β”œβ”€β”€ descriptaudiocodec/    # DAC codec
β”‚   β”‚   └── quantization/          # Vector quantization
β”‚   β”œβ”€β”€ data_collator/             # Data batch processing
β”‚   β”‚   └── higgs_audio_collator.py
β”‚   β”œβ”€β”€ dataset/                   # Dataset utilities
β”‚   β”‚   └── chatml_dataset.py
β”‚   └── model/
β”‚       └── higgs_audio/           # Core model
β”‚           β”œβ”€β”€ modeling_higgs_audio.py      # Model implementation
β”‚           β”œβ”€β”€ configuration_higgs_audio.py # Configuration classes
β”‚           β”œβ”€β”€ audio_head.py                # Decoder projector
β”‚           β”œβ”€β”€ utils.py                     # Utility functions
β”‚           β”œβ”€β”€ common.py                    # Base classes
β”‚           β”œβ”€β”€ custom_modules.py            # Custom layers
β”‚           └── cuda_graph_runner.py         # CUDA optimization
β”œβ”€β”€ infer_single_channel.py        # Single-channel inference script
β”œβ”€β”€ infer_dual_channel.py          # Dual-channel inference script
β”œβ”€β”€ INFERENCE_GUIDE.md             # Detailed inference guide
β”œβ”€β”€ requirements.txt               # Dependencies
β”œβ”€β”€ pyproject.toml                 # Project configuration
└── README.md                      # This file

πŸš€ Quick Start

1. Installation

Install dependencies:

pip install -r requirements.txt

Core Dependencies:

  • PyTorch >= 2.0
  • Transformers >= 4.45.1, < 4.47.0
  • descript-audio-codec
  • librosa, torchaudio
  • safetensors

2. Prepare Resources

Ensure you have the following:

  1. Model Checkpoint:

    path/to/checkpoint/
    β”œβ”€β”€ config.json
    β”œβ”€β”€ model.safetensors
    └── ...
    
  2. Tokenizer: Auto-downloaded from HuggingFace Hub

    • Default: bosonai/higgs-audio-v2-tokenizer
  3. Test Data (optional): Tokenized dataset

    dataset/tokenized_data/
    β”œβ”€β”€ val_manifest.jsonl
    └── tokens/
    

3. Run Inference

Single-Channel Inference

For single-channel audio processing:

python infer_single_channel.py \
    --checkpoint path/to/checkpoint \
    --dataset-dir path/to/dataset \
    --num-samples 5 \
    --output-dir outputs/results \
    --device cuda \
    --channel-index 0

Dual-Channel Inference

For dual-channel audio generation (conversational AI):

python infer_dual_channel.py \
    --checkpoint path/to/checkpoint \
    --dataset-dir path/to/dataset \
    --num-samples 5 \
    --output-dir outputs/results \
    --device cuda \
    --max-frames 500

Key Parameters:

  • --checkpoint: Path to model checkpoint directory
  • --dataset-dir: Path to tokenized dataset directory (containing val_manifest.jsonl)
  • --num-samples: Number of validation samples to process
  • --output-dir: Output directory for generated audio files
  • --device: Device to use (cuda or cpu)
  • --max-frames: Maximum audio frames to generate (for speed control)
  • --tokenizer: Tokenizer repo (default: bosonai/higgs-audio-v2-tokenizer)
  • --channel-index: (Single-channel only) Channel to extract (0 or 1)

πŸ’‘ Using as a Python Module

Import and use in your Python code:

from boson_multimodal.model.higgs_audio import (
    HiggsAudioModel,
    HiggsAudioConfig
)
from boson_multimodal.audio_processing import (
    load_higgs_audio_tokenizer
)
from boson_multimodal.data_collator import (
    HiggsAudioSampleCollator
)

# Load model
config = HiggsAudioConfig.from_pretrained("path/to/checkpoint")
model = HiggsAudioModel(config).to("cuda")

# Load tokenizer
tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer")

# Create collator
collator = HiggsAudioSampleCollator(
    audio_in_token_id=128015,
    audio_out_token_id=128016,
    audio_stream_bos_id=1024,
    audio_stream_eos_id=1025,
    audio_num_codebooks=8,
    interleave_audio_channels=True,
    audio_token_frame_hz=50
)

# Run inference (see inference scripts for details)

πŸ”§ Configuration

Model Configuration

Key parameters in config.json:

{
  "audio_num_codebooks": 8,          // Number of audio codebooks
  "audio_codebook_size": 1024,       // Size of each codebook
  "audio_token_frame_hz": 50,        // Frame rate (50 fps)
  "interleave_audio_channels": true, // Interleave dual channels
  "use_delay_pattern": false,        // Whether to use delay pattern
  "audio_dual_ffn_layers": [...]     // Dual FFN layer configuration
}

Token Specifications

  • Audio-in token: 128015 (<|AUDIO|>)
  • Audio-out token: 128016 (<|AUDIO_OUT|>)
  • Audio stream BOS: 1024
  • Audio stream EOS: 1025
  • Pad token: 0 or 128001
  • Text vocab size: ~128000 (LLaMA-based)
  • Audio vocab size: 1024 (per codebook)

🎯 Inference Outputs

The inference scripts generate:

  1. Audio Files (WAV format)

    • Sample rate: 16000 Hz
    • Single-channel: output_generated.wav, input_groundtruth.wav
    • Dual-channel: channel0_input.wav, channel1_generated.wav, channel1_groundtruth.wav
  2. Evaluation Metrics (console + JSON)

    • RMSE (Root Mean Squared Error)
    • MAE (Mean Absolute Error)
    • SNR (Signal-to-Noise Ratio)
    • Correlation coefficient
  3. Metrics JSON

    • Per-sample metrics
    • Average metrics across all samples

πŸ“Š Choosing the Right Script

Use infer_single_channel.py when:

  • βœ… Processing mono audio
  • βœ… Audio enhancement tasks
  • βœ… Audio reconstruction from tokens
  • βœ… Single-speaker scenarios
  • βœ… Extracting one channel from stereo

Use infer_dual_channel.py when:

  • βœ… Conversational AI (dialogue generation)
  • βœ… Turn-taking scenarios
  • βœ… Stereo audio processing
  • βœ… Multi-speaker systems
  • βœ… Generating responses conditioned on input

πŸ” Troubleshooting

Issue: Module not found

Error: ModuleNotFoundError: No module named 'boson_multimodal'

Solution: Ensure you're in the correct directory or add to Python path:

import sys
sys.path.insert(0, '/path/to/higgs_audio_inference')

Issue: CUDA out of memory

Error: RuntimeError: CUDA out of memory

Solution:

  • Reduce --max-frames parameter
  • Reduce --num-samples
  • Use CPU mode: --device cpu

Issue: Tokenizer download failed

Error: Cannot download tokenizer from HuggingFace Hub

Solution:

  • Check network connection
  • Use proxy: export HF_ENDPOINT=https://hf-mirror.com
  • Download tokenizer manually and specify local path: --tokenizer /path/to/local/tokenizer

Issue: Token shape mismatch

Error: "Expected token tensor with shape..."

Solution:

  • Single-channel: Ensure tokens are [8, frames], use --channel-index if needed
  • Dual-channel: Ensure tokens are [2, 8, frames]

πŸ“š Documentation

  • Main README: This file - Package overview and quick start
  • Inference Guide: INFERENCE_GUIDE.md - Detailed inference documentation
  • Training Reference: DUAL_CHANNEL_TRAINING_README.md - Training documentation

πŸ› Common Questions

Q: Can this be published as a pip package?

A: Yes. The package includes pyproject.toml. You can build and install:

pip install build
python -m build
pip install dist/higgs_audio_inference-*.whl

Q: What's the model size?

A:

  • Code: ~3800 lines of core code + dependencies
  • Model weights: Depends on checkpoint (typically hundreds of MB to a few GB)

Q: Which PyTorch versions are supported?

A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+.

Q: How do I use this in my project?

A: Two ways:

  1. Command-line: python higgs_audio_inference/infer_*.py ...
  2. Python import: See "Using as a Python Module" section above

πŸ’‘ Tips

  1. Start small: Test with --num-samples 1 and --max-frames 100 first
  2. Use CUDA: CPU inference is 10-50x slower
  3. Monitor memory: Reduce --max-frames if OOM errors occur
  4. Check outputs: Listen to generated audio to verify quality
  5. Read the guide: See INFERENCE_GUIDE.md for comprehensive documentation

Acknowledgments

This research was supported by Bitdeer AI of Bitdeer Technologies Group through provision of GPU resources and AI cloud services.