| # SonicBot | |
| Audio generation and processing inference package based on the Higgs audio model architecture. | |
| ## π¦ Package Contents | |
| This package provides complete inference capabilities for Higgs audio models: | |
| - **Core Model Architecture** (`boson_multimodal/model/higgs_audio/`) | |
| - Dual-channel audio generation model | |
| - Transformer encoder and decoder | |
| - Audio feature projector | |
| - Delay pattern support | |
| - Multi-codebook audio generation | |
| - **Audio Processing** (`boson_multimodal/audio_processing/`) | |
| - Higgs Audio Tokenizer (DAC-based) | |
| - Semantic encoder/decoder | |
| - Descriptive Audio Codec (DAC) | |
| - Vector Quantization (VQ) | |
| - **Data Processing** (`boson_multimodal/data_collator/`, `boson_multimodal/dataset/`) | |
| - HiggsAudioSampleCollator (batch processing) | |
| - ChatMLDatasetSample (dialogue data structures) | |
| - Multi-channel audio token handling | |
| - **Inference Scripts** | |
| - `infer_single_channel.py` - Single-channel audio inference | |
| - `infer_dual_channel.py` - Dual-channel audio generation | |
| ## π Directory Structure | |
| ``` | |
| higgs_audio_inference/ | |
| βββ boson_multimodal/ # Core library | |
| β βββ __init__.py | |
| β βββ constants.py # Token definitions | |
| β βββ data_types.py # ChatML data structures | |
| β βββ audio_processing/ # Audio tokenizer + vocoder | |
| β β βββ higgs_audio_tokenizer.py | |
| β β βββ semantic_module.py | |
| β β βββ descriptaudiocodec/ # DAC codec | |
| β β βββ quantization/ # Vector quantization | |
| β βββ data_collator/ # Data batch processing | |
| β β βββ higgs_audio_collator.py | |
| β βββ dataset/ # Dataset utilities | |
| β β βββ chatml_dataset.py | |
| β βββ model/ | |
| β βββ higgs_audio/ # Core model | |
| β βββ modeling_higgs_audio.py # Model implementation | |
| β βββ configuration_higgs_audio.py # Configuration classes | |
| β βββ audio_head.py # Decoder projector | |
| β βββ utils.py # Utility functions | |
| β βββ common.py # Base classes | |
| β βββ custom_modules.py # Custom layers | |
| β βββ cuda_graph_runner.py # CUDA optimization | |
| βββ infer_single_channel.py # Single-channel inference script | |
| βββ infer_dual_channel.py # Dual-channel inference script | |
| βββ INFERENCE_GUIDE.md # Detailed inference guide | |
| βββ requirements.txt # Dependencies | |
| βββ pyproject.toml # Project configuration | |
| βββ README.md # This file | |
| ``` | |
| ## π Quick Start | |
| ### 1. Installation | |
| Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| **Core Dependencies**: | |
| - PyTorch >= 2.0 | |
| - Transformers >= 4.45.1, < 4.47.0 | |
| - descript-audio-codec | |
| - librosa, torchaudio | |
| - safetensors | |
| ### 2. Prepare Resources | |
| Ensure you have the following: | |
| 1. **Model Checkpoint**: | |
| ``` | |
| path/to/checkpoint/ | |
| βββ config.json | |
| βββ model.safetensors | |
| βββ ... | |
| ``` | |
| 2. **Tokenizer**: Auto-downloaded from HuggingFace Hub | |
| - Default: `bosonai/higgs-audio-v2-tokenizer` | |
| 3. **Test Data** (optional): Tokenized dataset | |
| ``` | |
| dataset/tokenized_data/ | |
| βββ val_manifest.jsonl | |
| βββ tokens/ | |
| ``` | |
| ### 3. Run Inference | |
| #### Single-Channel Inference | |
| For single-channel audio processing: | |
| ```bash | |
| python infer_single_channel.py \ | |
| --checkpoint path/to/checkpoint \ | |
| --dataset-dir path/to/dataset \ | |
| --num-samples 5 \ | |
| --output-dir outputs/results \ | |
| --device cuda \ | |
| --channel-index 0 | |
| ``` | |
| #### Dual-Channel Inference | |
| For dual-channel audio generation (conversational AI): | |
| ```bash | |
| python infer_dual_channel.py \ | |
| --checkpoint path/to/checkpoint \ | |
| --dataset-dir path/to/dataset \ | |
| --num-samples 5 \ | |
| --output-dir outputs/results \ | |
| --device cuda \ | |
| --max-frames 500 | |
| ``` | |
| **Key Parameters**: | |
| - `--checkpoint`: Path to model checkpoint directory | |
| - `--dataset-dir`: Path to tokenized dataset directory (containing `val_manifest.jsonl`) | |
| - `--num-samples`: Number of validation samples to process | |
| - `--output-dir`: Output directory for generated audio files | |
| - `--device`: Device to use (`cuda` or `cpu`) | |
| - `--max-frames`: Maximum audio frames to generate (for speed control) | |
| - `--tokenizer`: Tokenizer repo (default: `bosonai/higgs-audio-v2-tokenizer`) | |
| - `--channel-index`: *(Single-channel only)* Channel to extract (0 or 1) | |
| ## π‘ Using as a Python Module | |
| Import and use in your Python code: | |
| ```python | |
| from boson_multimodal.model.higgs_audio import ( | |
| HiggsAudioModel, | |
| HiggsAudioConfig | |
| ) | |
| from boson_multimodal.audio_processing import ( | |
| load_higgs_audio_tokenizer | |
| ) | |
| from boson_multimodal.data_collator import ( | |
| HiggsAudioSampleCollator | |
| ) | |
| # Load model | |
| config = HiggsAudioConfig.from_pretrained("path/to/checkpoint") | |
| model = HiggsAudioModel(config).to("cuda") | |
| # Load tokenizer | |
| tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer") | |
| # Create collator | |
| collator = HiggsAudioSampleCollator( | |
| audio_in_token_id=128015, | |
| audio_out_token_id=128016, | |
| audio_stream_bos_id=1024, | |
| audio_stream_eos_id=1025, | |
| audio_num_codebooks=8, | |
| interleave_audio_channels=True, | |
| audio_token_frame_hz=50 | |
| ) | |
| # Run inference (see inference scripts for details) | |
| ``` | |
| ## π§ Configuration | |
| ### Model Configuration | |
| Key parameters in `config.json`: | |
| ```json | |
| { | |
| "audio_num_codebooks": 8, // Number of audio codebooks | |
| "audio_codebook_size": 1024, // Size of each codebook | |
| "audio_token_frame_hz": 50, // Frame rate (50 fps) | |
| "interleave_audio_channels": true, // Interleave dual channels | |
| "use_delay_pattern": false, // Whether to use delay pattern | |
| "audio_dual_ffn_layers": [...] // Dual FFN layer configuration | |
| } | |
| ``` | |
| ### Token Specifications | |
| - **Audio-in token**: 128015 (`<|AUDIO|>`) | |
| - **Audio-out token**: 128016 (`<|AUDIO_OUT|>`) | |
| - **Audio stream BOS**: 1024 | |
| - **Audio stream EOS**: 1025 | |
| - **Pad token**: 0 or 128001 | |
| - **Text vocab size**: ~128000 (LLaMA-based) | |
| - **Audio vocab size**: 1024 (per codebook) | |
| ## π― Inference Outputs | |
| The inference scripts generate: | |
| 1. **Audio Files** (WAV format) | |
| - Sample rate: 16000 Hz | |
| - Single-channel: `output_generated.wav`, `input_groundtruth.wav` | |
| - Dual-channel: `channel0_input.wav`, `channel1_generated.wav`, `channel1_groundtruth.wav` | |
| 2. **Evaluation Metrics** (console + JSON) | |
| - RMSE (Root Mean Squared Error) | |
| - MAE (Mean Absolute Error) | |
| - SNR (Signal-to-Noise Ratio) | |
| - Correlation coefficient | |
| 3. **Metrics JSON** | |
| - Per-sample metrics | |
| - Average metrics across all samples | |
| ## π Choosing the Right Script | |
| ### Use `infer_single_channel.py` when: | |
| - β Processing mono audio | |
| - β Audio enhancement tasks | |
| - β Audio reconstruction from tokens | |
| - β Single-speaker scenarios | |
| - β Extracting one channel from stereo | |
| ### Use `infer_dual_channel.py` when: | |
| - β Conversational AI (dialogue generation) | |
| - β Turn-taking scenarios | |
| - β Stereo audio processing | |
| - β Multi-speaker systems | |
| - β Generating responses conditioned on input | |
| ## π Troubleshooting | |
| ### Issue: Module not found | |
| **Error**: `ModuleNotFoundError: No module named 'boson_multimodal'` | |
| **Solution**: Ensure you're in the correct directory or add to Python path: | |
| ```python | |
| import sys | |
| sys.path.insert(0, '/path/to/higgs_audio_inference') | |
| ``` | |
| ### Issue: CUDA out of memory | |
| **Error**: `RuntimeError: CUDA out of memory` | |
| **Solution**: | |
| - Reduce `--max-frames` parameter | |
| - Reduce `--num-samples` | |
| - Use CPU mode: `--device cpu` | |
| ### Issue: Tokenizer download failed | |
| **Error**: Cannot download tokenizer from HuggingFace Hub | |
| **Solution**: | |
| - Check network connection | |
| - Use proxy: `export HF_ENDPOINT=https://hf-mirror.com` | |
| - Download tokenizer manually and specify local path: `--tokenizer /path/to/local/tokenizer` | |
| ### Issue: Token shape mismatch | |
| **Error**: "Expected token tensor with shape..." | |
| **Solution**: | |
| - **Single-channel**: Ensure tokens are `[8, frames]`, use `--channel-index` if needed | |
| - **Dual-channel**: Ensure tokens are `[2, 8, frames]` | |
| ## π Documentation | |
| - **Main README**: This file - Package overview and quick start | |
| - **Inference Guide**: `INFERENCE_GUIDE.md` - Detailed inference documentation | |
| - **Training Reference**: `DUAL_CHANNEL_TRAINING_README.md` - Training documentation | |
| ## π Common Questions | |
| **Q: Can this be published as a pip package?** | |
| A: Yes. The package includes `pyproject.toml`. You can build and install: | |
| ```bash | |
| pip install build | |
| python -m build | |
| pip install dist/higgs_audio_inference-*.whl | |
| ``` | |
| **Q: What's the model size?** | |
| A: | |
| - Code: ~3800 lines of core code + dependencies | |
| - Model weights: Depends on checkpoint (typically hundreds of MB to a few GB) | |
| **Q: Which PyTorch versions are supported?** | |
| A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+. | |
| **Q: How do I use this in my project?** | |
| A: Two ways: | |
| 1. Command-line: `python higgs_audio_inference/infer_*.py ...` | |
| 2. Python import: See "Using as a Python Module" section above | |
| ## π‘ Tips | |
| 1. **Start small**: Test with `--num-samples 1` and `--max-frames 100` first | |
| 2. **Use CUDA**: CPU inference is 10-50x slower | |
| 3. **Monitor memory**: Reduce `--max-frames` if OOM errors occur | |
| 4. **Check outputs**: Listen to generated audio to verify quality | |
| 5. **Read the guide**: See `INFERENCE_GUIDE.md` for comprehensive documentation | |
| ## Acknowledgments | |
| <div align="left"> | |
| <a href="https://www.bitdeer.com/"> | |
| <img src="https://pub-ad90b2169561455ea151c5176b67b638.r2.dev/2025/11/bitdeerai-logo-horizontal.svg" alt="Bitdeer" width="250"/> | |
| </a> | |
| </div> | |
| This research was supported by **[Bitdeer AI](https://www.bitdeer.ai/)** of Bitdeer Technologies Group through provision of GPU resources and AI cloud services. | |