File size: 10,282 Bytes
93334fc 3dc6a52 93334fc 3dc6a52 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 |
# SonicBot
Audio generation and processing inference package based on the Higgs audio model architecture.
## π¦ Package Contents
This package provides complete inference capabilities for Higgs audio models:
- **Core Model Architecture** (`boson_multimodal/model/higgs_audio/`)
- Dual-channel audio generation model
- Transformer encoder and decoder
- Audio feature projector
- Delay pattern support
- Multi-codebook audio generation
- **Audio Processing** (`boson_multimodal/audio_processing/`)
- Higgs Audio Tokenizer (DAC-based)
- Semantic encoder/decoder
- Descriptive Audio Codec (DAC)
- Vector Quantization (VQ)
- **Data Processing** (`boson_multimodal/data_collator/`, `boson_multimodal/dataset/`)
- HiggsAudioSampleCollator (batch processing)
- ChatMLDatasetSample (dialogue data structures)
- Multi-channel audio token handling
- **Inference Scripts**
- `infer_single_channel.py` - Single-channel audio inference
- `infer_dual_channel.py` - Dual-channel audio generation
## π Directory Structure
```
higgs_audio_inference/
βββ boson_multimodal/ # Core library
β βββ __init__.py
β βββ constants.py # Token definitions
β βββ data_types.py # ChatML data structures
β βββ audio_processing/ # Audio tokenizer + vocoder
β β βββ higgs_audio_tokenizer.py
β β βββ semantic_module.py
β β βββ descriptaudiocodec/ # DAC codec
β β βββ quantization/ # Vector quantization
β βββ data_collator/ # Data batch processing
β β βββ higgs_audio_collator.py
β βββ dataset/ # Dataset utilities
β β βββ chatml_dataset.py
β βββ model/
β βββ higgs_audio/ # Core model
β βββ modeling_higgs_audio.py # Model implementation
β βββ configuration_higgs_audio.py # Configuration classes
β βββ audio_head.py # Decoder projector
β βββ utils.py # Utility functions
β βββ common.py # Base classes
β βββ custom_modules.py # Custom layers
β βββ cuda_graph_runner.py # CUDA optimization
βββ infer_single_channel.py # Single-channel inference script
βββ infer_dual_channel.py # Dual-channel inference script
βββ INFERENCE_GUIDE.md # Detailed inference guide
βββ requirements.txt # Dependencies
βββ pyproject.toml # Project configuration
βββ README.md # This file
```
## π Quick Start
### 1. Installation
Install dependencies:
```bash
pip install -r requirements.txt
```
**Core Dependencies**:
- PyTorch >= 2.0
- Transformers >= 4.45.1, < 4.47.0
- descript-audio-codec
- librosa, torchaudio
- safetensors
### 2. Prepare Resources
Ensure you have the following:
1. **Model Checkpoint**:
```
path/to/checkpoint/
βββ config.json
βββ model.safetensors
βββ ...
```
2. **Tokenizer**: Auto-downloaded from HuggingFace Hub
- Default: `bosonai/higgs-audio-v2-tokenizer`
3. **Test Data** (optional): Tokenized dataset
```
dataset/tokenized_data/
βββ val_manifest.jsonl
βββ tokens/
```
### 3. Run Inference
#### Single-Channel Inference
For single-channel audio processing:
```bash
python infer_single_channel.py \
--checkpoint path/to/checkpoint \
--dataset-dir path/to/dataset \
--num-samples 5 \
--output-dir outputs/results \
--device cuda \
--channel-index 0
```
#### Dual-Channel Inference
For dual-channel audio generation (conversational AI):
```bash
python infer_dual_channel.py \
--checkpoint path/to/checkpoint \
--dataset-dir path/to/dataset \
--num-samples 5 \
--output-dir outputs/results \
--device cuda \
--max-frames 500
```
**Key Parameters**:
- `--checkpoint`: Path to model checkpoint directory
- `--dataset-dir`: Path to tokenized dataset directory (containing `val_manifest.jsonl`)
- `--num-samples`: Number of validation samples to process
- `--output-dir`: Output directory for generated audio files
- `--device`: Device to use (`cuda` or `cpu`)
- `--max-frames`: Maximum audio frames to generate (for speed control)
- `--tokenizer`: Tokenizer repo (default: `bosonai/higgs-audio-v2-tokenizer`)
- `--channel-index`: *(Single-channel only)* Channel to extract (0 or 1)
## π‘ Using as a Python Module
Import and use in your Python code:
```python
from boson_multimodal.model.higgs_audio import (
HiggsAudioModel,
HiggsAudioConfig
)
from boson_multimodal.audio_processing import (
load_higgs_audio_tokenizer
)
from boson_multimodal.data_collator import (
HiggsAudioSampleCollator
)
# Load model
config = HiggsAudioConfig.from_pretrained("path/to/checkpoint")
model = HiggsAudioModel(config).to("cuda")
# Load tokenizer
tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer")
# Create collator
collator = HiggsAudioSampleCollator(
audio_in_token_id=128015,
audio_out_token_id=128016,
audio_stream_bos_id=1024,
audio_stream_eos_id=1025,
audio_num_codebooks=8,
interleave_audio_channels=True,
audio_token_frame_hz=50
)
# Run inference (see inference scripts for details)
```
## π§ Configuration
### Model Configuration
Key parameters in `config.json`:
```json
{
"audio_num_codebooks": 8, // Number of audio codebooks
"audio_codebook_size": 1024, // Size of each codebook
"audio_token_frame_hz": 50, // Frame rate (50 fps)
"interleave_audio_channels": true, // Interleave dual channels
"use_delay_pattern": false, // Whether to use delay pattern
"audio_dual_ffn_layers": [...] // Dual FFN layer configuration
}
```
### Token Specifications
- **Audio-in token**: 128015 (`<|AUDIO|>`)
- **Audio-out token**: 128016 (`<|AUDIO_OUT|>`)
- **Audio stream BOS**: 1024
- **Audio stream EOS**: 1025
- **Pad token**: 0 or 128001
- **Text vocab size**: ~128000 (LLaMA-based)
- **Audio vocab size**: 1024 (per codebook)
## π― Inference Outputs
The inference scripts generate:
1. **Audio Files** (WAV format)
- Sample rate: 16000 Hz
- Single-channel: `output_generated.wav`, `input_groundtruth.wav`
- Dual-channel: `channel0_input.wav`, `channel1_generated.wav`, `channel1_groundtruth.wav`
2. **Evaluation Metrics** (console + JSON)
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- SNR (Signal-to-Noise Ratio)
- Correlation coefficient
3. **Metrics JSON**
- Per-sample metrics
- Average metrics across all samples
## π Choosing the Right Script
### Use `infer_single_channel.py` when:
- β
Processing mono audio
- β
Audio enhancement tasks
- β
Audio reconstruction from tokens
- β
Single-speaker scenarios
- β
Extracting one channel from stereo
### Use `infer_dual_channel.py` when:
- β
Conversational AI (dialogue generation)
- β
Turn-taking scenarios
- β
Stereo audio processing
- β
Multi-speaker systems
- β
Generating responses conditioned on input
## π Troubleshooting
### Issue: Module not found
**Error**: `ModuleNotFoundError: No module named 'boson_multimodal'`
**Solution**: Ensure you're in the correct directory or add to Python path:
```python
import sys
sys.path.insert(0, '/path/to/higgs_audio_inference')
```
### Issue: CUDA out of memory
**Error**: `RuntimeError: CUDA out of memory`
**Solution**:
- Reduce `--max-frames` parameter
- Reduce `--num-samples`
- Use CPU mode: `--device cpu`
### Issue: Tokenizer download failed
**Error**: Cannot download tokenizer from HuggingFace Hub
**Solution**:
- Check network connection
- Use proxy: `export HF_ENDPOINT=https://hf-mirror.com`
- Download tokenizer manually and specify local path: `--tokenizer /path/to/local/tokenizer`
### Issue: Token shape mismatch
**Error**: "Expected token tensor with shape..."
**Solution**:
- **Single-channel**: Ensure tokens are `[8, frames]`, use `--channel-index` if needed
- **Dual-channel**: Ensure tokens are `[2, 8, frames]`
## π Documentation
- **Main README**: This file - Package overview and quick start
- **Inference Guide**: `INFERENCE_GUIDE.md` - Detailed inference documentation
- **Training Reference**: `DUAL_CHANNEL_TRAINING_README.md` - Training documentation
## π Common Questions
**Q: Can this be published as a pip package?**
A: Yes. The package includes `pyproject.toml`. You can build and install:
```bash
pip install build
python -m build
pip install dist/higgs_audio_inference-*.whl
```
**Q: What's the model size?**
A:
- Code: ~3800 lines of core code + dependencies
- Model weights: Depends on checkpoint (typically hundreds of MB to a few GB)
**Q: Which PyTorch versions are supported?**
A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+.
**Q: How do I use this in my project?**
A: Two ways:
1. Command-line: `python higgs_audio_inference/infer_*.py ...`
2. Python import: See "Using as a Python Module" section above
## π‘ Tips
1. **Start small**: Test with `--num-samples 1` and `--max-frames 100` first
2. **Use CUDA**: CPU inference is 10-50x slower
3. **Monitor memory**: Reduce `--max-frames` if OOM errors occur
4. **Check outputs**: Listen to generated audio to verify quality
5. **Read the guide**: See `INFERENCE_GUIDE.md` for comprehensive documentation
## Acknowledgments
<div align="left">
<a href="https://www.bitdeer.com/">
<img src="https://pub-ad90b2169561455ea151c5176b67b638.r2.dev/2025/11/bitdeerai-logo-horizontal.svg" alt="Bitdeer" width="250"/>
</a>
</div>
This research was supported by **[Bitdeer AI](https://www.bitdeer.ai/)** of Bitdeer Technologies Group through provision of GPU resources and AI cloud services.
|