aigc-x
/

SonicBot

Safetensors

higgs_audio

Model card Files Files and versions

xet

Community

Chaos96 commited on Nov 11, 2025

Commit

93334fc

1 Parent(s): 71463c1

Add Acknowledgments

Browse files

Files changed (1) hide show

README.md +333 -3

README.md CHANGED Viewed

@@ -1,3 +1,333 @@
----
-license: apache-2.0
----

+# SonicBot
+Audio generation and processing inference package based on the Higgs audio model architecture.
+## 📦 Package Contents
+This package provides complete inference capabilities for Higgs audio models:
+- **Core Model Architecture** (`boson_multimodal/model/higgs_audio/`)
+  - Dual-channel audio generation model
+  - Transformer encoder and decoder
+  - Audio feature projector
+  - Delay pattern support
+  - Multi-codebook audio generation
+- **Audio Processing** (`boson_multimodal/audio_processing/`)
+  - Higgs Audio Tokenizer (DAC-based)
+  - Semantic encoder/decoder
+  - Descriptive Audio Codec (DAC)
+  - Vector Quantization (VQ)
+- **Data Processing** (`boson_multimodal/data_collator/`, `boson_multimodal/dataset/`)
+  - HiggsAudioSampleCollator (batch processing)
+  - ChatMLDatasetSample (dialogue data structures)
+  - Multi-channel audio token handling
+- **Inference Scripts**
+  - `infer_single_channel.py` - Single-channel audio inference
+  - `infer_dual_channel.py` - Dual-channel audio generation
+## 📁 Directory Structure
+```
+higgs_audio_inference/
+├── boson_multimodal/              # Core library
+│   ├── __init__.py
+│   ├── constants.py               # Token definitions
+│   ├── data_types.py              # ChatML data structures
+│   ├── audio_processing/          # Audio tokenizer + vocoder
+│   │   ├── higgs_audio_tokenizer.py
+│   │   ├── semantic_module.py
+│   │   ├── descriptaudiocodec/    # DAC codec
+│   │   └── quantization/          # Vector quantization
+│   ├── data_collator/             # Data batch processing
+│   │   └── higgs_audio_collator.py
+│   ├── dataset/                   # Dataset utilities
+│   │   └── chatml_dataset.py
+│   └── model/
+│       └── higgs_audio/           # Core model
+│           ├── modeling_higgs_audio.py      # Model implementation
+│           ├── configuration_higgs_audio.py # Configuration classes
+│           ├── audio_head.py                # Decoder projector
+│           ├── utils.py                     # Utility functions
+│           ├── common.py                    # Base classes
+│           ├── custom_modules.py            # Custom layers
+│           └── cuda_graph_runner.py         # CUDA optimization
+├── infer_single_channel.py        # Single-channel inference script
+├── infer_dual_channel.py          # Dual-channel inference script
+├── INFERENCE_GUIDE.md             # Detailed inference guide
+├── requirements.txt               # Dependencies
+├── pyproject.toml                 # Project configuration
+└── README.md                      # This file
+```
+## 🚀 Quick Start
+### 1. Installation
+Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+**Core Dependencies**:
+- PyTorch >= 2.0
+- Transformers >= 4.45.1, < 4.47.0
+- descript-audio-codec
+- librosa, torchaudio
+- safetensors
+### 2. Prepare Resources
+Ensure you have the following:
+1. **Model Checkpoint**:
+   ```
+   path/to/checkpoint/
+   ├── config.json
+   ├── model.safetensors
+   └── ...
+   ```
+2. **Tokenizer**: Auto-downloaded from HuggingFace Hub
+   - Default: `bosonai/higgs-audio-v2-tokenizer`
+3. **Test Data** (optional): Tokenized dataset
+   ```
+   dataset/tokenized_data/
+   ├── val_manifest.jsonl
+   └── tokens/
+   ```
+### 3. Run Inference
+#### Single-Channel Inference
+For single-channel audio processing:
+```bash
+python infer_single_channel.py \
+    --checkpoint path/to/checkpoint \
+    --dataset-dir path/to/dataset \
+    --num-samples 5 \
+    --output-dir outputs/results \
+    --device cuda \
+    --channel-index 0
+```
+#### Dual-Channel Inference
+For dual-channel audio generation (conversational AI):
+```bash
+python infer_dual_channel.py \
+    --checkpoint path/to/checkpoint \
+    --dataset-dir path/to/dataset \
+    --num-samples 5 \
+    --output-dir outputs/results \
+    --device cuda \
+    --max-frames 500
+```
+**Key Parameters**:
+- `--checkpoint`: Path to model checkpoint directory
+- `--dataset-dir`: Path to tokenized dataset directory (containing `val_manifest.jsonl`)
+- `--num-samples`: Number of validation samples to process
+- `--output-dir`: Output directory for generated audio files
+- `--device`: Device to use (`cuda` or `cpu`)
+- `--max-frames`: Maximum audio frames to generate (for speed control)
+- `--tokenizer`: Tokenizer repo (default: `bosonai/higgs-audio-v2-tokenizer`)
+- `--channel-index`: *(Single-channel only)* Channel to extract (0 or 1)
+## 💡 Using as a Python Module
+Import and use in your Python code:
+```python
+from boson_multimodal.model.higgs_audio import (
+    HiggsAudioModel,
+    HiggsAudioConfig
+)
+from boson_multimodal.audio_processing import (
+    load_higgs_audio_tokenizer
+)
+from boson_multimodal.data_collator import (
+    HiggsAudioSampleCollator
+)
+# Load model
+config = HiggsAudioConfig.from_pretrained("path/to/checkpoint")
+model = HiggsAudioModel(config).to("cuda")
+# Load tokenizer
+tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer")
+# Create collator
+collator = HiggsAudioSampleCollator(
+    audio_in_token_id=128015,
+    audio_out_token_id=128016,
+    audio_stream_bos_id=1024,
+    audio_stream_eos_id=1025,
+    audio_num_codebooks=8,
+    interleave_audio_channels=True,
+    audio_token_frame_hz=50
+)
+# Run inference (see inference scripts for details)
+```
+## 🔧 Configuration
+### Model Configuration
+Key parameters in `config.json`:
+```json
+{
+  "audio_num_codebooks": 8,          // Number of audio codebooks
+  "audio_codebook_size": 1024,       // Size of each codebook
+  "audio_token_frame_hz": 50,        // Frame rate (50 fps)
+  "interleave_audio_channels": true, // Interleave dual channels
+  "use_delay_pattern": false,        // Whether to use delay pattern
+  "audio_dual_ffn_layers": [...]     // Dual FFN layer configuration
+}
+```
+### Token Specifications
+- **Audio-in token**: 128015 (`<|AUDIO|>`)
+- **Audio-out token**: 128016 (`<|AUDIO_OUT|>`)
+- **Audio stream BOS**: 1024
+- **Audio stream EOS**: 1025
+- **Pad token**: 0 or 128001
+- **Text vocab size**: ~128000 (LLaMA-based)
+- **Audio vocab size**: 1024 (per codebook)
+## 🎯 Inference Outputs
+The inference scripts generate:
+1. **Audio Files** (WAV format)
+   - Sample rate: 16000 Hz
+   - Single-channel: `output_generated.wav`, `input_groundtruth.wav`
+   - Dual-channel: `channel0_input.wav`, `channel1_generated.wav`, `channel1_groundtruth.wav`
+2. **Evaluation Metrics** (console + JSON)
+   - RMSE (Root Mean Squared Error)
+   - MAE (Mean Absolute Error)
+   - SNR (Signal-to-Noise Ratio)
+   - Correlation coefficient
+3. **Metrics JSON**
+   - Per-sample metrics
+   - Average metrics across all samples
+## 📊 Choosing the Right Script
+### Use `infer_single_channel.py` when:
+- ✅ Processing mono audio
+- ✅ Audio enhancement tasks
+- ✅ Audio reconstruction from tokens
+- ✅ Single-speaker scenarios
+- ✅ Extracting one channel from stereo
+### Use `infer_dual_channel.py` when:
+- ✅ Conversational AI (dialogue generation)
+- ✅ Turn-taking scenarios
+- ✅ Stereo audio processing
+- ✅ Multi-speaker systems
+- ✅ Generating responses conditioned on input
+## 🔍 Troubleshooting
+### Issue: Module not found
+**Error**: `ModuleNotFoundError: No module named 'boson_multimodal'`
+**Solution**: Ensure you're in the correct directory or add to Python path:
+```python
+import sys
+sys.path.insert(0, '/path/to/higgs_audio_inference')
+```
+### Issue: CUDA out of memory
+**Error**: `RuntimeError: CUDA out of memory`
+**Solution**:
+- Reduce `--max-frames` parameter
+- Reduce `--num-samples`
+- Use CPU mode: `--device cpu`
+### Issue: Tokenizer download failed
+**Error**: Cannot download tokenizer from HuggingFace Hub
+**Solution**:
+- Check network connection
+- Use proxy: `export HF_ENDPOINT=https://hf-mirror.com`
+- Download tokenizer manually and specify local path: `--tokenizer /path/to/local/tokenizer`
+### Issue: Token shape mismatch
+**Error**: "Expected token tensor with shape..."
+**Solution**:
+- **Single-channel**: Ensure tokens are `[8, frames]`, use `--channel-index` if needed
+- **Dual-channel**: Ensure tokens are `[2, 8, frames]`
+## 📚 Documentation
+- **Main README**: This file - Package overview and quick start
+- **Inference Guide**: `INFERENCE_GUIDE.md` - Detailed inference documentation
+- **Training Reference**: `DUAL_CHANNEL_TRAINING_README.md` - Training documentation
+## 🐛 Common Questions
+**Q: Can this be published as a pip package?**
+A: Yes. The package includes `pyproject.toml`. You can build and install:
+```bash
+pip install build
+python -m build
+pip install dist/higgs_audio_inference-*.whl
+```
+**Q: What's the model size?**
+A:
+- Code: ~3800 lines of core code + dependencies
+- Model weights: Depends on checkpoint (typically hundreds of MB to a few GB)
+**Q: Which PyTorch versions are supported?**
+A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+.
+**Q: How do I use this in my project?**
+A: Two ways:
+1. Command-line: `python higgs_audio_inference/infer_*.py ...`
+2. Python import: See "Using as a Python Module" section above
+## 💡 Tips
+1. **Start small**: Test with `--num-samples 1` and `--max-frames 100` first
+2. **Use CUDA**: CPU inference is 10-50x slower
+3. **Monitor memory**: Reduce `--max-frames` if OOM errors occur
+4. **Check outputs**: Listen to generated audio to verify quality
+5. **Read the guide**: See `INFERENCE_GUIDE.md` for comprehensive documentation
+## Acknowledgments
+<div align="left">
+  <a href="https://www.bitdeer.com/">
+    <img src="https://pub-ad90b2169561455ea151c5176b67b638.r2.dev/2025/11/fb1fe1d18e52cf4625313b8849645e30.svg" alt="Bitdeer" width="200"/>
+  </a>
+</div>
+This research was supported by **Bitdeer AI Team** of [Bitdeer Technologies Group](https://www.bitdeer.com/) through provision of GPU resources and AI cloud services.