# SonicBot

Audio generation and processing inference package based on the Higgs audio model architecture.

## 📦 Package Contents

This package provides complete inference capabilities for Higgs audio models:

- **Core Model Architecture** (`boson_multimodal/model/higgs_audio/`)
  - Dual-channel audio generation model
  - Transformer encoder and decoder
  - Audio feature projector
  - Delay pattern support
  - Multi-codebook audio generation

- **Audio Processing** (`boson_multimodal/audio_processing/`)
  - Higgs Audio Tokenizer (DAC-based)
  - Semantic encoder/decoder
  - Descriptive Audio Codec (DAC)
  - Vector Quantization (VQ)

- **Data Processing** (`boson_multimodal/data_collator/`, `boson_multimodal/dataset/`)
  - HiggsAudioSampleCollator (batch processing)
  - ChatMLDatasetSample (dialogue data structures)
  - Multi-channel audio token handling

- **Inference Scripts**
  - `infer_single_channel.py` - Single-channel audio inference
  - `infer_dual_channel.py` - Dual-channel audio generation

## 📁 Directory Structure

```
higgs_audio_inference/
├── boson_multimodal/              # Core library
│   ├── __init__.py
│   ├── constants.py               # Token definitions
│   ├── data_types.py              # ChatML data structures
│   ├── audio_processing/          # Audio tokenizer + vocoder
│   │   ├── higgs_audio_tokenizer.py
│   │   ├── semantic_module.py
│   │   ├── descriptaudiocodec/    # DAC codec
│   │   └── quantization/          # Vector quantization
│   ├── data_collator/             # Data batch processing
│   │   └── higgs_audio_collator.py
│   ├── dataset/                   # Dataset utilities
│   │   └── chatml_dataset.py
│   └── model/
│       └── higgs_audio/           # Core model
│           ├── modeling_higgs_audio.py      # Model implementation
│           ├── configuration_higgs_audio.py # Configuration classes
│           ├── audio_head.py                # Decoder projector
│           ├── utils.py                     # Utility functions
│           ├── common.py                    # Base classes
│           ├── custom_modules.py            # Custom layers
│           └── cuda_graph_runner.py         # CUDA optimization
├── infer_single_channel.py        # Single-channel inference script
├── infer_dual_channel.py          # Dual-channel inference script
├── INFERENCE_GUIDE.md             # Detailed inference guide
├── requirements.txt               # Dependencies
├── pyproject.toml                 # Project configuration
└── README.md                      # This file
```

## 🚀 Quick Start

### 1. Installation

Install dependencies:

```bash
pip install -r requirements.txt
```

**Core Dependencies**:
- PyTorch >= 2.0
- Transformers >= 4.45.1, < 4.47.0
- descript-audio-codec
- librosa, torchaudio
- safetensors

### 2. Prepare Resources

Ensure you have the following:

1. **Model Checkpoint**:
   ```
   path/to/checkpoint/
   ├── config.json
   ├── model.safetensors
   └── ...
   ```

2. **Tokenizer**: Auto-downloaded from HuggingFace Hub
   - Default: `bosonai/higgs-audio-v2-tokenizer`

3. **Test Data** (optional): Tokenized dataset
   ```
   dataset/tokenized_data/
   ├── val_manifest.jsonl
   └── tokens/
   ```

### 3. Run Inference

#### Single-Channel Inference

For single-channel audio processing:

```bash
python infer_single_channel.py \
    --checkpoint path/to/checkpoint \
    --dataset-dir path/to/dataset \
    --num-samples 5 \
    --output-dir outputs/results \
    --device cuda \
    --channel-index 0
```

#### Dual-Channel Inference

For dual-channel audio generation (conversational AI):

```bash
python infer_dual_channel.py \
    --checkpoint path/to/checkpoint \
    --dataset-dir path/to/dataset \
    --num-samples 5 \
    --output-dir outputs/results \
    --device cuda \
    --max-frames 500
```

**Key Parameters**:
- `--checkpoint`: Path to model checkpoint directory
- `--dataset-dir`: Path to tokenized dataset directory (containing `val_manifest.jsonl`)
- `--num-samples`: Number of validation samples to process
- `--output-dir`: Output directory for generated audio files
- `--device`: Device to use (`cuda` or `cpu`)
- `--max-frames`: Maximum audio frames to generate (for speed control)
- `--tokenizer`: Tokenizer repo (default: `bosonai/higgs-audio-v2-tokenizer`)
- `--channel-index`: *(Single-channel only)* Channel to extract (0 or 1)

## 💡 Using as a Python Module

Import and use in your Python code:

```python
from boson_multimodal.model.higgs_audio import (
    HiggsAudioModel,
    HiggsAudioConfig
)
from boson_multimodal.audio_processing import (
    load_higgs_audio_tokenizer
)
from boson_multimodal.data_collator import (
    HiggsAudioSampleCollator
)

# Load model
config = HiggsAudioConfig.from_pretrained("path/to/checkpoint")
model = HiggsAudioModel(config).to("cuda")

# Load tokenizer
tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer")

# Create collator
collator = HiggsAudioSampleCollator(
    audio_in_token_id=128015,
    audio_out_token_id=128016,
    audio_stream_bos_id=1024,
    audio_stream_eos_id=1025,
    audio_num_codebooks=8,
    interleave_audio_channels=True,
    audio_token_frame_hz=50
)

# Run inference (see inference scripts for details)
```

## 🔧 Configuration

### Model Configuration

Key parameters in `config.json`:

```json
{
  "audio_num_codebooks": 8,          // Number of audio codebooks
  "audio_codebook_size": 1024,       // Size of each codebook
  "audio_token_frame_hz": 50,        // Frame rate (50 fps)
  "interleave_audio_channels": true, // Interleave dual channels
  "use_delay_pattern": false,        // Whether to use delay pattern
  "audio_dual_ffn_layers": [...]     // Dual FFN layer configuration
}
```

### Token Specifications

- **Audio-in token**: 128015 (`<|AUDIO|>`)
- **Audio-out token**: 128016 (`<|AUDIO_OUT|>`)
- **Audio stream BOS**: 1024
- **Audio stream EOS**: 1025
- **Pad token**: 0 or 128001
- **Text vocab size**: ~128000 (LLaMA-based)
- **Audio vocab size**: 1024 (per codebook)

## 🎯 Inference Outputs

The inference scripts generate:

1. **Audio Files** (WAV format)
   - Sample rate: 16000 Hz
   - Single-channel: `output_generated.wav`, `input_groundtruth.wav`
   - Dual-channel: `channel0_input.wav`, `channel1_generated.wav`, `channel1_groundtruth.wav`

2. **Evaluation Metrics** (console + JSON)
   - RMSE (Root Mean Squared Error)
   - MAE (Mean Absolute Error)
   - SNR (Signal-to-Noise Ratio)
   - Correlation coefficient

3. **Metrics JSON**
   - Per-sample metrics
   - Average metrics across all samples

## 📊 Choosing the Right Script

### Use `infer_single_channel.py` when:
- ✅ Processing mono audio
- ✅ Audio enhancement tasks
- ✅ Audio reconstruction from tokens
- ✅ Single-speaker scenarios
- ✅ Extracting one channel from stereo

### Use `infer_dual_channel.py` when:
- ✅ Conversational AI (dialogue generation)
- ✅ Turn-taking scenarios
- ✅ Stereo audio processing
- ✅ Multi-speaker systems
- ✅ Generating responses conditioned on input

## 🔍 Troubleshooting

### Issue: Module not found

**Error**: `ModuleNotFoundError: No module named 'boson_multimodal'`

**Solution**: Ensure you're in the correct directory or add to Python path:

```python
import sys
sys.path.insert(0, '/path/to/higgs_audio_inference')
```

### Issue: CUDA out of memory

**Error**: `RuntimeError: CUDA out of memory`

**Solution**:
- Reduce `--max-frames` parameter
- Reduce `--num-samples`
- Use CPU mode: `--device cpu`

### Issue: Tokenizer download failed

**Error**: Cannot download tokenizer from HuggingFace Hub

**Solution**:
- Check network connection
- Use proxy: `export HF_ENDPOINT=https://hf-mirror.com`
- Download tokenizer manually and specify local path: `--tokenizer /path/to/local/tokenizer`

### Issue: Token shape mismatch

**Error**: "Expected token tensor with shape..."

**Solution**:
- **Single-channel**: Ensure tokens are `[8, frames]`, use `--channel-index` if needed
- **Dual-channel**: Ensure tokens are `[2, 8, frames]`

## 📚 Documentation

- **Main README**: This file - Package overview and quick start
- **Inference Guide**: `INFERENCE_GUIDE.md` - Detailed inference documentation
- **Training Reference**: `DUAL_CHANNEL_TRAINING_README.md` - Training documentation

## 🐛 Common Questions

**Q: Can this be published as a pip package?**

A: Yes. The package includes `pyproject.toml`. You can build and install:
```bash
pip install build
python -m build
pip install dist/higgs_audio_inference-*.whl
```

**Q: What's the model size?**

A:
- Code: ~3800 lines of core code + dependencies
- Model weights: Depends on checkpoint (typically hundreds of MB to a few GB)

**Q: Which PyTorch versions are supported?**

A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+.

**Q: How do I use this in my project?**

A: Two ways:
1. Command-line: `python higgs_audio_inference/infer_*.py ...`
2. Python import: See "Using as a Python Module" section above

## 💡 Tips

1. **Start small**: Test with `--num-samples 1` and `--max-frames 100` first
2. **Use CUDA**: CPU inference is 10-50x slower
3. **Monitor memory**: Reduce `--max-frames` if OOM errors occur
4. **Check outputs**: Listen to generated audio to verify quality
5. **Read the guide**: See `INFERENCE_GUIDE.md` for comprehensive documentation



## Acknowledgments

<div align="left">
  <a href="https://www.bitdeer.com/">
    <img src="https://pub-ad90b2169561455ea151c5176b67b638.r2.dev/2025/11/bitdeerai-logo-horizontal.svg" alt="Bitdeer" width="250"/>
  </a>
</div>

This research was supported by **[Bitdeer AI](https://www.bitdeer.ai/)** of Bitdeer Technologies Group through provision of GPU resources and AI cloud services.