File size: 10,282 Bytes
93334fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dc6a52
93334fc
 
 
3dc6a52
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
# SonicBot

Audio generation and processing inference package based on the Higgs audio model architecture.

## πŸ“¦ Package Contents

This package provides complete inference capabilities for Higgs audio models:

- **Core Model Architecture** (`boson_multimodal/model/higgs_audio/`)
  - Dual-channel audio generation model
  - Transformer encoder and decoder
  - Audio feature projector
  - Delay pattern support
  - Multi-codebook audio generation

- **Audio Processing** (`boson_multimodal/audio_processing/`)
  - Higgs Audio Tokenizer (DAC-based)
  - Semantic encoder/decoder
  - Descriptive Audio Codec (DAC)
  - Vector Quantization (VQ)

- **Data Processing** (`boson_multimodal/data_collator/`, `boson_multimodal/dataset/`)
  - HiggsAudioSampleCollator (batch processing)
  - ChatMLDatasetSample (dialogue data structures)
  - Multi-channel audio token handling

- **Inference Scripts**
  - `infer_single_channel.py` - Single-channel audio inference
  - `infer_dual_channel.py` - Dual-channel audio generation

## πŸ“ Directory Structure

```

higgs_audio_inference/

β”œβ”€β”€ boson_multimodal/              # Core library

β”‚   β”œβ”€β”€ __init__.py

β”‚   β”œβ”€β”€ constants.py               # Token definitions

β”‚   β”œβ”€β”€ data_types.py              # ChatML data structures

β”‚   β”œβ”€β”€ audio_processing/          # Audio tokenizer + vocoder

β”‚   β”‚   β”œβ”€β”€ higgs_audio_tokenizer.py

β”‚   β”‚   β”œβ”€β”€ semantic_module.py

β”‚   β”‚   β”œβ”€β”€ descriptaudiocodec/    # DAC codec

β”‚   β”‚   └── quantization/          # Vector quantization

β”‚   β”œβ”€β”€ data_collator/             # Data batch processing

β”‚   β”‚   └── higgs_audio_collator.py

β”‚   β”œβ”€β”€ dataset/                   # Dataset utilities

β”‚   β”‚   └── chatml_dataset.py

β”‚   └── model/

β”‚       └── higgs_audio/           # Core model

β”‚           β”œβ”€β”€ modeling_higgs_audio.py      # Model implementation

β”‚           β”œβ”€β”€ configuration_higgs_audio.py # Configuration classes

β”‚           β”œβ”€β”€ audio_head.py                # Decoder projector

β”‚           β”œβ”€β”€ utils.py                     # Utility functions

β”‚           β”œβ”€β”€ common.py                    # Base classes

β”‚           β”œβ”€β”€ custom_modules.py            # Custom layers

β”‚           └── cuda_graph_runner.py         # CUDA optimization

β”œβ”€β”€ infer_single_channel.py        # Single-channel inference script

β”œβ”€β”€ infer_dual_channel.py          # Dual-channel inference script

β”œβ”€β”€ INFERENCE_GUIDE.md             # Detailed inference guide

β”œβ”€β”€ requirements.txt               # Dependencies

β”œβ”€β”€ pyproject.toml                 # Project configuration

└── README.md                      # This file

```

## πŸš€ Quick Start

### 1. Installation

Install dependencies:

```bash

pip install -r requirements.txt

```

**Core Dependencies**:
- PyTorch >= 2.0
- Transformers >= 4.45.1, < 4.47.0
- descript-audio-codec
- librosa, torchaudio
- safetensors

### 2. Prepare Resources

Ensure you have the following:

1. **Model Checkpoint**:
   ```

   path/to/checkpoint/

   β”œβ”€β”€ config.json

   β”œβ”€β”€ model.safetensors

   └── ...

   ```

2. **Tokenizer**: Auto-downloaded from HuggingFace Hub
   - Default: `bosonai/higgs-audio-v2-tokenizer`

3. **Test Data** (optional): Tokenized dataset
   ```

   dataset/tokenized_data/

   β”œβ”€β”€ val_manifest.jsonl

   └── tokens/

   ```

### 3. Run Inference

#### Single-Channel Inference

For single-channel audio processing:

```bash

python infer_single_channel.py \

    --checkpoint path/to/checkpoint \

    --dataset-dir path/to/dataset \

    --num-samples 5 \

    --output-dir outputs/results \

    --device cuda \

    --channel-index 0

```

#### Dual-Channel Inference

For dual-channel audio generation (conversational AI):

```bash

python infer_dual_channel.py \

    --checkpoint path/to/checkpoint \

    --dataset-dir path/to/dataset \

    --num-samples 5 \

    --output-dir outputs/results \

    --device cuda \

    --max-frames 500

```

**Key Parameters**:
- `--checkpoint`: Path to model checkpoint directory
- `--dataset-dir`: Path to tokenized dataset directory (containing `val_manifest.jsonl`)
- `--num-samples`: Number of validation samples to process
- `--output-dir`: Output directory for generated audio files
- `--device`: Device to use (`cuda` or `cpu`)
- `--max-frames`: Maximum audio frames to generate (for speed control)
- `--tokenizer`: Tokenizer repo (default: `bosonai/higgs-audio-v2-tokenizer`)
- `--channel-index`: *(Single-channel only)* Channel to extract (0 or 1)

## πŸ’‘ Using as a Python Module

Import and use in your Python code:

```python

from boson_multimodal.model.higgs_audio import (

    HiggsAudioModel,

    HiggsAudioConfig

)

from boson_multimodal.audio_processing import (

    load_higgs_audio_tokenizer

)

from boson_multimodal.data_collator import (

    HiggsAudioSampleCollator

)



# Load model

config = HiggsAudioConfig.from_pretrained("path/to/checkpoint")

model = HiggsAudioModel(config).to("cuda")



# Load tokenizer

tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer")



# Create collator

collator = HiggsAudioSampleCollator(

    audio_in_token_id=128015,

    audio_out_token_id=128016,

    audio_stream_bos_id=1024,

    audio_stream_eos_id=1025,

    audio_num_codebooks=8,

    interleave_audio_channels=True,

    audio_token_frame_hz=50

)



# Run inference (see inference scripts for details)

```

## πŸ”§ Configuration

### Model Configuration

Key parameters in `config.json`:

```json

{

  "audio_num_codebooks": 8,          // Number of audio codebooks

  "audio_codebook_size": 1024,       // Size of each codebook

  "audio_token_frame_hz": 50,        // Frame rate (50 fps)

  "interleave_audio_channels": true, // Interleave dual channels

  "use_delay_pattern": false,        // Whether to use delay pattern

  "audio_dual_ffn_layers": [...]     // Dual FFN layer configuration

}

```

### Token Specifications

- **Audio-in token**: 128015 (`<|AUDIO|>`)
- **Audio-out token**: 128016 (`<|AUDIO_OUT|>`)
- **Audio stream BOS**: 1024
- **Audio stream EOS**: 1025
- **Pad token**: 0 or 128001
- **Text vocab size**: ~128000 (LLaMA-based)
- **Audio vocab size**: 1024 (per codebook)

## 🎯 Inference Outputs

The inference scripts generate:

1. **Audio Files** (WAV format)
   - Sample rate: 16000 Hz
   - Single-channel: `output_generated.wav`, `input_groundtruth.wav`
   - Dual-channel: `channel0_input.wav`, `channel1_generated.wav`, `channel1_groundtruth.wav`

2. **Evaluation Metrics** (console + JSON)
   - RMSE (Root Mean Squared Error)
   - MAE (Mean Absolute Error)
   - SNR (Signal-to-Noise Ratio)
   - Correlation coefficient

3. **Metrics JSON**
   - Per-sample metrics
   - Average metrics across all samples

## πŸ“Š Choosing the Right Script

### Use `infer_single_channel.py` when:
- βœ… Processing mono audio
- βœ… Audio enhancement tasks
- βœ… Audio reconstruction from tokens
- βœ… Single-speaker scenarios
- βœ… Extracting one channel from stereo

### Use `infer_dual_channel.py` when:
- βœ… Conversational AI (dialogue generation)
- βœ… Turn-taking scenarios
- βœ… Stereo audio processing
- βœ… Multi-speaker systems
- βœ… Generating responses conditioned on input

## πŸ” Troubleshooting

### Issue: Module not found

**Error**: `ModuleNotFoundError: No module named 'boson_multimodal'`

**Solution**: Ensure you're in the correct directory or add to Python path:

```python

import sys

sys.path.insert(0, '/path/to/higgs_audio_inference')

```

### Issue: CUDA out of memory

**Error**: `RuntimeError: CUDA out of memory`

**Solution**:
- Reduce `--max-frames` parameter
- Reduce `--num-samples`
- Use CPU mode: `--device cpu`

### Issue: Tokenizer download failed

**Error**: Cannot download tokenizer from HuggingFace Hub

**Solution**:
- Check network connection
- Use proxy: `export HF_ENDPOINT=https://hf-mirror.com`
- Download tokenizer manually and specify local path: `--tokenizer /path/to/local/tokenizer`

### Issue: Token shape mismatch

**Error**: "Expected token tensor with shape..."

**Solution**:
- **Single-channel**: Ensure tokens are `[8, frames]`, use `--channel-index` if needed
- **Dual-channel**: Ensure tokens are `[2, 8, frames]`

## πŸ“š Documentation

- **Main README**: This file - Package overview and quick start
- **Inference Guide**: `INFERENCE_GUIDE.md` - Detailed inference documentation
- **Training Reference**: `DUAL_CHANNEL_TRAINING_README.md` - Training documentation

## πŸ› Common Questions

**Q: Can this be published as a pip package?**

A: Yes. The package includes `pyproject.toml`. You can build and install:
```bash

pip install build

python -m build

pip install dist/higgs_audio_inference-*.whl

```

**Q: What's the model size?**

A:
- Code: ~3800 lines of core code + dependencies
- Model weights: Depends on checkpoint (typically hundreds of MB to a few GB)

**Q: Which PyTorch versions are supported?**

A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+.

**Q: How do I use this in my project?**

A: Two ways:
1. Command-line: `python higgs_audio_inference/infer_*.py ...`
2. Python import: See "Using as a Python Module" section above

## πŸ’‘ Tips

1. **Start small**: Test with `--num-samples 1` and `--max-frames 100` first
2. **Use CUDA**: CPU inference is 10-50x slower
3. **Monitor memory**: Reduce `--max-frames` if OOM errors occur
4. **Check outputs**: Listen to generated audio to verify quality
5. **Read the guide**: See `INFERENCE_GUIDE.md` for comprehensive documentation



## Acknowledgments

<div align="left">
  <a href="https://www.bitdeer.com/">
    <img src="https://pub-ad90b2169561455ea151c5176b67b638.r2.dev/2025/11/bitdeerai-logo-horizontal.svg" alt="Bitdeer" width="250"/>

  </a>

</div>


This research was supported by **[Bitdeer AI](https://www.bitdeer.ai/)** of Bitdeer Technologies Group through provision of GPU resources and AI cloud services.