Chaos96 commited on
Commit
93334fc
Β·
1 Parent(s): 71463c1

Add Acknowledgments

Browse files
Files changed (1) hide show
  1. README.md +333 -3
README.md CHANGED
@@ -1,3 +1,333 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SonicBot
2
+
3
+ Audio generation and processing inference package based on the Higgs audio model architecture.
4
+
5
+ ## πŸ“¦ Package Contents
6
+
7
+ This package provides complete inference capabilities for Higgs audio models:
8
+
9
+ - **Core Model Architecture** (`boson_multimodal/model/higgs_audio/`)
10
+ - Dual-channel audio generation model
11
+ - Transformer encoder and decoder
12
+ - Audio feature projector
13
+ - Delay pattern support
14
+ - Multi-codebook audio generation
15
+
16
+ - **Audio Processing** (`boson_multimodal/audio_processing/`)
17
+ - Higgs Audio Tokenizer (DAC-based)
18
+ - Semantic encoder/decoder
19
+ - Descriptive Audio Codec (DAC)
20
+ - Vector Quantization (VQ)
21
+
22
+ - **Data Processing** (`boson_multimodal/data_collator/`, `boson_multimodal/dataset/`)
23
+ - HiggsAudioSampleCollator (batch processing)
24
+ - ChatMLDatasetSample (dialogue data structures)
25
+ - Multi-channel audio token handling
26
+
27
+ - **Inference Scripts**
28
+ - `infer_single_channel.py` - Single-channel audio inference
29
+ - `infer_dual_channel.py` - Dual-channel audio generation
30
+
31
+ ## πŸ“ Directory Structure
32
+
33
+ ```
34
+ higgs_audio_inference/
35
+ β”œβ”€β”€ boson_multimodal/ # Core library
36
+ β”‚ β”œβ”€β”€ __init__.py
37
+ β”‚ β”œβ”€β”€ constants.py # Token definitions
38
+ β”‚ β”œβ”€β”€ data_types.py # ChatML data structures
39
+ β”‚ β”œβ”€β”€ audio_processing/ # Audio tokenizer + vocoder
40
+ β”‚ β”‚ β”œβ”€β”€ higgs_audio_tokenizer.py
41
+ β”‚ β”‚ β”œβ”€β”€ semantic_module.py
42
+ β”‚ β”‚ β”œβ”€β”€ descriptaudiocodec/ # DAC codec
43
+ β”‚ β”‚ └── quantization/ # Vector quantization
44
+ β”‚ β”œβ”€β”€ data_collator/ # Data batch processing
45
+ β”‚ β”‚ └── higgs_audio_collator.py
46
+ β”‚ β”œβ”€β”€ dataset/ # Dataset utilities
47
+ β”‚ β”‚ └── chatml_dataset.py
48
+ β”‚ └── model/
49
+ β”‚ └── higgs_audio/ # Core model
50
+ β”‚ β”œβ”€β”€ modeling_higgs_audio.py # Model implementation
51
+ β”‚ β”œβ”€β”€ configuration_higgs_audio.py # Configuration classes
52
+ β”‚ β”œβ”€β”€ audio_head.py # Decoder projector
53
+ β”‚ β”œβ”€β”€ utils.py # Utility functions
54
+ β”‚ β”œβ”€β”€ common.py # Base classes
55
+ β”‚ β”œβ”€β”€ custom_modules.py # Custom layers
56
+ β”‚ └── cuda_graph_runner.py # CUDA optimization
57
+ β”œβ”€β”€ infer_single_channel.py # Single-channel inference script
58
+ β”œβ”€β”€ infer_dual_channel.py # Dual-channel inference script
59
+ β”œβ”€β”€ INFERENCE_GUIDE.md # Detailed inference guide
60
+ β”œβ”€β”€ requirements.txt # Dependencies
61
+ β”œβ”€β”€ pyproject.toml # Project configuration
62
+ └── README.md # This file
63
+ ```
64
+
65
+ ## πŸš€ Quick Start
66
+
67
+ ### 1. Installation
68
+
69
+ Install dependencies:
70
+
71
+ ```bash
72
+ pip install -r requirements.txt
73
+ ```
74
+
75
+ **Core Dependencies**:
76
+ - PyTorch >= 2.0
77
+ - Transformers >= 4.45.1, < 4.47.0
78
+ - descript-audio-codec
79
+ - librosa, torchaudio
80
+ - safetensors
81
+
82
+ ### 2. Prepare Resources
83
+
84
+ Ensure you have the following:
85
+
86
+ 1. **Model Checkpoint**:
87
+ ```
88
+ path/to/checkpoint/
89
+ β”œβ”€β”€ config.json
90
+ β”œβ”€β”€ model.safetensors
91
+ └── ...
92
+ ```
93
+
94
+ 2. **Tokenizer**: Auto-downloaded from HuggingFace Hub
95
+ - Default: `bosonai/higgs-audio-v2-tokenizer`
96
+
97
+ 3. **Test Data** (optional): Tokenized dataset
98
+ ```
99
+ dataset/tokenized_data/
100
+ β”œβ”€β”€ val_manifest.jsonl
101
+ └── tokens/
102
+ ```
103
+
104
+ ### 3. Run Inference
105
+
106
+ #### Single-Channel Inference
107
+
108
+ For single-channel audio processing:
109
+
110
+ ```bash
111
+ python infer_single_channel.py \
112
+ --checkpoint path/to/checkpoint \
113
+ --dataset-dir path/to/dataset \
114
+ --num-samples 5 \
115
+ --output-dir outputs/results \
116
+ --device cuda \
117
+ --channel-index 0
118
+ ```
119
+
120
+ #### Dual-Channel Inference
121
+
122
+ For dual-channel audio generation (conversational AI):
123
+
124
+ ```bash
125
+ python infer_dual_channel.py \
126
+ --checkpoint path/to/checkpoint \
127
+ --dataset-dir path/to/dataset \
128
+ --num-samples 5 \
129
+ --output-dir outputs/results \
130
+ --device cuda \
131
+ --max-frames 500
132
+ ```
133
+
134
+ **Key Parameters**:
135
+ - `--checkpoint`: Path to model checkpoint directory
136
+ - `--dataset-dir`: Path to tokenized dataset directory (containing `val_manifest.jsonl`)
137
+ - `--num-samples`: Number of validation samples to process
138
+ - `--output-dir`: Output directory for generated audio files
139
+ - `--device`: Device to use (`cuda` or `cpu`)
140
+ - `--max-frames`: Maximum audio frames to generate (for speed control)
141
+ - `--tokenizer`: Tokenizer repo (default: `bosonai/higgs-audio-v2-tokenizer`)
142
+ - `--channel-index`: *(Single-channel only)* Channel to extract (0 or 1)
143
+
144
+ ## πŸ’‘ Using as a Python Module
145
+
146
+ Import and use in your Python code:
147
+
148
+ ```python
149
+ from boson_multimodal.model.higgs_audio import (
150
+ HiggsAudioModel,
151
+ HiggsAudioConfig
152
+ )
153
+ from boson_multimodal.audio_processing import (
154
+ load_higgs_audio_tokenizer
155
+ )
156
+ from boson_multimodal.data_collator import (
157
+ HiggsAudioSampleCollator
158
+ )
159
+
160
+ # Load model
161
+ config = HiggsAudioConfig.from_pretrained("path/to/checkpoint")
162
+ model = HiggsAudioModel(config).to("cuda")
163
+
164
+ # Load tokenizer
165
+ tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer")
166
+
167
+ # Create collator
168
+ collator = HiggsAudioSampleCollator(
169
+ audio_in_token_id=128015,
170
+ audio_out_token_id=128016,
171
+ audio_stream_bos_id=1024,
172
+ audio_stream_eos_id=1025,
173
+ audio_num_codebooks=8,
174
+ interleave_audio_channels=True,
175
+ audio_token_frame_hz=50
176
+ )
177
+
178
+ # Run inference (see inference scripts for details)
179
+ ```
180
+
181
+ ## πŸ”§ Configuration
182
+
183
+ ### Model Configuration
184
+
185
+ Key parameters in `config.json`:
186
+
187
+ ```json
188
+ {
189
+ "audio_num_codebooks": 8, // Number of audio codebooks
190
+ "audio_codebook_size": 1024, // Size of each codebook
191
+ "audio_token_frame_hz": 50, // Frame rate (50 fps)
192
+ "interleave_audio_channels": true, // Interleave dual channels
193
+ "use_delay_pattern": false, // Whether to use delay pattern
194
+ "audio_dual_ffn_layers": [...] // Dual FFN layer configuration
195
+ }
196
+ ```
197
+
198
+ ### Token Specifications
199
+
200
+ - **Audio-in token**: 128015 (`<|AUDIO|>`)
201
+ - **Audio-out token**: 128016 (`<|AUDIO_OUT|>`)
202
+ - **Audio stream BOS**: 1024
203
+ - **Audio stream EOS**: 1025
204
+ - **Pad token**: 0 or 128001
205
+ - **Text vocab size**: ~128000 (LLaMA-based)
206
+ - **Audio vocab size**: 1024 (per codebook)
207
+
208
+ ## 🎯 Inference Outputs
209
+
210
+ The inference scripts generate:
211
+
212
+ 1. **Audio Files** (WAV format)
213
+ - Sample rate: 16000 Hz
214
+ - Single-channel: `output_generated.wav`, `input_groundtruth.wav`
215
+ - Dual-channel: `channel0_input.wav`, `channel1_generated.wav`, `channel1_groundtruth.wav`
216
+
217
+ 2. **Evaluation Metrics** (console + JSON)
218
+ - RMSE (Root Mean Squared Error)
219
+ - MAE (Mean Absolute Error)
220
+ - SNR (Signal-to-Noise Ratio)
221
+ - Correlation coefficient
222
+
223
+ 3. **Metrics JSON**
224
+ - Per-sample metrics
225
+ - Average metrics across all samples
226
+
227
+ ## πŸ“Š Choosing the Right Script
228
+
229
+ ### Use `infer_single_channel.py` when:
230
+ - βœ… Processing mono audio
231
+ - βœ… Audio enhancement tasks
232
+ - βœ… Audio reconstruction from tokens
233
+ - βœ… Single-speaker scenarios
234
+ - βœ… Extracting one channel from stereo
235
+
236
+ ### Use `infer_dual_channel.py` when:
237
+ - βœ… Conversational AI (dialogue generation)
238
+ - βœ… Turn-taking scenarios
239
+ - βœ… Stereo audio processing
240
+ - βœ… Multi-speaker systems
241
+ - βœ… Generating responses conditioned on input
242
+
243
+ ## πŸ” Troubleshooting
244
+
245
+ ### Issue: Module not found
246
+
247
+ **Error**: `ModuleNotFoundError: No module named 'boson_multimodal'`
248
+
249
+ **Solution**: Ensure you're in the correct directory or add to Python path:
250
+
251
+ ```python
252
+ import sys
253
+ sys.path.insert(0, '/path/to/higgs_audio_inference')
254
+ ```
255
+
256
+ ### Issue: CUDA out of memory
257
+
258
+ **Error**: `RuntimeError: CUDA out of memory`
259
+
260
+ **Solution**:
261
+ - Reduce `--max-frames` parameter
262
+ - Reduce `--num-samples`
263
+ - Use CPU mode: `--device cpu`
264
+
265
+ ### Issue: Tokenizer download failed
266
+
267
+ **Error**: Cannot download tokenizer from HuggingFace Hub
268
+
269
+ **Solution**:
270
+ - Check network connection
271
+ - Use proxy: `export HF_ENDPOINT=https://hf-mirror.com`
272
+ - Download tokenizer manually and specify local path: `--tokenizer /path/to/local/tokenizer`
273
+
274
+ ### Issue: Token shape mismatch
275
+
276
+ **Error**: "Expected token tensor with shape..."
277
+
278
+ **Solution**:
279
+ - **Single-channel**: Ensure tokens are `[8, frames]`, use `--channel-index` if needed
280
+ - **Dual-channel**: Ensure tokens are `[2, 8, frames]`
281
+
282
+ ## πŸ“š Documentation
283
+
284
+ - **Main README**: This file - Package overview and quick start
285
+ - **Inference Guide**: `INFERENCE_GUIDE.md` - Detailed inference documentation
286
+ - **Training Reference**: `DUAL_CHANNEL_TRAINING_README.md` - Training documentation
287
+
288
+ ## πŸ› Common Questions
289
+
290
+ **Q: Can this be published as a pip package?**
291
+
292
+ A: Yes. The package includes `pyproject.toml`. You can build and install:
293
+ ```bash
294
+ pip install build
295
+ python -m build
296
+ pip install dist/higgs_audio_inference-*.whl
297
+ ```
298
+
299
+ **Q: What's the model size?**
300
+
301
+ A:
302
+ - Code: ~3800 lines of core code + dependencies
303
+ - Model weights: Depends on checkpoint (typically hundreds of MB to a few GB)
304
+
305
+ **Q: Which PyTorch versions are supported?**
306
+
307
+ A: PyTorch >= 2.0, recommended 2.1+. CUDA 11.8+ or 12.1+.
308
+
309
+ **Q: How do I use this in my project?**
310
+
311
+ A: Two ways:
312
+ 1. Command-line: `python higgs_audio_inference/infer_*.py ...`
313
+ 2. Python import: See "Using as a Python Module" section above
314
+
315
+ ## πŸ’‘ Tips
316
+
317
+ 1. **Start small**: Test with `--num-samples 1` and `--max-frames 100` first
318
+ 2. **Use CUDA**: CPU inference is 10-50x slower
319
+ 3. **Monitor memory**: Reduce `--max-frames` if OOM errors occur
320
+ 4. **Check outputs**: Listen to generated audio to verify quality
321
+ 5. **Read the guide**: See `INFERENCE_GUIDE.md` for comprehensive documentation
322
+
323
+
324
+
325
+ ## Acknowledgments
326
+
327
+ <div align="left">
328
+ <a href="https://www.bitdeer.com/">
329
+ <img src="https://pub-ad90b2169561455ea151c5176b67b638.r2.dev/2025/11/fb1fe1d18e52cf4625313b8849645e30.svg" alt="Bitdeer" width="200"/>
330
+ </a>
331
+ </div>
332
+
333
+ This research was supported by **Bitdeer AI Team** of [Bitdeer Technologies Group](https://www.bitdeer.com/) through provision of GPU resources and AI cloud services.