|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: audio-to-audio |
|
|
--- |
|
|
|
|
|
# QuarkAudio-HCodec-1.5: A Unified Discrete Audio Tokenizer with adaptive frame rate for High-Fidelity, Multitask Audio Generation |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/pdf/2512.20151"> |
|
|
<img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Paper"> |
|
|
</a> |
|
|
<a href="https://github.com/alibaba/unified-audio/tree/main/QuarkAudio-UniSE"> |
|
|
<img src="https://img.shields.io/badge/GitHub-Code-green.svg" alt="GitHub"> |
|
|
</a> |
|
|
<a href="https://github.com/alibaba/unified-audio/tree/main/QuarkAudio-HCodec/HCodec-1.5/"> |
|
|
<img src="https://img.shields.io/badge/Model-Hugging%20Face-yellow.svg" alt="Hugging Face"> |
|
|
</a> |
|
|
<a href="https://www.modelscope.cn/models/QuarkAudio/QuarkAudio-HCodec/"> |
|
|
<img src="https://img.shields.io/badge/Model-%20%E9%AD%94%E6%90%AD-orange.svg" alt="ModelScope"> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/pdf/2512.20151"><img src="HCodec-1.5.png" width="70%" /></a> |
|
|
</p> |
|
|
|
|
|
|
|
|
## π― Quick Start: Run Inference in 3 Minutes |
|
|
## 1. Installation |
|
|
1. Install dependencies from requirement.txt via pypi |
|
|
2. Download pretrained weights from Huggingface 🤗: [QuarkAudio/HCodec-1.5-adaptive](https://huggingface.co/QuarkAudio/HCodec-1.5-adaptive) and save them to ./checkpoints/ |
|
|
3. confirm the `ckpt_path` in file `conf/config_adaptive_v3.yaml` is valid |
|
|
|
|
|
|
|
|
### 2. Clone Repository |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/alibaba/unified-audio.git |
|
|
cd QuarkAudio-HCodec |
|
|
``` |
|
|
|
|
|
### 3. Create a Conda environment and install dependencies |
|
|
|
|
|
```bash |
|
|
conda create -n unise python=3.10 |
|
|
conda activate unise |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
## 4. Tokenizer |
|
|
|
|
|
```bash |
|
|
#!/bin/bash |
|
|
|
|
|
python audio_tokenizer.py |
|
|
``` |
|
|
|
|
|
## 5. Optional configuration |
|
|
+ Customize your testing options about adaptive frame rate |
|
|
|
|
|
```yaml |
|
|
# hyperparameter configuration in conf/config_adaptive_v3.yaml |
|
|
|
|
|
training: false # keep false when testing |
|
|
use_similarity_alignment: true |
|
|
use_dynamic_similarity_threshold: false |
|
|
infer_using_dynamic_threshold: true # work when manual_threshold is null |
|
|
similarity_threshold: 0.7 |
|
|
similarity_threshold_lower: 0.7 |
|
|
similarity_threshold_upper: 1.0 # valid interval of dynamic threshold when 'infer_using_dynamic_threshold' turns on |
|
|
max_tokens_per_group: 8 |
|
|
manual_threshold: 0.6 # set to a fixed value when evaluate specific threshold |
|
|
``` |
|
|
|
|
|
## π Acknowlegement |
|
|
We would like to thank the great work of following projects: |
|
|
|
|
|
- The adaptive mechanism implementation is based on the work from [FlexiCodec](https://github.com/amphionspace/FlexiCodec) and [VARSTok](https://github.com/FunAudioLLM/FunResearch/tree/main/VARSTok). |
|
|
- Transformer implementation is based on the work from [Mimi Codec](https://github.com/kyutai-labs/moshi) |