HCodec-1.5-adaptive / README.md
opsam's picture
Update README.md
11107f8 verified
---
license: apache-2.0
language:
- en
pipeline_tag: audio-to-audio
---
# QuarkAudio-HCodec-1.5: A Unified Discrete Audio Tokenizer with adaptive frame rate for High-Fidelity, Multitask Audio Generation
<p align="center">
<a href="https://arxiv.org/pdf/2512.20151">
<img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Paper">
</a>
<a href="https://github.com/alibaba/unified-audio/tree/main/QuarkAudio-UniSE">
<img src="https://img.shields.io/badge/GitHub-Code-green.svg" alt="GitHub">
</a>
<a href="https://github.com/alibaba/unified-audio/tree/main/QuarkAudio-HCodec/HCodec-1.5/">
<img src="https://img.shields.io/badge/Model-Hugging%20Face-yellow.svg" alt="Hugging Face">
</a>
<a href="https://www.modelscope.cn/models/QuarkAudio/QuarkAudio-HCodec/">
<img src="https://img.shields.io/badge/Model-%20%E9%AD%94%E6%90%AD-orange.svg" alt="ModelScope">
</a>
</p>
<p align="center">
<a href="https://arxiv.org/pdf/2512.20151"><img src="HCodec-1.5.png" width="70%" /></a>
</p>
## 🎯 Quick Start: Run Inference in 3 Minutes
## 1. Installation
1. Install dependencies from requirement.txt via pypi
2. Download pretrained weights from Huggingface &#x1F917;: [QuarkAudio/HCodec-1.5-adaptive](https://huggingface.co/QuarkAudio/HCodec-1.5-adaptive) and save them to ./checkpoints/
3. confirm the `ckpt_path` in file `conf/config_adaptive_v3.yaml` is valid
### 2. Clone Repository
```bash
git clone https://github.com/alibaba/unified-audio.git
cd QuarkAudio-HCodec
```
### 3. Create a Conda environment and install dependencies
```bash
conda create -n unise python=3.10
conda activate unise
pip install -r requirements.txt
```
## 4. Tokenizer
```bash
#!/bin/bash
python audio_tokenizer.py
```
## 5. Optional configuration
+ Customize your testing options about adaptive frame rate
```yaml
# hyperparameter configuration in conf/config_adaptive_v3.yaml
training: false # keep false when testing
use_similarity_alignment: true
use_dynamic_similarity_threshold: false
infer_using_dynamic_threshold: true # work when manual_threshold is null
similarity_threshold: 0.7
similarity_threshold_lower: 0.7
similarity_threshold_upper: 1.0 # valid interval of dynamic threshold when 'infer_using_dynamic_threshold' turns on
max_tokens_per_group: 8
manual_threshold: 0.6 # set to a fixed value when evaluate specific threshold
```
## 😘 Acknowlegement
We would like to thank the great work of following projects:
- The adaptive mechanism implementation is based on the work from [FlexiCodec](https://github.com/amphionspace/FlexiCodec) and [VARSTok](https://github.com/FunAudioLLM/FunResearch/tree/main/VARSTok).
- Transformer implementation is based on the work from [Mimi Codec](https://github.com/kyutai-labs/moshi)