Update README.md
Browse files
README.md
CHANGED
|
@@ -1,88 +1 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
# X-Codec-2.0
|
| 4 |
-
Paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
|
| 5 |
-
|
| 6 |
-
**Update (2025-02-13):** Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune).
|
| 7 |
-
|
| 8 |
-
**Update (2025-02-07):** Our paper has been released!
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
## Directly used on Hugging Face
|
| 12 |
-
|
| 13 |
-
**Codec**: [xcodec2](https://huggingface.co/HKUST-Audio/xcodec2) (Use `xcodec2==0.1.5` for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However, I’m not sure if other problems may arise. If you prefer more stability, I recommend using `xcodec2==0.1.3` which accurately aligns during my codec training.)
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
**Llasa-collections**: [Llasa-collections](https://huggingface.co/collections/HKUSTAudio/llasa-679b87dbd06ac556cc0e0f44)
|
| 17 |
-
|
| 18 |
-
## Features
|
| 19 |
-
|
| 20 |
-
- **Single Vector Quantization**
|
| 21 |
-
- 65536 Codebook Size using Finite Scalar Quantization achieving 99% codebook usage. ( comparable to text tokenizers, LLaMA3 128256)
|
| 22 |
-
- 50x1 Tokens per Second
|
| 23 |
-
|
| 24 |
-
- **Multilingual Speech Semantic Support**
|
| 25 |
-
- Uses Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
|
| 26 |
-
- Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
|
| 27 |
-
|
| 28 |
-
- **High-Quality Speech Reconstruction**
|
| 29 |
-
- Transformer + Vocos Decoder
|
| 30 |
-
- BigCodec encoder
|
| 31 |
-
- Spec discriminator with FFT sizes {78, 126, 206, 334, 542, 876, 1418, 2296} tailored for transformer decoder. [Details here](https://openreview.net/pdf?id=4YpMrGfldX)
|
| 32 |
-
- Achieving UTMOS 4.13 WER 2.47 (hubert-large-ls960-ft) sim 0.82 (wavlm_large_finetune) stoi 0.92 pesq-nb 3.05 pesq-wb 2.44 on librispeech-test-clean reconstruction (gt: WER 1.96 UTMOS 4.09)
|
| 33 |
-
- Only for 16kHz speech
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
## Commandline Usage
|
| 37 |
-
## Setup
|
| 38 |
-
Code is tested on `python3.9`
|
| 39 |
-
|
| 40 |
-
Please follow the following steps to setup your environment
|
| 41 |
-
1. Clone this repo
|
| 42 |
-
2. conda create --name xcodec2 python=3.9
|
| 43 |
-
3. conda activate xcodec2
|
| 44 |
-
2. `pip install -r requirements.txt`
|
| 45 |
-
3. [Download the pretrained checkpoint here](https://huggingface.co/HKUST-Audio/xcodec2/blob/main/ckpt/epoch%3D4-step%3D1400000.ckpt)
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
## Inference
|
| 49 |
-
```bash
|
| 50 |
-
python inference.py
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
## Train
|
| 54 |
-
To train a XCodec2, firstly you have to prepare your data
|
| 55 |
-
|
| 56 |
-
1. Make a file list by:
|
| 57 |
-
```bash
|
| 58 |
-
python get_tsv.py
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
2. Train a X-Codec-2.0 with the default setting by:
|
| 62 |
-
|
| 63 |
-
```bash
|
| 64 |
-
python train.py log_dir=/path/to/log_dir
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
## Large-scale training, Batch inference and large-scale code extracting:
|
| 68 |
-
|
| 69 |
-
Batch inference
|
| 70 |
-
```bash
|
| 71 |
-
python inference_save_code.py
|
| 72 |
-
```
|
| 73 |
-
Training
|
| 74 |
-
```bash
|
| 75 |
-
Sbatch train_slurm.sh
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
Code extracting
|
| 79 |
-
```bash
|
| 80 |
-
Sbatch large_scale_save_code.sh
|
| 81 |
-
```
|
| 82 |
-
|
| 83 |
-
Code will save in output folder with the same subfolder structure for audio file.
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
## Acknowledgement
|
| 88 |
-
I would like to extend a special thanks to authors of BigCodec, since our code base is mainly borrowed from [BigCodec](https://github.com/Aria-K-Alethia/BigCodec).
|
|
|
|
| 1 |
+
it is not fully tested. I just know the training starts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|