Respair commited on
Commit
aebfb20
·
verified ·
1 Parent(s): 59b7eeb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -88
README.md CHANGED
@@ -1,88 +1 @@
1
- [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2502.04128)
2
-
3
- # X-Codec-2.0
4
- Paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
5
-
6
- **Update (2025-02-13):** Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune).
7
-
8
- **Update (2025-02-07):** Our paper has been released!
9
-
10
-
11
- ## Directly used on Hugging Face
12
-
13
- **Codec**: [xcodec2](https://huggingface.co/HKUST-Audio/xcodec2) (Use `xcodec2==0.1.5` for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However, I’m not sure if other problems may arise. If you prefer more stability, I recommend using `xcodec2==0.1.3` which accurately aligns during my codec training.)
14
-
15
-
16
- **Llasa-collections**: [Llasa-collections](https://huggingface.co/collections/HKUSTAudio/llasa-679b87dbd06ac556cc0e0f44)
17
-
18
- ## Features
19
-
20
- - **Single Vector Quantization**
21
- - 65536 Codebook Size using Finite Scalar Quantization achieving 99% codebook usage. ( comparable to text tokenizers, LLaMA3 128256)
22
- - 50x1 Tokens per Second
23
-
24
- - **Multilingual Speech Semantic Support**
25
- - Uses Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
26
- - Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
27
-
28
- - **High-Quality Speech Reconstruction**
29
- - Transformer + Vocos Decoder
30
- - BigCodec encoder
31
- - Spec discriminator with FFT sizes {78, 126, 206, 334, 542, 876, 1418, 2296} tailored for transformer decoder. [Details here](https://openreview.net/pdf?id=4YpMrGfldX)
32
- - Achieving UTMOS 4.13 WER 2.47 (hubert-large-ls960-ft) sim 0.82 (wavlm_large_finetune) stoi 0.92 pesq-nb 3.05 pesq-wb 2.44 on librispeech-test-clean reconstruction (gt: WER 1.96 UTMOS 4.09)
33
- - Only for 16kHz speech
34
-
35
-
36
- ## Commandline Usage
37
- ## Setup
38
- Code is tested on `python3.9`
39
-
40
- Please follow the following steps to setup your environment
41
- 1. Clone this repo
42
- 2. conda create --name xcodec2 python=3.9
43
- 3. conda activate xcodec2
44
- 2. `pip install -r requirements.txt`
45
- 3. [Download the pretrained checkpoint here](https://huggingface.co/HKUST-Audio/xcodec2/blob/main/ckpt/epoch%3D4-step%3D1400000.ckpt)
46
-
47
-
48
- ## Inference
49
- ```bash
50
- python inference.py
51
- ```
52
-
53
- ## Train
54
- To train a XCodec2, firstly you have to prepare your data
55
-
56
- 1. Make a file list by:
57
- ```bash
58
- python get_tsv.py
59
- ```
60
-
61
- 2. Train a X-Codec-2.0 with the default setting by:
62
-
63
- ```bash
64
- python train.py log_dir=/path/to/log_dir
65
- ```
66
-
67
- ## Large-scale training, Batch inference and large-scale code extracting:
68
-
69
- Batch inference
70
- ```bash
71
- python inference_save_code.py
72
- ```
73
- Training
74
- ```bash
75
- Sbatch train_slurm.sh
76
- ```
77
-
78
- Code extracting
79
- ```bash
80
- Sbatch large_scale_save_code.sh
81
- ```
82
-
83
- Code will save in output folder with the same subfolder structure for audio file.
84
-
85
-
86
-
87
- ## Acknowledgement
88
- I would like to extend a special thanks to authors of BigCodec, since our code base is mainly borrowed from [BigCodec](https://github.com/Aria-K-Alethia/BigCodec).
 
1
+ it is not fully tested. I just know the training starts.