Spaces:
Running
on
Zero
Running
on
Zero
| license: cc-by-nc-4.0 | |
| tags: | |
| - audio-to-audio | |
| pipeline_tag: audio-to-audio | |
| [](https://arxiv.org/abs/2502.04128) | |
| **Update (2025-02-13):** Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune). | |
| **Update (2025-02-07):** Our paper has been released! | |
| ## Paper | |
| LLaSA: Scaling Train Time and Inference Time Compute for LLaMA based Speech Synthesis | |
| Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025, xcodec 1.0) | |
| # Getting Started with XCodec2 on Hugging Face | |
| XCodec2 is a speech tokenizer that offers the following key features: | |
| 1. **Single Vector Quantization** | |
| 2. **50 Tokens per Second** | |
| 3. **Multilingual Speech Semantic Support and High-Quality Speech Reconstruction** | |
| To use `xcodec2`, ensure you have it installed. You can install it using the following command: | |
| ```bash | |
| conda create -n xcodec2 python=3.9 | |
| conda activate xcodec2 | |
| pip install xcodec2 (Use `xcodec2==0.1.5` for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However, I’m not sure if other problems may arise. If you prefer more stability, I recommend using `xcodec2==0.1.3` which accurately aligns during my codec training.) | |
| ``` | |
| Then, | |
| ```python | |
| import torch | |
| import soundfile as sf | |
| from transformers import AutoConfig | |
| from xcodec2.modeling_xcodec2 import XCodec2Model | |
| model_path = "HKUSTAudio/xcodec2" | |
| model = XCodec2Model.from_pretrained(model_path) | |
| model.eval().cuda() | |
| wav, sr = sf.read("test.wav") | |
| wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Shape: (1, T) | |
| with torch.no_grad(): | |
| # Only 16khz speech | |
| # Only supports single input. For batch inference, please refer to the link below. | |
| vq_code = model.encode_code(input_waveform=wav_tensor) | |
| print("Code:", vq_code ) | |
| recon_wav = model.decode_code(vq_code).cpu() # Shape: (1, 1, T') | |
| sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr) | |
| print("Done! Check reconstructed.wav") | |
| ``` | |
| # If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released [here](https://github.com/zhenye234/X-Codec-2.0). |