Spaces:

ASLP-lab
/

VoiceSculptor

Running on Zero

App Files Files Community

VoiceSculptor / xcodec2 /README.md

Huakang Chen

Add application file

1ec923d 8 days ago

preview code

raw

history blame contribute delete

2.27 kB

	---
	license: cc-by-nc-4.0
	tags:
	- audio-to-audio
	pipeline_tag: audio-to-audio
	---

	[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2502.04128)
	Update (2025-02-13): Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune).

	Update (2025-02-07): Our paper has been released!


	## Paper

	LLaSA: Scaling Train Time and Inference Time Compute for LLaMA based Speech Synthesis

	Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025, xcodec 1.0)


	# Getting Started with XCodec2 on Hugging Face
	XCodec2 is a speech tokenizer that offers the following key features:

	1. Single Vector Quantization
	2. 50 Tokens per Second
	3. Multilingual Speech Semantic Support and High-Quality Speech Reconstruction


	To use `xcodec2`, ensure you have it installed. You can install it using the following command:

	```bash
	conda create -n xcodec2 python=3.9
	conda activate xcodec2
	pip install xcodec2 (Use `xcodec2==0.1.5` for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However, I’m not sure if other problems may arise. If you prefer more stability, I recommend using `xcodec2==0.1.3` which accurately aligns during my codec training.)

	```
	Then,
	```python
	import torch
	import soundfile as sf
	from transformers import AutoConfig


	from xcodec2.modeling_xcodec2 import XCodec2Model

	model_path = "HKUSTAudio/xcodec2"

	model = XCodec2Model.from_pretrained(model_path)
	model.eval().cuda()


	wav, sr = sf.read("test.wav")
	wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Shape: (1, T)


	with torch.no_grad():
	# Only 16khz speech
	# Only supports single input. For batch inference, please refer to the link below.
	vq_code = model.encode_code(input_waveform=wav_tensor)
	print("Code:", vq_code )

	recon_wav = model.decode_code(vq_code).cpu() # Shape: (1, 1, T')


	sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
	print("Done! Check reconstructed.wav")
	```

	# If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released [here](https://github.com/zhenye234/X-Codec-2.0).