Instructions to use bezzam/xcodec2-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bezzam/xcodec2-hf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="bezzam/xcodec2-hf")# Load model directly from transformers import AutoFeatureExtractor, AutoModel extractor = AutoFeatureExtractor.from_pretrained("bezzam/xcodec2-hf") model = AutoModel.from_pretrained("bezzam/xcodec2-hf") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| license: cc-by-nc-4.0 | |
| pipeline_tag: feature-extraction | |
| tags: | |
| - audio | |
| - codec | |
| # X-Codec2 (Transformers-native) | |
| The X-Codec2 model was proposed in [Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](https://huggingface.co/papers/2502.04128). | |
| X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation. | |
| About its architecture: | |
| - **Unified Semantic-Acoustic Tokenization**: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre). | |
| - **Single-Stage Feature Scalar Quantization (FSQ)**: Unlike the multi-layer residual VQ in most approaches (e.g., DAC, EnCodec, X-Codec, Mimi), X-Codec2 uses a single-layer of Feature Scalar Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs. | |
| - **Transformer-Friendly Design**: The 1D token structure of X-Codec2 naturally aligns with the autoregressive modeling in LLMs like LLaMA, improving training efficiency and downstream compatibility. | |
| This model was contributed by [Eric Bezzam](https://huggingface.co/bezzam) and [Steven Zheng](https://huggingface.co/Steveeeeeeen). | |
| The original modeling code can be found [here](https://huggingface.co/HKUSTAudio/xcodec2/blob/main/modeling_xcodec2.py), while their training code is [here](https://github.com/zhenye234/X-Codec-2.0). | |
| ## Setup | |
| X-Codec2 is supported natively in 🤗 Transformers. Until it is part of an official Transformers release, install from source: | |
| ```bash | |
| pip install git+https://github.com/huggingface/transformers | |
| ``` | |
| ## Usage example | |
| Here is a quick example of how to encode and decode an audio using this model: | |
| ```python | |
| from datasets import Audio, load_dataset | |
| from transformers import AutoFeatureExtractor, AutoModel | |
| model_id = "HKUSTAudio/xcodec2-hf" | |
| model = AutoModel.from_pretrained(model_id, device_map="auto") | |
| feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) | |
| dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | |
| dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate)) | |
| audio = dataset[0]["audio"]["array"] | |
| inputs = feature_extractor(audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to( | |
| model.device, model.dtype | |
| ) | |
| print("Input waveform shape:", inputs["input_values"].shape) | |
| # Input waveform shape: torch.Size([1, 1, 93760]) | |
| # encoder and decoder | |
| audio_codes = model.encode(**inputs).audio_codes | |
| print("Audio codes shape:", audio_codes.shape) | |
| # Audio codes shape: torch.Size([1, 1, 293]) | |
| audio_values = model.decode(audio_codes).audio_values | |
| print("Audio values shape:", audio_values.shape) | |
| # Audio values shape: torch.Size([1, 1, 93760]) | |
| # Equivalently, you can do encoding and decoding in one step | |
| model_output = model(**inputs) | |
| audio_codes = model_output.audio_codes | |
| audio_values = model_output.audio_values | |
| ``` | |
| ### Batch processing | |
| Unlike the original [release](https://huggingface.co/HKUSTAudio/xcodec2), this implementation also supports batched inputs. | |
| ```python | |
| from datasets import Audio, load_dataset | |
| from transformers import AutoFeatureExtractor, AutoModel | |
| batch_size = 2 | |
| model_id = "HKUSTAudio/xcodec2-hf" | |
| model = AutoModel.from_pretrained(model_id, device_map="auto") | |
| feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) | |
| dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | |
| dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate)) | |
| audios = [dataset[i]["audio"]["array"] for i in range(batch_size)] | |
| inputs = feature_extractor(audio=audios, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to( | |
| model.device, model.dtype | |
| ) | |
| print("Input waveform shape:", inputs["input_values"].shape) | |
| # Input waveform shape: torch.Size([2, 1, 93760]) | |
| # encoder and decoder | |
| encoder_output = model.encode(**inputs) | |
| audio_codes = encoder_output.audio_codes | |
| print("Audio codes shape:", audio_codes.shape) | |
| # Audio codes shape: torch.Size([2, 1, 293]) | |
| audio_values = model.decode(audio_codes).audio_values | |
| print("Audio values shape:", audio_values.shape) | |
| # Audio values shape: torch.Size([2, 1, 93760]) | |
| # Equivalently, you can do encoding and decoding in one step | |
| model_output = model(**inputs) | |
| audio_codes = model_output.audio_codes | |
| audio_values = model_output.audio_values | |
| ``` | |
| ### Speed-up with `torch.compile` | |
| You can speed up inference with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html). The first few calls will be slower due to compilation overhead, but subsequent calls will be faster. | |
| On an A100, we observed a speed-up of ~1.35 for a batch size of 4 ([script](https://gist.github.com/ebezzam/3b79481b5d48d8e35c4ecc582aee0cb3#file-benchmark_torch_compile-py)). | |
| ```python | |
| import torch | |
| from datasets import Audio, load_dataset | |
| from transformers import AutoFeatureExtractor, AutoModel | |
| batch_size = 4 | |
| model_id = "HKUSTAudio/xcodec2-hf" | |
| model = AutoModel.from_pretrained(model_id, device_map="auto") | |
| feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) | |
| dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | |
| dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate)) | |
| audios = [dataset[i]["audio"]["array"] for i in range(batch_size)] | |
| inputs = feature_extractor( | |
| audio=audios, sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt" | |
| ).to(model.device, model.dtype) | |
| compiled_model = torch.compile(model, fullgraph=True) | |
| # Warmup (includes compilation on first call) | |
| for _ in range(10): | |
| with torch.inference_mode(): | |
| _ = compiled_model(**inputs) | |
| with torch.inference_mode(): | |
| output = compiled_model(**inputs) | |
| print("Audio values shape:", output.audio_values.shape) | |
| ``` |