---
license: cc-by-nc-4.0
tags:
  - audio
  - codec
  - speech
  - xcodec2
  - text-to-speech
  - multilingual
language:
  - en
  - ja
  - zh
  - bn
  - fr
  - de
  - ko
---

# 🗣️ XCodec2 Trained on 100K Hours of Multilingual Data

This is a retrained version of the XCodec2 neural audio codec by HKUSTAudio, using 100,000 hours of multilingual speech across seven languages. The model enables efficient speech compression and reconstruction for low-bandwidth, high-quality audio applications. Its discrete token outputs are well-suited for LLM-based TTS, AudioLM, multimodal models, and speech-to-speech systems, making it a versatile solution for multilingual and real-world speech processing tasks.

---

## 📌 Overview

- **Model Architecture:** [Xcodec2](https://huggingface.co/HKUSTAudio/xcodec2)  
- **Sampling Rate:** 16 kHz  
- **Tokens:** 50 tokens/second  
- **Developed By:** [Verbex.ai (Hishab Technologies Ltd.)](https://verbex.ai)  
- **Primary Use Case:** High-quality speech reconstruction and intermediate TTS representations
- **Training Time:** 11 Days(8xH100 80GB)
- **Epoch:** 1

---

## 🧪 Installation & Usage

This model requires `xcodec2`. We recommend using a minimal setup:

```bash
# Create environment
conda create -n xcodec2 python=3.9
conda activate xcodec2

# Install dependencies
pip install xcodec2==0.1.5
pip install numpy==1.26.4

```

### Example Usage

```python
import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model

model_path = "hishab/titu-xcodec2"  # Replace with actual Hugging Face path
model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()

# Load and preprocess waveform
wav, sr = sf.read("test_bn.wav")
if sr != 16000:
    import librosa
    wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
    sr = 16000
if len(wav.shape) > 1:
    wav = wav.mean(axis=1)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)

# Encode and decode
with torch.no_grad():
    vq_code = model.encode_code(input_waveform=wav_tensor)
    print("Code:", vq_code)

    recon_wav = model.decode_code(vq_code).cpu()

# Save output
sf.write("reconstructed_bn.wav", recon_wav[0, 0].numpy(), sr)
print("Done! Check reconstructed_bn.wav")
```

---

## 🌍 Multilingual Training Dataset

| Language  | Dataset(s)                            | Hours (K) |
|-----------|----------------------------------------|-----------|
| Japanese  | EmiliaYODAS + Verbex JA TTS Dataset    | 31.41     |
| English   | EmiliaYODAS                            | 25.69     |
| Chinese   | EmiliaYODAS                            | 12.50     |
| Bangla    | Verbex Bengali TTS Dataset             | 11.58     |
| French    | EmiliaYODAS + MLangLibrispeech         | 8.40      |
| German    | EmiliaYODAS + MLangLibrispeech         | 5.42      |
| Korean    | EmiliaYODAS                            | 5.00      |
| **Total** | —                                      | **100**   |

---

## 📊 Reconstruction Evaluation

Reconstruction metrics are computed over 100 samples for English, Japanese, and Bangla using this retrained model (`XCODEC2 Ours`) alongside baselines (XCODEC, SNAC, NEMO).

**Evaluation Test Sets:**  
- English: 100 Examples (Emilia Dataset)  
- Japanese: 100 Examples (Emilia Dataset)
- Bangla: 100 Examples (Verbex's Inhouse TTS Dataset) 

| Model             | Lang | MCD ↓   | MSE ↓   | SpeechBERTScore ↑ | SpeechBLEU ↑  | SpeechTokenDist ↑ |
|-------------------|------|--------|--------|-------------|--------|-------------|
| **XCODEC**        | BN   | 2.823  | 0.003  | 0.939       | 0.500  | 0.816       |
|                   | EN   | 3.166  | 0.012  | 0.962       | 0.660  | 0.856       |
|                   | JA   | 3.021  | 0.010  | 0.948       | 0.582  | 0.838       |
| **Overall**           |     | 3.003  | 0.008  | 0.949       | 0.581  | 0.837       |
| **XCODEC2 (Ours)** | BN   | 2.712  | 0.003  | 0.940       | 0.508  | 0.817       |
|                   | EN   | 3.206  | 0.014  | 0.957       | 0.644  | 0.851       |
|                   | JA   | 3.022  | 0.012  | 0.946       | 0.573  | 0.838       |
| **Overall**           |     | 2.980  | 0.010  | 0.948       | 0.575  | 0.835       |
| **hubertsiuzdak/snac_24khz**  | BN   | 3.104  | 0.002  | 0.911       | 0.442  | 0.785       |
|                   | EN   | 3.983  | 0.014  | 0.912       | 0.541  | 0.797       |
|                   | JA   | 3.512  | 0.009  | 0.903       | 0.472  | 0.761       |
| **Overall**           |     | 3.533  | 0.008  | 0.909       | 0.485  | 0.781       |
| **nvidia/low-frame-rate-speech-codec-22khz**  | BN   | 2.247  | 0.000  | 0.957       | 0.589  | 0.863       |
|                   | EN   | 2.867  | 0.007  | 0.969       | 0.707  | 0.872       |
|                   | JA   | 2.677  | 0.003  | 0.955       | 0.614  | 0.853       |
| **Overall**           |     | 2.597  | 0.003  | 0.960       | 0.636  | 0.863       |

#### SpeechBERTScore, SpeechBLEU and SpeechTokenDistance are calculated using https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics
---

## ✅ Intended Use

This model is suitable for:

- Speech tokenization in TTS pipelines  
- Low-bitrate speech compression   
- Code-based speech synthesis or generation tasks
- Multimodal LLM, Audio LM, Speech-to-Speech and etc. modeling

---

## 🚫 Limitations
- Licensed for **non-commercial use only**

---

## 📄 License

This model is licensed under **Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)**.  
Commercial usage is **not allowed**.

- SPDX Identifier: `CC-BY-NC-4.0`  
- License Details: [https://creativecommons.org/licenses/by-nc/4.0](https://creativecommons.org/licenses/by-nc/4.0)

---

## 📬 Contact

For research collaborations, feedback, or commercial licensing inquiries, please reach out to:

**🌐 Website:** [https://verbex.ai](https://verbex.ai)
---

<!-- ## 📖 Citation

```latex
@misc{verbex2025xcodec2,
  title        = {{Titu-XCodec2}: A Multilingual Neural Audio Codec by Verbex.ai},
  author       = {Mohammad Jahid Ibna Basher* and Saiful Islam* and Mehedi Hasan Menon and Tareq-Al-Muntasir},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/hishab/titu-xcodec2}},
}
```
 -->