--- license: cc-by-nc-4.0 tags: - audio - codec - speech - xcodec2 - text-to-speech - multilingual language: - en - ja - zh - bn - fr - de - ko --- # ๐Ÿ—ฃ๏ธ XCodec2 Trained on 100K Hours of Multilingual Data This is a retrained version of the XCodec2 neural audio codec by HKUSTAudio, using 100,000 hours of multilingual speech across seven languages. The model enables efficient speech compression and reconstruction for low-bandwidth, high-quality audio applications. Its discrete token outputs are well-suited for LLM-based TTS, AudioLM, multimodal models, and speech-to-speech systems, making it a versatile solution for multilingual and real-world speech processing tasks. --- ## ๐Ÿ“Œ Overview - **Model Architecture:** [Xcodec2](https://huggingface.co/HKUSTAudio/xcodec2) - **Sampling Rate:** 16 kHz - **Tokens:** 50 tokens/second - **Developed By:** [Verbex.ai (Hishab Technologies Ltd.)](https://verbex.ai) - **Primary Use Case:** High-quality speech reconstruction and intermediate TTS representations - **Training Time:** 11 Days(8xH100 80GB) - **Epoch:** 1 --- ## ๐Ÿงช Installation & Usage This model requires `xcodec2`. We recommend using a minimal setup: ```bash # Create environment conda create -n xcodec2 python=3.9 conda activate xcodec2 # Install dependencies pip install xcodec2==0.1.5 pip install numpy==1.26.4 ``` ### Example Usage ```python import torch import soundfile as sf from xcodec2.modeling_xcodec2 import XCodec2Model model_path = "hishab/titu-xcodec2" # Replace with actual Hugging Face path model = XCodec2Model.from_pretrained(model_path) model.eval().cuda() # Load and preprocess waveform wav, sr = sf.read("test_bn.wav") if sr != 16000: import librosa wav = librosa.resample(wav, orig_sr=sr, target_sr=16000) sr = 16000 if len(wav.shape) > 1: wav = wav.mean(axis=1) wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Encode and decode with torch.no_grad(): vq_code = model.encode_code(input_waveform=wav_tensor) print("Code:", vq_code) recon_wav = model.decode_code(vq_code).cpu() # Save output sf.write("reconstructed_bn.wav", recon_wav[0, 0].numpy(), sr) print("Done! Check reconstructed_bn.wav") ``` --- ## ๐ŸŒ Multilingual Training Dataset | Language | Dataset(s) | Hours (K) | |-----------|----------------------------------------|-----------| | Japanese | EmiliaYODAS + Verbex JA TTS Dataset | 31.41 | | English | EmiliaYODAS | 25.69 | | Chinese | EmiliaYODAS | 12.50 | | Bangla | Verbex Bengali TTS Dataset | 11.58 | | French | EmiliaYODAS + MLangLibrispeech | 8.40 | | German | EmiliaYODAS + MLangLibrispeech | 5.42 | | Korean | EmiliaYODAS | 5.00 | | **Total** | โ€” | **100** | --- ## ๐Ÿ“Š Reconstruction Evaluation Reconstruction metrics are computed over 100 samples for English, Japanese, and Bangla using this retrained model (`XCODEC2 Ours`) alongside baselines (XCODEC, SNAC, NEMO). **Evaluation Test Sets:** - English: 100 Examples (Emilia Dataset) - Japanese: 100 Examples (Emilia Dataset) - Bangla: 100 Examples (Verbex's Inhouse TTS Dataset) | Model | Lang | MCD โ†“ | MSE โ†“ | SpeechBERTScore โ†‘ | SpeechBLEU โ†‘ | SpeechTokenDist โ†‘ | |-------------------|------|--------|--------|-------------|--------|-------------| | **XCODEC** | BN | 2.823 | 0.003 | 0.939 | 0.500 | 0.816 | | | EN | 3.166 | 0.012 | 0.962 | 0.660 | 0.856 | | | JA | 3.021 | 0.010 | 0.948 | 0.582 | 0.838 | | **Overall** | | 3.003 | 0.008 | 0.949 | 0.581 | 0.837 | | **XCODEC2 (Ours)** | BN | 2.712 | 0.003 | 0.940 | 0.508 | 0.817 | | | EN | 3.206 | 0.014 | 0.957 | 0.644 | 0.851 | | | JA | 3.022 | 0.012 | 0.946 | 0.573 | 0.838 | | **Overall** | | 2.980 | 0.010 | 0.948 | 0.575 | 0.835 | | **hubertsiuzdak/snac_24khz** | BN | 3.104 | 0.002 | 0.911 | 0.442 | 0.785 | | | EN | 3.983 | 0.014 | 0.912 | 0.541 | 0.797 | | | JA | 3.512 | 0.009 | 0.903 | 0.472 | 0.761 | | **Overall** | | 3.533 | 0.008 | 0.909 | 0.485 | 0.781 | | **nvidia/low-frame-rate-speech-codec-22khz** | BN | 2.247 | 0.000 | 0.957 | 0.589 | 0.863 | | | EN | 2.867 | 0.007 | 0.969 | 0.707 | 0.872 | | | JA | 2.677 | 0.003 | 0.955 | 0.614 | 0.853 | | **Overall** | | 2.597 | 0.003 | 0.960 | 0.636 | 0.863 | #### SpeechBERTScore, SpeechBLEU and SpeechTokenDistance are calculated using https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics --- ## โœ… Intended Use This model is suitable for: - Speech tokenization in TTS pipelines - Low-bitrate speech compression - Code-based speech synthesis or generation tasks - Multimodal LLM, Audio LM, Speech-to-Speech and etc. modeling --- ## ๐Ÿšซ Limitations - Licensed for **non-commercial use only** --- ## ๐Ÿ“„ License This model is licensed under **Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)**. Commercial usage is **not allowed**. - SPDX Identifier: `CC-BY-NC-4.0` - License Details: [https://creativecommons.org/licenses/by-nc/4.0](https://creativecommons.org/licenses/by-nc/4.0) --- ## ๐Ÿ“ฌ Contact For research collaborations, feedback, or commercial licensing inquiries, please reach out to: **๐ŸŒ Website:** [https://verbex.ai](https://verbex.ai) ---