xcodec2-25TPS-24k / README.md
huseinzolkepliscicom's picture
Update README.md
3cab2d9 verified
metadata
base_model:
  - HKUSTAudio/xcodec2
datasets:
  - malaysia-ai/common_voice_17_0
  - mesolitica/Malaysian-STT-Whisper-Stage2
  - malaysia-ai/Multilingual-TTS
  - mesolitica/Malaysian-Emilia-v2
library_name: transformers
pipeline_tag: audio-to-audio

xcodec2-25TPS-24k

This repository contains the improved X-Codec-2.0 model as described in the paper Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling.

Preprint

Improve https://huggingface.co/HKUSTAudio/xcodec2 from 50 TPS to become 25 TPS and upscale output to 24k sample rate.

WanDB at https://wandb.ai/huseinzol05/xcodec2-24k-25tps, we also pushed all checkpoints in checkpoint.

Dataset

  1. https://huggingface.co/datasets/malaysia-ai/common_voice_17_0, train set only.
  2. https://huggingface.co/datasets/mesolitica/Malaysian-STT-Whisper-Stage2, except noise and audioset_0.5s.
  3. https://huggingface.co/datasets/malaysia-ai/Multilingual-TTS, specific commit 2421a13e07226d96ac7009d5327d96a84672768c except cml-tts and libritts_r_filtered
  4. https://huggingface.co/datasets/mesolitica/Malaysian-Emilia-v2, only sg_podcast and malaysian_podcast

How to use

  1. Git clone,
git clone https://github.com/Scicom-AI-Enterprise-Organization/X-Codec-2.0-25TPS-24k
cd X-Codec-2.0-25TPS-24k
  1. Load the model,
from modeling_xcodec2 import XCodec2Model
model = XCodec2Model.from_pretrained("Scicom-intl/xcodec2-25TPS-24k")
  1. Encode,
import librosa
import torch

y, sr = librosa.load('259041.mp3', sr = 16000)
wav_tensor = torch.from_numpy(y).float().unsqueeze(0)
codes = model.encode_code(wav_tensor)
  1. Decode,
import IPython.display as ipd

ipd.Audio(model.decode_code(codes)[0, 0].cpu(), rate = 24000)

Source code

Source code at https://github.com/Scicom-AI-Enterprise-Organization/X-Codec-2.0-25TPS-24k