UniSS / README.md
cmots's picture
Update README.md
bf22d57 verified
metadata
license: cc-by-4.0
language:
  - en
  - zh
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct
  - SparkAudio/Spark-TTS-0.5B
  - zai-org/glm-4-voice-tokenizer
pipeline_tag: audio-to-audio
metrics:
  - bleu
library_name: transformers

Model Card for UniSS

Model Details

Model Description

UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency. UniSS supports English and Chinese now.

Model Sources

Quick Start

  1. Install the environment and get the code
conda create -n uniss python=3.10.16
conda activate uniss
git clone https://github.com/cmots/UniSS.git
cd UniSS
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
  1. Download the weight

The weight of UniSS is on HuggingFace.

You have to download the model manually, you can download it via provided script:

python download_weight.py

or download via git clone (skip this if you have download via python script):

mkdir -p pretrained_models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/cmots/UniSS pretrained_models/UniSS
  1. Run the code
import soundfile
from uniss import UniSSTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from uniss import process_input, process_output

# 1. Set the device, wav path, model path
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

wav_path = "prompt_audio.wav"
model_path = "pretrained_models/UniSS"

# 2. Set the mode and target language
mode = 'Quality'    # 'Quality' or 'Performance'
tgt_lang = "<|eng|>"    # for English output
# tgt_lang = "<|cmn|>"  # for Chinese output

# 3. load the model, text tokenizer, and speech tokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_path)

speech_tokenizer = UniSSTokenizer.from_pretrained(model_path, device=device)

# 4. extract speech tokens
glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path)


# 5. process the input
input_text = process_input(glm4_tokens, bicodec_tokens, mode, tgt_lang)
input_token_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

# 6. translate the speech
output = model.generate(
    input_token_ids,
    max_new_tokens=1500,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.1
)

# 7. decode the output
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)

# 8. process the output
audio, translation, transcription = process_output(output_text[0], input_text, speech_tokenizer, mode, device)

# 9. save and show the results
soundfile.write("output_audio.wav", audio, 16000)

if mode == 'Quality':
    print("Transcription:\n", transcription)
print("Translation:\n", translation)

More examples and details is on Our Github Repo.

Citation

If you find our paper and code useful in your research, please consider giving a like and citation.

@misc{cheng2025uniss_s2st,
      title={UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice}, 
      author={Sitong Cheng and Weizhen Bian and Xinsheng Wang and Ruibin Yuan and Jianyi Chen and Shunshun Yin and Yike Guo and Wei Xue},
      year={2025},
      eprint={2509.21144},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2509.21144}, 
}