|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
|
- SparkAudio/Spark-TTS-0.5B |
|
|
- zai-org/glm-4-voice-tokenizer |
|
|
pipeline_tag: audio-to-audio |
|
|
metrics: |
|
|
- bleu |
|
|
library_name: transformers |
|
|
--- |
|
|
# Model Card for UniSS |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency. |
|
|
UniSS supports English and Chinese now. |
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** https://github.com/cmots/UniSS |
|
|
- **Paper:** https://arxiv.org/pdf/2509.21144 |
|
|
- **Demo:** https://cmots.github.io/uniss-demo |
|
|
|
|
|
## Quick Start |
|
|
1. Install the environment and get the code |
|
|
```bash |
|
|
conda create -n uniss python=3.10.16 |
|
|
conda activate uniss |
|
|
git clone https://github.com/cmots/UniSS.git |
|
|
cd UniSS |
|
|
pip install -r requirements.txt |
|
|
# If you are in mainland China, you can set the mirror as follows: |
|
|
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com |
|
|
``` |
|
|
2. Download the weight |
|
|
|
|
|
The weight of UniSS is on [HuggingFace](https://huggingface.co/cmots/UniSS). |
|
|
|
|
|
You have to download the model manually, you can download it via provided script: |
|
|
``` |
|
|
python download_weight.py |
|
|
``` |
|
|
|
|
|
or download via git clone (skip this if you have download via python script): |
|
|
``` bash |
|
|
mkdir -p pretrained_models |
|
|
|
|
|
# Make sure you have git-lfs installed (https://git-lfs.com) |
|
|
git lfs install |
|
|
|
|
|
git clone https://huggingface.co/cmots/UniSS pretrained_models/UniSS |
|
|
``` |
|
|
3. Run the code |
|
|
``` python |
|
|
import soundfile |
|
|
from uniss import UniSSTokenizer |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
from uniss import process_input, process_output |
|
|
|
|
|
# 1. Set the device, wav path, model path |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
wav_path = "prompt_audio.wav" |
|
|
model_path = "pretrained_models/UniSS" |
|
|
|
|
|
# 2. Set the mode and target language |
|
|
mode = 'Quality' # 'Quality' or 'Performance' |
|
|
tgt_lang = "<|eng|>" # for English output |
|
|
# tgt_lang = "<|cmn|>" # for Chinese output |
|
|
|
|
|
# 3. load the model, text tokenizer, and speech tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
|
|
|
speech_tokenizer = UniSSTokenizer.from_pretrained(model_path, device=device) |
|
|
|
|
|
# 4. extract speech tokens |
|
|
glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path) |
|
|
|
|
|
|
|
|
# 5. process the input |
|
|
input_text = process_input(glm4_tokens, bicodec_tokens, mode, tgt_lang) |
|
|
input_token_ids = tokenizer.encode(input_text, return_tensors="pt").to(device) |
|
|
|
|
|
# 6. translate the speech |
|
|
output = model.generate( |
|
|
input_token_ids, |
|
|
max_new_tokens=1500, |
|
|
temperature=0.7, |
|
|
top_p=0.8, |
|
|
repetition_penalty=1.1 |
|
|
) |
|
|
|
|
|
# 7. decode the output |
|
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True) |
|
|
|
|
|
# 8. process the output |
|
|
audio, translation, transcription = process_output(output_text[0], input_text, speech_tokenizer, mode, device) |
|
|
|
|
|
# 9. save and show the results |
|
|
soundfile.write("output_audio.wav", audio, 16000) |
|
|
|
|
|
if mode == 'Quality': |
|
|
print("Transcription:\n", transcription) |
|
|
print("Translation:\n", translation) |
|
|
``` |
|
|
|
|
|
More examples and details is on [Our Github Repo](https://github.com/cmots/UniSS). |
|
|
|
|
|
## Citation |
|
|
If you find our paper and code useful in your research, please consider giving a like and citation. |
|
|
```bibtex |
|
|
@misc{cheng2025uniss_s2st, |
|
|
title={UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice}, |
|
|
author={Sitong Cheng and Weizhen Bian and Xinsheng Wang and Ruibin Yuan and Jianyi Chen and Shunshun Yin and Yike Guo and Wei Xue}, |
|
|
year={2025}, |
|
|
eprint={2509.21144}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.SD}, |
|
|
url={https://arxiv.org/abs/2509.21144}, |
|
|
} |
|
|
``` |