File size: 3,859 Bytes
447a16f 055afce 447a16f 66fbe03 055afce 475c2ee e458980 055afce efdea0e 055afce efdea0e 055afce efdea0e bcb8951 efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e bf22d57 efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce efdea0e 055afce 475c2ee 055afce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
license: cc-by-4.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
- SparkAudio/Spark-TTS-0.5B
- zai-org/glm-4-voice-tokenizer
pipeline_tag: audio-to-audio
metrics:
- bleu
library_name: transformers
---
# Model Card for UniSS
## Model Details
### Model Description
UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency.
UniSS supports English and Chinese now.
### Model Sources
- **Repository:** https://github.com/cmots/UniSS
- **Paper:** https://arxiv.org/pdf/2509.21144
- **Demo:** https://cmots.github.io/uniss-demo
## Quick Start
1. Install the environment and get the code
```bash
conda create -n uniss python=3.10.16
conda activate uniss
git clone https://github.com/cmots/UniSS.git
cd UniSS
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
```
2. Download the weight
The weight of UniSS is on [HuggingFace](https://huggingface.co/cmots/UniSS).
You have to download the model manually, you can download it via provided script:
```
python download_weight.py
```
or download via git clone (skip this if you have download via python script):
``` bash
mkdir -p pretrained_models
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/cmots/UniSS pretrained_models/UniSS
```
3. Run the code
``` python
import soundfile
from uniss import UniSSTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from uniss import process_input, process_output
# 1. Set the device, wav path, model path
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
wav_path = "prompt_audio.wav"
model_path = "pretrained_models/UniSS"
# 2. Set the mode and target language
mode = 'Quality' # 'Quality' or 'Performance'
tgt_lang = "<|eng|>" # for English output
# tgt_lang = "<|cmn|>" # for Chinese output
# 3. load the model, text tokenizer, and speech tokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_path)
speech_tokenizer = UniSSTokenizer.from_pretrained(model_path, device=device)
# 4. extract speech tokens
glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path)
# 5. process the input
input_text = process_input(glm4_tokens, bicodec_tokens, mode, tgt_lang)
input_token_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
# 6. translate the speech
output = model.generate(
input_token_ids,
max_new_tokens=1500,
temperature=0.7,
top_p=0.8,
repetition_penalty=1.1
)
# 7. decode the output
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
# 8. process the output
audio, translation, transcription = process_output(output_text[0], input_text, speech_tokenizer, mode, device)
# 9. save and show the results
soundfile.write("output_audio.wav", audio, 16000)
if mode == 'Quality':
print("Transcription:\n", transcription)
print("Translation:\n", translation)
```
More examples and details is on [Our Github Repo](https://github.com/cmots/UniSS).
## Citation
If you find our paper and code useful in your research, please consider giving a like and citation.
```bibtex
@misc{cheng2025uniss_s2st,
title={UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice},
author={Sitong Cheng and Weizhen Bian and Xinsheng Wang and Ruibin Yuan and Jianyi Chen and Shunshun Yin and Yike Guo and Wei Xue},
year={2025},
eprint={2509.21144},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2509.21144},
}
``` |