File size: 3,859 Bytes
447a16f
 
 
 
 
 
 
055afce
 
447a16f
66fbe03
 
 
055afce
 
 
 
 
 
 
 
 
 
 
 
475c2ee
e458980
055afce
 
efdea0e
055afce
 
 
efdea0e
 
 
 
 
055afce
efdea0e
bcb8951
efdea0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
055afce
 
 
 
 
 
 
efdea0e
055afce
 
 
efdea0e
055afce
efdea0e
 
 
 
055afce
efdea0e
 
055afce
 
efdea0e
 
 
055afce
 
 
efdea0e
 
 
055afce
efdea0e
055afce
efdea0e
 
bf22d57
efdea0e
 
055afce
 
efdea0e
 
055afce
efdea0e
 
 
 
055afce
 
efdea0e
 
 
055afce
 
efdea0e
 
055afce
efdea0e
055afce
475c2ee
 
 
 
 
 
 
 
 
055afce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: cc-by-4.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
- SparkAudio/Spark-TTS-0.5B
- zai-org/glm-4-voice-tokenizer
pipeline_tag: audio-to-audio
metrics:
- bleu
library_name: transformers
---
# Model Card for UniSS

## Model Details

### Model Description

UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency.
UniSS supports English and Chinese now.
### Model Sources

- **Repository:** https://github.com/cmots/UniSS
- **Paper:** https://arxiv.org/pdf/2509.21144
- **Demo:** https://cmots.github.io/uniss-demo

## Quick Start
1. Install the environment and get the code
```bash
conda create -n uniss python=3.10.16
conda activate uniss
git clone https://github.com/cmots/UniSS.git
cd UniSS
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
```
2. Download the weight

The weight of UniSS is on [HuggingFace](https://huggingface.co/cmots/UniSS). 

You have to download the model manually, you can download it via provided script:
``` 
python download_weight.py
```

or download via git clone (skip this if you have download via python script):
``` bash
mkdir -p pretrained_models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/cmots/UniSS pretrained_models/UniSS
```
3. Run the code
``` python
import soundfile
from uniss import UniSSTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from uniss import process_input, process_output

# 1. Set the device, wav path, model path
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

wav_path = "prompt_audio.wav"
model_path = "pretrained_models/UniSS"

# 2. Set the mode and target language
mode = 'Quality'    # 'Quality' or 'Performance'
tgt_lang = "<|eng|>"    # for English output
# tgt_lang = "<|cmn|>"  # for Chinese output

# 3. load the model, text tokenizer, and speech tokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_path)

speech_tokenizer = UniSSTokenizer.from_pretrained(model_path, device=device)

# 4. extract speech tokens
glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path)


# 5. process the input
input_text = process_input(glm4_tokens, bicodec_tokens, mode, tgt_lang)
input_token_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

# 6. translate the speech
output = model.generate(
    input_token_ids,
    max_new_tokens=1500,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.1
)

# 7. decode the output
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)

# 8. process the output
audio, translation, transcription = process_output(output_text[0], input_text, speech_tokenizer, mode, device)

# 9. save and show the results
soundfile.write("output_audio.wav", audio, 16000)

if mode == 'Quality':
    print("Transcription:\n", transcription)
print("Translation:\n", translation)
```

More examples and details is on [Our Github Repo](https://github.com/cmots/UniSS).

## Citation
If you find our paper and code useful in your research, please consider giving a like and citation.
```bibtex
@misc{cheng2025uniss_s2st,
      title={UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice}, 
      author={Sitong Cheng and Weizhen Bian and Xinsheng Wang and Ruibin Yuan and Jianyi Chen and Shunshun Yin and Yike Guo and Wei Xue},
      year={2025},
      eprint={2509.21144},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2509.21144}, 
}
```