cmots
/

UniSS

text-generation

text-generation-inference

Model card Files Files and versions

cmots commited on Aug 31, 2025

Commit

055afce

·

verified ·

1 Parent(s): 66fbe03

Update README.md

Files changed (1) hide show

README.md +73 -1

README.md CHANGED Viewed

@@ -5,8 +5,80 @@ language:
 - zh
 base_model:
 - Qwen/Qwen2.5-1.5B-Instruct
 pipeline_tag: audio-to-audio
 metrics:
 - bleu
 library_name: transformers
----

 - zh
 base_model:
 - Qwen/Qwen2.5-1.5B-Instruct
+- SparkAudio/Spark-TTS-0.5B
+- zai-org/glm-4-voice-tokenizer
 pipeline_tag: audio-to-audio
 metrics:
 - bleu
 library_name: transformers
+---
+# Model Card for UniSS
+## Model Details
+### Model Description
+UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency.
+UniSS supports English and Chinese now.
+### Model Sources
+- **Repository:** https://github.com/cmots/UniSS
+- **Paper:**
+- **Demo:** https://cmots.github.io/uniss.github.io
+## Quick Start
+1. Install the environment
+```bash
+conda create -n uniss python=3.10.16
+conda activate uniss
+pip install uniss
+```
+2. Run the code
+``` python
+import soundfile
+from uniss import UniSSTokenizer
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+from uniss import process_input, process_output
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+wav_path = "prompt_audio.wav"
+model_path = "cmots/UniSS"
+# load the model, text tokenizer, and speech tokenizer
+model = AutoModelForCausalLM.from_pretrained(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+speech_tokenizer = UniSSTokenizer.from_pretrained(model_path)
+# extract speech tokens
+glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path)
+tgt_lang = "<|eng|>"
+# process the input
+input_text = process_input(glm4_tokens, bicodec_tokens, "Quality", tgt_lang)
+# translate the speech
+output = model.generate(
+    glm4_tokens,
+    bicodec_tokens,
+    max_new_tokens=100,
+    num_beams=1,
+    early_stopping=True,
+)
+output_text = tokenizer.decode(output, skip_special_tokens=True)
+audio, translation, transcription = process_output(output_text, input_text, speech_tokenizer, "Quality", device)
+soundfile.write("output_audio.wav", audio, 16000)
+print(translation)
+print(transcription)
+```
+## Citation
+```bibtex
+```