cmots commited on
Commit
055afce
·
verified ·
1 Parent(s): 66fbe03

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -1
README.md CHANGED
@@ -5,8 +5,80 @@ language:
5
  - zh
6
  base_model:
7
  - Qwen/Qwen2.5-1.5B-Instruct
 
 
8
  pipeline_tag: audio-to-audio
9
  metrics:
10
  - bleu
11
  library_name: transformers
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - zh
6
  base_model:
7
  - Qwen/Qwen2.5-1.5B-Instruct
8
+ - SparkAudio/Spark-TTS-0.5B
9
+ - zai-org/glm-4-voice-tokenizer
10
  pipeline_tag: audio-to-audio
11
  metrics:
12
  - bleu
13
  library_name: transformers
14
+ ---
15
+ # Model Card for UniSS
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency.
22
+ UniSS supports English and Chinese now.
23
+ ### Model Sources
24
+
25
+ - **Repository:** https://github.com/cmots/UniSS
26
+ - **Paper:**
27
+ - **Demo:** https://cmots.github.io/uniss.github.io
28
+
29
+ ## Quick Start
30
+ 1. Install the environment
31
+ ```bash
32
+ conda create -n uniss python=3.10.16
33
+ conda activate uniss
34
+ pip install uniss
35
+ ```
36
+ 2. Run the code
37
+ ``` python
38
+ import soundfile
39
+ from uniss import UniSSTokenizer
40
+ from transformers import AutoTokenizer, AutoModelForCausalLM
41
+ import torch
42
+ from uniss import process_input, process_output
43
+
44
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
45
+
46
+ wav_path = "prompt_audio.wav"
47
+
48
+ model_path = "cmots/UniSS"
49
+
50
+ # load the model, text tokenizer, and speech tokenizer
51
+ model = AutoModelForCausalLM.from_pretrained(model_path)
52
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
53
+ speech_tokenizer = UniSSTokenizer.from_pretrained(model_path)
54
+
55
+ # extract speech tokens
56
+ glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path)
57
+
58
+ tgt_lang = "<|eng|>"
59
+
60
+ # process the input
61
+ input_text = process_input(glm4_tokens, bicodec_tokens, "Quality", tgt_lang)
62
+
63
+ # translate the speech
64
+ output = model.generate(
65
+ glm4_tokens,
66
+ bicodec_tokens,
67
+ max_new_tokens=100,
68
+ num_beams=1,
69
+ early_stopping=True,
70
+ )
71
+ output_text = tokenizer.decode(output, skip_special_tokens=True)
72
+
73
+ audio, translation, transcription = process_output(output_text, input_text, speech_tokenizer, "Quality", device)
74
+
75
+ soundfile.write("output_audio.wav", audio, 16000)
76
+ print(translation)
77
+ print(transcription)
78
+
79
+ ```
80
+
81
+ ## Citation
82
+ ```bibtex
83
+
84
+ ```