ashishkblink commited on
Commit
e53ebe7
·
verified ·
1 Parent(s): f8d6d19

Upload f5_tts/infer/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. f5_tts/infer/README.md +193 -0
f5_tts/infer/README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Inference
2
+
3
+ The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
4
+
5
+ **More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**
6
+
7
+ Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
8
+
9
+ To avoid possible inference failures, make sure you have seen through the following instructions.
10
+
11
+ - Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
12
+ - Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
13
+ - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
14
+ - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
15
+ - If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
16
+ - Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).
17
+
18
+
19
+ ## Gradio App
20
+
21
+ Currently supported features:
22
+
23
+ - Basic TTS with Chunk Inference
24
+ - Multi-Style / Multi-Speaker Generation
25
+ - Voice Chat powered by Qwen2.5-3B-Instruct
26
+
27
+ The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
28
+
29
+ The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
30
+
31
+ Could also be used as a component for larger application.
32
+ ```python
33
+ import gradio as gr
34
+ from f5_tts.infer.infer_gradio import app
35
+
36
+ with gr.Blocks() as main_app:
37
+ gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
38
+
39
+ # ... other Gradio components
40
+
41
+ app.render()
42
+
43
+ main_app.launch()
44
+ ```
45
+
46
+
47
+ ## CLI Inference
48
+
49
+ The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.
50
+
51
+ The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.
52
+
53
+ For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.
54
+
55
+ Basically you can inference with flags:
56
+ ```bash
57
+ # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
58
+ f5-tts_infer-cli \
59
+ --model "F5-TTS" \
60
+ --ref_audio "ref_audio.wav" \
61
+ --ref_text "The content, subtitle or transcription of reference audio." \
62
+ --gen_text "Some text you want TTS model generate for you."
63
+
64
+ # Choose Vocoder
65
+ f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
66
+ f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>
67
+ ```
68
+
69
+ And a `.toml` file would help with more flexible usage.
70
+
71
+ ```bash
72
+ f5-tts_infer-cli -c custom.toml
73
+ ```
74
+
75
+ For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
76
+
77
+ ```toml
78
+ # F5-TTS | E2-TTS
79
+ model = "F5-TTS"
80
+ ref_audio = "infer/examples/basic/basic_ref_en.wav"
81
+ # If an empty "", transcribes the reference audio automatically.
82
+ ref_text = "Some call me nature, others call me mother nature."
83
+ gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
84
+ # File with text to generate. Ignores the text above.
85
+ gen_file = ""
86
+ remove_silence = false
87
+ output_dir = "tests"
88
+ ```
89
+
90
+ You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
91
+
92
+ ```toml
93
+ # F5-TTS | E2-TTS
94
+ model = "F5-TTS"
95
+ ref_audio = "infer/examples/multi/main.flac"
96
+ # If an empty "", transcribes the reference audio automatically.
97
+ ref_text = ""
98
+ gen_text = ""
99
+ # File with text to generate. Ignores the text above.
100
+ gen_file = "infer/examples/multi/story.txt"
101
+ remove_silence = true
102
+ output_dir = "tests"
103
+
104
+ [voices.town]
105
+ ref_audio = "infer/examples/multi/town.flac"
106
+ ref_text = ""
107
+
108
+ [voices.country]
109
+ ref_audio = "infer/examples/multi/country.flac"
110
+ ref_text = ""
111
+ ```
112
+ You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
113
+
114
+ ## Speech Editing
115
+
116
+ To test speech editing capabilities, use the following command:
117
+
118
+ ```bash
119
+ python src/f5_tts/infer/speech_edit.py
120
+ ```
121
+
122
+ ## Socket Realtime Client
123
+
124
+ To communicate with socket server you need to run
125
+ ```bash
126
+ python src/f5_tts/socket_server.py
127
+ ```
128
+
129
+ <details>
130
+ <summary>Then create client to communicate</summary>
131
+
132
+ ``` python
133
+ import socket
134
+ import numpy as np
135
+ import asyncio
136
+ import pyaudio
137
+
138
+ async def listen_to_voice(text, server_ip='localhost', server_port=9999):
139
+ client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
140
+ client_socket.connect((server_ip, server_port))
141
+
142
+ async def play_audio_stream():
143
+ buffer = b''
144
+ p = pyaudio.PyAudio()
145
+ stream = p.open(format=pyaudio.paFloat32,
146
+ channels=1,
147
+ rate=24000, # Ensure this matches the server's sampling rate
148
+ output=True,
149
+ frames_per_buffer=2048)
150
+
151
+ try:
152
+ while True:
153
+ chunk = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 1024)
154
+ if not chunk: # End of stream
155
+ break
156
+ if b"END_OF_AUDIO" in chunk:
157
+ buffer += chunk.replace(b"END_OF_AUDIO", b"")
158
+ if buffer:
159
+ audio_array = np.frombuffer(buffer, dtype=np.float32).copy() # Make a writable copy
160
+ stream.write(audio_array.tobytes())
161
+ break
162
+ buffer += chunk
163
+ if len(buffer) >= 4096:
164
+ audio_array = np.frombuffer(buffer[:4096], dtype=np.float32).copy() # Make a writable copy
165
+ stream.write(audio_array.tobytes())
166
+ buffer = buffer[4096:]
167
+ finally:
168
+ stream.stop_stream()
169
+ stream.close()
170
+ p.terminate()
171
+
172
+ try:
173
+ # Send only the text to the server
174
+ await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, text.encode('utf-8'))
175
+ await play_audio_stream()
176
+ print("Audio playback finished.")
177
+
178
+ except Exception as e:
179
+ print(f"Error in listen_to_voice: {e}")
180
+
181
+ finally:
182
+ client_socket.close()
183
+
184
+ # Example usage: Replace this with your actual server IP and port
185
+ async def main():
186
+ await listen_to_voice("my name is jenny..", server_ip='localhost', server_port=9998)
187
+
188
+ # Run the main async function
189
+ asyncio.run(main())
190
+ ```
191
+
192
+ </details>
193
+