indicF5

Sleeping

App Files Files Community

ashishkblink commited on Jan 5

Commit

e53ebe7

verified ·

1 Parent(s): f8d6d19

Upload f5_tts/infer/README.md with huggingface_hub

Browse files

Files changed (1) hide show

f5_tts/infer/README.md +193 -0

f5_tts/infer/README.md ADDED Viewed

	@@ -0,0 +1,193 @@

+# Inference
+The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
+**More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**
+Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
+To avoid possible inference failures, make sure you have seen through the following instructions.
+- Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
+- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
+- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
+- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
+- If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
+- Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).
+## Gradio App
+Currently supported features:
+- Basic TTS with Chunk Inference
+- Multi-Style / Multi-Speaker Generation
+- Voice Chat powered by Qwen2.5-3B-Instruct
+The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
+The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
+Could also be used as a component for larger application.
+```python
+import gradio as gr
+from f5_tts.infer.infer_gradio import app
+with gr.Blocks() as main_app:
+    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
+    # ... other Gradio components
+    app.render()
+main_app.launch()
+```
+## CLI Inference
+The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.
+The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.
+For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.
+Basically you can inference with flags:
+```bash
+# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
+f5-tts_infer-cli \
+--model "F5-TTS" \
+--ref_audio "ref_audio.wav" \
+--ref_text "The content, subtitle or transcription of reference audio." \
+--gen_text "Some text you want TTS model generate for you."
+# Choose Vocoder
+f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
+f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>
+```
+And a `.toml` file would help with more flexible usage.
+```bash
+f5-tts_infer-cli -c custom.toml
+```
+For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
+```toml
+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/basic/basic_ref_en.wav"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = "Some call me nature, others call me mother nature."
+gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
+# File with text to generate. Ignores the text above.
+gen_file = ""
+remove_silence = false
+output_dir = "tests"
+```
+You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
+```toml
+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/multi/main.flac"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = ""
+gen_text = ""
+# File with text to generate. Ignores the text above.
+gen_file = "infer/examples/multi/story.txt"
+remove_silence = true
+output_dir = "tests"
+[voices.town]
+ref_audio = "infer/examples/multi/town.flac"
+ref_text = ""
+[voices.country]
+ref_audio = "infer/examples/multi/country.flac"
+ref_text = ""
+```
+You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
+## Speech Editing
+To test speech editing capabilities, use the following command:
+```bash
+python src/f5_tts/infer/speech_edit.py
+```
+## Socket Realtime Client
+To communicate with socket server you need to run
+```bash
+python src/f5_tts/socket_server.py
+```
+<details>
+<summary>Then create client to communicate</summary>
+``` python
+import socket
+import numpy as np
+import asyncio
+import pyaudio
+async def listen_to_voice(text, server_ip='localhost', server_port=9999):
+    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    client_socket.connect((server_ip, server_port))
+    async def play_audio_stream():
+        buffer = b''
+        p = pyaudio.PyAudio()
+        stream = p.open(format=pyaudio.paFloat32,
+                        channels=1,
+                        rate=24000,  # Ensure this matches the server's sampling rate
+                        output=True,
+                        frames_per_buffer=2048)
+        try:
+            while True:
+                chunk = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 1024)
+                if not chunk:  # End of stream
+                    break
+                if b"END_OF_AUDIO" in chunk:
+                    buffer += chunk.replace(b"END_OF_AUDIO", b"")
+                    if buffer:
+                        audio_array = np.frombuffer(buffer, dtype=np.float32).copy()  # Make a writable copy
+                        stream.write(audio_array.tobytes())
+                    break
+                buffer += chunk
+                if len(buffer) >= 4096:
+                    audio_array = np.frombuffer(buffer[:4096], dtype=np.float32).copy()  # Make a writable copy
+                    stream.write(audio_array.tobytes())
+                    buffer = buffer[4096:]
+        finally:
+            stream.stop_stream()
+            stream.close()
+            p.terminate()
+    try:
+        # Send only the text to the server
+        await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, text.encode('utf-8'))
+        await play_audio_stream()
+        print("Audio playback finished.")
+    except Exception as e:
+        print(f"Error in listen_to_voice: {e}")
+    finally:
+        client_socket.close()
+# Example usage: Replace this with your actual server IP and port
+async def main():
+    await listen_to_voice("my name is jenny..", server_ip='localhost', server_port=9998)
+# Run the main async function
+asyncio.run(main())
+```
+</details>