Spaces:

pythonlearnreal
/

F5-TTS-THAI

Running

App Files Files Community

F5-TTS-THAI / deployment /src /f5_tts /infer /README.md

pythonlearnreal

Upload folder using huggingface_hub

106478e verified 5 months ago

preview code

raw

history blame contribute delete

8.13 kB

	# Inference

	The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.

	More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.

	Currently support 30s for a single generation, which is the total length including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.

	To avoid possible inference failures, make sure you have seen through the following instructions.

	- Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
	- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
	- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
	- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
	- If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
	- Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).


	## Gradio App

	Currently supported features:

	- Basic TTS with Chunk Inference
	- Multi-Style / Multi-Speaker Generation
	- Voice Chat powered by Qwen2.5-3B-Instruct
	- [Custom inference with more language support](src/f5_tts/infer/SHARED.md)

	The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.

	The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.

	More flags options:

	```bash
	# Automatically launch the interface in the default web browser
	f5-tts_infer-gradio --inbrowser

	# Set the root path of the application, if it's not served from the root ("/") of the domain
	# For example, if the application is served at "https://example.com/myapp"
	f5-tts_infer-gradio --root_path "/myapp"
	```

	Could also be used as a component for larger application:
	```python
	import gradio as gr
	from f5_tts.infer.infer_gradio import app

	with gr.Blocks() as main_app:
	gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")

	# ... other Gradio components

	app.render()

	main_app.launch()
	```


	## CLI Inference

	The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.

	The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.

	For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.

	Basically you can inference with flags:
	```bash
	# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
	f5-tts_infer-cli \
	--model "F5-TTS" \
	--ref_audio "ref_audio.wav" \
	--ref_text "The content, subtitle or transcription of reference audio." \
	--gen_text "Some text you want TTS model generate for you."

	# Choose Vocoder
	f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
	f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>

	# More instructions
	f5-tts_infer-cli --help
	```

	And a `.toml` file would help with more flexible usage.

	```bash
	f5-tts_infer-cli -c custom.toml
	```

	For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:

	```toml
	# F5-TTS \| E2-TTS
	model = "F5-TTS"
	ref_audio = "infer/examples/basic/basic_ref_en.wav"
	# If an empty "", transcribes the reference audio automatically.
	ref_text = "Some call me nature, others call me mother nature."
	gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
	# File with text to generate. Ignores the text above.
	gen_file = ""
	remove_silence = false
	output_dir = "tests"
	```

	You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.

	```toml
	# F5-TTS \| E2-TTS
	model = "F5-TTS"
	ref_audio = "infer/examples/multi/main.flac"
	# If an empty "", transcribes the reference audio automatically.
	ref_text = ""
	gen_text = ""
	# File with text to generate. Ignores the text above.
	gen_file = "infer/examples/multi/story.txt"
	remove_silence = true
	output_dir = "tests"

	[voices.town]
	ref_audio = "infer/examples/multi/town.flac"
	ref_text = ""

	[voices.country]
	ref_audio = "infer/examples/multi/country.flac"
	ref_text = ""
	```
	You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.

	## Speech Editing

	To test speech editing capabilities, use the following command:

	```bash
	python src/f5_tts/infer/speech_edit.py
	```

	## Socket Realtime Client

	To communicate with socket server you need to run
	```bash
	python src/f5_tts/socket_server.py
	```

	<details>
	<summary>Then create client to communicate</summary>

	```bash
	# If PyAudio not installed
	sudo apt-get install portaudio19-dev
	pip install pyaudio
	```

	``` python
	# Create the socket_client.py
	import socket
	import asyncio
	import pyaudio
	import numpy as np
	import logging
	import time

	logging.basicConfig(level=logging.INFO)
	logger = logging.getLogger(__name__)


	async def listen_to_F5TTS(text, server_ip="localhost", server_port=9998):
	client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
	await asyncio.get_event_loop().run_in_executor(None, client_socket.connect, (server_ip, int(server_port)))

	start_time = time.time()
	first_chunk_time = None

	async def play_audio_stream():
	nonlocal first_chunk_time
	p = pyaudio.PyAudio()
	stream = p.open(format=pyaudio.paFloat32, channels=1, rate=24000, output=True, frames_per_buffer=2048)

	try:
	while True:
	data = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 8192)
	if not data:
	break
	if data == b"END":
	logger.info("End of audio received.")
	break

	audio_array = np.frombuffer(data, dtype=np.float32)
	stream.write(audio_array.tobytes())

	if first_chunk_time is None:
	first_chunk_time = time.time()

	finally:
	stream.stop_stream()
	stream.close()
	p.terminate()

	logger.info(f"Total time taken: {time.time() - start_time:.4f} seconds")

	try:
	data_to_send = f"{text}".encode("utf-8")
	await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, data_to_send)
	await play_audio_stream()

	except Exception as e:
	logger.error(f"Error in listen_to_F5TTS: {e}")

	finally:
	client_socket.close()


	if __name__ == "__main__":
	text_to_send = "As a Reader assistant, I'm familiar with new technology. which are key to its improved performance in terms of both training speed and inference efficiency. Let's break down the components"

	asyncio.run(listen_to_F5TTS(text_to_send))
	```

	</details>