# Inference The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts. **More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.** Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**. To avoid possible inference failures, make sure you have seen through the following instructions. - Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation. - Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words. - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English. - If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.). - Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates). ## Gradio App Currently supported features: - Basic TTS with Chunk Inference - Multi-Style / Multi-Speaker Generation - Voice Chat powered by Qwen2.5-3B-Instruct - [Custom inference with more language support](src/f5_tts/infer/SHARED.md) The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference. The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat. More flags options: ```bash # Automatically launch the interface in the default web browser f5-tts_infer-gradio --inbrowser # Set the root path of the application, if it's not served from the root ("/") of the domain # For example, if the application is served at "https://example.com/myapp" f5-tts_infer-gradio --root_path "/myapp" ``` Could also be used as a component for larger application: ```python import gradio as gr from f5_tts.infer.infer_gradio import app with gr.Blocks() as main_app: gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app") # ... other Gradio components app.render() main_app.launch() ``` ## CLI Inference The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference. The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`. For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file. Basically you can inference with flags: ```bash # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage) f5-tts_infer-cli \ --model "F5-TTS" \ --ref_audio "ref_audio.wav" \ --ref_text "The content, subtitle or transcription of reference audio." \ --gen_text "Some text you want TTS model generate for you." # Choose Vocoder f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file # More instructions f5-tts_infer-cli --help ``` And a `.toml` file would help with more flexible usage. ```bash f5-tts_infer-cli -c custom.toml ``` For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`: ```toml # F5-TTS | E2-TTS model = "F5-TTS" ref_audio = "infer/examples/basic/basic_ref_en.wav" # If an empty "", transcribes the reference audio automatically. ref_text = "Some call me nature, others call me mother nature." gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring." # File with text to generate. Ignores the text above. gen_file = "" remove_silence = false output_dir = "tests" ``` You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`. ```toml # F5-TTS | E2-TTS model = "F5-TTS" ref_audio = "infer/examples/multi/main.flac" # If an empty "", transcribes the reference audio automatically. ref_text = "" gen_text = "" # File with text to generate. Ignores the text above. gen_file = "infer/examples/multi/story.txt" remove_silence = true output_dir = "tests" [voices.town] ref_audio = "infer/examples/multi/town.flac" ref_text = "" [voices.country] ref_audio = "infer/examples/multi/country.flac" ref_text = "" ``` You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`. ## Speech Editing To test speech editing capabilities, use the following command: ```bash python src/f5_tts/infer/speech_edit.py ``` ## Socket Realtime Client To communicate with socket server you need to run ```bash python src/f5_tts/socket_server.py ```
Then create client to communicate ```bash # If PyAudio not installed sudo apt-get install portaudio19-dev pip install pyaudio ``` ``` python # Create the socket_client.py import socket import asyncio import pyaudio import numpy as np import logging import time logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) async def listen_to_F5TTS(text, server_ip="localhost", server_port=9998): client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) await asyncio.get_event_loop().run_in_executor(None, client_socket.connect, (server_ip, int(server_port))) start_time = time.time() first_chunk_time = None async def play_audio_stream(): nonlocal first_chunk_time p = pyaudio.PyAudio() stream = p.open(format=pyaudio.paFloat32, channels=1, rate=24000, output=True, frames_per_buffer=2048) try: while True: data = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 8192) if not data: break if data == b"END": logger.info("End of audio received.") break audio_array = np.frombuffer(data, dtype=np.float32) stream.write(audio_array.tobytes()) if first_chunk_time is None: first_chunk_time = time.time() finally: stream.stop_stream() stream.close() p.terminate() logger.info(f"Total time taken: {time.time() - start_time:.4f} seconds") try: data_to_send = f"{text}".encode("utf-8") await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, data_to_send) await play_audio_stream() except Exception as e: logger.error(f"Error in listen_to_F5TTS: {e}") finally: client_socket.close() if __name__ == "__main__": text_to_send = "As a Reader assistant, I'm familiar with new technology. which are key to its improved performance in terms of both training speed and inference efficiency. Let's break down the components" asyncio.run(listen_to_F5TTS(text_to_send)) ```