pythonlearnreal's picture
Upload folder using huggingface_hub
106478e verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Inference

The pretrained model checkpoints can be reached at 🤗 Hugging Face and 🤖 Model Scope, or will be automatically downloaded when running inference scripts.

More checkpoints with whole community efforts can be found in SHARED.md, supporting more languages.

Currently support 30s for a single generation, which is the total length including both prompt and output audio. However, you can provide infer_cli and infer_gradio with longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.

To avoid possible inference failures, make sure you have seen through the following instructions.

  • Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
  • Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
  • Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
  • Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
  • If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
  • Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).

Gradio App

Currently supported features:

The cli command f5-tts_infer-gradio equals to python src/f5_tts/infer/infer_gradio.py, which launches a Gradio APP (web interface) for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and update the path to load_model() in infer_gradio.py. Currently only load TTS models first, will load ASR model to do transcription if ref_text not provided, will load LLM model if use Voice Chat.

More flags options:

# Automatically launch the interface in the default web browser
f5-tts_infer-gradio --inbrowser

# Set the root path of the application, if it's not served from the root ("/") of the domain
# For example, if the application is served at "https://example.com/myapp"
f5-tts_infer-gradio --root_path "/myapp"

Could also be used as a component for larger application:

import gradio as gr
from f5_tts.infer.infer_gradio import app

with gr.Blocks() as main_app:
    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")

    # ... other Gradio components

    app.render()

main_app.launch()

CLI Inference

The cli command f5-tts_infer-cli equals to python src/f5_tts/infer/infer_cli.py, which is a command line tool for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and use --ckpt_file to specify the model you want to load, or directly update in infer_cli.py.

For change vocab.txt use --vocab_file to provide your vocab.txt file.

Basically you can inference with flags:

# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

# Choose Vocoder
f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>

# More instructions
f5-tts_infer-cli --help

And a .toml file would help with more flexible usage.

f5-tts_infer-cli -c custom.toml

For example, you can use .toml to pass in variables, refer to src/f5_tts/infer/examples/basic/basic.toml:

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/basic/basic_ref_en.wav"
# If an empty "", transcribes the reference audio automatically.
ref_text = "Some call me nature, others call me mother nature."
gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
# File with text to generate. Ignores the text above.
gen_file = ""
remove_silence = false
output_dir = "tests"

You can also leverage .toml file to do multi-style generation, refer to src/f5_tts/infer/examples/multi/story.toml.

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/multi/main.flac"
# If an empty "", transcribes the reference audio automatically.
ref_text = ""
gen_text = ""
# File with text to generate. Ignores the text above.
gen_file = "infer/examples/multi/story.txt"
remove_silence = true
output_dir = "tests"

[voices.town]
ref_audio = "infer/examples/multi/town.flac"
ref_text = ""

[voices.country]
ref_audio = "infer/examples/multi/country.flac"
ref_text = ""

You should mark the voice with [main] [town] [country] whenever you want to change voice, refer to src/f5_tts/infer/examples/multi/story.txt.

Speech Editing

To test speech editing capabilities, use the following command:

python src/f5_tts/infer/speech_edit.py

Socket Realtime Client

To communicate with socket server you need to run

python src/f5_tts/socket_server.py
Then create client to communicate
# If PyAudio not installed
sudo apt-get install portaudio19-dev
pip install pyaudio
# Create the socket_client.py
import socket
import asyncio
import pyaudio
import numpy as np
import logging
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


async def listen_to_F5TTS(text, server_ip="localhost", server_port=9998):
    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    await asyncio.get_event_loop().run_in_executor(None, client_socket.connect, (server_ip, int(server_port)))

    start_time = time.time()
    first_chunk_time = None

    async def play_audio_stream():
        nonlocal first_chunk_time
        p = pyaudio.PyAudio()
        stream = p.open(format=pyaudio.paFloat32, channels=1, rate=24000, output=True, frames_per_buffer=2048)

        try:
            while True:
                data = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 8192)
                if not data:
                    break
                if data == b"END":
                    logger.info("End of audio received.")
                    break

                audio_array = np.frombuffer(data, dtype=np.float32)
                stream.write(audio_array.tobytes())

                if first_chunk_time is None:
                    first_chunk_time = time.time()

        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()

        logger.info(f"Total time taken: {time.time() - start_time:.4f} seconds")

    try:
        data_to_send = f"{text}".encode("utf-8")
        await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, data_to_send)
        await play_audio_stream()

    except Exception as e:
        logger.error(f"Error in listen_to_F5TTS: {e}")

    finally:
        client_socket.close()


if __name__ == "__main__":
    text_to_send = "As a Reader assistant, I'm familiar with new technology. which are key to its improved performance in terms of both training speed and inference efficiency. Let's break down the components"

    asyncio.run(listen_to_F5TTS(text_to_send))