Spaces:

pythonlearnreal
/

F5-TTS-THAI

Running

App Files Files Community

F5-TTS-THAI / src /f5_tts /infer /README.md

pythonlearnreal

Upload folder using huggingface_hub

106478e verified 5 months ago

preview code

raw

history blame contribute delete

8.13 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Inference

The pretrained model checkpoints can be reached at 🤗 Hugging Face and 🤖 Model Scope, or will be automatically downloaded when running inference scripts.

More checkpoints with whole community efforts can be found in SHARED.md, supporting more languages.

Currently support 30s for a single generation, which is the total length including both prompt and output audio. However, you can provide infer_cli and infer_gradio with longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.

To avoid possible inference failures, make sure you have seen through the following instructions.

Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).

Gradio App

Currently supported features:

Basic TTS with Chunk Inference
Multi-Style / Multi-Speaker Generation
Voice Chat powered by Qwen2.5-3B-Instruct
Custom inference with more language support

The cli command f5-tts_infer-gradio equals to python src/f5_tts/infer/infer_gradio.py, which launches a Gradio APP (web interface) for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and update the path to load_model() in infer_gradio.py. Currently only load TTS models first, will load ASR model to do transcription if ref_text not provided, will load LLM model if use Voice Chat.

More flags options:

# Automatically launch the interface in the default web browser
f5-tts_infer-gradio --inbrowser

# Set the root path of the application, if it's not served from the root ("/") of the domain
# For example, if the application is served at "https://example.com/myapp"
f5-tts_infer-gradio --root_path "/myapp"

Could also be used as a component for larger application:

import gradio as gr
from f5_tts.infer.infer_gradio import app

with gr.Blocks() as main_app:
    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")

    # ... other Gradio components

    app.render()

main_app.launch()

CLI Inference

The cli command f5-tts_infer-cli equals to python src/f5_tts/infer/infer_cli.py, which is a command line tool for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and use --ckpt_file to specify the model you want to load, or directly update in infer_cli.py.

For change vocab.txt use --vocab_file to provide your vocab.txt file.

Basically you can inference with flags:

# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

# Choose Vocoder
f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>

# More instructions
f5-tts_infer-cli --help

And a .toml file would help with more flexible usage.

f5-tts_infer-cli -c custom.toml

For example, you can use .toml to pass in variables, refer to src/f5_tts/infer/examples/basic/basic.toml:

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/basic/basic_ref_en.wav"
# If an empty "", transcribes the reference audio automatically.
ref_text = "Some call me nature, others call me mother nature."
gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
# File with text to generate. Ignores the text above.
gen_file = ""
remove_silence = false
output_dir = "tests"

You can also leverage .toml file to do multi-style generation, refer to src/f5_tts/infer/examples/multi/story.toml.

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/multi/main.flac"
# If an empty "", transcribes the reference audio automatically.
ref_text = ""
gen_text = ""
# File with text to generate. Ignores the text above.
gen_file = "infer/examples/multi/story.txt"
remove_silence = true
output_dir = "tests"

[voices.town]
ref_audio = "infer/examples/multi/town.flac"
ref_text = ""

[voices.country]
ref_audio = "infer/examples/multi/country.flac"
ref_text = ""

You should mark the voice with [main] [town] [country] whenever you want to change voice, refer to src/f5_tts/infer/examples/multi/story.txt.

Speech Editing

To test speech editing capabilities, use the following command:

python src/f5_tts/infer/speech_edit.py

Socket Realtime Client

To communicate with socket server you need to run

python src/f5_tts/socket_server.py

Then create client to communicate

# If PyAudio not installed
sudo apt-get install portaudio19-dev
pip install pyaudio

# Create the socket_client.py
import socket
import asyncio
import pyaudio
import numpy as np
import logging
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


async def listen_to_F5TTS(text, server_ip="localhost", server_port=9998):
    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    await asyncio.get_event_loop().run_in_executor(None, client_socket.connect, (server_ip, int(server_port)))

    start_time = time.time()
    first_chunk_time = None

    async def play_audio_stream():
        nonlocal first_chunk_time
        p = pyaudio.PyAudio()
        stream = p.open(format=pyaudio.paFloat32, channels=1, rate=24000, output=True, frames_per_buffer=2048)

        try:
            while True:
                data = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 8192)
                if not data:
                    break
                if data == b"END":
                    logger.info("End of audio received.")
                    break

                audio_array = np.frombuffer(data, dtype=np.float32)
                stream.write(audio_array.tobytes())

                if first_chunk_time is None:
                    first_chunk_time = time.time()

        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()

        logger.info(f"Total time taken: {time.time() - start_time:.4f} seconds")

    try:
        data_to_send = f"{text}".encode("utf-8")
        await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, data_to_send)
        await play_audio_stream()

    except Exception as e:
        logger.error(f"Error in listen_to_F5TTS: {e}")

    finally:
        client_socket.close()


if __name__ == "__main__":
    text_to_send = "As a Reader assistant, I'm familiar with new technology. which are key to its improved performance in terms of both training speed and inference efficiency. Let's break down the components"

    asyncio.run(listen_to_F5TTS(text_to_send))