Spaces:

danhtran2mind
/

Vi-F5-TTS

Running

App Files Files Community

Vi-F5-TTS / docs /inference /inference_doc.md

danhtran2mind

Upload 244 files

3f9cba0 verified 6 months ago

preview code

raw

history blame contribute delete

4.84 kB

	# Inference Arguments

	The following table describes the command-line arguments available for the `infer-cli.py` script, which is used for text-to-speech (TTS) inference with advanced batch processing capabilities. These arguments allow users to override settings defined in the configuration file (`basic.toml` by default).

	\| Argument \| Description \| Type \| Default Value \| Notes \|
	\|----------\|-------------\|------\|---------------\|-------\|
	\| `-c`, `--config` \| Path to the configuration file. \| `str` \| `f5_tts/infer/examples/basic/basic.toml` \| Specifies the TOML configuration file to use. \|
	\| `-m`, `--model` \| Model name to use for inference. \| `str` \| `F5TTS_v1_Base` (from config) \| Options: `F5TTS_v1_Base`, `F5TTS_Base`, `E2TTS_Base`, etc. \|
	\| `-mc`, `--model_cfg` \| Path to the model's YAML configuration file. \| `str` \| `configs/<model>.yaml` (from config) \| Defines model-specific settings. \|
	\| `-p`, `--ckpt_file` \| Path to the model checkpoint file (.pt). \| `str` \| (from config) \| Leave blank to use default checkpoint. \|
	\| `-v`, `--vocab_file` \| Path to the vocabulary file (.txt). \| `str` \| (from config) \| Leave blank to use default vocabulary. \|
	\| `-r`, `--ref_audio` \| Path to the reference audio file. \| `str` \| `infer/examples/basic/basic_ref_en.wav` (from config) \| Used as a reference for voice synthesis. \|
	\| `-s`, `--ref_text` \| Transcript or subtitle for the reference audio. \| `str` \| `Some call me nature, others call me mother nature.` (from config) \| Text corresponding to the reference audio. \|
	\| `-t`, `--gen_text` \| Text to synthesize into speech. \| `str` \| `Here we generate something just for test.` (from config) \| Ignored if `--gen_file` is provided. \|
	\| `-f`, `--gen_file` \| Path to a file containing text to synthesize. \| `str` \| (from config) \| Overrides `--gen_text` if specified. \|
	\| `-o`, `--output_dir` \| Path to the output directory. \| `str` \| `tests` (from config) \| Directory where generated audio files are saved. \|
	\| `-w`, `--output_file` \| Name of the output audio file. \| `str` \| `infer_cli_<timestamp>.wav` (from config) \| Timestamp format: `%Y%m%d_%H%M%S`. \|
	\| `--save_chunk` \| Save individual audio chunks during inference. \| `bool` \| `False` (from config) \| If enabled, saves chunks to `<output_dir>/<output_file>_chunks/`. \|
	\| `--no_legacy_text` \| Disable lossy ASCII transliteration for Unicode text in file names. \| `bool` \| `False` (from config) \| If disabled, uses Unicode in file names; warns if used with `--save_chunk`. \|
	\| `--remove_silence` \| Remove long silences from the generated audio. \| `bool` \| `False` (from config) \| Applies silence removal post-processing. \|
	\| `--load_vocoder_from_local` \| Load vocoder from a local directory. \| `bool` \| `False` (from config) \| Uses `../checkpoints/vocos-mel-24khz` or similar if enabled. \|
	\| `--vocoder_name` \| Name of the vocoder to use. \| `str` \| (from config, defaults to `mel_spec_type`) \| Options: `vocos`, `bigvgan`. \|
	\| `--target_rms` \| Target loudness normalization value for output speech. \| `float` \| (from config, defaults to `target_rms`) \| Adjusts audio loudness. \|
	\| `--cross_fade_duration` \| Duration of cross-fade between audio segments (seconds). \| `float` \| (from config, defaults to `cross_fade_duration`) \| Smooths transitions between segments. \|
	\| `--nfe_step` \| Number of function evaluation (denoising) steps. \| `int` \| (from config, defaults to `nfe_step`) \| Controls inference quality. \|
	\| `--cfg_strength` \| Classifier-free guidance strength. \| `float` \| (from config, defaults to `cfg_strength`) \| Influences generation quality. \|
	\| `--sway_sampling_coef` \| Sway sampling coefficient. \| `float` \| (from config, defaults to `sway_sampling_coef`) \| Affects sampling behavior. \|
	\| `--speed` \| Speed of the generated audio. \| `float` \| (from config, defaults to `speed`) \| Adjusts playback speed. \|
	\| `--fix_duration` \| Fixed total duration for reference and generated audio (seconds). \| `float` \| (from config, defaults to `fix_duration`) \| Enforces a specific duration. \|
	\| `--device` \| Device to run inference on. \| `str` \| (from config, defaults to `device`) \| E.g., `cpu`, `cuda`. \|

	## Notes
	- Arguments without default values in the script (e.g., `--model`, `--ref_audio`) inherit defaults from the configuration file.
	- The `--no_legacy_text` flag is implemented as `store_false`, so enabling it sets `use_legacy_text` to `False`.
	- If `--gen_file` is provided, it overrides `--gen_text`.
	- The script supports multiple voices defined in the config file under the `voices` key, with a fallback to a `main` voice.
	- The output audio is saved as a WAV file, and optional chunked audio segments can be saved if `--save_chunk` is enabled.
	- The script uses `cached_path` for downloading model checkpoints from Hugging Face if no local checkpoint is specified.

	# Inference Arguments

	The following table describes the command-line arguments available for the `infer-cli.py` script, which is used for text-to-speech (TTS) inference with advanced batch processing capabilities. These arguments allow users to override settings defined in the configuration file (`basic.toml` by default).

	\| Argument \| Description \| Type \| Default Value \| Notes \|
	\|----------\|-------------\|------\|---------------\|-------\|
	\| `-c`, `--config` \| Path to the configuration file. \| `str` \| `f5_tts/infer/examples/basic/basic.toml` \| Specifies the TOML configuration file to use. \|
	\| `-m`, `--model` \| Model name to use for inference. \| `str` \| `F5TTS_v1_Base` (from config) \| Options: `F5TTS_v1_Base`, `F5TTS_Base`, `E2TTS_Base`, etc. \|
	\| `-mc`, `--model_cfg` \| Path to the model's YAML configuration file. \| `str` \| `configs/<model>.yaml` (from config) \| Defines model-specific settings. \|
	\| `-p`, `--ckpt_file` \| Path to the model checkpoint file (.pt). \| `str` \| (from config) \| Leave blank to use default checkpoint. \|
	\| `-v`, `--vocab_file` \| Path to the vocabulary file (.txt). \| `str` \| (from config) \| Leave blank to use default vocabulary. \|
	\| `-r`, `--ref_audio` \| Path to the reference audio file. \| `str` \| `infer/examples/basic/basic_ref_en.wav` (from config) \| Used as a reference for voice synthesis. \|
	\| `-s`, `--ref_text` \| Transcript or subtitle for the reference audio. \| `str` \| `Some call me nature, others call me mother nature.` (from config) \| Text corresponding to the reference audio. \|
	\| `-t`, `--gen_text` \| Text to synthesize into speech. \| `str` \| `Here we generate something just for test.` (from config) \| Ignored if `--gen_file` is provided. \|
	\| `-f`, `--gen_file` \| Path to a file containing text to synthesize. \| `str` \| (from config) \| Overrides `--gen_text` if specified. \|
	\| `-o`, `--output_dir` \| Path to the output directory. \| `str` \| `tests` (from config) \| Directory where generated audio files are saved. \|
	\| `-w`, `--output_file` \| Name of the output audio file. \| `str` \| `infer_cli_<timestamp>.wav` (from config) \| Timestamp format: `%Y%m%d_%H%M%S`. \|
	\| `--save_chunk` \| Save individual audio chunks during inference. \| `bool` \| `False` (from config) \| If enabled, saves chunks to `<output_dir>/<output_file>_chunks/`. \|
	\| `--no_legacy_text` \| Disable lossy ASCII transliteration for Unicode text in file names. \| `bool` \| `False` (from config) \| If disabled, uses Unicode in file names; warns if used with `--save_chunk`. \|
	\| `--remove_silence` \| Remove long silences from the generated audio. \| `bool` \| `False` (from config) \| Applies silence removal post-processing. \|
	\| `--load_vocoder_from_local` \| Load vocoder from a local directory. \| `bool` \| `False` (from config) \| Uses `../checkpoints/vocos-mel-24khz` or similar if enabled. \|
	\| `--vocoder_name` \| Name of the vocoder to use. \| `str` \| (from config, defaults to `mel_spec_type`) \| Options: `vocos`, `bigvgan`. \|
	\| `--target_rms` \| Target loudness normalization value for output speech. \| `float` \| (from config, defaults to `target_rms`) \| Adjusts audio loudness. \|
	\| `--cross_fade_duration` \| Duration of cross-fade between audio segments (seconds). \| `float` \| (from config, defaults to `cross_fade_duration`) \| Smooths transitions between segments. \|
	\| `--nfe_step` \| Number of function evaluation (denoising) steps. \| `int` \| (from config, defaults to `nfe_step`) \| Controls inference quality. \|
	\| `--cfg_strength` \| Classifier-free guidance strength. \| `float` \| (from config, defaults to `cfg_strength`) \| Influences generation quality. \|
	\| `--sway_sampling_coef` \| Sway sampling coefficient. \| `float` \| (from config, defaults to `sway_sampling_coef`) \| Affects sampling behavior. \|
	\| `--speed` \| Speed of the generated audio. \| `float` \| (from config, defaults to `speed`) \| Adjusts playback speed. \|
	\| `--fix_duration` \| Fixed total duration for reference and generated audio (seconds). \| `float` \| (from config, defaults to `fix_duration`) \| Enforces a specific duration. \|
	\| `--device` \| Device to run inference on. \| `str` \| (from config, defaults to `device`) \| E.g., `cpu`, `cuda`. \|

	## Notes
	- Arguments without default values in the script (e.g., `--model`, `--ref_audio`) inherit defaults from the configuration file.
	- The `--no_legacy_text` flag is implemented as `store_false`, so enabling it sets `use_legacy_text` to `False`.
	- If `--gen_file` is provided, it overrides `--gen_text`.
	- The script supports multiple voices defined in the config file under the `voices` key, with a fallback to a `main` voice.
	- The output audio is saved as a WAV file, and optional chunked audio segments can be saved if `--save_chunk` is enabled.
	- The script uses `cached_path` for downloading model checkpoints from Hugging Face if no local checkpoint is specified.