Update README.md

6b9ac30 verified about 17 hours ago

4.64 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: text-to-speech
	---

	# wfloat-tts

	`wfloat-tts` is a lightweight multi-speaker English VITS text-to-speech model with speaker, emotion, and intensity control.

	This repo includes:

	- `model.safetensors`: inference weights
	- `config.json`: model config and token mapping
	- `src/wfloat_tts/`: a small Python inference helper

	The repo is set up for standalone inference from the released model files. You do not need the original training codebase to synthesize speech with it.

	## Sample Outputs

	### `mad_scientist_woman` surprise

	- Audio: [samples/08_mad_scientist_woman_surprise_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/08_mad_scientist_woman_surprise_080.wav)
	- Input text: "No, no, that's not possible. The formula should have crystallized, but it adapted instead. Do you realize what that means for the rest of my work?"
	- `sid`: `7`
	- `emotion`: `surprise`
	- `intensity`: `0.8`

	<audio controls>
	<source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/08_mad_scientist_woman_surprise_080.wav" type="audio/wav">
	</audio>

	### `fun_hero_woman` joy

	- Audio: [samples/04_fun_hero_woman_joy_070.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/04_fun_hero_woman_joy_070.wav)
	- Input text: "Come on, keep up! The crowd is cheering."
	- `sid`: `3`
	- `emotion`: `joy`
	- `intensity`: `0.7`

	<audio controls>
	<source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/04_fun_hero_woman_joy_070.wav" type="audio/wav">
	</audio>

	### `strong_hero_man` anger

	- Audio: [samples/05_strong_hero_man_anger_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/05_strong_hero_man_anger_080.wav)
	- Input text: "Enough. You had your warning, and you kept pushing innocent people around. Take one more step, and I end this."
	- `sid`: `4`
	- `emotion`: `anger`
	- `intensity`: `0.8`

	<audio controls>
	<source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/05_strong_hero_man_anger_080.wav" type="audio/wav">
	</audio>

	Find more examples in the [samples folder](https://huggingface.co/Wfloat/wfloat-tts/tree/main/samples).

	## Inputs

	The intended inference inputs are:

	- `text`: the utterance to synthesize
	- `sid`: numeric speaker id
	- `emotion`: emotion label
	- `intensity`: value from `0.0` to `1.0`

	You do not need to pass raw control symbols. The Python helper converts `emotion` and `intensity` into the control tokens the model was trained on.

	## Install

	```bash
	pip install -e .
	pip install "piper-phonemize==1.3.0" -f https://k2-fsa.github.io/icefall/piper_phonemize
	```

	Runtime dependencies:

	- `torch`
	- `numpy`
	- `safetensors`
	- `piper-phonemize`

	`piper-phonemize` is installed separately because the current recommended wheels are hosted here:

	- https://k2-fsa.github.io/icefall/piper_phonemize

	## Python Example

	```python
	from wfloat_tts import load_generator, write_wave

	generator = load_generator(
	checkpoint_path="model.safetensors",
	config_path="config.json",
	)

	audio = generator.generate(
	text="Hey there, how are you today?",
	sid=11,
	emotion="neutral",
	intensity=0.5,
	)

	write_wave("out.wav", audio.samples, audio.sample_rate)
	```

	## How It Is Conditioned

	This model was trained to condition on:

	- speaker id
	- one emotion control token
	- one intensity control token

	The reference inference path processes a full utterance, appends one emotion token and one intensity token for the whole utterance, and runs synthesis over that full sequence.

	## Speaker IDs

	Use numeric `sid` values:

	\| Speaker \| SID \|
	\| --- \| ---: \|
	\| `skilled_hero_man` \| 0 \|
	\| `skilled_hero_woman` \| 1 \|
	\| `fun_hero_man` \| 2 \|
	\| `fun_hero_woman` \| 3 \|
	\| `strong_hero_man` \| 4 \|
	\| `strong_hero_woman` \| 5 \|
	\| `mad_scientist_man` \| 6 \|
	\| `mad_scientist_woman` \| 7 \|
	\| `clever_villain_man` \| 8 \|
	\| `clever_villain_woman` \| 9 \|
	\| `narrator_man` \| 10 \|
	\| `narrator_woman` \| 11 \|
	\| `wise_elder_man` \| 12 \|
	\| `wise_elder_woman` \| 13 \|
	\| `outgoing_anime_man` \| 14 \|
	\| `outgoing_anime_woman` \| 15 \|
	\| `scary_villain_man` \| 16 \|
	\| `scary_villain_woman` \| 17 \|
	\| `news_reporter_man` \| 18 \|
	\| `news_reporter_woman` \| 19 \|

	## Emotions

	Supported emotion labels:

	- `neutral`
	- `joy`
	- `sadness`
	- `anger`
	- `fear`
	- `surprise`
	- `dismissive`
	- `confusion`

	`intensity` is clamped to the range `[0.0, 1.0]` and mapped to one of ten discrete intensity levels.

	## Notes

	- `model.safetensors` is the main inference artifact in this repo.
	- `config.json` includes the token mapping needed by the processor.
	- The current release uses a multi-speaker model with 20 speakers.