| --- |
| license: mit |
| language: |
| - en |
| pipeline_tag: text-to-speech |
| --- |
| |
| # wfloat-tts |
|
|
| `wfloat-tts` is a lightweight multi-speaker English VITS text-to-speech model with speaker, emotion, and intensity control. |
|
|
| This repo includes: |
|
|
| - `model.safetensors`: inference weights |
| - `config.json`: model config and token mapping |
| - `src/wfloat_tts/`: a small Python inference helper |
|
|
| The repo is set up for standalone inference from the released model files. You do not need the original training codebase to synthesize speech with it. |
|
|
| ## Sample Outputs |
|
|
| ### `mad_scientist_woman` surprise |
|
|
| - Audio: [samples/08_mad_scientist_woman_surprise_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/08_mad_scientist_woman_surprise_080.wav) |
| - Input text: "No, no, that's not possible. The formula should have crystallized, but it adapted instead. Do you realize what that means for the rest of my work?" |
| - `sid`: `7` |
| - `emotion`: `surprise` |
| - `intensity`: `0.8` |
|
|
| <audio controls> |
| <source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/08_mad_scientist_woman_surprise_080.wav" type="audio/wav"> |
| </audio> |
|
|
| ### `fun_hero_woman` joy |
|
|
| - Audio: [samples/04_fun_hero_woman_joy_070.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/04_fun_hero_woman_joy_070.wav) |
| - Input text: "Come on, keep up! The crowd is cheering." |
| - `sid`: `3` |
| - `emotion`: `joy` |
| - `intensity`: `0.7` |
|
|
| <audio controls> |
| <source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/04_fun_hero_woman_joy_070.wav" type="audio/wav"> |
| </audio> |
|
|
| ### `strong_hero_man` anger |
|
|
| - Audio: [samples/05_strong_hero_man_anger_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/05_strong_hero_man_anger_080.wav) |
| - Input text: "Enough. You had your warning, and you kept pushing innocent people around. Take one more step, and I end this." |
| - `sid`: `4` |
| - `emotion`: `anger` |
| - `intensity`: `0.8` |
|
|
| <audio controls> |
| <source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/05_strong_hero_man_anger_080.wav" type="audio/wav"> |
| </audio> |
|
|
| Find more examples in the [samples folder](https://huggingface.co/Wfloat/wfloat-tts/tree/main/samples). |
|
|
| ## Inputs |
|
|
| The intended inference inputs are: |
|
|
| - `text`: the utterance to synthesize |
| - `sid`: numeric speaker id |
| - `emotion`: emotion label |
| - `intensity`: value from `0.0` to `1.0` |
|
|
| You do not need to pass raw control symbols. The Python helper converts `emotion` and `intensity` into the control tokens the model was trained on. |
|
|
| ## Install |
|
|
| ```bash |
| pip install -e . |
| pip install "piper-phonemize==1.3.0" -f https://k2-fsa.github.io/icefall/piper_phonemize |
| ``` |
|
|
| Runtime dependencies: |
|
|
| - `torch` |
| - `numpy` |
| - `safetensors` |
| - `piper-phonemize` |
|
|
| `piper-phonemize` is installed separately because the current recommended wheels are hosted here: |
|
|
| - https://k2-fsa.github.io/icefall/piper_phonemize |
| |
| ## Python Example |
| |
| ```python |
| from wfloat_tts import load_generator, write_wave |
|
|
| generator = load_generator( |
| checkpoint_path="model.safetensors", |
| config_path="config.json", |
| ) |
| |
| audio = generator.generate( |
| text="Hey there, how are you today?", |
| sid=11, |
| emotion="neutral", |
| intensity=0.5, |
| ) |
| |
| write_wave("out.wav", audio.samples, audio.sample_rate) |
| ``` |
| |
| ## How It Is Conditioned |
| |
| This model was trained to condition on: |
| |
| - speaker id |
| - one emotion control token |
| - one intensity control token |
| |
| The reference inference path processes a full utterance, appends one emotion token and one intensity token for the whole utterance, and runs synthesis over that full sequence. |
| |
| ## Speaker IDs |
| |
| Use numeric `sid` values: |
| |
| | Speaker | SID | |
| | --- | ---: | |
| | `skilled_hero_man` | 0 | |
| | `skilled_hero_woman` | 1 | |
| | `fun_hero_man` | 2 | |
| | `fun_hero_woman` | 3 | |
| | `strong_hero_man` | 4 | |
| | `strong_hero_woman` | 5 | |
| | `mad_scientist_man` | 6 | |
| | `mad_scientist_woman` | 7 | |
| | `clever_villain_man` | 8 | |
| | `clever_villain_woman` | 9 | |
| | `narrator_man` | 10 | |
| | `narrator_woman` | 11 | |
| | `wise_elder_man` | 12 | |
| | `wise_elder_woman` | 13 | |
| | `outgoing_anime_man` | 14 | |
| | `outgoing_anime_woman` | 15 | |
| | `scary_villain_man` | 16 | |
| | `scary_villain_woman` | 17 | |
| | `news_reporter_man` | 18 | |
| | `news_reporter_woman` | 19 | |
| |
| ## Emotions |
| |
| Supported emotion labels: |
| |
| - `neutral` |
| - `joy` |
| - `sadness` |
| - `anger` |
| - `fear` |
| - `surprise` |
| - `dismissive` |
| - `confusion` |
| |
| `intensity` is clamped to the range `[0.0, 1.0]` and mapped to one of ten discrete intensity levels. |
| |
| ## Notes |
| |
| - `model.safetensors` is the main inference artifact in this repo. |
| - `config.json` includes the token mapping needed by the processor. |
| - The current release uses a multi-speaker model with 20 speakers. |
| |