File size: 4,644 Bytes
ce40172 f71bc95 6b9ac30 f71bc95 2aa7b71 dc08417 2aa7b71 dc08417 2aa7b71 dc08417 2aa7b71 dc08417 2aa7b71 dc08417 2aa7b71 dc08417 2aa7b71 dc08417 2aa7b71 f71bc95 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | ---
license: mit
language:
- en
pipeline_tag: text-to-speech
---
# wfloat-tts
`wfloat-tts` is a lightweight multi-speaker English VITS text-to-speech model with speaker, emotion, and intensity control.
This repo includes:
- `model.safetensors`: inference weights
- `config.json`: model config and token mapping
- `src/wfloat_tts/`: a small Python inference helper
The repo is set up for standalone inference from the released model files. You do not need the original training codebase to synthesize speech with it.
## Sample Outputs
### `mad_scientist_woman` surprise
- Audio: [samples/08_mad_scientist_woman_surprise_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/08_mad_scientist_woman_surprise_080.wav)
- Input text: "No, no, that's not possible. The formula should have crystallized, but it adapted instead. Do you realize what that means for the rest of my work?"
- `sid`: `7`
- `emotion`: `surprise`
- `intensity`: `0.8`
<audio controls>
<source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/08_mad_scientist_woman_surprise_080.wav" type="audio/wav">
</audio>
### `fun_hero_woman` joy
- Audio: [samples/04_fun_hero_woman_joy_070.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/04_fun_hero_woman_joy_070.wav)
- Input text: "Come on, keep up! The crowd is cheering."
- `sid`: `3`
- `emotion`: `joy`
- `intensity`: `0.7`
<audio controls>
<source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/04_fun_hero_woman_joy_070.wav" type="audio/wav">
</audio>
### `strong_hero_man` anger
- Audio: [samples/05_strong_hero_man_anger_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/05_strong_hero_man_anger_080.wav)
- Input text: "Enough. You had your warning, and you kept pushing innocent people around. Take one more step, and I end this."
- `sid`: `4`
- `emotion`: `anger`
- `intensity`: `0.8`
<audio controls>
<source src="https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/05_strong_hero_man_anger_080.wav" type="audio/wav">
</audio>
Find more examples in the [samples folder](https://huggingface.co/Wfloat/wfloat-tts/tree/main/samples).
## Inputs
The intended inference inputs are:
- `text`: the utterance to synthesize
- `sid`: numeric speaker id
- `emotion`: emotion label
- `intensity`: value from `0.0` to `1.0`
You do not need to pass raw control symbols. The Python helper converts `emotion` and `intensity` into the control tokens the model was trained on.
## Install
```bash
pip install -e .
pip install "piper-phonemize==1.3.0" -f https://k2-fsa.github.io/icefall/piper_phonemize
```
Runtime dependencies:
- `torch`
- `numpy`
- `safetensors`
- `piper-phonemize`
`piper-phonemize` is installed separately because the current recommended wheels are hosted here:
- https://k2-fsa.github.io/icefall/piper_phonemize
## Python Example
```python
from wfloat_tts import load_generator, write_wave
generator = load_generator(
checkpoint_path="model.safetensors",
config_path="config.json",
)
audio = generator.generate(
text="Hey there, how are you today?",
sid=11,
emotion="neutral",
intensity=0.5,
)
write_wave("out.wav", audio.samples, audio.sample_rate)
```
## How It Is Conditioned
This model was trained to condition on:
- speaker id
- one emotion control token
- one intensity control token
The reference inference path processes a full utterance, appends one emotion token and one intensity token for the whole utterance, and runs synthesis over that full sequence.
## Speaker IDs
Use numeric `sid` values:
| Speaker | SID |
| --- | ---: |
| `skilled_hero_man` | 0 |
| `skilled_hero_woman` | 1 |
| `fun_hero_man` | 2 |
| `fun_hero_woman` | 3 |
| `strong_hero_man` | 4 |
| `strong_hero_woman` | 5 |
| `mad_scientist_man` | 6 |
| `mad_scientist_woman` | 7 |
| `clever_villain_man` | 8 |
| `clever_villain_woman` | 9 |
| `narrator_man` | 10 |
| `narrator_woman` | 11 |
| `wise_elder_man` | 12 |
| `wise_elder_woman` | 13 |
| `outgoing_anime_man` | 14 |
| `outgoing_anime_woman` | 15 |
| `scary_villain_man` | 16 |
| `scary_villain_woman` | 17 |
| `news_reporter_man` | 18 |
| `news_reporter_woman` | 19 |
## Emotions
Supported emotion labels:
- `neutral`
- `joy`
- `sadness`
- `anger`
- `fear`
- `surprise`
- `dismissive`
- `confusion`
`intensity` is clamped to the range `[0.0, 1.0]` and mapped to one of ten discrete intensity levels.
## Notes
- `model.safetensors` is the main inference artifact in this repo.
- `config.json` includes the token mapping needed by the processor.
- The current release uses a multi-speaker model with 20 speakers.
|