Spaces:
Running on Zero
Apply for a GPU community grant: Personal project
Type a sentence. Pick a voice. Turn the dial.
At 0 you hear it spoken cleanly. At 4 you hear it as wordless glossolalia: invented words that obey English sound-rules but mean nothing, in the same voice. The middle of the dial is the point. At 2 the sentence is half-dissolved, recognizable but slipping, not a clean cut between speech and noise.
The dial is a learned scalar conditioner. A small network maps the dial position to a vector added into F5-TTS's time embedding (the same AdaLN pathway the model uses for the diffusion timestep), co-trained with a LoRA. The naive version (appending a tongues N token to the prompt) failed: F5-TTS has no language-model front end, so it read the level word aloud and intelligibility moved the wrong way (Spearman -0.70). Making the conditioning a non-text scalar means the model cannot speak it, and the LoRA only has to learn one thing: the per-level audio transformation.
Live: turn the dial, hit play this dial. Or hit dissolve to hear the whole 0 to 4 sweep crossfaded into one take.
Why it is worth a look
No shipped product, open or closed, gives you a typed-input, graded, voice-locked slide into glossolalia. Emotion and prosody sliders (Hume, ElevenLabs) move other axes and optimise for intelligibility. The closest research (dysarthric-speech clones, discrete lyric-swap edits) solves a different problem. The originality here is the interaction, not the model: a continuous, learned intelligibility axis on one token.
It is also not a DSP trick. Reverb, formant-shift, and vocoders act uniformly on audio that already exists. They cannot read a sentence you just invented and erode specific words into different but plausible ones while holding syllable count, stress, and voice. Only a model trained for it can, which is why this needed a fine-tune.
Two modes
Tongues: true glossolalia. The dial conditions the LoRA to slur the sentence into invented, pronounceable pseudo-words. she sells seashells by the seashore becomes something like she'll sell sicials by the sohar at the middle, wordless tongues at the top.
Ghost: mondegreen. Real English words are swapped for similar-sounding real words (seashells to seagulls), the misheard-lyric effect. More words change as the dial rises. This is pareidolia, not glossolalia, and is labeled as such.