Spaces:

build-small-hackathon
/

glossolalia

Running on Zero

Apply for a GPU community grant: Personal project

by akshan-main - opened 17 days ago

Build Small Hackathon org 17 days ago

Type a sentence. Pick a voice. Turn the dial.

At 0 you hear it spoken cleanly. At 4 you hear it as wordless glossolalia: invented words that obey English sound-rules but mean nothing, in the same voice. The middle of the dial is the point. At 2 the sentence is half-dissolved, recognizable but slipping, not a clean cut between speech and noise.

The dial is a learned scalar conditioner. A small network maps the dial position to a vector added into F5-TTS's time embedding (the same AdaLN pathway the model uses for the diffusion timestep), co-trained with a LoRA. The naive version (appending a tongues N token to the prompt) failed: F5-TTS has no language-model front end, so it read the level word aloud and intelligibility moved the wrong way (Spearman -0.70). Making the conditioning a non-text scalar means the model cannot speak it, and the LoRA only has to learn one thing: the per-level audio transformation.

Live: turn the dial, hit play this dial. Or hit dissolve to hear the whole 0 to 4 sweep crossfaded into one take.

Why it is worth a look

No shipped product, open or closed, gives you a typed-input, graded, voice-locked slide into glossolalia. Emotion and prosody sliders (Hume, ElevenLabs) move other axes and optimise for intelligibility. The closest research (dysarthric-speech clones, discrete lyric-swap edits) solves a different problem. The originality here is the interaction, not the model: a continuous, learned intelligibility axis on one token.

It is also not a DSP trick. Reverb, formant-shift, and vocoders act uniformly on audio that already exists. They cannot read a sentence you just invented and erode specific words into different but plausible ones while holding syllable count, stress, and voice. Only a model trained for it can, which is why this needed a fine-tune.

Two modes

Tongues: true glossolalia. The dial conditions the LoRA to slur the sentence into invented, pronounceable pseudo-words. she sells seashells by the seashore becomes something like she'll sell sicials by the sohar at the middle, wordless tongues at the top.
Ghost: mondegreen. Real English words are swapped for similar-sounding real words (seashells to seagulls), the misheard-lyric effect. More words change as the dial rises. This is pareidolia, not glossolalia, and is labeled as such.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment